Vision-Language-Action Models: From Chatbots to Interaction with the Physical World

FORMAT: TalkLEVEL: Intermediate LANGUAGE: Spanish

LLM-powered chatbots marked a before and after in artificial intelligence, enabling systems capable of understanding and generating natural language with great fluency. More recently, multimodal models expanded these capabilities by incorporating images, audio, and video, bringing AI closer to a more complete understanding of its environment. In this talk we'll explore Vision-Language-Action Models (VLA), architectures that combine computer vision, natural language, and decision-making to let intelligent agents interpret their environment and execute actions in the physical world. We'll also see how the Python ecosystem has become a fundamental piece for developing these solutions through modern tools like PyTorch, Hugging Face, robotic simulators, and open source frameworks currently used in robotics and multimodal artificial intelligence.

Speaker

Gerardo Vilcamiza Espinoza

Senior AI Engineer @ NTT DATA

Hi! My name is Gerardo and I'm a Mechatronics Engineer with a Master's in Embedded Artificial Intelligence. I currently work as Senior AI Engineer at the technology consultancy NTT DATA, leading generative AI projects where we apply text, audio, and image generation models in solutions for the banking and insurance sectors across Latin America. I also work as a research professor at Universidad de Buenos Aires, where I teach Deep Learning and Computer Vision courses. Additionally, I lead research projects at the Embedded Systems Laboratory focused on robotics and satellite systems.

View speaker

Want to know more?

Join PyCon Colombia newsletter and get a complete overview of our events, speakers and community participation.