Vision-Language-Action Models: From Chatbots to Interaction with the Physical World

LLM-powered chatbots marked a before and after in artificial intelligence, enabling systems capable of understanding and generating natural language with great fluency. More recently, multimodal models expanded these capabilities by incorporating images, audio, and video, bringing AI closer to a more complete understanding of its environment. In this talk we'll explore Vision-Language-Action Models (VLA), architectures that combine computer vision, natural language, and decision-making to let intelligent agents interpret their environment and execute actions in the physical world. We'll also see how the Python ecosystem has become a fundamental piece for developing these solutions through modern tools like PyTorch, Hugging Face, robotic simulators, and open source frameworks currently used in robotics and multimodal artificial intelligence.

Want to know more?

Join PyCon Colombia newsletter and get a complete overview of our events, speakers and community participation.