Code we created during that video:
🟢 https://github.com/Grigorij-Dudnik/Ro...
We're making our XLeRobot to be fully autonomous with an AI agent to make its own decisions and control its movements. This video shows how we connected an LLM agent to our XLeRobot so it can see through a camera, decide where to go, and respond to voice commands. We tested different LLM models like GPT and Gemini before settling on Gemini Robotics, which is designed specifically for robotics applications. The LangChain framework helped us build the AI agent that loops through decisions - the robot takes a photo, the LLM decides what to do next, executes a tool like "move forward" or "turn right", and repeats. We also added a microphone for voice controlled robot functionality, implemented speech-to-text transcription, and figured out how to make the Raspberry Pi robot only listen when we're actually talking to it instead of transcribing everything constantly.
Getting this autonomous robot to work involved solving a bunch of hardware problems with the Raspberry Pi robot setup, like USB ports randomly changing names and camera images not providing enough spatial information for the AI agent XLeRobot to understand angles. We fixed the angle problem by augmenting the camera feed with a grid overlay so the robot vision system could see exactly which direction to turn. Throughout the video we run different tests - making the embodied AI approach humans, find exits, navigate corridors, and locate objects. Some tests passed, like finding a person in the corridor, while others failed spectacularly, like trying to use the ramp or finding a backpack around a corner. The DIY robot struggles with understanding left versus right and its own body size, but it's actually making autonomous decisions and moving around. We open sourced all the code from this project so anyone building an open source robot can use it on their own XLeRobot. Next video we'll be adding arms and implementing vision action models so this AI robotics platform can actually pick up and manipulate objects."
0:00 - Intro
1:39 - Equipping AI with movemnt tools
3:05 - Connecting camera feed
3:36 - How does the agent works?
4:40 - Choosing agentic framework
5:36 - First tests
7:37 - Choosing LLM
8:37 - Image augmentation
9:32 - Swapping device names (Udev rules)
10:48 - Enabling listening for voice commands
13:00 - Waking up transcription
15:11 - Listening own motors problem
16:44 - Little testing
17:41 - Emergency opened
18:11 - Navigation lag problem
21:12 - Test of the ramp
22:27 - Searching a human test
22:50 - Searching backack around the corner test
23:17 - Searching backack in the hallway
24:28 - Outro