Embodied LLM Agent Uses SmolVLA to Manipulate Objects

Опубликовано: 20 Май 2026
на канале: Greg's Tech

803

🦾 RoboCrew lib (our code): https://github.com/Grigorij-Dudnik/Ro...

In this video, we are taking on a next challenge: teaching our XLeRobot how to grab objects using its arms. Moving around on wheels is easy for an embodied LLM agent, but robot arm manipulation is a whole different beast since the agent cannot directly control all six joints at once. To solve this, we are using a VLA model as LLM agent tools, letting the robot brain decide exactly when to trigger the arms. This is a huge step toward building a truly autonomous robot, and as far as I know, no one has tried connecting an embodied AI agent and a VLA model quite like this before.

Getting this DIY robot to actually grab things meant we had to record a ton of data using robot teleoperation, first with VR and then with a leader arm, and yes, you really do have to train the human operator before you can train the robot. After filtering out the bad episodes, we trained our VLA model on the dataset, but running it on a Raspberry Pi robot brought some serious hardware limits to light. We managed to get a lightweight VLA model running locally, and even tested out SmolVLA, but the Raspberry Pi just does not have the power for smooth movement, making SmolVLA take about three minutes per step.

Despite the compute struggles, we finally got all the pieces working together for our embodied LLM agent to combine movement and robot arm manipulation in a single run. It took some troubleshooting, like fixing camera sharing between the agent and the VLA model, but watching the XLeRobot successfully approach, grab, and hand over a notebook is incredible. If you want to try this out on your own DIY robot, we have open-sourced the code in the RoboCrew library, so you can easily get your XLeRobot running its own robot brain with SmolVLA and LLM agent tools.

0:00 - Intro
1:12 - What are VLAs?
2:03 - How to connect LLM and VLA?
2:24 - Creating Script for a Dataset Recording
3:23 - Dataset Recording
5:14 - Dataset Cleaning
5:42 - Running Training
6:12 - Test of Trained Model
6:38 - Implementation of VLA as a Tool
7:48 - Camera Access Problems
9:34 - Implementing VLA as a Tool
10:47 - Creating Remote Server
11:50 - Recording New Dataset with Leader Arm
13:28 - Inferencing ACT Policy
14:00 - Inferencing SmolVLA Policy
14:27 - Giving Notebook to Human Dataset
14:54 - Overheat Problems
15:58 - Giving Notebook Test
16:24 - Connecting LLM and VLAs
17:03 - Approach and Grab a Notebook
20:00 - RoboCrew - the Code we Did
20:45 - Approach and GIve a Notebook
21:17 - Conclusions
21:57 - Outro