I decided to put three different VLA models to the test - the older ACT, the mid-generation SmolVLA, and the brand new Pi 0.5 - to see which one actually works best for robot control. I ran them on both my real robot and in a Unity simulation environment, comparing their success rates on pick and place tasks, how much GPU memory they needed, and how fast they could calculate robot actions. Spoiler alert: getting these VLA models running wasn't always smooth sailing, especially with Pi 0.5, but I learned a ton about imitation learning and embodied AI in the process. I even threw in a bonus test with the Pi-0 robot model at the end to give you more data points for your own DIY robot projects.
For the real robot experiments with LeRobot, I trained ACT transformer and SmolVLA on my local machine, then had to use cloud GPU for AI to train Pi0.5 since my hardware couldn't handle it. Let me tell you, the whole "just rent a cloud GPU" thing isn't as magical as people make it sound - I went through seven different instances dealing with compatibility issues, disk space problems, and unexpected charges before finally getting my AI robotics training to complete. The lightweight ACT transformer absolutely crushed SmolVLA in the pick and place task despite being the older model, using only 2GB of VRAM compared to SmolVLA's 4GB. Pi0.5 refused to work at all on the real robot, though it did show it needs around 8GB of memory.
The Unity simulation robotics tests told a different story though. SmolVLA actually outperformed ACT this time on the same pick and place challenge, though it took 20 times longer to calculate each action - I even had to reduce the FPS to make it more stable, which weirdly improved the robot learning performance. Pi 0.5 still wouldn't cooperate due to what seems like LeRobot library bugs, but here's where it gets interesting: the older Pi-0 robot model worked perfectly in simulation, matching SmolVLA's 30% success rate while using similar resources to Pi 0.5. For anyone doing embodied AI work or building their own DIY robot, my takeaway is that ACT transformer is your friend for edge devices and limited hardware, SmolVLA is worth trying for better performance if you have the GPU power, and the Pi models show serious promise once those library issues get sorted out - which hopefully they already are by the time you're watching this VLA model comparison.
Video is for you if you are passionate of embodied AI and you like to train a VLA models.
00:00 - Intro
1:03 - Dataset recodring
2:03 - Training
3:01 - My opinion on cloud GPUs
4:39 - Experiment 1 (Real robot) intro
4:54 - ACT test
6:39 - Extended action horizon experiment
7:24 - SmolVLA test
10:16 - Pi0.5 test
10:58 - Real robot experiment results
11:38 - Experiment 2 (simulation) intro
11:59 - ACT test sim
13:54 - SmolVLA test sim
16:01 - Pi0.5 test sim
17:07 - Simulation experiment results
17:29 - Bonus
18:27 - Conclusions
19:27 - Outro