7. π0: A Vision-Language-Action Flow Model for General Robot Control

AI Paper By Hand

Nov 08, 2024

π0, a prototype model to enable a capable and dexterous generalist robot policy. It is obtained by combining large-scale multi-task and multi-robot data collection with a new network architecture.

The way π0 works can be summarized into four main steps:
1. The training framework, consists of a weighted combination of authors' own dexterous manipulation datasets collected on 7 different robot configurations for 68 different tasks, and the entire OXE dataset which contains data from 22 robots. It also uses diverse language labels, combining task names and segment annotations. The paper calls it the pre-training mixture which the authors suggest helps train the base model with broad capabilities and generalization.
2. It uses PaliGemma-3B as the base vision language model (VLM for image + text). Along with the VLM, it adds 300M parameters for the action expert (for robotics-specific inputs/outputs) which is initialized from scratch.
3. Based on the above, the model can be prompted for zero-shot control or fine-tuned on high-quality data to enable complex multi-stage tasks.

- Across all tasks and all comparisons, even the “parity” version of the model outperforms all baselines (OpenVLA & Octo), and the full version of the model achieves the best results by a large margin.
- The full pre-trained π0 model attains more than 50% of the maximum score across all of the tasks with especially significant improvements on the hardest tasks.

Annotated Paper - Π0 A Vision Language Action Flow Model

6.7MB ∙ PDF file

Download

Paper : https://www.physicalintelligence.company/download/pi0.pdf
Website : https://www.physicalintelligence.company/blog/pi0

Imagine AI

Discussion about this post