OpenVLA Generalist Robot Policy
🔗 Source: arXiv
OpenVLA: An Open-Source Vision-Language-Action Model
🚀 Technical Novelty
- Mechanism: Fuses multi-granularity visual features (DINOv2 + SigLIP) with a Llama-2 backbone, directly fine-tuned on 970k real-world robot episodes to predict actions as language tokens. Integrates LoRA for parameter-efficient adaptation and int4 quantization for low-memory inference.
- Nuance: Replaces closed, massive VLAs (e.g., RT-2-X at 55B) and modular architectures with a streamlined, end-to-end token-fusion pipeline that achieves higher success rates with 7x fewer parameters while enabling commodity-GPU deployment.
💡 Yield
- Surpasses RT-2-X by 16.5% absolute task success rate across 29 tasks and multiple robot embodiments.
- Matches full fine-tuning performance using only 1.4% of trainable parameters via LoRA, cutting training compute by 8x.
- Achieves bfloat16-level accuracy with int4 quantization, halving VRAM requirements while maintaining viable control frequencies (~3Hz).
⚠️ Limitations
- Restricted to single-image observations; lacks support for multi-view, proprioceptive history, or heterogeneous sensory inputs.
- Inference throughput remains insufficient for high-frequency control setups (e.g., 50Hz ALOHA systems).
- Peak success rates rarely exceed 90%, highlighting reliability gaps in complex real-world scenarios.
- Compute constraints left foundational design questions (e.g., base VLM scaling, internet-scale co-training benefits) unexplored.