Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world.
To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training.
Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.
Sim2Real-VLA dual-system architecture: high-level planner with chains-of-affordances and low-level actor with tokenized action space for zero-shot Sim2Real transfer.
Success rates and step counts on six robot manipulation tasks (Sim., Real., Steps). Sim2Real-VLA achieves the best performance across all tasks and metrics.
Per-task results under domain shift (supplementary). Each task: 1 success + 4 gap conditions.
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
Success
Gap: bg
Gap: obj
Gap: table
Gap: table+obj
If you find this work useful, please cite:
@inproceedings{zhao2026sim2realvla,
title={Sim2Real-VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation},
author={Runyi Zhao and Sheng Xu and Ruixing Jin and Yueci Deng and Yunxin Tai and Kui Jia and Guiliang Liu},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
}