Sim2Real-VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation

Runyi Zhao¹, Sheng Xu¹, Ruixing Jin¹, Yueci Deng¹, Yunxin Tai², Kui Jia^1,2, Guiliang Liu^1,3*

¹School of Data Science, The Chinese University of Hong Kong, Shenzhen
²DexForce Technology
³Shenzhen Loop Area Institute
^*Corresponding author: liuguiliang@cuhk.edu.cn

Paper (PDF) Code (Integrating to EmbodiChain)

Abstract

Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world.

To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training.

Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.

Video

Method

Sim2Real-VLA dual-system architecture: high-level planner with chains-of-affordances and low-level actor with tokenized action space for zero-shot Sim2Real transfer.

Experimental Results

Success rates and step counts on six robot manipulation tasks (Sim., Real., Steps). Sim2Real-VLA achieves the best performance across all tasks and metrics.

Domain Gap (Sim2Real)

Per-task results under domain shift (supplementary). Each task: 1 success + 4 gap conditions.

Schematic

Gap: table+obj

Citation

If you find this work useful, please cite:

@inproceedings{zhao2026sim2realvla,
  title={Sim2Real-VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation},
  author={Runyi Zhao and Sheng Xu and Ruixing Jin and Yueci Deng and Yunxin Tai and Kui Jia and Guiliang Liu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
}