Sim2Real-VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation

Runyi Zhao1, Sheng Xu1, Ruixing Jin1, Yueci Deng1, Yunxin Tai2, Kui Jia1,2, Guiliang Liu1,3*
1School of Data Science, The Chinese University of Hong Kong, Shenzhen
2DexForce Technology
3Shenzhen Loop Area Institute
*Corresponding author: liuguiliang@cuhk.edu.cn

Abstract

Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world.

To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training.

Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.

Video

Method

Sim2Real-VLA architecture

Sim2Real-VLA dual-system architecture: high-level planner with chains-of-affordances and low-level actor with tokenized action space for zero-shot Sim2Real transfer.

Experimental Results

Success rates and step counts on six robot manipulation tasks (Sim., Real., Steps). Sim2Real-VLA achieves the best performance across all tasks and metrics.

Main Results Table

Domain Gap (Sim2Real)

Per-task results under domain shift (supplementary). Each task: 1 success + 4 gap conditions.

Schematic

Domain gap schematic

Test Results

Domain gap test results

Table Rearrangement

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Single-Arm Water Pouring

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Dual-Arm Water Pouring

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Basket Pick-and-Place

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Items Hand-Over and Place

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Pan Open and Place

Success

Gap: bg

Gap: obj

Gap: table

Gap: table+obj

Citation

If you find this work useful, please cite:

@inproceedings{zhao2026sim2realvla,
  title={Sim2Real-VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation},
  author={Runyi Zhao and Sheng Xu and Ruixing Jin and Yueci Deng and Yunxin Tai and Kui Jia and Guiliang Liu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
}