DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

Sixu Lin1,2 Yunpeng Qing3 Litao Liu4 Ming Zhou5 Ruixing Jin1 Xiaoyi Fan2 Guiliang Liu1,6,†
1 School of Data Science, The Chinese University of Hong Kong (Shenzhen) 2 Jiangxing Intelligence Technology Inc. 3 Zhejiang University 4 Rutgers University-New Brunswick 5 Shanghai AI Laboratory 6 Shenzhen Loop Area Institute
Corresponding author: Guiliang Liu, liuguiliang@cuhk.edu.cn

Pipeline

Pipeline figure from the DyGRO-VLA paper.
Figure 1: Method pipeline. DyGRO-VLA follows a two-stage training recipe. In the offline stage, the VLA backbone is trained to predict actions while learning a compact latent representation via an information-bottleneck objective. In the online stage, the VLA backbone is frozen and the residual MoE is optimized in multi-task settings as a residual compensation module to improve multi-task ability and generalization.

Abstract

Recent progress in Reinforcement Learning (RL) has shown strong potential for enhancing the fine-grained manipulation capabilities of Vision-Language-Action (VLA) models. However, existing RL fine-tuning methods are typically task-specific, which can disrupt the original cross-task representations learned by generalist VLA models and hinder their scalability across diverse manipulation tasks. In this work, we propose DyGRO-VLA (Dynamic Grouped Residual Optimization), a novel two-stage framework for cross-task RL fine-tuning of VLA models. In the first stage, we introduce an information-theoretic objective to learn compact and task-relevant latent representations that preserve cross-task knowledge. In the second stage, we design a Mixture-of-RL-Residuals (MoRR) module, which dynamically routes tasks to specialized residual policy experts based on learned task embeddings. This enables efficient online policy optimization while mitigating negative transfer across tasks. Extensive experiments on LIBERO, RoboTwin2, and real-world robotic manipulation tasks demonstrate that DyGRO-VLA consistently improves multi-task success rates and generalization performance compared to strong VLA and RL fine-tuning baselines.

LIBERO Evaluation Videos

LIBERO-Spatial pick up the black bowl between the plate and the ramekin and place it on the plate

LIBERO-Spatial pick up the black bowl on the wooden cabinet and place it on the plate

LIBERO-Object pick up the alphabet soup and place it in the basket

LIBERO-Object pick up the orange juice and place it in the basket

LIBERO-Goal open the middle drawer of the cabinet

LIBERO-Goal put the wine bottle on the rack

LIBERO-10 put the white mug on the plate and put the chocolate pudding to the right of the plate

LIBERO-10 put both moka pots on the stove

RoboTwin Videos

Pick Dual Bottles

Simulation Pick Dual Bottles

Real-world Pick Dual Bottles

Place Empty Cup

Simulation Place Empty Cup

Real-world Place Empty Cup

Stack Bowls Two

Simulation Stack Bowls Two

Real-world Stack Bowls Two

Beat Block Hammer

Simulation Beat Block Hammer

Real-world Beat Block Hammer

Conclusion

We investigate the scalability challenge of RL post-training for vision-language-action models, where multi-task online optimization can induce cross-task interference and catastrophic forgetting. To address this, we propose DyGRO-VLA, a two-stage framework that learns task-sharing representations from offline demonstrations and refines behavior online via dynamically routed residual RL experts. Across LIBERO, RoboTwin2, and real-world Sim2Real experiments, DyGRO-VLA improves multi-task performance and robustness over strong baselines, with notable gains on challenging tasks. These results highlight the value of preserving shared representations and using dynamic residual modularization for scalable VLA post-training. An important direction of future work is extending DyGRO-VLA to mobile manipulation tasks that involving locomotion tasks before manipulating objects.