GVLA: Gripper-aware Vision Language Action Models

🎉  Accepted at ECCV 2026  ðŸŽ‰

Hanyi Zhang1, Zihong Luo1, Tianyu Li1, Khang Nguyen2, Basu Hela3, Shreyas Kumar3, Ngoc Duy Tran3, Feng Dai4, Charith Munasinghe5, Jorge Peña Queralta5, Giovanni Toffetti5, Khoa Vo6, Ngan Le6, Ravi Prakash3, Quan Vuong7, Tung D. Ta4, Long Hu8, Anh Nguyen1, Baoru Huang1
1University of Liverpool, 2Mohamed bin Zayed University of Artificial Intelligence, 3Indian Institute of Science, 4The University of Tokyo, 5Zürcher Hochschule für Angewandte Wissenschaften (ZHAW), 6University of Arkansas, 7Physical Intelligence, 8Huazhong University of Science and Technology
GVLA and MiGA overview showing multi-gripper demonstrations in simulation and real-world settings

Abstract

Vision language action models (VLAs) have advanced general purpose robotic grasping and manipulation by enabling robots to interpret visual observations and natural language instructions to generate executable action sequences. However, existing VLAs often implicitly assume gripper invariance, despite grasping strategies being inherently embodiment-dependent. Different gripper types, such as parallel-jaw and suction, usually require distinct interaction strategies to achieve the same grasping objective. Moreover, current datasets for VLAs predominantly rely on parallel-jaw grippers, limiting gripper-aware learning. To address this gap, we introduce MiGA, a multi-gripper-aware dataset spanning five distinct gripper types across multiple robots with 103,000 demonstrations, explicitly capturing strategy divergence under shared task objectives. We further propose GVLA, which combines a new multi-gripper tokenizer with adapter-based policy routing. Our new gripper encoding induces structured embedding information that balances parameter sharing and strategy differentiation, while layer-wise probing confirms meaningful gripper-conditioned representations for VLAs. Intensive experiments in both simulation and real-world robots show that GVLA outperforms the current baselines across evaluated settings. Our method also improves zero-shot generalization or few-shot adaptation to new objects or unseen tasks, and enables more efficient gripper adaptation.

MiGA Dataset

MiGA is a multi-gripper-aware dataset with 103K demonstrations across 36 tasks, 5 gripper types, and both simulation and real-world robot setups. It captures how identical task objectives require different contact choices, approach directions, and execution strategies across gripper embodiments.

103KTrajectories
5Gripper types
36Tasks
Real + SimDomains
Gripper-specific strategy variations across MiGA task categories
Gripper-specific strategy variation across singulated, stacked, constrained, and semantic tasks.
MiGA dataset statistics
MiGA balances task categories and gripper coverage across the dataset.

GVLA Method

GVLA conditions a VLA backbone on gripper embodiment through Multi-gripper tokenization and a Dual Mixture-of-Adapters. Platform-, type-, and instance-level gripper tokens provide structured conditioning, while gripper and platform adapter pools route computation toward embodiment-specific policies.

Multi-gripper tokenizationEncodes robot platform, gripper type, and gripper instance as learnable tokens.
Dual Mixture-of-AdaptersRoutes action generation through platform- and gripper-aware adapter experts.
GVLA architecture overview
GVLA injects gripper-aware tokens and adapter routing into the VLA action-generation pipeline.

Results

Across simulation and Real-world Validation, GVLA improves gripper-aware manipulation, zero-shot object generalization, and few-shot adaptation to new tasks or grippers.

Baseline comparison across GVLA task categories
GVLA outperforms gripper-agnostic baselines across task categories.
Cross-object generalization success rates
GVLA maintains stronger zero-shot transfer across unseen objects and grippers.
Task, gripper, and mixed-data adaptation results
Gripper-aware conditioning enables more efficient few-shot task and gripper adaptation.
Real-world robotic validation setup and results
Real-world Validation shows improved adaptation on a UR5 robot with a Robotiq 2F-85 gripper.

BibTeX

@inproceedings{zhang2026gvla,
      title={Gripper-aware Vision Language Action Models},
      author={Zhang, Hanyi and Luo, Zihong and Li, Tianyu and Nguyen, Khang and Hela, Basu and Kumar, Shreyas and Tran, Ngoc Duy and Dai, Feng and Munasinghe, Charith and Pe{\~n}a Queralta, Jorge and Toffetti, Giovanni and Vo, Khoa and Le, Ngan and Prakash, Ravi and Vuong, Quan and Ta, Tung D. and Hu, Long and Nguyen, Anh and Huang, Baoru},
      booktitle={European Conference on Computer Vision (ECCV)},
      year={2026},
      organization={Springer}
    }