YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale


Takehiko Ohkawa1*   Jumpei Arima2*   Yuki Noguchi2   Masatoshi Tateno1,3   Makoto Sugiura1   Takuya Okubo1   Yuki Wakayama2   Naoaki Kanazawa1   Tatsuya Matsushima1,3   Yohishiro Okumatsu2†   Kei Ota1†

1AI Robot Association (AIRoA)   2Toyota Motor Corporation   3The University of Tokyo

ICRA 2026 Workshop on Beyond Teleoperation

*Equal contribution   †Equal supervision




Yielding Universal Bidigital Interface (YUBI). Our lightweight, finger-aligned gripper offers intuitive control by mirroring human digital kinematics for dexterous manipulation. Leveraging high-precision VR-based tracking, YUBI facilitates the curation of a large-scale, high-quality bimanual dataset to advance robotic foundation models.

Abstract

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data curation for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) have lowered the barrier for in-the-wild data collection, their bulky pistol-grip designs can present ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this limitation, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion, allowing the jaws to naturally follow the operator's grip. This intuitive interface bridges the gap between human intent and robotic execution, facilitating more precise fingertip motor control. Furthermore, compared to SLAM-based tracking used in the original UMI, our system enhances the fidelity of 6 DoF gripper tracking via a rig-based operation setup integrated with VR systems. We validate the system's efficacy by providing an unprecedented UMI-based dataset, comprising 2730 hours of interaction data across 300K episodes and 40 distinct tasks. Our experiments demonstrate that YUBI offers advantages over the original UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. Collectively, the YUBI framework establishes a foundation for massive, high-fidelity data acquisition toward robotic foundation models.

YUBI Gripper Design

YUBI replaces the pistol-grip interface with a yielding, finger-aligned actuation design: the thumb drives one jaw and the index/middle fingers the other, so each jaw yields to its finger and the aperture follows the operator's natural pinch without motor resistance, which improves haptic transparency and dexterity. An integrated support grip acts as a fulcrum for loads up to 2 kg, while a miniaturized build cuts handheld mass to ~319 g (from ~780–900 g), reducing wrist fatigue.

Exploded view of the YUBI gripper. The bidigital mechanism uses internal gears to actuate the jaws, supported by an ergonomic grip and flap. A fisheye camera is attached for task observation, and the Quest controller provides high-frequency 6-DoF trajectory tracking.

Operation Setup

A fixed desktop setup supports sustained bimanual operation. Each hand holds a YUBI device with a wrist camera, Quest controller, and magnetic aperture encoder. VR-based gripper tracking uses the Quest 3S for 6 DoF poses, which are more reliable than drift-prone SLAM. The headset is rig-mounted to avoid neck fatigue. A fixed RealSense D435 stereo camera adds a top-down workspace view for extra supervision, and a foot pedal enables hands-free sub-action annotation.

System overview. Bimanual YUBI-based demonstrations are collected at 20 desks in parallel from 104 operators. The setup features a stereo top-view camera for stable workspace observation, a rig-mounted VR system for 6 DoF gripper tracking, and foot-pedal-based action segmentation.

Large-Scale Bimanual Dataset

Collected across 20 desks running 24/7 for one month by 104 operators, the dataset comprises 2730 hours across 300K episodes and 40 tasks, far larger than Fast-UMI (~60 h, 22 tasks) and the original UMI (12 h, 4 tasks). Tasks span seven domains (industrial, kitchen, toy, desk work, clothing, appliance, personal care) and six skill types (placement, assembly, insertion, deformation, sorting, writing), and most combine several skills in practice. All streams are converted to the LeRobot format at 30 Hz, with a detector cascade filtering defective episodes.

Domain distribution (left) across seven categories and skill distribution (right) across six primary skill types of the 40 tasks.
Sample episodes from the YUBI dataset across representative tasks. (Placeholder videos; replace with final clips.)

Deployment Results

Placeholder section. Policies trained on the YUBI dataset will be deployed on real robots, with rollout videos shown below. (Content to be added.)

Deployment rollouts. (Placeholder videos; replace with final clips.)

BibTeX

@inproceedings{ohkawa2026yubi,
  author      = {Takehiko Ohkawa and Jumpei Arima and Yuki Noguchi and
                 Masatoshi Tateno and Makoto Sugiura and Takuya Okubo and
                 Yuki Wakayama and Naoaki Kanazawa and Tatsuya Matsushima and
                 Yohishiro Okumatsu and Kei Ota},
  title       = {{YUBI}: Yielding Universal Bidigital Interface for
                 Bimanual Dexterous Manipulation at Scale},
  booktitle   = {IEEE International Conference on Robotics and Automation
                 (ICRA) Workshop on Beyond Teleoperation},
  year        = {2026},
}