YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

Takehiko Ohkawa^1* Jumpei Arima^2* Yuki Noguchi² Masatoshi Tateno^1,4 Makoto Sugiura¹ Takuya Okubo¹ Kengo Ikeuchi^1,4 Yuma Shin^1,5 Hiroki Nishizawa^1,6 Naoaki Kanazawa¹ Yuki Wakayama² Daiki Fukunaga² Koshi Makihara³ Tomohiro Motoda³ Floris Erich³ Yukiyasu Domae³ Tatsuya Matsushima^1,4 Yohishiro Okumatsu² Kei Ota¹

¹AI Robot Association (AIRoA) ²Toyota Motor Corporation ³National Institute of Advanced Industrial Science and Technology (AIST)
⁴The University of Tokyo ⁵Institute of Science Tokyo ⁶Waseda University

ICRA 2026 Workshop on Beyond Teleoperation

[Paper] [Hardware] [Software] [Dataset (Coming Soon!)]

*Project Co-Lead

Yielding Universal Bidigital Interface (YUBI). Our lightweight, finger-aligned gripper offers intuitive control by mirroring human digital kinematics for dexterous manipulation. Leveraging high-precision VR-based tracking, YUBI facilitates the curation of a large-scale, high-quality bimanual dataset to advance robotic foundation models.

Abstract

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

YUBI Gripper Design

YUBI replaces the pistol-grip interface with a yielding, finger-aligned actuation design: the thumb drives one jaw and the index/middle fingers the other, so each jaw yields to its finger and the aperture follows the operator's natural pinch without motor resistance, which improves haptic transparency and dexterity. An integrated support grip acts as a fulcrum for loads up to 2 kg, while a miniaturized build cuts handheld mass to ~319 g (200 g gripper + 119 g VR controller, down from ~780–900 g), reducing wrist fatigue.

Overview and exploded view of the YUBI gripper. The bidigital mechanism uses internal gears to actuate the jaws, supported by an ergonomic grip and flap. A fisheye camera is attached for task observation, and the Quest controller provides high-frequency 6-DoF trajectory tracking.

Operation Setup

YUBI supports two complementary data-collection modes: a stationary tabletop rig and a portable handheld configuration. The stationary setup, shown below, is our primary configuration and accounts for the majority of the collected data. Each hand holds a YUBI device with a wrist camera, Quest controller, and magnetic aperture encoder. VR-based gripper tracking uses the Quest 3S for 6 DoF poses, which are more reliable than drift-prone SLAM. The headset is rig-mounted to avoid neck fatigue. A fixed RealSense D435 stereo camera adds a top-down workspace view for extra supervision, and a foot pedal enables hands-free sub-action annotation.

Stationary setup. Bimanual YUBI-based demonstrations are collected at 22 desks in parallel from 179 operators. The setup features a stereo top-view camera for stable workspace observation, a rig-mounted VR system for 6 DoF gripper tracking, and foot-pedal-based action segmentation.

The portable mode keeps the gripper fully self-contained, dropping the fixed rig, headset stand, and external cameras so data can be collected in the wild. This unlocks whole-body and mobile scenarios such as household chores or tasks spanning multiple rooms, while preserving the same finger-aligned control and on-board 6 DoF tracking, with an egocentric camera supplying the workspace view.

Portable setup. A fully self-contained YUBI configuration for in-the-wild collection with whole-body motion, requiring no fixed rig, headset mount, or external cameras.

Large-Scale Bimanual Dataset

Collected across 22 desks running 24/7 for over two months by 179 operators (125 male, 54 female), the dataset comprises 8434 hours across 1.20M episodes and 119 tasks, far larger than Fast-UMI (~60 h, 22 tasks) and the original UMI (12 h, 4 tasks). Tasks span seven domains (industrial, kitchen, toy, desk work, clothing, appliance, personal care) and six skill types (placement, insertion, sorting, assembly, deformation, tool use), and most combine several skills in practice. All streams are converted to the LeRobot format at 30 Hz, with a detector cascade filtering defective episodes.

Domain distribution (left) across seven categories and skill distribution (right) across six primary skill types of the 119 tasks.

Usability Study

We recruited a gender-balanced group of 10 operators with no prior experience using either UMI or YUBI, and compared the two devices on a dexterity test and an operational efficiency test.

Dexterity test. Operators ran single-attempt pick-and-place of six hex nuts (M10–M3, largest to smallest). Both devices approach the ceiling on large nuts (≥94% at M8–M10), but diverge as the diameter shrinks: YUBI leads UMI by +20 and +10 pp at M6 and M5, and by roughly 3× at the smallest M3 nut, indicating substantially better precision under tight tolerances.

Operational efficiency test. Operators performed five tasks under three conditions (direct hand, UMI, and YUBI) with counterbalanced ordering. YUBI is consistently faster than UMI, with per-task speed-ups from 1.37× (domino arrangement) to 4.19× (phone charging), substantially narrowing the gap to direct hand operation even for precision tasks.

Task UI

During collection, operators work from a web-based task console that shows live multi-view camera feeds (left–arm, top-view camera, and right–arm), the current task with its ordered sub-task checklist, real-time 6 DoF tracking, and the recording state. A foot pedal starts and stops episode recording and segments sub-actions hands-free, so the operator can keep both hands on the YUBI grippers. This console is provided by yubi-app, the open-source software stack we released. The demo below walks through a typical recording session.

Operation Examples

Using YUBI, operators collect a wide range of everyday bimanual tasks spanning multiple domains (industrial, kitchen, desk work, clothing, personal care) and skills (placement, insertion, deformation, tool use). A few representative examples are shown below.

Operation examples. Representative bimanual tasks collected with YUBI across everyday domains and skills.

Robot Policy Deployment

To test whether the YUBI dataset translates into real-world capability, we train a multi-task vision-language-action policy (π_0.5-based) on YUBI's wrist data and deploy it across three bimanual robot platforms (UR, Franka, and Toyota's semi-humanoid ELEY), each fitted with the YUBI gripper as a common end-effector. Because the policy is trained on the gripper end-effector trajectory rather than robot-specific joint space, a single dataset transfers across kinematically distinct arms without retargeting. A single multi-task policy carries out a range of complex bimanual tasks, from placing a ball in a basket and stacking cups to unfolding glasses, picking and placing socks, and taping a box, across all three robots. This confirms that end-effector-space supervision from YUBI transfers across embodiments and generalizes to dexterous bimanual manipulation.

UR — Ball in basket

UR — Stack cup pyramid

UR — Unfold glasses

Franka — Pick-and-place socks

Franka — Tape in box

ELEY — Cup placement

Robot deployment rollouts. A single multi-task policy trained on YUBI data, deployed across three robots and six tasks with the YUBI gripper as a shared end-effector.

BibTeX

@techreport{ohkawa2026yubi,
  author      = {Takehiko Ohkawa and Jumpei Arima and Yuki Noguchi and
                 Masatoshi Tateno and Makoto Sugiura and Takuya Okubo and
                 Kengo Ikeuchi and Yuma Shin and Hiroki Nishizawa and
                 Naoaki Kanazawa and Yuki Wakayama and Daiki Fukunaga and
                 Koshi Makihara and Tomohiro Motoda and Floris Erich and
                 Yukiyasu Domae and Tatsuya Matsushima and Yohishiro Okumatsu and
                 Kei Ota},
  title       = {{YUBI}: Yielding Universal Bidigital Interface for
                 Bimanual Dexterous Manipulation at Scale},
  year        = {2026},
}

Acknowledgments

This work is based on results obtained from a project, JPNP25015, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).