
We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data curation for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) have lowered the barrier for in-the-wild data collection, their bulky pistol-grip designs can present ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this limitation, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion, allowing the jaws to naturally follow the operator's grip. This intuitive interface bridges the gap between human intent and robotic execution, facilitating more precise fingertip motor control. Furthermore, compared to SLAM-based tracking used in the original UMI, our system enhances the fidelity of 6 DoF gripper tracking via a rig-based operation setup integrated with VR systems. We validate the system's efficacy by providing an unprecedented UMI-based dataset, comprising 2730 hours of interaction data across 300K episodes and 40 distinct tasks. Our experiments demonstrate that YUBI offers advantages over the original UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. Collectively, the YUBI framework establishes a foundation for massive, high-fidelity data acquisition toward robotic foundation models.
YUBI shifts away from the conventional pistol-grip interface of prior UMI grippers toward a yielding, finger-aligned actuation design. One jaw is actuated by the thumb while the opposing jaw is driven by the coordinated motion of the index and middle fingers, so each jaw yields directly to its driving finger. The gripper aperture therefore follows the operator's natural pinch motion without motor-driven resistance, mitigating the control mismatch and poor haptic transparency of pistol-grip designs and enabling operators to directly leverage their inherent dexterity. To preserve precision while supporting loads of up to 2 kg, an integrated support grip serves as a mechanical fulcrum and the remaining fingers stabilize the grip to distribute load. A miniaturized gripper architecture with a lightweight camera module reduces the handheld mass to ~319 g (200 g gripper + 119 g controller), down from ~780–900 g in prior UMI/VR designs, alleviating wrist fatigue over long-duration sessions.
We design a fixed desktop operation setup tailored for sustained, bimanual manipulation. Each hand grasps one YUBI device equipped with an onboard wrist camera, a Quest controller, and a magnetic encoder that measures the gripper aperture. For VR-based gripper tracking, the Quest 3S tracks the 6 DoF trajectory of the controller mounted on YUBI, yielding higher fidelity than drift-prone SLAM. Unlike head-worn VR systems, the heavy headset is mounted on the fixed rig, reducing neck fatigue while maintaining tracking coverage. A rigidly mounted RealSense D435 stereo camera provides a stable top-down view of the workspace for additional supervision signals (object tracking, fine-grained action annotation). A laptop-based task UI aggregates all sensor streams, and a foot pedal lets operators annotate task transitions and sub-action boundaries hands-free.
We collected YUBI-based manipulation data at scale across 20 desks, operated 24/7 over one month by 104 operators (73 male, 31 female). The resulting dataset comprises 2730 hours of interaction across 300K episodes and 40 distinct tasks — substantially larger than prior UMI-based datasets such as Fast-UMI (~60 hours, 22 tasks) and the original UMI (12 hours, 4 tasks). The tasks span seven domains (industrial, kitchen, toy, desk work, clothing, appliance, personal care) and six primary skill types (placement, assembly, insertion, deformation, sorting, writing), reflecting YUBI's target scope of precise, heavy, and everyday object handling. Most tasks combine multiple skills in practice — e.g., "writing on a whiteboard" requires pick-and-place of the marker, tactile-sensitive writing and erasing, and inserting the cap. All sensor streams are converted to the LeRobot format at 30 Hz, and a cascade of detectors filters defective episodes (too-short recordings, stuck pose/aperture signals, kinematically implausible jumps).
Placeholder section. Policies trained on the YUBI dataset will be deployed on real robots, with rollout videos shown below. (Content to be added.)
@inproceedings{ohkawa2026yubi,
author = {Takehiko Ohkawa and Jumpei Arima and Yuki Noguchi and
Masatoshi Tateno and Makoto Sugiura and Takuya Okubo and
Yuki Wakayama and Naoaki Kanazawa and Tatsuya Matsushima and
Yohishiro Okumatsu and Kei Ota},
title = {{YUBI}: Yielding Universal Bidigital Interface for
Bimanual Dexterous Manipulation at Scale},
booktitle = {IEEE International Conference on Robotics and Automation
(ICRA) Workshop on Beyond Teleoperation},
year = {2026},
}