TOP

The Dexterous Revolution: How Robots Are Learning to Use Their Hands

A founder’s guide to the architectures, datasets, and open questions shaping the next frontier of robot manipulation.

For decades, robots could pick and place. They could weld, sort, and move with precision — but only in the narrow corridors engineers carved out for them. Anything resembling a human hand, with its 27 degrees of freedom and extraordinary sensorimotor intelligence, remained stubbornly out of reach. That is changing. Rapidly. The convergence of large-scale simulation, diffusion-based policy learning, and rich manipulation datasets has produced a crop of architectures that can grasp arbitrary objects, adapt to cluttered scenes, and begin to generalize across tasks in ways that would have seemed implausible three years ago.
The Dexterous Revolution How Robots Are Learningto Use Their Hands

01 — STATE OF RESEARCH

Where Dexterous Manipulation Stands Today

Dexterous manipulation — controlling a multi-fingered robotic hand to grasp, reorient, and manipulate objects with human-like adaptability — sits at the intersection of three fields: robotics, computer vision, and machine learning. Publications tagged “dexterous manipulation” on Scopus and IEEEXplore have roughly tripled since 2020, a pace accelerated by breakthroughs in simulation fidelity and the emergence of diffusion models as policy representations.

The field has historically split into two problems: grasp synthesis (generating a target hand pose for an object) and grasp execution (actually moving there with a closed-loop controller). Most early systems solved these independently and sequentially, which made them brittle. Modern approaches increasingly collapse this gap, learning end-to-end policies that produce trajectories, not just poses.

88.8%

Success rate on 3,200+ unseen objects (ResDex, ICLR 2025)

9.3B

Annotated grasps in the latest real-world dataset (GraspClutter6D)

90.7%

Zero-shot sim-to-real transfer success (DexGraspNet 2.0, CoRL 2024)

Three paradigms dominate the current architectural landscape: reinforcement learning with compositional policy structuresdiffusion-based imitation learning with 3D representations  and interactive imitation learning with human-in-the-loop feedback. Each offers a different trade-off between data efficiency, generalization, and deployment readiness.

“The question is no longer whether robots can learn dexterous skills”

whether they can learn them cheaply enough, quickly enough, and generally enough to matter outside the lab.”

What has changed most dramatically is the scale and quality of available data. Just three years ago, the largest dexterous grasping datasets contained tens of thousands of annotated grasps. Today’s benchmarks count in the billions, and real-world cluttered-scene datasets have finally begun to match the complexity that practitioners actually encounter in warehouses, kitchens, and operating rooms.

02 — PROMINENT ARCHITECTURES

Three Architectures Defining the Field

The following three papers represent the most consequential architectural directions in dexterous manipulation published in the 2024–2025 window. They are not simply incremental — each challenges a foundational assumption about how dexterous policies should be designed.

Architecture · ICLR 2025

ResDex: Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping

Huang, Yuan, Fu & Lu — ICLR 2025 · arXiv:2410.02475

ResDex attacks a persistent bottleneck in multi-task RL for dexterous hands: learning a single policy that generalizes across thousands of geometrically diverse objects without expensive curriculum design. The key insight is elegantly compositional. Rather than training one monolithic policy, ResDex maintains a set of geometry-agnostic base policies, each mastered independently on individual objects and surprisingly portable to unseen ones. A mixture-of-experts (MoE) gating network then learns to blend these base policies, while a residual action head corrects for the inevitable gap between what the experts collectively predict and what the actual object geometry demands.

This residual formulation is the crux. By decomposing the policy into a shared prior (the expert mixture) and an object-specific correction (the residual), ResDex sidesteps the optimization challenges that defeat naive multi-task RL on thousands of objects. The result is a policy trained across 3,200 objects on a single GPU in under 12 hours, with no measurable generalization gap to novel categories.

88.8%

success · 3,200 objects · 12 hrs / single GPU

Architecture · RSS 2024

3D Diffusion Policy (DP3): Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Zhang, Zhang, Hu, Wang & Xu — RSS 2024 · arXiv:2403.03954

If ResDex is a story about RL efficiency, DP3 is a story about what happens when you give a diffusion model the right visual input. The central claim is provocative in its simplicity: replace the 2D image encoders used in prior visuomotor policies with a compact 3D point cloud representation, encoded by a lightweight MLP, and the improvement in both accuracy and generalization is substantial.

Diffusion policies had already proven their ability to model multimodal action distributions in manipulation, capturing the “which of several valid motions” ambiguity that simpler Gaussian policies collapse. What DP3 adds is spatial grounding — the policy conditions on a representation that explicitly encodes object geometry in three dimensions, making it robust to viewpoint changes, lighting variation, and novel object instances that defeat 2D methods. Across 72 simulated tasks and 4 real-world evaluations with an Allegro hand, DP3 achieves 85% real-robot success with just 40 demonstrations per task — a remarkable data efficiency figure for high-DOF manipulation.

85%

real-robot · 72 sim tasks · 40 demos per task

Survey · Frontiers 2025

Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives

Frontiers in Robotics and AI · 2025 · DOI: 10.3389/frobt.2025.1682437

This comprehensive survey maps the emerging paradigm of Interactive Imitation Learning (IIL) — a class of approaches where human feedback is woven into the policy learning loop rather than collected upfront in a static dataset. The distinction matters enormously for dexterous tasks, where covariate shift (the policy encountering states it never saw in demonstrations) is a persistent failure mode.

IIL methods range from simple human corrections and interventions during rollout, to preference-based feedback, to reward shaping from human observation. The survey offers the most rigorous taxonomy of these approaches to date, situating diffusion policies, transformer-based architectures, and RL-from-human-feedback methods within a unified framework. Crucially, it identifies the open tensions: interactive methods improve data efficiency but require real hardware and human availability — friction that must be reduced before these techniques scale to production deployments.

Comprehensive survey · Diffusion + RL + IIL taxonomy · 2025

Architectural Comparison

ArchitectureParadigmKey StrengthData RequirementSim-to-Real
ResDexRLUniversal generalization, training efficiencySim only (no human demos)Demonstrated
DP3Imitation LearningSample efficiency, spatial generalization~40 demos/taskStrong (85% real)
IIL (Survey)Interactive ILCovariate shift correction, human priorsFew demos + correctionsDeployment focus

03 — Dexterous Datasets

The Data Infrastructure Behind the Progress

Policy learning scales with data quality as much as architectural ingenuity. For most of the past decade, dexterous manipulation lagged parallel-jaw grasping not because of a shortage of algorithms, but a shortage of data — particularly diverse, high-quality, real-world data at the scale that modern deep learning requires. The three datasets below represent the current frontier.

Dataset · RA-L 2025

GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes

Back et al. — IEEE Robotics and Automation Letters 2025 · arXiv:2504.06866

GraspClutter6D is the most realistic large-scale grasping benchmark yet released. Its defining characteristic is clutter density: the 1,000 scenes average 14.1 objects per scene with 62.6% occlusion — numbers that reflect actual bin-picking and warehouse environments, not the simplified tabletop arrangements that dominate most benchmarks. All data is captured in the real world (not simulation) using four RGB-D cameras, providing 736K annotated 6D object poses and 9.3 billion feasible grasp annotations across 52K images.

The benchmark evaluations are sobering: state-of-the-art methods trained on existing datasets degrade significantly in GraspClutter6D scenes, while models trained on this data show clear improvement. The dataset spans 200 objects across 75 environment configurations — bins, shelves, and tables — making it the first benchmark that realistically exercises perception systems under the occlusion and viewpoint variation of deployed systems.

Real-world · 1,000 scenes · 9.3B grasps · 62.6% occlusion

Dataset · CORL 2024

DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes

Zhang, Liu et al. — CoRL 2024 · pku-epic.github.io/DexGraspNet2.0

DexGraspNet 2.0 is the synthetic counterpart to GraspClutter6D — and its scale is staggering. Covering 1,319 objects across 8,270 scenes with 427 million annotated grasps, it dwarfs its predecessor (the original DexGraspNet, which contained 1.32M grasps on 5,355 objects) by two orders of magnitude in scene complexity. The key methodological contribution is a diffusion model conditioned on local geometry that learns to generate collision-free, physically plausible grasps at scale. What makes DexGraspNet 2.0 particularly consequential is its sim-to-real transfer story: using depth-based point cloud inputs at test time and a test-time depth restoration technique, the method achieves 90.7% real-world success without any real-world training data. This is the zero-shot sim-to-real benchmark the field has been building toward, and it sets a new baseline for synthetic dataset utility.

Synthetic · 427M grasps · 8,270 scenes · 90.7% zero-shot real

Dataset · ICLR 2025

VTDexManip: A Dataset and Benchmark for Visual-Tactile Pretraining and Dexterous Manipulation

Policy learning scales with data quality as much as architectural ingenuity. For most of the past decade, dexterous manipulation lagged parallel-jaw grasping not because of a shortage of algorithms, but a shortage of data — particularly diverse, high-quality, real-world data at the scale that modern deep learning requires. The three datasets below represent the current frontier.

Dataset · RA-L 2025

GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes

Liu, Cui et al. — ICLR 2025 · arXiv:2406.09657

VTDexManip fills a gap that most datasets ignore: the tactile modality. While vision-based manipulation has benefited from years of large-scale dataset curation, tactile sensing — arguably even more important for fine-grained dexterous tasks like cap-turning, faucet-screwing, and in-hand reorientation — has lacked a corresponding benchmark. VTDexManip changes this by providing the first visual-tactile dataset collected from human demonstrations, covering 10 task categories across 182 objects with 20 force sensors on the dexterous hand. The dataset is paired with a benchmark that evaluates pretrained and non-pretrained representations (CLIP, R3M, MVP, Voltron, ResNet18) on six complex manipulation tasks in simulation, providing a rigorous apples-to-apples comparison of how different visual and tactile pretraining strategies transfer to dexterous skill learning. For startups building systems that need to handle contact-rich tasks — assembly, screwing, compliant manipulation — this is required reading.

Visual-tactile · 182 objects · 20 force sensors · 6 benchmark tasks

Dataset Landscape at a Glance

DatasetTypeScaleKey Differentiator
GraspClutter6DReal-world9.3B grasps, 1K scenesHighest real-world clutter density (14.1 obj/scene)
DexGraspNet 2.0Synthetic427M grasps, 8.27K scenesBest zero-shot sim-to-real (90.7%)
VTDexManipReal (human demo)182 objects, 6 tasksFirst visual-tactile benchmark for dexterous manipulation
Note: The original DexGraspNet (CVPR 2023, 1.32M grasps, 5,355 objects) remains the standard RL training benchmark; DexGraspNet 2.0 extends this to scene-level cluttered grasping. GraspNet-1Billion (8.9 objects/scene, 35.2% occlusion) is the dominant prior baseline that GraspClutter6D supersedes in realism.  

04 — Future Research Directions

What Comes Next

Despite remarkable progress, dexterous manipulation remains a fundamentally unsolved problem for general deployment. The gap between benchmark success rates and reliable real-world performance is narrowing, but it has not closed. Here are the five research directions where the most consequential advances are likely to emerge over the next two to three years.

01
Tactile Intelligence and Multimodal Sensing

Vision alone is insufficient for many dexterous tasks. VTDexManip has opened the benchmark infrastructure; what follows is a push toward learned representations that fuse tactile, proprioceptive, and visual signals in a unified policy. The key technical challenge is heterogeneous sensor fusion: tactile sensors produce high-frequency, spatially localized contact maps that are fundamentally unlike image data. Architectures that handle this gracefully — perhaps through cross-attention between modalities — are an open research problem with high commercial value in assembly, medical devices, and food handling.

02
Scaling Sim-to-Real Transfer Beyond Grasping

DexGraspNet 2.0's 90.7% zero-shot sim-to-real result is striking, but it applies to a relatively clean task: pick up an object. Extending this fidelity to in-hand manipulation, tool use, and bimanual coordination requires simulation environments that accurately model contact dynamics, deformable objects, and the subtle friction forces that make or break a grasp. The next generation of physics simulators — and the policies trained on them — will need to close the dynamics gap that still limits transfer for contact-rich, high-DOF tasks.

03
Foundation Models for Manipulation

The NLP community's playbook — pretrain at scale, fine-tune cheaply — is being ported to robotics. Early examples like RT-2 and π0 show that vision-language-action models can acquire broad manipulation priors from internet-scale data, then specialize to specific tasks with minimal demonstrations. For dexterous hands specifically, the challenge is that most robot video data features parallel grippers, not multi-fingered hands — a domain gap that requires targeted data collection strategies and cross-embodiment transfer methods that are still nascent.

04
Long-Horizon Dexterous Planning

Current architectures excel at single-grasp tasks: pick this object, reorient this object. Real-world tasks — assembling a product, preparing food, organizing a cluttered space — require sequences of dexterous actions over extended horizons. The challenge is compounding: error propagates, objects move, and the state space expands dramatically. Hierarchical architectures that separate high-level task planning from low-level dexterous control, combined with recovery behaviors when sub-tasks fail, are a critical missing layer.

05
Hardware-Software Co-Design

The best algorithm cannot compensate for a hand that lacks the degrees of freedom, actuation bandwidth, or sensor coverage required for a task. The most interesting work at this frontier treats hardware design as a learnable variable — using differentiable simulation to co-optimize hand morphology and control policy jointly. Coupled with the increasing availability of low-cost dexterous hands (LEAP Hand, Shadow Hand variants, custom tendon-driven designs), this opens a design space that startups are well-positioned to explore.

05 — Perspective

Why This Moment Matters

Dexterous manipulation has been “five years away” for most of the past thirty years. What is different now is not a single breakthrough but an infrastructure shift: simulation quality has improved to the point where billion-scale data generation is tractable; diffusion models have given us policy representations expressive enough to model the multimodal, contact-rich structure of manipulation; and real-world benchmarks like GraspClutter6D are finally honest about the gap between lab performance and deployment reality.

The architectures covered here — ResDex’s compositional RL, DP3’s 3D-grounded diffusion, and IIL’s human-in-the-loop learning — are not competing solutions. They are complementary tools, each suited to different phases of deployment: RL for scaling without human data, diffusion imitation for fast task acquisition from demonstrations, and interactive learning for continuous improvement in production. The most capable manipulation systems of the near future will likely use all three.

For those of us building at this frontier: the data infrastructure is finally catching up to the algorithmic ambition. The next constraint is not benchmark scores — it is the gap between what a robot can do in a controlled setting and what it can do reliably in the messy, variable, unstructured world where it actually needs to work. That gap is closable. The research is pointing the way.

References

[1] Huang Z., Yuan H., Fu Y., Lu Z. (2025). Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping. ICLR 2025. arXiv:2410.02475

[2] Ze Y., Zhang G., Zhang K., Hu C., Wang M., Xu H. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. RSS 2024. arXiv:2403.03954

[3] Freiberg et al. (2025). Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives — A Survey. Frontiers in Robotics and AI. doi:10.3389/frobt.2025.1682437

[4] Back S. et al. (2025). GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes. RA-L 2025. arXiv:2504.06866

[5] Zhang J., Liu H. et al. (2024). DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes. CoRL 2024. Project Page

[6] Liu Q., Cui Y. et al. (2025). VTDexManip: A Dataset and Benchmark for Visual-Tactile Pretraining and Dexterous Manipulation with Reinforcement Learning. ICLR 2025. arXiv:2406.09657

About Us

Cerebel builds dexterous AI workforces that bring human-like precision to physical industries.

Contact

© 2026 Cerebel. All Rights Reserved.

Blank Form (#4)