Dexterous manipulation — controlling a multi-fingered robotic hand to grasp, reorient, and manipulate objects with human-like adaptability — sits at the intersection of three fields: robotics, computer vision, and machine learning. Publications tagged “dexterous manipulation” on Scopus and IEEEXplore have roughly tripled since 2020, a pace accelerated by breakthroughs in simulation fidelity and the emergence of diffusion models as policy representations.
The field has historically split into two problems: grasp synthesis (generating a target hand pose for an object) and grasp execution (actually moving there with a closed-loop controller). Most early systems solved these independently and sequentially, which made them brittle. Modern approaches increasingly collapse this gap, learning end-to-end policies that produce trajectories, not just poses.
Success rate on 3,200+ unseen objects (ResDex, ICLR 2025)
Annotated grasps in the latest real-world dataset (GraspClutter6D)
Zero-shot sim-to-real transfer success (DexGraspNet 2.0, CoRL 2024)
Three paradigms dominate the current architectural landscape: reinforcement learning with compositional policy structures, diffusion-based imitation learning with 3D representations and interactive imitation learning with human-in-the-loop feedback. Each offers a different trade-off between data efficiency, generalization, and deployment readiness.
“The question is no longer whether robots can learn dexterous skills”
whether they can learn them cheaply enough, quickly enough, and generally enough to matter outside the lab.”
What has changed most dramatically is the scale and quality of available data. Just three years ago, the largest dexterous grasping datasets contained tens of thousands of annotated grasps. Today’s benchmarks count in the billions, and real-world cluttered-scene datasets have finally begun to match the complexity that practitioners actually encounter in warehouses, kitchens, and operating rooms.
ResDex attacks a persistent bottleneck in multi-task RL for dexterous hands: learning a single policy that generalizes across thousands of geometrically diverse objects without expensive curriculum design. The key insight is elegantly compositional. Rather than training one monolithic policy, ResDex maintains a set of geometry-agnostic base policies, each mastered independently on individual objects and surprisingly portable to unseen ones. A mixture-of-experts (MoE) gating network then learns to blend these base policies, while a residual action head corrects for the inevitable gap between what the experts collectively predict and what the actual object geometry demands.
This residual formulation is the crux. By decomposing the policy into a shared prior (the expert mixture) and an object-specific correction (the residual), ResDex sidesteps the optimization challenges that defeat naive multi-task RL on thousands of objects. The result is a policy trained across 3,200 objects on a single GPU in under 12 hours, with no measurable generalization gap to novel categories.
success · 3,200 objects · 12 hrs / single GPU
If ResDex is a story about RL efficiency, DP3 is a story about what happens when you give a diffusion model the right visual input. The central claim is provocative in its simplicity: replace the 2D image encoders used in prior visuomotor policies with a compact 3D point cloud representation, encoded by a lightweight MLP, and the improvement in both accuracy and generalization is substantial.
Diffusion policies had already proven their ability to model multimodal action distributions in manipulation, capturing the “which of several valid motions” ambiguity that simpler Gaussian policies collapse. What DP3 adds is spatial grounding — the policy conditions on a representation that explicitly encodes object geometry in three dimensions, making it robust to viewpoint changes, lighting variation, and novel object instances that defeat 2D methods. Across 72 simulated tasks and 4 real-world evaluations with an Allegro hand, DP3 achieves 85% real-robot success with just 40 demonstrations per task — a remarkable data efficiency figure for high-DOF manipulation.
real-robot · 72 sim tasks · 40 demos per task
This comprehensive survey maps the emerging paradigm of Interactive Imitation Learning (IIL) — a class of approaches where human feedback is woven into the policy learning loop rather than collected upfront in a static dataset. The distinction matters enormously for dexterous tasks, where covariate shift (the policy encountering states it never saw in demonstrations) is a persistent failure mode.
IIL methods range from simple human corrections and interventions during rollout, to preference-based feedback, to reward shaping from human observation. The survey offers the most rigorous taxonomy of these approaches to date, situating diffusion policies, transformer-based architectures, and RL-from-human-feedback methods within a unified framework. Crucially, it identifies the open tensions: interactive methods improve data efficiency but require real hardware and human availability — friction that must be reduced before these techniques scale to production deployments.
Comprehensive survey · Diffusion + RL + IIL taxonomy · 2025
| Architecture | Paradigm | Key Strength | Data Requirement | Sim-to-Real |
|---|---|---|---|---|
| ResDex | RL | Universal generalization, training efficiency | Sim only (no human demos) | Demonstrated |
| DP3 | Imitation Learning | Sample efficiency, spatial generalization | ~40 demos/task | Strong (85% real) |
| IIL (Survey) | Interactive IL | Covariate shift correction, human priors | Few demos + corrections | Deployment focus |
Policy learning scales with data quality as much as architectural ingenuity. For most of the past decade, dexterous manipulation lagged parallel-jaw grasping not because of a shortage of algorithms, but a shortage of data — particularly diverse, high-quality, real-world data at the scale that modern deep learning requires. The three datasets below represent the current frontier.
GraspClutter6D is the most realistic large-scale grasping benchmark yet released. Its defining characteristic is clutter density: the 1,000 scenes average 14.1 objects per scene with 62.6% occlusion — numbers that reflect actual bin-picking and warehouse environments, not the simplified tabletop arrangements that dominate most benchmarks. All data is captured in the real world (not simulation) using four RGB-D cameras, providing 736K annotated 6D object poses and 9.3 billion feasible grasp annotations across 52K images.
The benchmark evaluations are sobering: state-of-the-art methods trained on existing datasets degrade significantly in GraspClutter6D scenes, while models trained on this data show clear improvement. The dataset spans 200 objects across 75 environment configurations — bins, shelves, and tables — making it the first benchmark that realistically exercises perception systems under the occlusion and viewpoint variation of deployed systems.
Real-world · 1,000 scenes · 9.3B grasps · 62.6% occlusion
Synthetic · 427M grasps · 8,270 scenes · 90.7% zero-shot real
Policy learning scales with data quality as much as architectural ingenuity. For most of the past decade, dexterous manipulation lagged parallel-jaw grasping not because of a shortage of algorithms, but a shortage of data — particularly diverse, high-quality, real-world data at the scale that modern deep learning requires. The three datasets below represent the current frontier.
Visual-tactile · 182 objects · 20 force sensors · 6 benchmark tasks
| Dataset | Type | Scale | Key Differentiator |
|---|---|---|---|
| GraspClutter6D | Real-world | 9.3B grasps, 1K scenes | Highest real-world clutter density (14.1 obj/scene) |
| DexGraspNet 2.0 | Synthetic | 427M grasps, 8.27K scenes | Best zero-shot sim-to-real (90.7%) |
| VTDexManip | Real (human demo) | 182 objects, 6 tasks | First visual-tactile benchmark for dexterous manipulation |
Despite remarkable progress, dexterous manipulation remains a fundamentally unsolved problem for general deployment. The gap between benchmark success rates and reliable real-world performance is narrowing, but it has not closed. Here are the five research directions where the most consequential advances are likely to emerge over the next two to three years.
Vision alone is insufficient for many dexterous tasks. VTDexManip has opened the benchmark infrastructure; what follows is a push toward learned representations that fuse tactile, proprioceptive, and visual signals in a unified policy. The key technical challenge is heterogeneous sensor fusion: tactile sensors produce high-frequency, spatially localized contact maps that are fundamentally unlike image data. Architectures that handle this gracefully — perhaps through cross-attention between modalities — are an open research problem with high commercial value in assembly, medical devices, and food handling.
DexGraspNet 2.0's 90.7% zero-shot sim-to-real result is striking, but it applies to a relatively clean task: pick up an object. Extending this fidelity to in-hand manipulation, tool use, and bimanual coordination requires simulation environments that accurately model contact dynamics, deformable objects, and the subtle friction forces that make or break a grasp. The next generation of physics simulators — and the policies trained on them — will need to close the dynamics gap that still limits transfer for contact-rich, high-DOF tasks.
The NLP community's playbook — pretrain at scale, fine-tune cheaply — is being ported to robotics. Early examples like RT-2 and π0 show that vision-language-action models can acquire broad manipulation priors from internet-scale data, then specialize to specific tasks with minimal demonstrations. For dexterous hands specifically, the challenge is that most robot video data features parallel grippers, not multi-fingered hands — a domain gap that requires targeted data collection strategies and cross-embodiment transfer methods that are still nascent.
Current architectures excel at single-grasp tasks: pick this object, reorient this object. Real-world tasks — assembling a product, preparing food, organizing a cluttered space — require sequences of dexterous actions over extended horizons. The challenge is compounding: error propagates, objects move, and the state space expands dramatically. Hierarchical architectures that separate high-level task planning from low-level dexterous control, combined with recovery behaviors when sub-tasks fail, are a critical missing layer.
The best algorithm cannot compensate for a hand that lacks the degrees of freedom, actuation bandwidth, or sensor coverage required for a task. The most interesting work at this frontier treats hardware design as a learnable variable — using differentiable simulation to co-optimize hand morphology and control policy jointly. Coupled with the increasing availability of low-cost dexterous hands (LEAP Hand, Shadow Hand variants, custom tendon-driven designs), this opens a design space that startups are well-positioned to explore.
Dexterous manipulation has been “five years away” for most of the past thirty years. What is different now is not a single breakthrough but an infrastructure shift: simulation quality has improved to the point where billion-scale data generation is tractable; diffusion models have given us policy representations expressive enough to model the multimodal, contact-rich structure of manipulation; and real-world benchmarks like GraspClutter6D are finally honest about the gap between lab performance and deployment reality.
The architectures covered here — ResDex’s compositional RL, DP3’s 3D-grounded diffusion, and IIL’s human-in-the-loop learning — are not competing solutions. They are complementary tools, each suited to different phases of deployment: RL for scaling without human data, diffusion imitation for fast task acquisition from demonstrations, and interactive learning for continuous improvement in production. The most capable manipulation systems of the near future will likely use all three.
For those of us building at this frontier: the data infrastructure is finally catching up to the algorithmic ambition. The next constraint is not benchmark scores — it is the gap between what a robot can do in a controlled setting and what it can do reliably in the messy, variable, unstructured world where it actually needs to work. That gap is closable. The research is pointing the way.
[1] Huang Z., Yuan H., Fu Y., Lu Z. (2025). Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping. ICLR 2025. arXiv:2410.02475
[2] Ze Y., Zhang G., Zhang K., Hu C., Wang M., Xu H. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. RSS 2024. arXiv:2403.03954
[3] Freiberg et al. (2025). Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives — A Survey. Frontiers in Robotics and AI. doi:10.3389/frobt.2025.1682437
[4] Back S. et al. (2025). GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes. RA-L 2025. arXiv:2504.06866
[5] Zhang J., Liu H. et al. (2024). DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes. CoRL 2024. Project Page
[6] Liu Q., Cui Y. et al. (2025). VTDexManip: A Dataset and Benchmark for Visual-Tactile Pretraining and Dexterous Manipulation with Reinforcement Learning. ICLR 2025. arXiv:2406.09657
Cerebel builds dexterous AI workforces that bring human-like precision to physical industries.
© 2026 Cerebel. All Rights Reserved.