AI Learns to Follow Human Instructions for Complex Object Manipulation

Robots and virtual assistants have long struggled to perform everyday tasks like opening a drawer or adjusting glasses, but a new AI framework bridges this gap by translating natural language into precise hand movements for articulated objects. Published at ICLR 2026, the SynHLMA system enables machines to understand and execute instructions such as 'help open the middle drawer of the cabinet' or 'please help with my eyeglasses,' advancing applications in robotics, VR, and AR where dexterous manipulation is critical. This breakthrough addresses a core challenge in embodied AI: moving beyond rigid objects to handle items with moving parts, which require coordinated grasps and deformations over time.

Researchers discovered that SynHLMA can generate, predict, and interpolate hand-object interactions by discretizing movements into tokens, similar to how language models process words. Given a 3D object point cloud and a text query, the system outputs realistic hand poses that align with the instruction, achieving superior performance in tasks like generating full manipulation sequences from scratch or completing partial actions. For example, it accurately synthesizes motions for opening a laptop lid or rotating scissors, as validated on the HAOI-Lang dataset, which includes over 500,000 static grasps and 50,000 sequences annotated with GPT-4 descriptions.

The methodology combines a multi-stage vector-quantized variational autoencoder (VQ-VAE) with a fine-tuned language model, Vicuna-7B, using Low-Rank Adaptation (LoRA) for efficiency. SynHLMA discretizes hand and object parameters—such as global pose, local articulation, and refinement details—into separate codebooks, ensuring fine-grained control. An articulation-aware loss function enforces physical plausibility by penalizing penetrations, maintaining pose consistency, and aligning joint configurations, all while leveraging a physics-based simulator for realistic training. This approach allows the model to capture dynamic variations in articulated joints, such as hinges or sliders, without relying on extensive manual programming.

Experimental results show that SynHLMA outperforms state-of-the-art baselines like Text2HOI and MotionGPT across key metrics. In generation tasks, it improved Fréchet Inception Distance (FID) by 4.919% and diversity by 13.986%, indicating more realistic and varied outputs. For prediction, where only 20% of a sequence is provided, it achieved a 14.64% FID improvement and 19.572% gain in diversity. The framework's robustness stems from its hierarchical token representation and shared embedding space, which align language instructions with manipulation actions. Ablation studies confirmed that removing joint-aware tokens or using a single codebook degrades performance, highlighting the importance of discrete, semantic-driven stages.

This research matters because it brings AI closer to human-like dexterity in real-world scenarios, from household robotics to industrial automation. By enabling machines to interpret and act on natural language commands for complex objects, SynHLMA reduces the need for specialized programming and could enhance assistive technologies, such as robots aiding people with disabilities. The team demonstrated its practical utility by transferring learned manipulations to a ShadowHand robot, showing that synthesized poses can guide physical execution in simulated environments.

Limitations include the model's reliance on simulated data from the HAOI-Lang dataset, which, though comprehensive, may not capture all real-world variations. The paper notes that future work should explore bimanual coordination and finer-grained manipulations, as the current framework focuses on single-hand interactions. Additionally, while SynHLMA excels in interpolation and prediction, its performance in entirely unseen object categories remains to be tested, pointing to areas for further refinement in generalization.

AI Learns to Follow Human Instructions for Complex Object Manipulation

About the Author

Guilherme A.