AI Guesses Your Full Body in VR from Head and Hands

TL;DR

A lightweight AI generates realistic full-body movement in real time using only a headset and controllers, removing a key barrier to immersive VR.

Virtual and augmented reality experiences often feel incomplete because they can only track your head and hands, leaving the rest of your body as a stiff, unnatural avatar. This gap breaks immersion, making interactions in digital worlds feel less realistic. A new AI approach developed by researchers at Samsung R&D Institute UK and CERTH tackles this problem head-on, enabling real-time generation of full-body motions from the sparse data provided by standard VR headsets and controllers. Their , called Mem-MLP, not only achieves high accuracy but runs efficiently on mobile devices like the Meta Quest 3, hitting 72 frames per second—the threshold needed for seamless interaction.

The key finding is that Mem-MLP can reconstruct smooth, plausible full-body movements using only three sensor inputs: the head and both hands. As shown in Figure 1, it outperforms state-of-the-art s in balancing accuracy and speed, reducing the mean per-joint position error by 26% compared to AvatarPoser, another real-time capable model. For instance, in Scenario-1 testing on the AMASS dataset, Mem-MLP achieved a mean per-joint position error of 3.08 cm and a jitter metric of 6.03 m/s³, indicating smooth motion. This allows VR avatars to mirror complex actions like walking, sitting, and jumping in real-time, as visualized in Figure 5, without requiring additional body sensors.

Ology centers on a multi-layer perceptron backbone enhanced with a novel component called the Memory-Block. This block uses trainable code-vectors to represent missing sensor data from joints like the pelvis and legs, combining them with sparse signals from previous time instances to improve temporal consistency. Specifically, the Memory-Block integrates a frozen VQ-VAE model to generate these code-vectors, which encode plausible motion priors. During training, it blends features from sparse inputs, ground-truth motions, and code-vectors; at inference, it operates autoregressively, using past predictions to inform current frames. Additionally, the researchers formulated the problem as multi-task learning, with separate branches predicting joint rotations and positions, leveraging a loss weighting mechanism based on homoscedastic uncertainty to balance conflicting objectives.

From extensive experiments on the AMASS dataset demonstrate Mem-MLP's superiority. In Scenario-1, it achieved competitive metrics, such as a mean per-joint rotation error of 2.57 degrees and a mean per-joint velocity error of 15.02 cm/s, while maintaining the lowest computational cost at 0.25 GFLOPs. Ablation studies in Table 4 show that the Memory-Block alone improved the jitter metric from 22.70 to 8.42, and adding the multi-head predictor further reduced it to 6.03. Figure 1 illustrates the trade-off between accuracy and inference time, with Mem-MLP reaching 72 FPS on a Quest 3 headset, surpassing alternatives like AGRoL-Diffusion, which only managed 3.5 FPS. The model also showed robustness across different scenarios, though performance dipped in Scenario-2 due to limited motion diversity in the training data, highlighting its dependency on comprehensive datasets.

Are significant for everyday VR and AR users, as this technology enables more immersive experiences without cumbersome extra gear. By generating full-body motions in real-time, it could enhance applications in gaming, virtual social interactions, and professional training simulations, making digital avatars move naturally. 's efficiency on mobile hardware means it can be deployed widely, potentially lowering barriers to high-quality VR. Moreover, the approach of using code-vectors to fill in missing data could inspire similar techniques in other fields where sparse sensor data is a , such as robotics or biomechanics.

However, the study acknowledges limitations. The model's performance relies heavily on the diversity of the training data; as noted, accuracy decreased in Scenario-2 where certain motion types like crawling or dancing were underrepresented. This suggests that may struggle with unusual or highly dynamic movements not well-covered in datasets. Additionally, while Mem-MLP achieves real-time speeds, its autoregressive design means errors could accumulate over time if initial predictions are off. The researchers also caution that the VQ-VAE component, though frozen during training, adds complexity, and its effectiveness depends on the quality of the learned code-vectors. Future work might address these issues by expanding dataset variety or refining the memory mechanism for better generalization.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn