AIResearch AIResearch
Back to articles
AI

Foundry: How SuperTokens Are Making 3D AI Models Practical for Edge Devices

The rise of large-scale foundation models has transformed machine learning, particularly in 3D vision, where models pre-trained on massive datasets serve as powerful, general-purpose feature extractor…

AI Research
March 26, 2026
3 min read
Foundry: How SuperTokens Are Making 3D AI Models Practical for Edge Devices

The rise of large-scale foundation models has transformed machine learning, particularly in 3D vision, where models pre-trained on massive datasets serve as powerful, general-purpose feature extractors for robotics, autonomous driving, and AR/VR applications. However, their immense computational demands—with hundreds of millions of parameters and quadratic attention complexity—create a critical deployment bottleneck. As researchers demonstrate, even modern GPUs can fail to process moderately sized point clouds of 300k points, let alone the million-point scenes common in real-world use. This barrier prevents these advanced models from reaching resource-constrained edge devices where they're often most needed, highlighting a pressing need for efficient compression without sacrificing versatility.

Traditional compression techniques like knowledge distillation typically create efficient 'specialist' models by training a student to mimic a teacher's outputs on a specific task, but this sacrifices the general-purpose representational power that makes foundation models valuable. Other s, such as direct feature mimicry from vision-language models like CLIP, still produce students specialized for narrow capabilities like zero-shot classification. To address this, researchers introduce Foundation Model Distillation (FMD), a new paradigm aimed at creating compact, portable proxies that retain the original model's broad utility. Their implementation, Foundry, is the first FMD framework for 3D point clouds, centered on a novel compress-and-reconstruct objective using learnable SuperTokens.

At its core, Foundry trains a lightweight student model to compress the teacher's dense token embeddings into a small, fixed-size set of SuperTokens, then reconstruct the original embeddings from this compressed representation. This process involves a Dynamic Supertoken Optimization (DSO) module that uses cross-attention to aggregate information from input tokens into SuperTokens, acting as a shared memory of semantic and geometric concepts. A Cross-Attention Upsampling (CAU) module then reconstructs the teacher's full latent space, with the entire system trained via a Smooth L1 loss to minimize reconstruction error. This forces the SuperTokens to become an efficient basis for the teacher's representational manifold, resulting in a standalone student that acts as a miniature foundation model, capable of cheap fine-tuning for diverse downstream tasks without needing the original teacher.

Experimental validate the FMD paradigm's effectiveness. A single Foundry student distilled on the general-purpose ShapeNet55 dataset maintains high performance when fine-tuned on both classification (89.95% accuracy on ShapeNet55) and segmentation tasks, while specialist students trained with traditional knowledge distillation see performance collapse when transferred outside their native task. In few-shot learning scenarios on ModelNet40, the distilled student retains remarkable capability, achieving 91.8% top-1 accuracy in a 10-shot setting even with extreme compression down to a single SuperToken. The SuperToken mechanism itself proves crucial, outperforming baselines like static K-Means clustering (which causes a 13% accuracy drop) and Farthest Point Sampling, demonstrating that learned semantic compression captures richer information than simple geometric pre-sampling.

Benchmarking across six classification datasets and one segmentation dataset shows Foundry maintains accuracy within 1-2% of the full teacher model on synthetic data and within a few points on challenging real-world datasets, despite compressing latent representations to as few as 1-16 SuperTokens. Computational analysis reveals significant efficiency gains: Foundry reduces FLOPs from 478 G to 137-178 G on object-level inference and enables processing of large-scale scenes that exceed baseline VRAM limits, with a 4.0 GB footprint and 4.6-second forward time on an RTX A3000 GPU. The framework also supports dynamic, budget-aware inference via a gating mechanism, allowing on-the-fly trade-offs between computational cost and accuracy. However, limitations include a focus on a single 3D self-supervised teacher (Point-JEPA) and the need for further validation across other foundation models and modalities, though the compress-and-reconstruct design shows promising transferability.

Reference: Letellier et al., 2025, arXiv:2511.20721

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn