Foundry: SuperTokens Make 3D AI Models Work on Edge Devices

TL;DR

Learn how SuperTokens compress 3D AI models so they run efficiently on low-power edge devices, cutting costs without losing accuracy.

The rise of large-scale foundation models has transformed machine learning, particularly in 3D vision, where models pre-trained on massive datasets serve as powerful, general-purpose feature extractors for robotics, autonomous driving, and AR/VR applications. However, their immense computational demands—with hundreds of millions of parameters and quadratic attention complexity—create a critical deployment bottleneck. As researchers demonstrate, even modern GPUs can fail to process moderately sized point clouds of 300k points, let alone the million-point scenes common in real-world use. This barrier prevents these advanced models from reaching resource-constrained edge devices where they're often most needed, highlighting a pressing need for efficient compression without sacrificing versatility.

Traditional compression techniques like knowledge distillation typically create efficient 'specialist' models by training a student to mimic a teacher's outputs on a specific task, but this sacrifices the general-purpose representational power that makes foundation models valuable. Other s, such as direct feature mimicry from vision-language models like CLIP, still produce students specialized for narrow capabilities like zero-shot classification. To address this, researchers introduce Foundation Model Distillation (FMD), a new paradigm aimed at creating compact, portable proxies that retain the original model's broad utility. Their implementation, Foundry, is the first FMD framework for 3D point clouds, centered on a novel compress-and-reconstruct objective using learnable SuperTokens.

At its core, Foundry trains a lightweight student model to compress the teacher's dense token embeddings into a small, fixed-size set of SuperTokens, then reconstruct the original embeddings from this compressed representation. This process involves a Dynamic Supertoken Optimization (DSO) module that uses cross-attention to aggregate information from input tokens into SuperTokens, acting as a shared memory of semantic and geometric concepts. A Cross-Attention Upsampling (CAU) module then reconstructs the teacher's full latent space, with the entire system trained via a Smooth L1 loss to minimize reconstruction error. This forces the SuperTokens to become an efficient basis for the teacher's representational manifold, resulting in a standalone student that acts as a miniature foundation model, capable of cheap fine-tuning for diverse downstream tasks without needing the original teacher.

Experimental validate the FMD paradigm's effectiveness. A single Foundry student distilled on the general-purpose ShapeNet55 dataset maintains high performance when fine-tuned on both classification (89.95% accuracy on ShapeNet55) and segmentation tasks, while specialist students trained with traditional knowledge distillation see performance collapse when transferred outside their native task. In few-shot learning scenarios on ModelNet40, the distilled student retains remarkable capability, achieving 91.8% top-1 accuracy in a 10-shot setting even with extreme compression down to a single SuperToken. The SuperToken mechanism itself proves crucial, outperforming baselines like static K-Means clustering (which causes a 13% accuracy drop) and Farthest Point Sampling, demonstrating that learned semantic compression captures richer information than simple geometric pre-sampling.

Benchmarking across six classification datasets and one segmentation dataset shows Foundry maintains accuracy within 1-2% of the full teacher model on synthetic data and within a few points on challenging real-world datasets, despite compressing latent representations to as few as 1-16 SuperTokens. Computational analysis reveals significant efficiency gains: Foundry reduces FLOPs from 478 G to 137-178 G on object-level inference and enables processing of large-scale scenes that exceed baseline VRAM limits, with a 4.0 GB footprint and 4.6-second forward time on an RTX A3000 GPU. The framework also supports dynamic, budget-aware inference via a gating mechanism, allowing on-the-fly trade-offs between computational cost and accuracy. However, limitations include a focus on a single 3D self-supervised teacher (Point-JEPA) and the need for further validation across other foundation models and modalities, though the compress-and-reconstruct design shows promising transferability.

Reference: Letellier et al., 2025, arXiv:2511.20721

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn