AI Speeds Up Without Losing Accuracy

Vision-language models (VLMs) are AI systems that process both images and text, enabling applications like generating captions for photos or answering questions about visual content. However, their high computational demands and slow response times have limited their use on resource-constrained devices like smartphones. Researchers have introduced FastVLM, a framework that speeds up VLM inference significantly while maintaining accuracy, addressing a key bottleneck in deploying these models in real-world scenarios.

The key finding is that FastVLM achieves a 1.55 to 1.85 times speedup in inference time compared to standard methods, as demonstrated on models like BLIP-2 and LLaVA-1.5. This improvement comes without notable drops in performance metrics such as BLEU-4 scores for captioning tasks, ensuring that the quality of outputs remains high. For example, on the COCO dataset, BLEU-4 scores improved from lower baselines to up to 43.6 with the full FastVLM approach, showing that speed gains do not compromise accuracy.

Methodology involves a self-speculative decoding (SSD) approach, where a lightweight draft model generates tentative outputs quickly, and a verification step checks them using the full model. FastVLM enhances this by adding an imitation network that learns to mimic deeper layers of the model, capturing complex features essential for vision-language tasks. This network is trained using cosine similarity and knowledge distillation to align its outputs with the full model's representations, decoupling the draft and verification roles to avoid performance trade-offs. The process reuses key-value caches to reduce memory overhead, making it efficient for devices with limited resources.

Results analysis from experiments on datasets like MS-COCO, NoCaps, VisDial, MM-Vet, and LLaVA-Wild show consistent speedups. For instance, in captioning tasks, FastVLM increased acceptance rates of draft tokens, reducing the number of costly full-model calls. Figure 5 in the paper illustrates how acceptance rates grow with context length, optimizing the draft token count dynamically. The framework maintained or improved scores across metrics, with BLIP-2 models showing BLEU-4 improvements from around 65.5 to 83.8 in some configurations, highlighting its effectiveness in real-world evaluations.

Contextually, this advancement matters because it makes advanced AI more accessible on everyday devices, enabling faster image descriptions, dialogue systems, and complex reasoning without high latency. For regular users, this could lead to quicker responses in apps that use AI for visual assistance, enhancing user experience in fields like education, accessibility, and entertainment. By reducing inference time, FastVLM supports broader adoption of VLMs in mobile and edge computing environments.

Limitations include reduced effectiveness on tasks with very short outputs, such as visual question answering (VQA), where answers are often one token long. In these cases, the speedup is minimal because the draft model generates few tokens, and verification calls become comparable to standard decoding. The paper notes that while FastVLM still offers improvements over baseline methods due to shared parameters, its benefits are most pronounced in long-form generation tasks.

AI Speeds Up Without Losing Accuracy

About the Author

Guilherme A.