Artificial intelligence systems typically require extensive computational resources to process multimedia data, but a new approach could change that. Researchers have developed TEMPEST, a method that enables AI models to learn directly from compressed files without the need for decoding, significantly reducing processing time and memory usage while maintaining competitive performance. This breakthrough addresses a fundamental bottleneck in applying transformer architectures to audio and image data, where large file sizes lead to prohibitive computational demands.
The key finding is that TEMPEST achieves state-of-the-art results on classification tasks while using substantially fewer computational resources. By exploiting the inherent structure of compressed file formats like MP3 and JPEG, the method reduces the number of tokens processed by transformers by a factor of 3 and cuts the attention matrix size by an order of magnitude. On the Speech Commands V2 dataset, TEMPEST achieved 91.15% accuracy compared to 91.92% for conventional methods, while processing only 108 tokens per second versus thousands for standard approaches.
The methodology leverages the block-based organization common to compressed formats. Instead of decoding files into raw media, TEMPEST treats compressed blocks as atomic units. Each block is mapped to a compact embedding through a lightweight transformer network, then these embeddings are processed by a classification network similar to Vision Transformers. The system includes three components: a block embedding network that converts compressed bytes into feature representations, a classification network that makes predictions, and a reconstruction network that regularizes the embeddings by attempting to reconstruct the original byte sequences.
Results show TEMPEST maintains competitive accuracy across multiple domains. On audio classification tasks using MP3 files, it achieved 58.98% accuracy on ESC-50 and 91.15% on Speech Commands V2. For images, despite the challenge of inexact block boundaries in JPEG files, it reached 95.79% accuracy on MNIST digit recognition. The method also demonstrated robustness to varying compression rates—training with multiple bitrates improved ESC-50 accuracy from 56.66% to 58.98%, showing that exposure to diverse compressed representations enhances generalization.
This approach matters because compressed files dominate digital storage and transmission. By bypassing decoding steps, TEMPEST could enable faster processing of multimedia content in applications ranging from content moderation to medical imaging. The efficiency gains are particularly significant for systems handling millions of files, where reduced computational requirements translate to lower energy consumption and faster response times. The method's ability to work with multiple compression formats suggests broad applicability across different media types.
Limitations include challenges with formats that lack clear block boundaries, such as JPEG, where the method must approximate Minimum Coded Unit partitions. The current implementation also underperforms on Opus audio codec compared to MP3, achieving only 49.36% accuracy versus 58.98% for MP3 on ESC-50. Future work could explore modality-agnostic approaches and architectural optimizations to further enhance efficiency across diverse compression schemes.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn