AI Compression Tool Runs Large Models on Consumer GPUs

TL;DR

This open-source framework automates AI model shrinking so organizations can deploy powerful language models without expensive hardware upgrades.

Deploying large AI models has become increasingly difficult due to their massive memory requirements, often exceeding the capacity of standard hardware. Foundation models with tens to hundreds of billions of parameters can require over 140 GB of memory at full precision, making them impractical for most organizations to run. This bottleneck has created a pressing need for effective compression techniques that can reduce model size without sacrificing performance. A new open-source tool called OneComp addresses this by automating the entire quantization process, transforming a fragmented landscape of manual techniques into a streamlined, hardware-aware pipeline.

OneComp works by reducing the precision of model parameters through a called post-training quantization (PTQ), which compresses weights after training without requiring expensive retraining. The framework automatically inspects a model, plans mixed-precision assignments using its AutoBit component, and executes progressive quantization stages. These stages range from layer-wise compression, which processes one linear layer at a time with minimal memory, to block-wise refinement that optimizes entire Transformer blocks, and finally global refinement that coordinates adjustments across all layers. A key innovation is treating the first quantized checkpoint as a deployable pivot, ensuring each subsequent stage improves the same model and that quality increases as more computational resources are invested.

Ology behind OneComp is built on three core principles: a path-to-plan API that dynamically derives workflows from model architecture and GPU budget, a resource-adaptive engine providing monotonically improving quality, and an extensible refiner architecture that allows new algorithms to be integrated seamlessly. For optimization, OneComp incorporates techniques like Quantization Error Propagation (QEP), which corrects targets to compensate for errors from upstream quantized layers, and Submodule-Aware Coordinate Descent (LPCD), which jointly optimizes coupled layers within functional units. The framework supports various quantization formats, including JointQ for 3-4 bit regimes by jointly optimizing scales and integer weights, and structured binary-factor formats like DBF and MDBF for extreme 1-2 bit compression.

Experimental demonstrate OneComp's effectiveness across models like LLaMA and Qwen. In tests on Llama-3-8B, activation-aware mixed-precision allocation with AutoBit achieved a perplexity of 6.85 at 4.16 average bits per weight, close to the full-precision baseline of 6.14, while uniform assignment degraded to 9.53. For layer-wise PTQ, QEP correction reduced perplexity from 12.66 to 6.66 in 4-bit settings, and JointQ consistently outperformed standard GPTQ, especially in per-channel quantization where GPTQ showed significant degradation. In extreme low-bit regimes, MDBF maintained lower perplexity than DBF across models, with improvements increasing at 1.00 bits per weight and for larger models, showing that structured formats can preserve meaningful accuracy where uniform quantization fails.

Of OneComp are significant for making advanced AI more accessible. By automating compression, it lowers the barrier for organizations to deploy foundation models on commodity hardware, potentially reducing memory footprints by factors of four to eight. This could enable wider adoption in resource-constrained environments like edge devices or smaller research labs. The framework's progressive refinement also allows users to balance quality and computational cost, stopping at earlier stages for faster or investing more for higher accuracy. However, the paper notes limitations, such as the current focus on weight-only quantization, leaving activation and KV-cache compression for future work, and of maintaining accuracy at very low bit-widths where performance gaps from full precision remain. Planned extensions include support for MDBF and global PTQ stages, aiming to further bridge the gap between algorithmic innovation and practical deployment.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn