AI Compresses Scientific Data Better Than Ever

As high-resolution simulations and environmental monitoring systems generate massive datasets, storing and transmitting this information efficiently has become a critical bottleneck in scientific research. A new study introduces a method that uses large language models (LLMs) to compress scientific data with guaranteed accuracy, achieving up to 30% better compression than existing techniques. This advancement could accelerate discoveries in fields like climate science by making vast datasets more manageable without sacrificing precision.

The researchers developed LLMC OMP, a paradigm that treats scientific data compression as a language modeling task. Instead of compressing numbers directly, it converts continuous data values into discrete tokens, similar to how words are tokenized in text processing. These tokens are arranged into sequences using a Z-order space-filling curve to preserve spatial and temporal relationships. A decoder-only transformer model is then trained to predict the next token in these sequences, learning the underlying patterns in the data.

During compression, the model autoregressively predicts tokens step by step. If the prediction matches the actual data within a top-k list of candidates, only the index is stored; otherwise, the true value is recorded as a correction. This ensures that the decompressed data stays within a user-specified error bound, such as a relative tolerance of 10^-4, meaning no point deviates more than this limit from the original. The method combines cross-entropy loss for token accuracy and mean-squared error for numerical fidelity, optimized via gradient descent.

Experiments on datasets like the Red Sea Reanalysis and ERA5 reanalysis—which include variables such as temperature, humidity, and wind speed—show that LLMC OMP outperforms state-of-the-art compressors like SZ3.1, SPERR, ZFP, and HPEZ across various error bounds. For instance, on the ERA5-2023-T temperature dataset at a relative error of 10^-5, it achieved a compression ratio of 37.8, meaning the compressed data is about 38 times smaller than the original, compared to 32.6 for the next best method. The approach maintains high fidelity, with decompression errors tightly clustered near zero, as shown in error distribution analyses.

This method matters because it enables scientists to handle multi-terabyte datasets more effectively, facilitating faster data sharing and analysis in applications like weather forecasting and oceanography. By leveraging AI, it adapts to diverse data types and resolutions, overcoming limitations of traditional compressors that struggle with nonlinear dynamics and long-range dependencies. For the general public, this could lead to more accurate climate models and timely environmental insights, supporting global efforts in sustainability and disaster preparedness.

Limitations include the computational cost of training, which took up to 72 hours for longer context lengths in tests, and the need for careful tuning of parameters like vocabulary size and top-k values to avoid overfitting or underperformance. The paper notes that overly large models or vocabularies can reduce compression efficiency, indicating that optimal performance requires balancing model capacity with resource constraints. Future work may focus on improving efficiency through lightweight architectures and adaptive sampling strategies.

AI Compresses Scientific Data Better Than Ever

About the Author

Guilherme A.