Artificial intelligence is transforming weather prediction, but a new study reveals these systems perform worse in low-income countries, raising concerns about global fairness. Researchers from Brown University developed a tool called SAFE (Stratified Assessments of Forecasts over Earth) to evaluate AI weather models across different geographic and economic regions. Their findings show that as forecasts extend beyond a week, accuracy disparities grow, with the poorest nations experiencing the largest errors. This gap could impact everything from agriculture to disaster preparedness in vulnerable communities.
The key discovery is that AI-based weather prediction models exhibit significant performance variations when assessed by country, income level, and land type. Using SAFE, the team calculated root mean square error (RMSE) for temperature at 850 hPa (T850) and geopotential height at 500 hPa (Z500) across 121 models. They found that models like FuXi and NeuralGCM showed the smallest fairness gaps initially, but all systems displayed increasing errors in low-income regions over time. For example, when predicting T850, the greatest absolute difference in RMSE between strata widened notably after 72 hours, with low-income areas consistently facing higher inaccuracies.
To conduct this analysis, the researchers stratified global weather data into four attributes: country territories, United Nations subregions, World Bank income classifications, and landcover (land or water). They used ERA5 reanalysis data from the European Centre for Medium-Range Weather Forecasts as ground truth, applying latitude-weighted RMSE to account for Earth's spheroid shape. The SAFE package computed per-strata errors for lead times from 12 to 240 hours, filtering outliers using local outlier factor detection to ensure robust results. This methodology allowed precise comparison of model performance across 195 countries and territories, avoiding the coarse rectangular regions common in prior evaluations.
The results, detailed in tables throughout the paper, demonstrate that variance in RMSE increases with lead time, particularly for income and subregion attributes. In one benchmark, the greatest absolute difference for T850 predictions in low-income strata reached over 2.0 K after 240 hours, compared to under 0.1 K at 12 hours. Similarly, for Z500, differences escalated from around 10 m²/s² to over 600 m²/s² in the same period. These disparities highlight that globally averaged metrics mask critical local weaknesses, with models often blurring extreme weather events that are vital for safety and economic planning.
This research matters because AI weather models are increasingly used by organizations like NOAA and in public apps, influencing decisions from farming to emergency response. Inaccurate forecasts in developing regions could exacerbate climate vulnerabilities, affecting food security and disaster resilience. The study underscores that fairness in AI isn't just a social issue but a practical one, where biased predictions may lead to unequal access to reliable information. By exposing these gaps, the work urges developers to incorporate stratified assessments during model training to improve equity.
Limitations include that SAFE currently operates at a fixed spatial resolution and doesn't address how training data imbalances might perpetuate these errors. The authors note that future work should integrate implicit neural representations to better handle coastlines and islands, and explore how objective functions could be adjusted to reduce disparities. As AI becomes central to forecasting, this study calls for a shift from global averages to localized, fair evaluations to ensure all communities benefit equally from technological advances.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn