As artificial intelligence tools become embedded in climate science and policy, a critical question emerges: do the environmental benefits of specialized AI systems outweigh their own energy footprints? A new study examines the real-world energy consumption of climate-domain chatbots compared to generic models, revealing that design choices can dramatically increase power use without necessarily improving answer quality. This research shifts the focus from model training to inference-time energy, which scales with usage and could offset the climate advantages these tools aim to support.
The researchers found that domain-specific retrieval-augmented generation (RAG) systems, such as the climate chatbots ChatNetZero and ChatNDC, do not inherently use less energy than a generic model like GPT-4o-mini. In fact, ChatNDC consumed the most energy per query at 4.53 × 10⁻³ kWh, which is about four times higher than GPT-4o-mini's 1.13 × 10⁻³ kWh and over 10 times higher than ChatNetZero's 4.08 × 10⁻⁴ kWh. These show that energy consumption varies widely based on workflow complexity, with more agentic designs leading to substantially higher power use. The study also linked energy use to output length, noting that longer responses tend to increase both energy demand and embellishment—non-factual statements—without clear gains in factual accuracy.
To assess energy consumption, the team developed a query-level ology that decomposes RAG workflows into retrieval, generation, and hallucination-checking components. They tested 102 domain-specific climate questions across four pipelines: ChatNetZero, ChatNDC, GPT-4o-mini, and a constrained version of GPT-4o-mini limited to 200 words. Energy estimates were based on end-to-end response times, using a refined version of an existing API-based framework that incorporates factors like GPU utilization and power usage effectiveness. Experiments were conducted at different times of day from the Netherlands to account for network latency variations, with all models set to a temperature of 0 for reproducibility.
The data shows that inference accounts for the largest share of energy use across all systems, but hallucination checking becomes a major contributor in more complex designs. In ChatNDC, which uses an additional LLM call for verification, hallucination checking made up 30.9% of total energy, compared to only 8.4% in ChatNetZero, which relies on a lighter cosine-similarity . Factual accuracy scores were similar across models, with mean scores around 0.6 to 0.7, but embellishment varied significantly: ChatNetZero had the lowest mean embellishment score at 0.15, while ChatNDC and the GPT models scored between 0.55 and 0.62. This suggests that adding energy-intensive verification steps does not automatically yield better quality, as seen in Figure 8 where higher energy use did not correlate with higher factual scores.
These have immediate for AI developers and climate researchers aiming to balance accuracy with sustainability. The study highlights that output length and workflow design are key drivers of energy consumption, with agentic RAG systems potentially wasting power on unnecessary steps. For instance, ChatNDC's requirement for 400-600 word responses led to higher embellishment, indicating that forcing longer answers can degrade quality while increasing energy use. The researchers suggest that future systems could improve efficiency by routing queries based on complexity—using lightweight pipelines for simple facts and reserving more compute-intensive workflows for complex reasoning—akin to how smart thermostats adjust energy use based on need.
However, the study has limitations. Energy estimates rely on API response times rather than direct server-side measurements, making them approximations rather than precise figures. The experiments focused on single query interactions and a limited set of models, primarily from the GPT family, which may not capture broader trends across different providers or hardware setups. Additionally, the study did not account for multi-turn conversations or live search features, which could further increase energy use in real-world applications. Future research should expand to include more models, locations, and session-level analyses to better understand the full environmental impact of AI in climate domains.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn