As AI agents take on more complex, multi-step tasks like software development and data analysis, their ability to manage time becomes crucial for effective planning and coordination. However, a new study reveals a fundamental limitation: these models cannot accurately estimate how long their own actions will take. This gap in temporal self-awareness poses significant s for deploying AI in time-sensitive applications, from emergency response to resource allocation, where misjudging duration can have direct operational consequences. The research, conducted across 68 tasks and four model families, shows that even state-of-the-art models like GPT-5 and GPT-4o consistently overestimate task durations by large margins, relying on heuristics rather than genuine self-knowledge.
The key finding is that large language models lack experiential grounding in their own inference time, leading to systematic errors in temporal self-estimation. In pre-task estimates, models overshoot actual duration by 4–7 times, with a median ratio of 6.11× for GPT-5 and 3.60× for GPT-4o, as shown in Table 2. This means they predict human-scale minutes for tasks that complete in seconds. More critically, on relative ordering tests with counter-intuitive task pairs—where harder-labeled tasks actually finish faster—GPT-5 scored only 18% accuracy, significantly below chance, indicating reliance on superficial complexity cues rather than actual processing time. Post-hoc recall is similarly disconnected, with estimates diverging from reality by an order of magnitude in either direction, and these failures persist in multi-step agentic settings with errors of 5–10 times.
Ology involved four experiments designed to probe different aspects of temporal self-estimation. First, absolute calibration tested whether pre-task estimates correlate with actual execution time across 68 tasks spanning categories like code generation, debugging, and reasoning. Models were prompted to estimate duration in seconds before execution, with wall-clock time measured from API request to response completion. Second, relative ordering evaluated if models could identify which of two tasks takes longer, using 26 hard pairs curated to defeat surface heuristics, including near-identical and counter-intuitive pairs. Third, post-hoc recall assessed whether models could report duration after task completion without external timing information. Finally, agentic tasks extended the analysis to multi-step scenarios with tool use, such as building a landing page or debugging a project, using a ReAct agent with bash and Python tools.
Analysis, detailed in figures and tables, reveals consistent patterns across model families. Frontier models like GPT-5 and GPT-4o show weak positive correlation with actuals (r = 0.55 and 0.35, respectively), but with substantial bias, as illustrated in Figure 1. Open-source models like OLMo3-7B and Qwen3-8B show no significant correlation, clustering around arbitrary values. Counter-intuitive pairs provided the clearest evidence: GPT-5’s 18% accuracy on 11 such pairs, as shown in Figure 2, demonstrates systematic failure when complexity labels mislead. Post-hoc estimates varied widely, with GPT-4o claiming 42 seconds for tasks completing in 8 seconds, a 5.2× overestimation. In agentic tasks, pre-task estimates were 5–10× off, and post-hoc estimates were even more disconnected, with GPT-4o claiming 30 seconds for tasks that ran 10 minutes.
Of this limitation are practical and far-reaching for real-world AI deployment. In settings where latency, deadlines, or scarce resources constrain behavior—such as medical triage, emergency response, or interactive computer use—agents that cannot estimate their own duration cannot schedule themselves reliably. This affects planning, delegation, and coordination in multi-agent systems, as timing errors can compound across levels. The study suggests that effective solutions require external timing infrastructure, historical logging, and system-level timeouts, rather than relying on model self-regulation. Future work should explore training with explicit timing signals and architectures that better retain temporally grounded state to address this gap.
Limitations of the research include its focus on specific models and 68 English-language tasks, which may not generalize to other domains or languages. The study primarily uses correlation as a metric, and future investigations could examine whether architectural changes, such as timing tokens or compute-aware training, could provide the missing grounding. Additionally, the counter-intuitive pairs, while diagnostic, represent a curated subset, and broader task suites might reveal different patterns. underscore that while models possess propositional knowledge about duration from training data, they lack the experiential grounding needed for accurate self-estimation, a that must be addressed for robust AI agent systems.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn