AI Agents Fail at Real-World Remote Work

As artificial intelligence systems advance rapidly, their potential to automate jobs has sparked widespread societal concern. However, a new study reveals that current AI agents struggle to handle the complexity of real-world remote work, achieving success in only a tiny fraction of tasks. This finding provides crucial empirical evidence for policymakers, businesses, and the public to understand AI's current limitations and prepare for future workforce changes.

The researchers introduced the Remote Labor Index (RLI), a benchmark designed to measure AI's ability to complete practical, economically valuable remote work. They found that the highest-performing AI agents, including models like ChatGPT and Gemini Pro, achieved an automation rate of just 2.5% on the RLI. This means these systems failed to complete over 97% of tasks at a level equivalent to human professionals, highlighting a significant gap between AI capabilities and the demands of the digital labor market.

To build the RLI, the team sourced 240 real-world projects from freelancing platforms like Upwork, covering diverse fields such as video animation, graphic design, architecture, and software development. Each project included a brief, input files, and a gold-standard deliverable produced by a human freelancer, ensuring the benchmark reflects actual market transactions. The researchers then evaluated AI agents by having them generate deliverables based on the same inputs, with human evaluators comparing AI outputs to the human standards using a rigorous manual process.

The results show that AI agents consistently underperformed across most project types. Common failure modes included producing corrupted files, incomplete deliverables, and poor-quality outputs, such as child-like drawings or inconsistent 3D renderings. For example, in tasks requiring video production or architectural plans, AIs often submitted truncated videos or designs that did not match supplied sketches. Despite these shortcomings, the study detected steady progress in relative performance using an Elo-based scoring system, indicating that newer models are gradually improving, though they remain far from human-level competence.

This research matters because it grounds discussions about AI automation in empirical data, helping stakeholders assess risks and opportunities for the remote workforce. By demonstrating that current AIs cannot autonomously handle complex, creative, or technical tasks, the study suggests that jobs requiring these skills may be less immediately threatened than previously thought. However, the steady improvement in AI performance signals that continuous monitoring is essential to anticipate future shifts.

Limitations of the study include its exclusion of certain remote work types, such as jobs requiring physical presence or client interaction, and its focus on individual tasks rather than team-based projects. Additionally, the cost data from freelancers may not account for inflation, potentially underestimating the economic value of the work. These gaps mean that achieving high automation rates on the RLI does not necessarily equate to full human replacement in all remote roles, emphasizing the need for broader evaluations in future research.

AI Agents Fail at Real-World Remote Work

About the Author

Guilherme A.