Scale AI benchmark finds agents complete under 5% of real tasks

TL;DR

Scale AI's Remote Labor Index tests AI agents on 23 real-work sectors, revealing a stubbornly low 5% success rate after six months of benchmarking.

The most honest number in AI right now might be 2.5%. That is the share of professional freelance tasks a top agent completed to a paying client's standard when Scale AI and the Center for AI Safety launched their Remote Labor Index in late 2025. Six months later, HRD Australia reports the ceiling still sits below 5%.

Measuring that gap has been harder than it should be. Most benchmarks dominating model leaderboards test isolated skills: coding on well-defined problems, answering factual questions, summarizing documents. The Remote Labor Index, or RLI, was designed differently. It asks whether an agent can take a task from briefing to deliverable, end to end, and produce output a client would actually accept.

"After six months we are still seeing less than 5%," said Udari Madhushani Sehwag, security and policy research lead at Scale AI and a contributor to the benchmark. "They started very low initially and we are still in this very low region."

The full-task methodology

Tasks were drawn from real postings on digital labor platforms including Upwork, spanning 23 professional sectors: video editing, logo design, architecture, data analysis, jewelry design, and game development. Evaluators then compared AI output to human-produced work on a single criterion: would a paying client accept this deliverable?

That framing matters more than it might seem. An agent might produce a window design that looks geometrically coherent, Sehwag noted, while missing structural specifications any trained architect would flag immediately. Passing a component-level test and completing a professional deliverable require different capabilities. Agents must manage context across many steps, anticipate domain constraints, and handle dependencies that do not appear in the initial brief. Current systems consistently fail before the finish line.

The RLI's slow progress over six months suggests something structural, not just a gap waiting to close with the next model release. Long-horizon task completion requires stable state management, error recovery, and coherent reasoning across extended sequences. These properties are fundamentally harder to improve than accuracy on fixed evaluation sets.

The benchmark inflation problem

There is a well-documented pattern in AI evaluation: benchmarks saturate quickly, models tune toward them, and scores climb while practical utility lags. The RLI resists this dynamic because it uses human economic judgment as the criterion. A client either pays for the output or does not.

New models have been claiming top positions on narrow tasks at a rapid pace. Crypto Briefing recently covered Z.ai's GLM-5.2, which reportedly outperforms GPT-5.5 on long-horizon coding benchmarks at significantly lower cost. Model release trackers like Price Per Token log dozens of new deployments monthly, reflecting an industry concentrated on raw capability gains rather than the end-to-end task completion enterprise buyers actually need.

Enterprise deployment has not slowed in response. CNBC has tracked accelerating artificial intelligence infrastructure investment through mid-2026, with major vendors racing to expand model offerings even as evidence for meaningful task automation rates remained thin. The RLI result suggests the deployment curve and the capability curve are running on separate tracks.

For ML engineers and applied scientists building or evaluating agent systems, the practical implication is direct: isolated benchmark scores are not predictive of full-task success in professional settings. End-to-end evaluation against representative domain tasks is necessary, not optional.

The RLI will continue tracking agents as the model landscape evolves. Whether the next six months show more movement than the first remains, as of mid-2026, an open question worth watching.

FAQ

What is the Remote Labor Index (RLI)?
A benchmark developed jointly by Scale AI and the Center for AI Safety that measures AI agent performance on real paid digital work tasks drawn from freelance platforms, evaluating whether output meets a paying client's standard rather than a fixed test-set answer.

Why do AI agents fail at professional tasks despite strong benchmark scores?
Most benchmarks test isolated skills. Completing a full professional task requires managing multi-step context, domain-specific constraints, and error recovery across extended workflows. These properties are harder to optimize than single-task accuracy, and current agent frameworks have not cracked them at production scale.

Which sectors does the RLI benchmark cover?
The benchmark spans 23 sectors including video editing, logo design, architecture, data analysis, jewelry design, and game development, all sourced from real digital labor platform postings.

What does a sub-5% AI agent task completion rate mean for enterprise deployments?
Agent performance on internal demos or narrow evaluations is likely not representative of production outcomes on complex professional work. Organizations should run end-to-end evaluations on domain-specific tasks before deploying at scale, rather than relying on leaderboard position as a proxy for real-world utility.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn