TL;DR
Developers report Claude failing on complex workflows after Anthropic quietly reduced reasoning depth, raising transparency concerns ahead of a potential IPO.
Anthropic's Claude models are underperforming on complex engineering tasks, and the developer community has made its displeasure known. Reports of instruction-following failures, premature shortcuts, and elevated error rates on multi-step workflows have mounted steadily over recent weeks, pointing toward a deliberate but undisclosed change in how the models operate.
The suspected mechanism is specific: AOL Finance reports that Anthropic quietly reduced Claude's default "effort" level, cutting the number of tokens processed per response to lower per-query compute costs. The tradeoff is well understood by anyone who builds inference pipelines. Fewer tokens mean shallower reasoning, and shallower reasoning degrades exactly the tasks that make frontier models worth paying for: complex code generation, multi-step agent workflows, and instruction-dense prompts.
The performance gap
For ML engineers running Claude through production pipelines, the degradation surfaces in predictable ways. The model exits reasoning chains earlier than expected, defaults to heuristic shortcuts over systematic problem decomposition, and loses coherence over longer instruction sequences. No single failure is catastrophic. The cumulative effect, for teams that had calibrated their systems against prior Claude behavior, is a model that no longer matches the baseline they built against.
Speculation about the underlying cause points toward compute scarcity. Anthropic has announced significantly fewer large data center deals than OpenAI or Google, even as its user adoption accelerated sharply in recent months. Sustaining inference quality across a fast-growing user base, without equivalent hardware headroom, creates obvious pressure to reduce per-request resource consumption. Cutting token budgets is a blunt but immediate lever. Anthropic has not confirmed this publicly, so the compute-constraint theory remains well-grounded inference rather than disclosed fact.
The transparency problem
What transformed a performance complaint into a reputational crisis is the distance between this behavior and Anthropic's public identity. The company has positioned itself more consistently than any other frontier lab as a safety-focused, transparent organization aligned with its users' interests. Quietly adjusting model behavior in ways that affect output quality, without publishing a changelog or notifying enterprise customers, cuts directly against that positioning.
Anthropic is valued at $380 billion and reportedly on a path toward an IPO, according to the same AOL Finance report. Developer trust is a line item in that valuation. Enterprises evaluating long-term AI infrastructure commitments weight reliability and candor heavily. A surge of public complaints about undisclosed model changes, circulating as the company courts institutional investors, is a reputational liability that compounds the operational one.
The Mythos contrast
The timing grows stranger alongside Anthropic's simultaneous rollout of Claude Mythos, a new model that PBS NewsHour describes as capable of sustained autonomous tasks comparable to those a skilled security researcher might run across an entire workday. Mythos is being tested under strict access controls by roughly 40 vetted companies, with no public release planned, specifically because of its ability to identify exploitable software vulnerabilities at scale.
LLM Stats places the Mythos Preview on April 7, arriving amid a dense cluster of releases from Google, Meta, and Zhipu AI. The contrast is hard to ignore: Anthropic's most capable model sits behind tight restrictions while its widely deployed production models appear to be running at reduced capacity. Price Per Token noted the same period saw Anthropic expand its managed agent features, a move that raises per-task compute demands even as base model effort appears compressed.
What practitioners should do now
Developers building on Claude's API need to treat current model behavior as a fresh baseline, not a continuation of prior configurations. Regression test suites calibrated two or three months ago may no longer reflect production reality. Prompts that rely on extended chain-of-thought or multi-step reasoning should be stress-tested explicitly against current model versions. Where output depth is critical, explicit token budget guidance in the system prompt may recover some of the lost fidelity.
The structural risk extends beyond Anthropic. Any frontier model running as a managed service can be tuned server-side without notice. That is industry-wide, not a uniquely Anthropic failure. But practitioners systematically underweight it when building production pipelines against hosted models they do not control or self-host.
Anthropic now faces a credibility test with two distinct audiences: the developer community that built its user base, and the institutional investors it needs to fund the next generation of infrastructure. How the company chooses to respond in the coming weeks will reveal which group it actually treats as its primary constituency.
FAQ
What caused Claude's recent performance decline?
Reports indicate Anthropic reduced Claude's default token processing depth, which limits how much reasoning the model applies per request. This appears designed to cut compute costs, though Anthropic has not officially confirmed this explanation.
Does the degradation affect all Claude models equally?
Reports focus on production Claude models available through the API and consumer products. A server-side reduction in default effort would apply broadly across versions unless specific tiers are exempted, which Anthropic has not clarified.
What is Claude Mythos, and how does it fit into this story?
Mythos is Anthropic's newest and most capable model, currently in limited preview with around 40 vetted organizations. It is not publicly available due to its advanced ability to find exploitable software vulnerabilities. The contrast between Mythos's capabilities and the apparent throttling of production Claude has sharpened criticism of how Anthropic allocates its compute resources.
Should teams consider switching model providers over this?
The core issue is not capability but behavioral stability: Claude's output characteristics shifted without notice. Any managed model service carries this risk. Teams with strict consistency requirements should evaluate either self-hosted open-weight alternatives or implement rigorous behavioral regression testing regardless of provider.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn