TL;DR
A 630-line Python script is doing what most ML research teams cannot: running hundreds of experiments overnight without human supervision.
The headline from Karpathy's AutoResearch project is easy to sensationalize: an AI agent ran 700 machine learning experiments in two days and identified 20 improvements worth keeping. The more useful framing is what the number means in context. A standard academic ML team might run 20 experiments in a week. AutoResearch ran 700 in the same window it takes to sleep on a problem.
Andrej Karpathy published the project on GitHub in March 2026. The repository accumulated over 21,000 stars within days. A post about the results reached 8.6 million views in 48 hours — exceptional even for an AI announcement in a year when exceptional AI announcements have become routine.
How the loop works
AutoResearch is 630 lines of Python. The core structure is a loop: an AI coding agent receives a research objective, proposes a modification to a training pipeline, implements it, runs the experiment, reads the output, and decides what to try next. The agent is not iterating randomly. It accumulates results across hundreds of runs and adjusts its strategy based on what has and has not worked.
The fixed constraints are deliberate. Each experiment is bounded by a compute budget — in the published version, a small number of GPU-minutes per run — which forces the agent to explore efficiently rather than exhaustively. The 700 experiments did not require a datacenter. They ran within a defined resource ceiling.
The 20 improvements that survived yielded an aggregate 11% reduction in training time on the target task. That number is not revolutionary in isolation. The point is that a human researcher did not generate any of those optimizations. The agent did.
Shopify's overnight test
The most widely cited external validation came from Tobias Lutke, the CEO of Shopify. Lutke ran a variant of the setup on his own systems and reported 37 experiments overnight with a 19% performance gain. That result does not replicate Karpathy's experiment directly — different infrastructure, different task, different agent configuration — but it establishes that the basic approach is not an artifact of a single carefully chosen benchmark.
Lutke's public post triggered a wave of replications across the ML community. The pattern that emerged was consistent: systems that ran unattended overnight produced optimization candidates that would have taken a human team days to generate, even when the quality of any individual suggestion was uneven.
The honest caveat is that AutoResearch is most effective on tasks with fast, verifiable feedback loops. Training loss on a small model is easy to measure. Validating a drug candidate is not. The technique generalizes well within a subset of ML research; its applicability to broader scientific domains remains an open question.
Why 630 lines matters
The AutoResearch codebase is deliberately minimal. Karpathy has been explicit about the design decision: a small implementation is easier to understand, easier to modify, and harder to break in ways that are difficult to debug. The 630-line version is not a prototype for a larger system — it is the thesis about what the core idea actually requires.
That minimalism has a side effect. Researchers who want to adapt AutoResearch to their own problems can read the entire codebase in an afternoon. Several dozen derivative projects appeared on GitHub within weeks of the initial release, targeting domains from physics simulation to compiler optimization.
The barrier to running it is access to a capable AI coding agent and some GPU time. Those are not trivial constraints, but they are lower than building the infrastructure from scratch.
What changes
The practical consequence of AutoResearch is not that human researchers are replaced. It is that the ratio of experiments to researcher-hours shifts substantially. An ML team that previously spent most of its time running and analyzing individual experiments can now run a large fraction of the iterative work unattended and shift human attention toward experiment design and result interpretation.
The less comfortable version of that framing: teams that adopt this approach will explore much larger search spaces than teams that do not. The productivity gap between well-resourced and poorly-resourced research groups may widen rather than close.
Karpathy has described his personal research practice as having largely delegated both coding and iteration to AI agents. AutoResearch is the public, open-source expression of that instinct applied to ML experimentation at scale.
---
FAQ
Q: What is Andrej Karpathy's AutoResearch?
A: AutoResearch is an open-source Python project that uses an AI coding agent to run machine learning experiments autonomously. The agent proposes modifications to a training pipeline, implements them, runs each experiment, reads the results, and iterates — without human involvement between cycles.
Q: How many experiments did AutoResearch run in the original test?
A: 700 experiments over approximately 48 hours, identifying 20 optimizations that reduced training time by 11%.
Q: Can AutoResearch be applied beyond machine learning?
A: The technique requires fast, verifiable feedback loops — conditions that apply readily to ML training metrics. Domains with longer validation cycles present higher barriers, though researchers are adapting the approach to areas including compiler optimization and physics simulation.
Q: What is "The Karpathy Loop"?
A: A name the AI community applied to the AutoResearch pattern: an iterative cycle where an AI agent proposes, implements, tests, and learns from ML experiments without human intervention between cycles.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn