Autoresearch, Agent Loops and the Future of Work
By The AI Daily Brief: Artificial Intelligence News
Summary
## Key takeaways - **Auto Research Core Loop**: An AI agent iterates on train.py by editing model architecture, hyperparameters, and optimizer; runs 5-minute training; evaluates on val BPB; commits improvements to git branch or reverts, repeating indefinitely. [02:02], [05:46] - **Human Edits Prompt File**: Humans edit program.md, a markdown file with instructions on researcher behavior, experiments to try, and when to be bold versus conservative, while agents execute the research loop. [04:32], [06:24] - **83 Experiments Improve Model**: In Andre's session, 83 experiments ran with 15 improvements kept, driving val BPB from .9979 down to .9697. [05:55], [06:06] - **Ralph Wiggum Loop Precursor**: Ralph is a software development loop that feeds a coding agent project specs, has it implement tasks, run tests, and commit if passing, restarting fresh agents to externalize memory in git and files. [09:16], [10:16] - **Agent Loops as Work Primitive**: Agentic loops are basic building blocks of work like meetings or spreadsheets, cutting across roles; humans write strategy memos while agents execute experiments scored by clear metrics. [01:12], [07:36] - **Eval Loop Readiness Map**: Processes succeed with agent loops if they have objective scores, fast cheap iterations, bounded environments; top quadrant includes code generation, game AI, ad optimization, LLM training. [13:46], [14:40]
Topics Covered
- Agentic loops birth new work primitive
- Humans write memos, agents execute research
- Ralph loops externalize memory via git
- Loops thrive on fast cheap scorable iterations
- Future loops demand arena design skills
Full Transcript
Today we're discussing what Andre Carpathy's weekend project about auto research can tell us about the future of work. Now today we are talking about a
work. Now today we are talking about a new project from Andre Carpathy called auto research. And you might notice that
auto research. And you might notice that we are doing an entire episode about this instead of our normal division into the headlines in the main episode. It's
because I think that this topic is actually even more significant than it seems on the surface of it. One would be tempted to think that all of us nerds were just getting over excited because Andre Carpathy, who was held in such esteem, released the new GitHub
repository. And while that is certainly
repository. And while that is certainly true, there is something bigger going on here. You might remember a couple months
here. You might remember a couple months ago me talking about something called Ralph Wiggum. Ralph is in simplest terms
Ralph Wiggum. Ralph is in simplest terms a software development loop that keeps running, building software in an iterative and persistent way by looping the same instructions over and over and over again. It's named after Simpson's
over again. It's named after Simpson's character Ralph Wiggum for his lovable and indomitable persistence despite whatever's going on around him. Now,
we'll talk more about Ralph in a little bit, but the key concept to take away is this idea of an iterative loop.
Karpathy's auto research is also at core about an iterative loop. And I think combined what you have is arguably a new type of work primitive. Primitives are
the basic building blocks of work that are so fundamental that they show up everywhere across roles in industries and that people reach for automatically once they have it. New ones don't come around very often. And so this idea that
agentic loops might be one is I think worthy of some serious scrutiny. But
let's talk about what Andre actually released first and then we will come back to that. On Saturday, Andre, who was on the founding team at OpenAI, and who was previously the director of AI at Tesla, and who you might remember from
coining such terms as vibe coding last February, and who has now suggested we are in a different era of agentic engineering as of this February, again, tweeted on Saturday, I packaged up the auto research project into a new
self-contained minimal repo if people would like to play over the weekend.
It's basically nanohat LLM training core stripped down to a single GPU. One file
version of around 630 lines of code.
Then the human iterates on the prompt MD. An AI agent iterates on the training
MD. An AI agent iterates on the training code py. The goal is to engineer your
code py. The goal is to engineer your agents to make the fastest research project indefinitely and without any of your own involvement. In the image, which he shared alongside it, every dot is a complete LLM training run that
lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings of lower variation loss by the end of the neural network architecture,
the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. Part code, part sci-fi, and a pinch of psychosis. As a
caption to the image, he wrote, "One day, Frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using soundwave interconnect in the ritual of
a group meeting. That era is long gone.
Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster mega structures in the skies. The agents claim that we are now in the 10,25th generation of the codebase. In any case, no one could tell
codebase. In any case, no one could tell if that's right or wrong, as the code, quote unquote, is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of
comprehension. This repo is the story of how it all began. So, let's talk about what auto research actually is, at least in the version that was released by Andre. Auto research is a system for
Andre. Auto research is a system for training a small language model.
Basically, the kind of model that powers all of these AI tools, but much smaller.
The type of model that could one day run on, for example, an edge device like a phone. The goal is to make a model as
phone. The goal is to make a model as good as possible at understanding and generating text. Normally or
generating text. Normally or classically, a human researcher would sit there tweaking the training setup, doing things like adjusting the model's architecture, changing how fast it learns, experimenting with different optimization strategy. They'd run an
optimization strategy. They'd run an experiment, check the results, decide what to try next, and repeat. That's
basically the core loop of machine learning research, and it's bottlenecked by how fast a human can iterate. Auto
research instead hands that entire loop to an AI agent and it does so in an intentionally simplified and tiny way in this repo. There are just three files
this repo. There are just three files that matter. The first is prepare. py
that matter. The first is prepare. py
which is fixed infrastructure that doesn't change. It downloads the
doesn't change. It downloads the training data, trains a tokenizer and handles evaluation. The second is train
handles evaluation. The second is train py. This contains the entire GPT model
py. This contains the entire GPT model definition, the optimizer and the training loop. This is the single file
training loop. This is the single file the AI agent is allowed to edit.
Everything in it is fair game. The model
architecture, the hyperparameters, the batch size, the attention parameters, the learning rate schedule, literally everything. The third file is
everything. The third file is program.md. And this is the most
program.md. And this is the most conceptually important one, especially in the context of this idea of these loops being larger primitives. It's a
markdown file, plain text written in English, that contains the instructions for the AI agent. It describes how the agent should behave as a researcher, what kind of experiments to try, what to be cautious about, and when to be bold
versus conservative. This is the file
versus conservative. This is the file that the human in this equation edits.
So, the way that this is going to work is you point an AI agent like Claude or Codex or whatever at the repo and tell it to read program.mmd and start experimenting. The agent reads the
experimenting. The agent reads the instructions, looks at the current state of train. py, decides on a modification
of train. py, decides on a modification to try, makes the edit, and kicks off a training run. Every training run has a
training run. Every training run has a fixed 5-minute budget. When the run finishes, the system evaluates the model on a validation set and produces a single number. In this case, that's
single number. In this case, that's validation BPB or val BPB, which stands for validation bits per bite. In this
case, lower is better. The agent then makes a decision. If the new val BPB is lower than the previous best, the change is kept. It gets committed to a git
is kept. It gets committed to a git feature branch. It becomes the new
feature branch. It becomes the new baseline, and the agent builds on top of it for the next experiment. If the val BPB is the same or higher, the change is discarded. The agent reverts to the
discarded. The agent reverts to the previous best version and tries something different. Then the loop
something different. Then the loop repeats indefinitely because of that 5minute constraint. You can run this for
5minute constraint. You can run this for an hour and get 12 experiments. You can
run it overnight and get about 100. The
session that Andre shared showed 83 experiments of which 15 had improvements that they kept and which drove the valpb from.9979
from.9979 down to.9697.
down to.9697.
So basically instead of the researcher running the research at this point they are designing the arena that the research lives in which is the program.mmd file. Andre describes it as
program.mmd file. Andre describes it as a super lightweight skill and basically it's a research strategy document.
Carpathy explicitly says you are not touching any of the python files like you normally would as a researcher.
Instead you are programming the program.mmd markdown file that provide
program.mmd markdown file that provide context to the AI agents and set up your autonomous research org. The human's job becomes write a better memo and the agent's job is execute research within the frame the memo sets. The loop
between them is mediated by a single unambiguous number in the case of Andre's experiment the Val BPB that tells you whether things are getting better or worse. And that is the whole system. Almost immediately people
system. Almost immediately people started squawking about this. Leor
Alexander wrote, "You don't write the training code anymore. You write a prompt that tells an AI agent how to think about research. The agent edits the code, trains a small model for exactly 5 minutes, checks the score,
keeps or discards the result, and loops all night. No human in the loop. That
all night. No human in the loop. That
fixed 5-minute clock is the quiet genius. No matter what the agent
genius. No matter what the agent changes, the network size, the learning rate, the entire architecture, every run gets compared on equal footing. This
turns open-ended research into a game with a clear score. Cosmic Labs
co-founder Me McNelte writes, "Wild shift. Turning a single GPU into an
shift. Turning a single GPU into an autonomous experiment loop changes the pace of iteration. If the evaluation metric is welldesigned, the system can explore hundreds of ideas far faster than manual tuning. Craig Huitt argued
that the specific context of training LLMs isn't what matters. Instead, he
called it the cleanest example of the agent loop that's about to eat everything. One, human writes a strategy
everything. One, human writes a strategy dock. Two, agent executes experiments
dock. Two, agent executes experiments autonomously. Three, clear metric
autonomously. Three, clear metric decides what stays and what gets tossed.
Four, repeat 100x overnight. The person
who figures out how to apply this pattern to business problems, not just ML research, is going to build something massive. The code is almost irrelevant.
massive. The code is almost irrelevant.
The architecture and mindset is everything. Daniel Mistler called this
everything. Daniel Mistler called this automation of the scientific method. And
it's me Chase also noticed that this would be valuable for things outside of ML research as well. He writes, "While this was made for self-improving LLMs, the framework could be applied to anything. One, AI agent reads context in
anything. One, AI agent reads context in previous results. Two, proposes targeted
previous results. Two, proposes targeted code edits. Three runs a fast
code edits. Three runs a fast reproducible experiment. Four gets an
reproducible experiment. Four gets an objective scaler score. Five git commits only the winners or reverts. Six repeats
forever on a feature branch. And of
course, many made the connection to the Ralph Wigum loop that was popularized a couple months ago. Neweron writes,
"Sounds like a hyper mode Ralph Wiggum from a few months ago. Instead of
looping until a task is done, you give the agent a benchmark on what to improve. Goal isn't completion but
improve. Goal isn't completion but continuous improvement against a measurable target." Co-founders Nick
measurable target." Co-founders Nick called it the Ralph Wiggum loop for science. Define what winning looks like.
science. Define what winning looks like.
Hand over the variables. Let the agent find what drives it. Why cominator
president Gary Tan made this connection as well in a blog post about auto research. Gary writes, "Auto research
research. Gary writes, "Auto research didn't emerge from nothing. The same
pattern put an AI in a loop with clear success metrics, was already working in software development by mid 2025.
Jeffrey Huntley, a developer working from rural Australia, invented what he calls the Ralph Wiggum technique. Feed a
prompt to a coding agent. Whatever it
produces, feedback in loop until it works. The loop is the hero, not the
works. The loop is the hero, not the model. Now, expanding on the Wigum loop
model. Now, expanding on the Wigum loop just a little bit, basically what you have is a script that runs an AI coding agent in a loop over time. Each
iteration of the loop does the same thing. It feeds the agent a prompt that
thing. It feeds the agent a prompt that includes a project specification, tells the agent to read the current state of the codebase, pick a task to work on, implement it, run the tests, and commit it if everything passes. When the agent
is done with its task or when it runs out of context window, the loop terminates the agent process and spins up a brand new one. Fresh context
window, no memory of the previous session. The new agent reads the same
session. The new agent reads the same spec, looks at the codebase, which now includes the previous agents commits, figures out what's been done and what still needs doing, picks the next task, and goes. Now, there are a couple things
and goes. Now, there are a couple things that the Ralph loop was trying to solve for. In a traditional session, if you
for. In a traditional session, if you keep going long enough, the context window is going to fill up. The model
starts losing track of earlier parts of the conversation and the responses degrade. The Ralph loop solution is to
degrade. The Ralph loop solution is to deliberately kill the agent and start fresh before that happens. Memory then
doesn't live in the AI context window.
It lives in the files and in the code that's been written. The get commit history, a progress.txt file that each agent appends to and a JSONbased product requirements document that tracks which tasks are done and which aren't. Every
new agent instance bootstraps its understanding from these external artifacts, not from a conversation history. Each individual agent session
history. Each individual agent session then might not be perfect, but the loop corrects for that over time because state is externalized and the system is self-healing.
So part of what Ralph was trying to solve for was just the limits of the context window, but the other part is that people want agents that work while they sleep or while they're doing other things. And this is a way to solve for
things. And this is a way to solve for that. So with connection to Ralph Loops
that. So with connection to Ralph Loops made, many people started exploring auto research in other contexts. Vernon
Mather wrote, "I hook this up to a peer-to-peer astrophysics researcher agent, which gossips and collaborates with other such agents, and your open clause to one, learn how to train an astrophysics model. Two, train a new
astrophysics model. Two, train a new astrophysics model. Three, use it to
astrophysics model. Three, use it to write papers. Four, peer agents based on
write papers. Four, peer agents based on Frontier Lab models critique it. Five,
surface breakthroughs and then feedback in that loop." Getting a little bit more practical, Vadim, the CEO of Vugola, writes, "I built a version of this for my whole company. The core problem with most agent setups, they output something
and stop. The agent writes an email,
and stop. The agent writes an email, sends an email, generates code, done.
The next time it runs, it starts from zero. No memory of what worked, no
zero. No memory of what worked, no memory of what failed. Pure amnesia.
That's not automation. That's a script you babysit. The fix is one principle.
you babysit. The fix is one principle.
Close the loop. Every agent in my setup on OpenClaw reads a shared brain file before doing any work, then writes back to it after. I call it learnings.mmd.
It's baked into every agent system prompt. Before starting work, read
prompt. Before starting work, read learnings.mmd. After completing work,
learnings.mmd. After completing work, append what you learned to learnings.mmd. That's the foundation.
learnings.mmd. That's the foundation.
One file, all agents read it, all agents write to it. Now, they're not isolated processes. They're a network that
processes. They're a network that accumulates knowledge. So, basically,
accumulates knowledge. So, basically, Veim is describing a loop for the entire agentic process of his company. In an
article on X, he writes, "Most marketing teams run around 30 experiments a year.
The next generation will run 36,500 plus easily. Things like new landing pages,
easily. Things like new landing pages, new ad creative, maybe a subject line test." Except what if you applied an
test." Except what if you applied an experiment loop? Eric writes, "Modify a
experiment loop? Eric writes, "Modify a variable, deploy it, measure one metric, keep or discard, repeat forever. Cold
email, ad creative, landing pages, job postings, YouTube thumbnails, discovery call scripts, they all follow the same loop." He also gave the example of cold
loop." He also gave the example of cold outreach, which is their first test. The
setup is 15 inboxes and around 300 emails per day with the agent modifying one variable per experiment. Sends 100
emails, waits 72 hours, scores positive reply rate, keeps her, discards, and repeats. Roberto Nixon wrote about how
repeats. Roberto Nixon wrote about how the auto research model could be applied to advertising. One, you define success,
to advertising. One, you define success, purchases apps installs whatever and set a budget. Two, meta Google Tik Tok's infinite content machine generates thousands of ad variations, copy format,
imagery, etc. Tests real time against live audiences, keeps what works, kills what doesn't. Four, agent loop runs
what doesn't. Four, agent loop runs continuously. A campaign moves from
continuously. A campaign moves from fixed asset to a living organism, ever evolving towards your stated goals. So,
humans define goals and set guard rails.
essentially a system prompt in this case i.e. brand guidelines and then press go.
i.e. brand guidelines and then press go.
Everything else is automated. Now apply
this to any business function with a measurable outcome and fast feedback loop. And so this brings up the question
loop. And so this brings up the question does this type of agentic loop primitive work for every context or are there some specific set of characteristics? I think
you're going to see this loop applied to a huge range of activities. But where
it's going to initially be most successful are areas where there are five things that are true. First, there
is a score. Something that is scorable.
In other words, that the loop can tell better from worse without asking a human. The more subjective worse or
human. The more subjective worse or better is, the harder this is going to be. Although even that's not impossible.
be. Although even that's not impossible.
You just have to build some sort of objective scoring into the system. The
second requirement is that iterations are fast and cheap. Basically, that bad attempts waste minutes, not months. The
environment needs to be bounded with the agent having a defined work and action space. The cost of a bad iteration needs
space. The cost of a bad iteration needs to be low, i.e. you're not going to try this live with legal filings. and the
agent needs to be able to leave traces.
So with Claude, we designed an eval loop readiness map which basically plots things on an x-axis of how automatable the evaluation is and a y-axis of iteration speed. The top area of the map
iteration speed. The top area of the map then are work processes that have seconds long iteration speed with fully automated evaluation possible. On the
other end of the spectrum is where evaluation is largely or entirely subjective and the iteration speed is months. So what are some examples up in
months. So what are some examples up in the top quadrant where iteration speed is seconds and evaluation can be fully automated is things like code generation. Some of the other ones that
generation. Some of the other ones that Claude came up with for game AI and NPC behavior, ad bid optimization, algorithmic trading, and then of course we've got LLM training research according to Andre Carpathy. Moving down
where you start to have iteration speed that's a little bit slower and automation that's a little bit more partial. You have things like content
partial. You have things like content moderation, AB testing, copy, supply chain routing, and then so on and so forth. It goes down all the way to the
forth. It goes down all the way to the other end of the spectrum. Something
like political negotiation is subjective and takes months. Therapy and
counseling, highly subjective with very low iteration speed. And whether each of these individual inputs is right or wrong, and I don't agree with where Claude put all of them, the point is
this. It is my very strong instinct that
this. It is my very strong instinct that every single work process that has the ability to have success measured and scored in an objective way is going to have people experimenting with agentic
loops around it. Now I think what makes this a primitive is that this is not just a new job although I'm sure there will be specialists. This is something that people are going to do within their existing roles in the same way that
meetings or slide decks or email or spreadsheets are primitives that people use and cut across every function. What
we're going to have in the future is things like this. A product manager writing a PRD, kicking off a Ralph loop before dinner and reviewing the PR in the morning. A sales rep writing
the morning. A sales rep writing targeting criteria and tone guidelines, pointing a loop at 200 leads overnight and reviewing the top 30. A financial
analyst defining constraints, looping through portfolio allocation back tests and reviewing the optimized output. A
recruiter writing a scoring rubric, looping through 500 rums and reviewing flagged edge cases. A QA engineer writing acceptance criteria and then looping through test generation and execution. a lawyer writing a risk flag
execution. a lawyer writing a risk flag checklist and looping through a stack of vendor contracts. Now, interestingly,
vendor contracts. Now, interestingly, there is already very clearly a lot of work to productize this. Also, on
Saturday, March 7th, Claude code creator Boris Churnney wrote, "Release today loop/loop is a powerful new way to schedule recurring tasks for up to 3
days at a time. Eg. Sloop babysit all my PRs, autofix build issues, and when comments come in, use a worktree agent to fix them. eg / loop every morning using the slack mcp to give me a summary
of top posts I was tagged in. Think
about the heartbeat in openclaw. The
heartbeat is effectively the core loop of any openclaw agent where by default every 30 minutes the heartbeat fires creating a moment for the agent to wake up ask where things are and continue on
with its core mission. And yet even with all this change that I'm describing this is almost certainly not the end state of the loops primitive. Andre himself wrote about this on Sunday. The next step for auto research, he says, is that it has
to be a synchronously massive collaborative for agents. The goal is not to emulate a single PhD student.
It's to emulate a research community of them. Current code synchronously grows a
them. Current code synchronously grows a single thread of commits in a particular research direction. But the original
research direction. But the original repo is more of a seed from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. GitHub is almost but not
really suited for this. It has a softly built-in assumption of one master branch which temporarily forks off into PRs just to merge back a bit later. I'm not
actually sure what this collaborative version should look like, but it's a big idea that is more general than just the auto research repo specifically. Agents
can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures.
Existing abstractions will accumulate stress as intelligence, attention, and tenacity cease to be bottlenecks.
Other people picked up this theme. Blake
Herren writes, "The missing layer is memory across the swarm. Right now, each agents run in an isolated thread with no awareness of what other agents tried, what worked, what conflicted. Git tracks
code changes, but not decisions, reasoning, or failed experiments. You
need a semantic memory layer underneath the branches, so agent 47 knows agent 12 already tried that direction, and it didn't converge. Kathy F writes, "The
didn't converge. Kathy F writes, "The real unlock is when these agent researchers can share negative results efficiently. In academia, failed
efficiently. In academia, failed experiments go to the graveyard. In a
collaborative agent network, every failure is a data point that prunes the search tree for everyone." Eugen Jin goes further saying, "AGI is billions of AI agents doing autonomous research together. Figuring out the right
together. Figuring out the right abstraction for multi-agent collaboration is the key. GitHub is not good for agents." Dan Romero wonders if it's going to look closer to a social network than to a new version of GitHub.
Maltbook, he writes, was too anthroscorphic, but an agent native social network to collaborate on auto research is interesting. As we round the corner here, already we were living in a
world where our comparative advantage as humans had been retreating to a higher level of abstraction. The new high-value skills around agent loops are things like arena design, i.e. writing the
program.mmd file and creating the context in which the agent is operating, evaluator construction or building the score function, i.e. being able to tell the agent what good actually is and building a scoring system for it. And
then there's other skills like loop operation, problem decomposition. But
the point is that all of these things operate on a much higher level of abstraction than most of our work tasks today. One interesting experiment to run
today. One interesting experiment to run this week is to as you're working find the things that you repeatedly do or a part of doing where you know right now what better looks like. Ask if you could
encapsulate that judgment clearly enough for an agent to use it as a score. If
you can, you might be able to point a loop at that part of your job to work on your behalf overnight, and that likely gives you a preview of the next version of your job. One of the great challenges right now, as someone who thinks about
how to help individuals and companies adopt AI, is that every week the capability overhang gets bigger. In
other words, the gap between meeting companies and people where they are and what I think they should be actually doing gets wider. At some point, it's so wide that it almost becomes malfeasants to meet them where they are. And yet,
what other choice is there? The only
other choice that I found is to try to provide as many resources as I can for the people who are living at the other side of that gap and who are really pushing the boundaries. And if you think that you had an advantage when you were
just vibe coding with lovable or clawed code, let me tell you, if you start to figure out how to implement agentic loops in your work, you are going to literally run circles, looping circles
around everyone else. My Spidey sense says that what auto research represents is bigger than just a weekend project for one of AI's favorite people. And I'm
excited to dig in further. For now, that is going to do it for today's AI daily brief. Thanks for listening or watching
brief. Thanks for listening or watching as always. And until next time, peace.
as always. And until next time, peace.
Loading video analysis...