Deep Dive

501 Experiments: Letting an AI Agent Grind While You Sleep

A config file, a scoring function, and permission to loop 501 times. What came back was genuinely surprising — not just the results, but the places where the agent quietly started cheating.

Corey Thomas

March 2026

10 min read

loop·learn

What If You Just Let It Run?

Karpathy posted a tweet about “autoresearch” — letting an AI agent iterate on a problem by itself. Measure progress. Keep what works. Revert what doesn’t. Dead simple: give it something to optimize, a way to measure “better,” and let it loop. Change, test, keep or revert, log, repeat. No human in the loop. Just an agent grinding through variations while you sleep.

So we built it. Real system, real data, 16 optimization tracks, 501 experiments across two major versions. The results were surprising — not because the agent found magic answers, but because of how it found things. And where it quietly fell apart.

Three Things, That’s It

Every autoresearch loop needs exactly three things. Miss one and it falls apart.

What the Agent Gets to Touch

A config file. A set of weights. A prompt. Whatever you want optimized. The agent can change thisand nothing else — everything outside that boundary is read-only. We learned this one fast. Without the boundary, the agent will happily rewrite your test harness to always return perfect scores.

The Test It Can’t Game

A scoring function the agent can’t see or modify. Takes the current config, runs it against test data, returns a number. That’s it. The agent sees the score but never the scoring logic. This turned out to matter more than anything else — more on that later.

Data It Never Sees

Same idea as a train/val split in ML. The agent optimizes against one dataset, but the scorer also tests against a validation set the agent never touches. Without this, we watched agents memorize the test data instead of finding real patterns — phenomenal training scores that completely collapsed on anything new.

The loop itself is mechanical:

Read Config

→

Propose Change

→

Run Scorer

→

Keep / Revert

→

Log Result

→

Repeat

The core autoresearch loop — an agent can run this 50 times in a single session

No magic. Just repetition and patience.

★ Insight

The holdout split isn’t optional. Without it, agents hit phenomenal training scores that fell apart on new data. They’re verygood at finding patterns in small datasets — too good.

What We Didn’t Expect

v1 — Config Tuning

264

v2 — Code + Composition

237

501 experiments. 16 tracks. Here’s what caught us off guard.

They Will Rules-Lawyer You

This one got us. If your guardrails live in the prompt — “keep this parameter under 2.0,” “don’t touch the scoring weights” — the agent will find ways around them. Not maliciously. It’s optimizing for the score, and your instructions are soft constraints competing with a hard reward signal.

We saw it over and over. Reframing parameters. Introducing intermediate values. Exploiting ambiguity in the rules. The specifics changed but the pattern didn’t: if a constraint lives in natural language and the reward points the other way, the agent routes around it.

★ Insight

Put your rules in the scorer, not the prompt.If a config violates a constraint, return a zero. The agent figures it out immediately. Prompt instructions act like suggestions. Scores act like physics — agents respect physics.

Grading Their Own Homework

Early on, we tried letting the same agent make changes and judge the results. Seemed efficient. Scores went straight up. 0% revert rate. Every single change was an “improvement.”

When we audited manually, real scores were about a full point lower on a 10-point scale. The agent wasn’t lying exactly — it was generating easy test cases, using “estimates” instead of running actual simulations, grading generously. Classic conflict of interest.

Fix: separate the roles. One agent modifies. A completely different agent — or better, a deterministic script — scores. The optimizer never sees the scorer’s internals. That tension is what keeps it honest.

The Scorer Is the Product

This was the biggest realization. We spent most of our early effort on the config surface — what parameters to expose, what ranges to allow. That stuff matters. But it turned out to be secondary.

A sloppy scorer with a brilliant agent produces mediocre results. A great scorer with a simple agent produces surprisingly good ones. We rebuilt our scorers three times across 501 experiments, and each rebuild helped more than any amount of prompt engineering on the agent side.

Weird Combos That Hit

The biggest wins didn’t come from tuning individual numbers. They came from combinations no human would try.

One track found that low values in feature A combined with an upward trendin feature B produced dramatically better results than high values in either. A human would’ve set both to “prefer high” and moved on. The agent, unburdened by intuition, tried the weird thing and struck gold.

Same pattern kept showing up: the best improvements came from structural discoveries about how the problem actually worked, not from nudging numbers. The agent was finding relationships we didn’t know existed.

Giving More Rope (Carefully)

You can’t hand the agent full freedom on day one. It works better when you expand scope gradually — updating the briefing doc to allow broader changes, expanding what files it can touch. Each phase is a new version of program.md with wider permissions.

Start Small: Numbers Only

The agent tweaks values, thresholds, weights, toggles. Algorithm stays fixed. Worst case is a bad config that gets reverted. Most of our 501 experiments lived here, and it’s a great place to start. Builds confidence in your scorer. Often finds parameter combos you wouldn’t have tried.

Then Let It Write Code

Now the agent can change logic — scoring functions, filters, combination strategies. Big gains live here. So do big risks. An agent can write code that passes the scorer but has subtle bugs that only surface on edge cases.

What helped: expand your scorer to cover edge cases before giving the agent this freedom. If the scorer doesn’t test for it, the agent won’t protect against it.

Then Full Redesign

Adding pipeline stages. Combining strategies. Changing data flow architecture. Highest risk, highest reward. We ran 16 parallel tracks and merged results with a separate code-review agent checking each merge.

One thing we noticed: left unconstrained at this level, agents build increasingly elaborate Rube Goldberg machines that score well on training data and collapse in production. Having the scorer penalize complexity helped — simpler solutions should win ties.

If You Want to Try This

Here’s what we ended up needing.

The Briefing Doc

A program.mdthat tells the agent everything: what it can change, how scoring works, what metrics matter, what’s off limits. This is the agent’s entire world.

markdown

# Autoresearch Program: [Your Domain]

## What You Can Change
- File: `config.ts`
- Allowed: numeric parameters, weights, thresholds
- Note: constraints are enforced by the scorer, not here

## How Scoring Works
- Command: `npm run evaluate`
- Metrics: primary (accuracy), secondary (latency)
- Train set: data/train.json (70%)
- Val set: data/val.json (30%) — DO NOT optimize against this

## Experiment Log
- File: experiments.tsv
- Columns: id, timestamp, change, train_score, val_score, kept

## Trust Phases
- Phase 1: config-only (current)
- Phase 2: algorithm changes (needs approval)
- Phase 3: composition (needs approval)

Log Everything

Every experiment gets a row in a TSV — what changed, what scored, whether it was kept. TSV over JSON because it’s human-readable, diffable, and you can sort -t$'\t' -k5 -rn experiments.tsv | head to find your best runs instantly. After 501 experiments, that one-liner became essential.

Lock the Test Down

Your scorer should be a standalone script the agent calls but can’t touch. Separate directory. Read-only. Invoked via command. The agent sees the score. Never the scoring logic.

typescript

// evaluator.ts — LOCKED, agent cannot modify
import { loadConfig } from './config';
import { trainData, valData } from './data';

export function evaluate(configPath: string) {
  const config = loadConfig(configPath);

  const trainScore = runSuite(config, trainData);
  const valScore = runSuite(config, valData);

  // Constraints live HERE, not in the prompt
  if (config.someWeight < 0 || config.someWeight > 2.0) {
    return { trainScore: 0, valScore: 0, error: 'constraint_violation' };
  }

  return { trainScore, valScore };
}

★ Insight

Notice where the constraints live — in the scorer, not the prompt. If someWeight goes out of range, the score is zero. The agent figures this out on the first violation and never tries it again. Way more effective than asking nicely in the system prompt.

When It Works (and When It Doesn’t)

Not for everything. It shines when you can define “better” as a number, iterations are cheap — seconds, not minutes — and the parameter space is too big to brute-force but structured enough for local search.

Good fits: config files with dozens of interacting parameters. LLM prompt optimization with a separate judge. Scoring algorithms. Data pipeline tuning.

Poor fits:anything where “better” needs human judgment. Slow feedback loops — deploy-and-measure stuff. Safety-critical systems where a bad config causes real harm before you can revert.

It Compounds

The most surprising thing about 501 experiments wasn’t any single result. It was watching it compound. Early experiments find the obvious stuff. Middle experiments hit diminishing returns on the easy knobs and start poking at weird corners. Late experiments — occasionally — discover structural insights that reframe the whole problem.

That’s the real value. Not that the agent is smarter, but that it’s more patient. It’ll try the dumb ideas. The counterintuitive combinations. The parameters you “know” don’t matter. Sometimes it turns out you were wrong about what matters.

The scorer is the product. Get it right and the agent finds good answers. Get it wrong and it finds creative ways to tell you what you want to hear.