Auto research loop

Generalized from Karpathy’s autoresearch. Same loop, any domain.

The Idea

An AI agent runs an infinite hill-climbing loop: modify → run → measure → keep or revert → repeat. No human in the loop. Wake up to a TSV of completed experiments.

This works for any project where you have:

A quantifiable metric (one or two scalars)
A fast feedback loop (seconds to minutes per run)
A sandboxed file to modify
An immutable eval harness

The core insight: most optimization problems share the same structure — you tweak something, measure the outcome, and decide whether to keep the change. An agent can do this 24/7 without fatigue, bias, or distraction.

The Pattern

LOOP FOREVER:
  1. Read current state (code, config, prompt — whatever you're optimizing)
  2. Form a hypothesis ("what if I try X?")
  3. Edit the target file
  4. Git commit (audit trail)
  5. Run the experiment (fixed time/resource budget)
  6. Extract metrics from output
  7. If improved → keep (advance branch)
     If equal/worse → revert (git reset)
  8. Log to results.tsv
  9. NEVER STOP — human will interrupt when done

Pattern Variants

Exploration vs. Exploitation: The default loop is pure hill-climbing (exploitation). To avoid local optima, inject randomness:

VARIANT — EXPLORE/EXPLOIT:
  Every Nth iteration (e.g., N=5):
    Skip the "revert if worse" step
    Keep the change regardless (explore a new region)
    Mark as "explore" in the TSV
  Otherwise:
    Run the standard keep/revert logic (exploit)

Batch Mode: When individual runs are noisy (e.g., stochastic benchmarks), run the same config K times and compare the mean:

VARIANT — BATCHED:
  After editing the target file:
    Run the experiment K times (e.g., K=3)
    Compute mean and standard deviation of metrics
    Only keep if mean improves AND improvement > 1 stddev

Multi-Stage Pipeline: When optimization has dependent stages (e.g., preprocess → train → evaluate):

VARIANT — PIPELINE:
  Stage 1: Optimize preprocessing config → metric: data_quality_score
  Stage 2: Freeze preprocessing, optimize training → metric: val_loss
  Stage 3: Freeze training, optimize inference → metric: latency_ms

Setup Contract

Every autoexp run needs four things defined upfront:

Component	What it is	Example
Target file	The ONE file the agent can modify	`train.py`, `prompt.txt`, `config.yaml`
Eval harness	Immutable script that produces the metric	`evaluate.py`, `run_bench.sh`
Metric(s)	1-2 scalar values, lower or higher = better	`val_bpb ↓`, `accuracy ↑`, `latency_ms ↓`
Budget	Time/cost cap per experiment	5 min wall clock, max 100 runs, $X total

Dual Metric Mode

For two metrics (e.g., accuracy ↑ and latency ↓), define a dominance rule:

Primary/secondary: Improve primary; secondary is a soft constraint (e.g., “accuracy must improve; latency shouldn’t 2x”)
Pareto: Keep if better on either metric without regressing on the other
Weighted: score = w1 * metric1 + w2 * metric2 — collapse to single scalar
Threshold-gated: Metric B must stay above a minimum threshold; only then is improvement in Metric A considered (e.g., “latency must be < 200ms; beyond that, maximize throughput”)
Lexicographic: Sort by Metric A first. Only break ties using Metric B (e.g., “minimize error rate first, then minimize model size among equally accurate models”)

Triple+ Metric Mode

When you genuinely need 3+ metrics (e.g., accuracy, latency, memory usage), collapse them:

# Option 1: Weighted composite
score = 0.5 * accuracy + 0.3 * (1 / latency_ms) + 0.2 * (1 / memory_mb)

# Option 2: Constraint + optimize
# Constraints: latency < 100ms AND memory < 512MB
# Optimize: accuracy ↑
# If constraints violated → discard immediately

Results Logging

Tab-separated. Simple. Diffable. No infrastructure.

commit	metric_1	metric_2	status	description
a1b2c3d	0.9979	44.0	keep	baseline
b2c3d4e	0.9932	44.2	keep	increased learning rate to 0.04
c3d4e5f	1.0050	44.0	discard	switched to GeLU activation
d4e5f6g	0.0000	0.0	crash	doubled model width (OOM)
e5f6g7h	0.9910	42.1	keep	added dropout 0.1 after attention layers
f6g7h8i	0.9915	43.8	discard	tried cosine annealing schedule
g7h8i9j	0.9880	40.5	keep	reduced embedding dim from 512 to 256
h8i9j0k	timeout	—	crash	batch size 2048 exceeded memory + time budget

Statuses: keep discard crash timeout explore

Extended Logging

For richer analysis, optionally maintain a companion experiments.jsonl:

{"commit": "a1b2c3d", "timestamp": "2026-03-08T02:14:00Z", "metrics": {"val_loss": 0.9979, "throughput": 44.0}, "status": "keep", "description": "baseline", "diff_lines": 0, "wall_time_s": 120, "tokens_used": 0}
{"commit": "b2c3d4e", "timestamp": "2026-03-08T02:19:30Z", "metrics": {"val_loss": 0.9932, "throughput": 44.2}, "status": "keep", "description": "increased learning rate to 0.04", "diff_lines": 1, "wall_time_s": 118, "tokens_used": 1420}

This enables post-hoc analysis: which types of changes yield the biggest improvements? How does wall time correlate with metric gain?

Branch Isolation

Always experiment on a dedicated branch:

git checkout -b autoexp/<tag>  # e.g., autoexp/mar8-prompt-tuning

Main stays clean. The branch is your lab notebook. Every commit is a recoverable experiment.

Branch Strategy for Parallel Experiments

When running multiple autoexp agents simultaneously on different aspects:

git checkout -b autoexp/mar8-learning-rate   # Agent 1: tuning LR
git checkout -b autoexp/mar8-architecture     # Agent 2: model structure
git checkout -b autoexp/mar8-regularization   # Agent 3: dropout/weight decay

After all agents finish, cherry-pick the best commits from each branch into a combined experiment branch and validate that the improvements compose.

Where This Applies

Domain	Target file	Metric	Eval harness
ML training	`train.py`	val_loss ↓	Fixed eval function
Prompt engineering	`prompt.txt`	eval_accuracy ↑	LLM judge or test suite
RAG pipelines	`config.yaml`	retrieval_precision ↑	Benchmark query set
Compiler/perf tuning	`flags.conf`	runtime_ms ↓	Benchmark binary
API optimization	`handler.py`	p99_latency ↓	Load test script
System prompts	`system.md`	task_score ↑	Eval suite with rubric
CSS/layout	`styles.css`	lighthouse_score ↑	Lighthouse CI
SQL queries	`query.sql`	exec_time_ms ↓	EXPLAIN ANALYZE wrapper
Infrastructure	`terraform.tf`	cost_per_hour ↓	`terraform plan` parser
Regex patterns	`patterns.yaml`	f1_score ↑	Labeled match/no-match dataset
Search ranking	`ranking.py`	ndcg@10 ↑	Relevance-judged query set
Image processing	`pipeline.py`	ssim_score ↑ / processing_time ↓	Reference image comparison
Audio processing	`denoise_config.yaml`	snr_db ↑	Test audio clips + measurement
Data pipeline ETL	`transform.py`	rows_per_second ↑ / error_rate ↓	Fixed input dataset
Caching strategy	`cache_config.yaml`	hit_rate ↑ / memory_mb ↓	Replay production access logs
Feature engineering	`features.py`	model_auc ↑	Fixed train/test split
Hyperparameter search	`hparams.json`	val_metric ↑	Training + eval script
Email/notification templates	`template.html`	render_time_ms ↓ / accessibility_score ↑	Rendering engine + axe-core
Compression settings	`compress.conf`	compression_ratio ↑ / decode_time_ms ↓	Benchmark file set
Networking config	`nginx.conf`	requests_per_sec ↑ / error_rate ↓	`wrk` or `ab` load test
Database indexing	`indexes.sql`	avg_query_time_ms ↓	Query benchmark suite
Serialization format	`schema.proto` / `schema.avsc`	serialize_time_μs ↓ / payload_bytes ↓	Round-trip benchmark
Log parsing rules	`grok_patterns.conf`	parse_accuracy ↑ / parse_rate_eps ↑	Labeled log samples
CI/CD pipeline	`.github/workflows/ci.yml`	pipeline_duration_s ↓	Trigger + measure workflow time
Docker image	`Dockerfile`	image_size_mb ↓ / build_time_s ↓	`docker build` + `docker images`
Kubernetes resources	`deployment.yaml`	pod_startup_s ↓ / resource_cost ↓	`kubectl apply` + monitoring
Game AI behavior	`ai_config.json`	win_rate ↑	Simulated matches vs. baseline
Recommendation engine	`rec_model.py`	precision@k ↑ / diversity ↑	Offline eval on held-out set
A/B test config	`experiment.json`	conversion_rate ↑	Simulated traffic replay
Spelling/grammar rules	`rules.yaml`	f1_score ↑	Annotated error corpus
Chatbot routing	`intents.yaml`	classification_accuracy ↑	Labeled utterance dataset
Batch job scheduling	`scheduler.conf`	total_makespan_s ↓	Simulated job queue
Memory allocator tuning	`malloc.conf`	alloc_throughput ↑ / fragmentation ↓	Allocation trace replay

Detailed Case Studies

Case 1: Prompt Engineering for Classification

Target: prompt.txt (system prompt for an LLM classifier)
Metric: accuracy ↑ on 200 labeled test examples
Eval: Send each test input to the LLM with the prompt, compare output to gold label
Budget: 50 experiments, $15 API cost cap

Typical agent moves:
  - Add few-shot examples
  - Reword instructions for clarity
  - Add chain-of-thought scaffolding
  - Constrain output format
  - Add edge case handling instructions

Case 2: SQL Query Optimization

Target: query.sql (a slow analytical query)
Metric: execution_time_ms ↓ (via EXPLAIN ANALYZE)
Eval: Run query 5 times on a staging database, take median
Budget: 30 experiments, 2 hours wall clock

Typical agent moves:
  - Rewrite subqueries as CTEs (or vice versa)
  - Change JOIN order
  - Add/remove index hints
  - Replace correlated subqueries with window functions
  - Materialize intermediate results
  - Adjust WHERE clause predicate ordering

Case 3: Docker Image Size Reduction

Target: Dockerfile
Metric: image_size_mb ↓ (secondary: build_time_s ↓)
Eval: docker build → docker images → extract size
Budget: 40 experiments, 3 hours

Typical agent moves:
  - Switch base image (ubuntu → alpine → distroless)
  - Merge RUN layers to reduce intermediate layers
  - Add .dockerignore entries
  - Multi-stage builds
  - Remove unnecessary packages
  - Order layers for better cache reuse

Case 4: Regex Pattern Matching

Target: patterns.yaml (list of regex rules for data extraction)
Metric: f1_score ↑ on labeled test corpus
Eval: Run each regex against test strings, compute precision/recall/F1
Budget: 60 experiments, 1 hour

Typical agent moves:
  - Broaden overly strict patterns
  - Add negative lookaheads to reduce false positives
  - Combine redundant patterns
  - Handle edge cases (unicode, whitespace variants)
  - Anchor patterns to reduce backtracking

Case 5: Caching Strategy Optimization

Target: cache_config.yaml (TTL, eviction policy, size limits)
Metric: cache_hit_rate ↑ (constraint: memory_mb < 512)
Eval: Replay 1 hour of production access logs against cache simulator
Budget: 80 experiments, 4 hours

Typical agent moves:
  - Adjust TTL per content type
  - Switch eviction policy (LRU → LFU → ARC)
  - Tune cache size partitions
  - Add prefetch rules for predictable access patterns
  - Adjust admission policy thresholds

Case 6: Feature Engineering for Tabular ML

Target: features.py (feature transformation pipeline)
Metric: model_auc ↑ on fixed validation set
Eval: Generate features → train lightweight model (e.g., XGBoost) → evaluate
Budget: 40 experiments, 6 hours

Typical agent moves:
  - Add interaction features (feature_A * feature_B)
  - Log-transform skewed distributions
  - Bin continuous variables
  - Add rolling window aggregates
  - Target encoding for high-cardinality categoricals
  - Remove noisy or redundant features

Case 7: Nginx Performance Tuning

Target: nginx.conf
Metric: requests_per_sec ↑ (constraint: error_rate < 0.1%)
Eval: Run wrk benchmark for 30 seconds, extract RPS and error count
Budget: 30 experiments, 2 hours

Typical agent moves:
  - Adjust worker_processes and worker_connections
  - Enable/tune gzip compression levels
  - Tune keepalive_timeout and keepalive_requests
  - Adjust proxy_buffer_size settings
  - Enable sendfile, tcp_nopush, tcp_nodelay
  - Tune upstream connection pooling

Case 8: Game AI Behavior Tuning

Target: ai_config.json (weights for AI decision-making)
Metric: win_rate ↑ against baseline AI (over 100 simulated matches)
Eval: Run match simulator, count wins/losses/draws
Budget: 100 experiments, 8 hours

Typical agent moves:
  - Adjust aggression/defense weight balance
  - Tune resource gathering priorities
  - Modify threat assessment thresholds
  - Change retreat health thresholds
  - Adjust exploration vs. known-path preferences

What Makes This Work

Single file constraint — prevents the agent from refactoring the universe
Fixed budget — no runaway experiments
Git commit per try — perfect audit trail, trivial revert
TSV logging — zero-infra results tracking
“NEVER STOP” — the agent is a tireless researcher, not an assistant waiting for permission
Deterministic eval — same input always produces the same metric (or use batching for stochastic evals)
Isolation — experiments happen on a branch, in a sandbox, away from production

What This Doesn’t Work For

Subjective quality (UI aesthetics, writing style, music composition) — no scalar metric
Slow feedback (deploy → wait for user traffic → measure) — loop stalls
Multi-file changes (architectural refactors) — too much surface area
Safety-critical systems (medical devices, flight control) — autonomous modification without human review is a bad idea
Irreversible side effects (sending emails, charging credit cards, modifying production databases) — agent must not trigger real-world actions
Problems requiring creativity over optimization (designing a new algorithm from scratch) — hill-climbing can’t escape the search space it’s given
Highly coupled systems (changing one config requires coordinated changes in 5 others) — single-file constraint breaks down
Non-deterministic environments (metrics swing ±20% between identical runs) — unless you use batched evaluation with statistical significance testing

Workarounds for Edge Cases

Problem	Workaround
Slow eval (>10 min)	Use a proxy metric (e.g., subset eval, smaller dataset)
Subjective quality	Create a rubric-based LLM judge that outputs a score
Multi-file needed	Bundle related configs into a single YAML/JSON file
Noisy metrics	Run each experiment K times, use mean ± stddev
Large search space	Seed the agent with known good directions in the prompt
Agent gets stuck in local optimum	Inject periodic random restarts (explore mode)
Expensive eval (API costs)	Use cheaper model for exploration, expensive for validation

Cost Awareness

Karpathy’s version costs GPU-hours. In an LLM-agent context, each iteration costs API tokens. Add a budget cap:

max_experiments: 50        # hard stop
max_cost_usd: 10.00        # estimated from token usage
max_wall_clock_hours: 8    # total run time
abort_on_consecutive_crashes: 3  # stop if 3 crashes in a row

The agent should log cumulative cost in the TSV or a separate budget.log.

Cost Estimation by Domain

Domain	Typical cost per experiment	Experiments per hour	8-hour overnight cost
Prompt engineering (API)	$0.05–$0.50	10–30	$4–$120
ML training (GPU)	$0.10–$5.00	3–12	$2.40–$480
SQL optimization	~$0 (local DB)	20–60	~$0
Docker builds	~$0 (local)	5–15	~$0
Load testing	~$0 (local)	10–20	~$0
Infrastructure (cloud)	$0.01–$1.00	5–10	$0.40–$80

Agent Instructions Template

Use this as the base prompt when spawning an autoexp agent:

You are an autonomous experimentation agent. Your task:

**Goal**: Optimize [METRIC] by modifying [TARGET_FILE].
**Eval command**: `[EVAL_COMMAND]`
**Metric extraction**: [HOW TO PARSE METRIC FROM EVAL OUTPUT]
**Direction**: [↑ higher is better / ↓ lower is better]
**Budget**: [MAX_EXPERIMENTS] experiments, [MAX_HOURS] hours, $[MAX_COST] cost
**Branch**: autoexp/[TAG]

Rules:
1. ONLY modify [TARGET_FILE]. Never touch the eval harness.
2. Git commit before every experiment run.
3. If metric improves or equals best → keep. Otherwise → revert.
4. Log every experiment to results.tsv (commit, metric, status, description).
5. If an experiment crashes, log it as "crash" and revert.
6. NEVER STOP until budget is exhausted or human interrupts.
7. Think carefully about each hypothesis. Prefer targeted changes over random edits.
8. After every 10 experiments, review the results.tsv to identify patterns.

Advanced Techniques

Warm Starting

Instead of starting from scratch, seed the agent with knowledge:

Previous best experiments (from last night's run):
- Learning rate 0.03 gave val_loss 0.9910 ✓
- Dropout 0.1 helped, 0.3 hurt
- GeLU worse than ReLU for this architecture
- Batch sizes > 1024 cause OOM

Start from commit [BEST_COMMIT] and explore from there.

Checkpoint and Resume

For long-running experiments, design for interruption:

# Save state
cp results.tsv results.tsv.bak
git log --oneline autoexp/current > experiment_history.txt

# Resume later
git checkout autoexp/current
# Agent reads results.tsv to understand what's been tried
# Continues from where it left off

Meta-Optimization

Run autoexp on autoexp — optimize the agent’s own strategy:

Target: agent_strategy.yaml (parameters like explore_rate, batch_size, hypothesis_style)
Metric: best_metric_after_50_experiments ↑
Eval: Run a full 50-experiment autoexp loop, extract the best metric achieved

This is slow but powerful for tuning the exploration strategy itself.

The Philosophy

If each experiment takes ~5 minutes, you get ~12/hour. Over an 8-hour sleep, that’s ~100 experiments. You wake up to a results table your agent built while you dreamed.

The human’s job: define the metric, set the constraints, review the results. The agent’s job: relentlessly explore the space.

The power of this pattern isn’t any single experiment — it’s the compound effect of hundreds of small, methodical explorations. Most individual changes will be discards. But the few that stick accumulate into something no human would have the patience to find manually.

Inspired by @karpathy/autoresearch. Generalized for any quantifiable optimization problem.

Bridging Vectors and Graphs: Building a Hybrid Knowledge Graph Retrieval System

Platform Architecture for Autonomous Development