karpathy/autoresearch is the “AI agent runs LLM training experiments overnight” experiment. The original loop is sequential. One local GPU, one experiment at a time. This guide walks through the same workload on VESSL Cloud, where every experiment is a batch job and you can fan K of them out at once. The example is a vehicle for three VESSL Cloud patterns you can reuse for any agent-driven training loop:Documentation Index
Fetch the complete documentation index at: https://docs.cloud.vessl.ai/llms.txt
Use this file to discover all available pages before exploring further.
- One Object storage volume as a shared, read-only cache so per-experiment startup pays no data-prep cost.
- A batch job per experiment instead of an interactive workspace, so submission is scriptable and the agent never holds a GPU lease.
- K-way fan-out to compress an N-experiment cycle into roughly N/K wall time.

Why experiments run faster on the cloud
karpathy’s loop has the agent edit code, train for ~5 minutes, checkval_bpb, keep or revert, repeat. On a single local GPU the bottleneck is sequencing: one experiment at a time, no matter how cheap each one is.
On VESSL Cloud the same code runs as a batch job. Each job:
- Mounts a shared Object storage volume that holds the ~10 GB data cache.
- Boots in 3-4 minutes (image pull, dependency install, JIT compile).
- Trains for ~5 minutes and writes
val_bpbto the log.
The arithmetic. On a single H100, sequential autoresearch lands roughly 12 experiments per hour. With K=4 fan-out on VESSL, the agent submits 4 jobs in parallel each round and waits for the slowest, landing about 28 experiments per hour aggregate. The cap shifts from “GPU availability” to “how many parallel jobs you choose to allow.”
Prerequisites
- A VESSL Cloud account with credits (sign up)
- An organization with access to H100 SXM ×1 (or A100 SXM ×1; numbers below are for H100)
vesslctlinstalled and authenticated (vesslctl auth status)- Git, Python 3.11+, and a coding agent that can run shell commands (Claude Code, Codex, Cursor, and so on)
Set up the recipe
Create a shared Object storage volume for the data cache
Every experiment in the loop reads the same ~10 GB pretraining data. Putting it on an Object storage volume means you download it once and every job after that mounts the cache read-only.Find your object storage slug with
vesslctl storage list. See Create a volume for the full creation flow.Object storage is the right fit here precisely because reads are slow but cheap, and the data is read-only after the initial population. For a training data hot path you would pick faster storage; for a cache that’s filled once and reused thousands of times, you want shareability.
Run data prep once
Clone the cookbook locally and run the data-prep script. It downloads the FineWeb-Edu shard and writes it into the volume.
prep.sh submits a one-off batch job that mounts your cache volume and runs prepare.py. Once it finishes, every subsequent training job skips the download entirely.Submit a single experiment to verify the wiring
Before turning the agent loose, run one experiment by hand to confirm the recipe works in your account:
submit.sh pushes your branch, calls vesslctl job create with the cache volume mounted, polls until the job hits a terminal state, and pulls the logs back. If val_bpb shows up in run.log, the recipe is wired up correctly.Fan out to K parallel experiments
Once a single job works, the agent can submit K candidates at once. The cookbook ships two helpers:
A typical “Mode B” round looks like:The four jobs run on four independent GPUs simultaneously. The round finishes when the slowest one does.
| Script | What it does |
|---|---|
batch-job/submit-async.sh | Submit one job and return the slug immediately. |
batch-job/wait-jobs.sh slug1 slug2 ... | Poll N slugs until all are terminal, then print each job’s val_bpb and peak_vram_mb. |
What a research cycle looks like
A real 16-experiment, 4-round Mode B cycle (K=4) on H100 SXM ×1 (deneb-kr) is bundled with the recipe. The full data is inresults.example.tsv; here is the summary:
| Round | Knob the agent tried | Best val_bpb | Verdict |
|---|---|---|---|
| 1 | EMBEDDING_LR, WEIGHT_DECAY, WINDOW_PATTERN | 1.0107 | None beat baseline |
| 2 | MATRIX_LR 0.04 → 0.05 | 1.0081 | First improvement |
| 3 | TOTAL_BATCH_SIZE 2^19 → 2^18 | 0.9986 | Smaller batch wins |
| 4 | On top of round 3, DEPTH 8 → 10 | 0.9856 | Beats karpathy’s 0.9979 reference |

Things worth knowing
- Per-experiment overhead. Each VESSL job pays ~3-4 minutes of startup (image pull,
uv sync/ torch reinstall,train.pycompile) on top of the ~5-minute training budget. Sequential mode lands ~7 experiments/hour against ~12/hour on a dedicated local GPU; the win comes from K-way fan-out, not from per-experiment speed. - Branch hygiene. The agent runs entirely on an
autoresearch/<tag>branch and force-pushes to origin. Do not point two agents at the same tag; the second one will clobber the first’s commits. Use one tag per agent session (for example,opt-may7,arch-may7). - Cost is unbounded by default. A runaway loop is real spend. Set a daily concurrency cap in the cloud console and check
vesslctl billing showif you’re nervous. - GPU choice. The cookbook defaults to H100 SXM ×1 (deneb-kr, $2.39/hr). For a cheaper run set
AUTORESEARCH_RESOURCE_SPECto an A100 spec; numbers won’t be directly comparable to the karpathy reference because flash-attn falls back on non-Hopper GPUs.
Next steps
- Read the full recipe — Code, agent program, and analysis notebook live in vessl-ai/vessl-cloud-cookbook/autoresearch.
- Run multiple agents at once — Spawn N agent sessions on N different tags to research different directions in parallel. Combined with K-way fan-out per agent, you get N × K experiments per round.
- Adapt the pattern to your own loop — The submit/wait scripts are generic. Swap
train.pyfor any single-GPU training entrypoint, point the cache volume at your data, and the same K-fan-out approach works for hyperparameter sweeps, neural architecture search (NAS), or any agent-driven experiment loop. - Submit jobs from your CLI — See
vesslctl job createfor the full job submission flow.
