Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cloud.vessl.ai/llms.txt

Use this file to discover all available pages before exploring further.

karpathy/autoresearch is the “AI agent runs LLM training experiments overnight” experiment. The original loop is sequential. One local GPU, one experiment at a time. This guide walks through the same workload on VESSL Cloud, where every experiment is a batch job and you can fan K of them out at once. The example is a vehicle for three VESSL Cloud patterns you can reuse for any agent-driven training loop:
  • One Object storage volume as a shared, read-only cache so per-experiment startup pays no data-prep cost.
  • A batch job per experiment instead of an interactive workspace, so submission is scriptable and the agent never holds a GPU lease.
  • K-way fan-out to compress an N-experiment cycle into roughly N/K wall time.
The full code (agent program, helper scripts, results notebook) lives in the autoresearch cookbook recipe. This page focuses on the VESSL Cloud workflow.
Comparison diagram: karpathy's sequential loop versus VESSL Cloud K-parallel fan-out

Why experiments run faster on the cloud

karpathy’s loop has the agent edit code, train for ~5 minutes, check val_bpb, keep or revert, repeat. On a single local GPU the bottleneck is sequencing: one experiment at a time, no matter how cheap each one is. On VESSL Cloud the same code runs as a batch job. Each job:
  • Mounts a shared Object storage volume that holds the ~10 GB data cache.
  • Boots in 3-4 minutes (image pull, dependency install, JIT compile).
  • Trains for ~5 minutes and writes val_bpb to the log.
Because each job is independent, you can submit K of them at once. With K=4 on H100 SXM ×1, a 16-experiment cycle drops from ~2 hours of sequential compute to ~40 minutes of wall time. Same dollars, ~5× the throughput. The compute didn’t get faster; the scheduling got parallel.
The arithmetic. On a single H100, sequential autoresearch lands roughly 12 experiments per hour. With K=4 fan-out on VESSL, the agent submits 4 jobs in parallel each round and waits for the slowest, landing about 28 experiments per hour aggregate. The cap shifts from “GPU availability” to “how many parallel jobs you choose to allow.”

Prerequisites

  • A VESSL Cloud account with credits (sign up)
  • An organization with access to H100 SXM ×1 (or A100 SXM ×1; numbers below are for H100)
  • vesslctl installed and authenticated (vesslctl auth status)
  • Git, Python 3.11+, and a coding agent that can run shell commands (Claude Code, Codex, Cursor, and so on)
New to VESSL Cloud? Complete the Member quickstart first to set up your account, payment, and storage.

Set up the recipe

1

Create a shared Object storage volume for the data cache

Every experiment in the loop reads the same ~10 GB pretraining data. Putting it on an Object storage volume means you download it once and every job after that mounts the cache read-only.
vesslctl volume create \
  --name autoresearch-cache \
  --storage <your-object-storage-slug> \
  --teams <your-team>

vesslctl volume list
export AUTORESEARCH_CACHE_VOLUME=objvol-...
Find your object storage slug with vesslctl storage list. See Create a volume for the full creation flow.
Object storage is the right fit here precisely because reads are slow but cheap, and the data is read-only after the initial population. For a training data hot path you would pick faster storage; for a cache that’s filled once and reused thousands of times, you want shareability.
2

Run data prep once

Clone the cookbook locally and run the data-prep script. It downloads the FineWeb-Edu shard and writes it into the volume.
git clone https://github.com/vessl-ai/vessl-cloud-cookbook.git
cd vessl-cloud-cookbook/autoresearch
bash batch-job/prep.sh
prep.sh submits a one-off batch job that mounts your cache volume and runs prepare.py. Once it finishes, every subsequent training job skips the download entirely.
You only need to re-run prep.sh if the data itself changes. The agent loop never touches the cache after this step.
3

Submit a single experiment to verify the wiring

Before turning the agent loose, run one experiment by hand to confirm the recipe works in your account:
git checkout -b autoresearch/sanity-check
bash batch-job/submit.sh > run.log 2>&1
grep "^val_bpb:\|^peak_vram_mb:" run.log
submit.sh pushes your branch, calls vesslctl job create with the cache volume mounted, polls until the job hits a terminal state, and pulls the logs back. If val_bpb shows up in run.log, the recipe is wired up correctly.
4

Fan out to K parallel experiments

Once a single job works, the agent can submit K candidates at once. The cookbook ships two helpers:
ScriptWhat it does
batch-job/submit-async.shSubmit one job and return the slug immediately.
batch-job/wait-jobs.sh slug1 slug2 ...Poll N slugs until all are terminal, then print each job’s val_bpb and peak_vram_mb.
A typical “Mode B” round looks like:
SLUGS=$(for cfg in cfg1 cfg2 cfg3 cfg4; do
  git commit --allow-empty -m "try $cfg"
  bash batch-job/submit-async.sh
done)

bash batch-job/wait-jobs.sh $SLUGS
The four jobs run on four independent GPUs simultaneously. The round finishes when the slowest one does.
Cap your concurrency in the VESSL Cloud console if you don’t want every parallel round to use the full available GPU pool. The submitter does not enforce a limit; the cloud schedules whatever you submit.

What a research cycle looks like

A real 16-experiment, 4-round Mode B cycle (K=4) on H100 SXM ×1 (deneb-kr) is bundled with the recipe. The full data is in results.example.tsv; here is the summary:
RoundKnob the agent triedBest val_bpbVerdict
1EMBEDDING_LR, WEIGHT_DECAY, WINDOW_PATTERN1.0107None beat baseline
2MATRIX_LR 0.04 → 0.051.0081First improvement
3TOTAL_BATCH_SIZE 2^19 → 2^180.9986Smaller batch wins
4On top of round 3, DEPTH 8 → 100.9856Beats karpathy’s 0.9979 reference
Total spend: ~$5.10 (16 experiments × ~$0.33 each at $2.39/hr H100). Wall time: ~40 minutes (4 rounds × ~10 min each, K=4 jobs per round). The same work would take ~2 hours of sequential compute on a single local H100.
Progress chart: best val_bpb per round across 16 experiments, K=4 fan-out, versus baseline
The analysis notebook (analysis.ipynb) is in the cookbook recipe. Drop your own results.tsv into the notebook to regenerate the chart for your run.

Things worth knowing

  • Per-experiment overhead. Each VESSL job pays ~3-4 minutes of startup (image pull, uv sync / torch reinstall, train.py compile) on top of the ~5-minute training budget. Sequential mode lands ~7 experiments/hour against ~12/hour on a dedicated local GPU; the win comes from K-way fan-out, not from per-experiment speed.
  • Branch hygiene. The agent runs entirely on an autoresearch/<tag> branch and force-pushes to origin. Do not point two agents at the same tag; the second one will clobber the first’s commits. Use one tag per agent session (for example, opt-may7, arch-may7).
  • Cost is unbounded by default. A runaway loop is real spend. Set a daily concurrency cap in the cloud console and check vesslctl billing show if you’re nervous.
  • GPU choice. The cookbook defaults to H100 SXM ×1 (deneb-kr, $2.39/hr). For a cheaper run set AUTORESEARCH_RESOURCE_SPEC to an A100 spec; numbers won’t be directly comparable to the karpathy reference because flash-attn falls back on non-Hopper GPUs.

Next steps

  • Read the full recipe — Code, agent program, and analysis notebook live in vessl-ai/vessl-cloud-cookbook/autoresearch.
  • Run multiple agents at once — Spawn N agent sessions on N different tags to research different directions in parallel. Combined with K-way fan-out per agent, you get N × K experiments per round.
  • Adapt the pattern to your own loop — The submit/wait scripts are generic. Swap train.py for any single-GPU training entrypoint, point the cache volume at your data, and the same K-fan-out approach works for hyperparameter sweeps, neural architecture search (NAS), or any agent-driven experiment loop.
  • Submit jobs from your CLI — See vesslctl job create for the full job submission flow.