Ever feel like you spend more time tweaking tiny numbers in a config file than you do thinking about the big picture? You’re not alone. The grind of hyperparameter tuning is a rite of passage for anyone in machine learning. You change a learning rate, run a model for hours, check the results, and then… you do it all over again. And again.
It’s tedious, it’s time-consuming, and honestly, a lot of it feels like guesswork.
What if you could build a little assistant—an autonomous agent—that does all that boring work for you? An AI that explores different model configurations, runs the experiments, and keeps track of what works, all while you’re focused on more interesting problems (or just grabbing a coffee).
That’s not science fiction. It’s exactly what we’re going to build today. We’re going to take a brilliant concept from Andrej Karpathy, one of the sharpest minds in AI, and get it running in a place anyone can access for free: Google Colab.
We'll be using his AutoResearch framework to create a simple, automated loop that tinkers with a model's settings, measures its performance, and intelligently hunts for better configurations. Ready to build your own AI research assistant? Let's get started.
Setting Up Our Workshop: Getting the Code and Tools
First things first, we need to set up our workspace. Think of this like clearing your workbench and laying out all your tools. We're going to do this right inside a Google Colab notebook, which means you don't need a fancy GPU or a complicated local setup.
Our first move is to grab Karpathy's AutoResearch code directly from GitHub and install the Python libraries it needs to run.
# Copy Code
import os, sys, subprocess, json, re, random, shutil, time
from pathlib import Path
# A little helper to install packages if they're missing
def pip_install(pkg):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])
# The libraries we'll need for this project
for pkg in [ "numpy","pandas","pyarrow","requests", "rustbpe","tiktoken","openai" ]:
try:
__import__(pkg)
except:
pip_install(pkg)
import pandas as pd
# Let's clone the project repository if we don't have it already
if not Path("autoresearch").exists():
print("Cloning the autoresearch repo...")
subprocess.run(["git","clone","https://github.com/karpathy/autoresearch.git"])
# And now, we'll work inside that directory
os.chdir("autoresearch")
# This part is optional, but it sets up an OpenAI key if you have one
# It's not needed for the core loop, but good to have
OPENAI_API_KEY=None
try:
from google.colab import userdata
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
except:
OPENAI_API_KEY=os.environ.get("OPENAI_API_KEY")
if OPENAI_API_KEY:
os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY
print("Setup complete! We're ready to go.")
See? We just pulled the entire project into our Colab environment and made sure all the dependencies are in place. Now we can start tinkering.
Making It Colab-Friendly: A Few Quick Tweaks
The original code is designed for a more powerful machine than the one Colab gives us for free. If we tried to run it as-is, we'd likely run out of memory or time. It’s like trying to run a high-end video game on a ten-year-old laptop—it’s just not going to work.
So, we need to be clever. We're going to programmatically open up a few of the core files (prepare.py and train.py) and dial down some of the settings. We’ll reduce things like the model size and the amount of data it trains on so that each experiment can run quickly on our limited hardware.
This is the beauty of scripting—we don't have to manually edit these files. We can just tell our code what to change.
# Copy Code
print("Adjusting config files for a Colab environment...")
# Define the paths to the files we're going to edit
prepare_path = Path("prepare.py")
train_path = Path("train.py")
program_path = Path("program.md")
# Read the original text of the files
prepare_text = prepare_path.read_text()
train_text = train_path.read_text()
# Let's make some changes to make things run faster and with less memory
# We're using regular expressions (re.sub) to find and replace the values
prepare_text = re.sub(r"MAX_SEQ_LEN = \d+", "MAX_SEQ_LEN = 512", prepare_text)
prepare_text = re.sub(r"TIME_BUDGET = \d+", "TIME_BUDGET = 120", prepare_text) # Run for 2 mins
prepare_text = re.sub(r"EVAL_TOKENS = .*", "EVAL_TOKENS = 4 * 65536", prepare_text)
train_text = re.sub(r"DEPTH = \d+", "DEPTH = 4", train_text) # Smaller model
train_text = re.sub(r"DEVICE_BATCH_SIZE = \d+", "DEVICE_BATCH_SIZE = 16", train_text)
train_text = re.sub(r"TOTAL_BATCH_SIZE = .*", "TOTAL_BATCH_SIZE = 2**17", train_text)
train_text = re.sub(r'WINDOW_PATTERN = "SSSL"', 'WINDOW_PATTERN = "L"', train_text)
# Write our changes back to the files
prepare_path.write_text(prepare_text)
train_path.write_text(train_text)
# We can also give our "researcher" a goal
program_path.write_text("""
Goal: Run autonomous research loop on Google Colab.
Rules: Only modify train.py hyperparameters.
Metric: Lower val_bpb is better.
""")
print("Config files patched. Preparing the dataset now...")
# Finally, run the preparation script to download and process the data
subprocess.run(["python", "prepare.py", "--num-shards", "4", "--download-workers", "2"])
print("Dataset is ready!")
Alright, we’ve successfully reconfigured the project. Now we have a lightweight version of the experiment that’s ready to run in Colab. Next up, we need a starting point.
Establishing Our Baseline: The First Run
You can't know if you're improving if you don't know where you started, right? That's why our next step is to run the training script once with the default (well, our newly modified default) settings. This gives us a "baseline" performance metric.
The key metric here is val_bpb, or validation bits-per-byte. Don't worry too much about the technical details—just know that it’s a measure of how well the model is performing. The lower the val_bpb, the better the model.
After the run is complete, we'll need a way to read the log file and pull out that all-important number, along with a few other useful stats like training time and memory usage.
# Copy Code
print("Running the baseline experiment... this might take a couple of minutes.")
# This runs the training script and saves all its output to a file called 'baseline.log'
subprocess.run("python train.py > baseline.log 2>&1", shell=True)
# A helper function to parse the log file and extract the numbers we care about
def parse_run_log(log_path):
text = Path(log_path).read_text(errors="ignore")
def find(p):
m = re.search(p, text, re.MULTILINE)
return float(m.group(1)) if m else None
return {
"val_bpb": find(r"^val_bpb:\s*([0-9.]+)"),
"training_seconds": find(r"^training_seconds:\s*([0-9.]+)"),
"peak_vram_mb": find(r"^peak_vram_mb:\s*([0-9.]+)"),
"num_steps": find(r"^num_steps:\s*([0-9.]+)")
}
# Let's parse our baseline log
baseline = parse_run_log("baseline.log")
print("Baseline run complete!")
print(baseline)
# Now, we'll create a little spreadsheet (a TSV file) to track all our experiments
results_path = Path("results.tsv")
rows = [{
"commit": "baseline",
"val_bpb": baseline["val_bpb"] if baseline["val_bpb"] else 0,
"memory_gb": round((baseline["peak_vram_mb"] or 0)/1024, 1),
"status": "keep",
"description": "baseline"
}]
# Save it using the pandas library
pd.DataFrame(rows).to_csv(results_path, sep="\t", index=False)
print("Baseline results saved to results.tsv.")
Perfect. We now have our starting score. Every new experiment our agent runs will be compared against this baseline. If it does better, we have a new champion!
Building the Brain: The Hyperparameter Engine
This is where the real magic begins. We’re going to build the core engine that allows our script to act like a researcher. This involves a few key parts:
- A "pantry" of options (
SEARCH_SPACE): We'll define all the possible hyperparameter values we want our agent to try. Think of this as a chef's pantry full of different ingredients and spices. - A "recipe writer" (
sample_candidate): A function that randomly picks a few hyperparameters from our pantry and comes up with a new "recipe" or configuration to try. - A "line cook" (
apply_hparams): A function that takes the new recipe and actually modifies thetrain.pyfile, getting it ready for the next experiment.
Let's put the code together for this.
# Copy Code
TRAIN_FILE = Path("train.py")
BACKUP_FILE = Path("train.base.py")
# Let's save a clean copy of our baseline training script
if not BACKUP_FILE.exists():
shutil.copy2(TRAIN_FILE, BACKUP_FILE)
# These are the hyperparameters we're going to let our agent play with
HP_KEYS = [
"WINDOW_PATTERN", "TOTAL_BATCH_SIZE", "EMBEDDING_LR", "UNEMBEDDING_LR",
"MATRIX_LR", "SCALAR_LR", "WEIGHT_DECAY", "ADAM_BETAS", "WARMUP_RATIO",
"WARMDOWN_RATIO", "FINAL_LR_FRAC", "DEPTH", "DEVICE_BATCH_SIZE"
]
# --- Helper Functions for Reading and Writing ---
def read_text(path): return Path(path).read_text()
def write_text(path, text): Path(path).write_text(text)
# Function to extract the current hyperparameter values from the text
def extract_hparams(text):
vals = {}
for k in HP_KEYS:
m = re.search(rf"^{k}\s*=\s*(.+?)$", text, re.MULTILINE)
if m: vals[k] = m.group(1).strip()
return vals
# Function to change a hyperparameter value in the text
def set_hparam(text, key, value):
return re.sub(rf"^{key}\s*=.*$", f"{key} = {value}", text, flags=re.MULTILINE)
# Get the hyperparams from our clean backup file
base_text = read_text(BACKUP_FILE)
base_hparams = extract_hparams(base_text)
# --- This is our "pantry" of options! ---
SEARCH_SPACE = {
"WINDOW_PATTERN": ['"L"', '"SSSL"'],
"TOTAL_BATCH_SIZE": ["2**16", "2**17", "2**18"],
"EMBEDDING_LR": ["0.2", "0.4", "0.6"],
"MATRIX_LR": ["0.01", "0.02", "0.04"],
"SCALAR_LR": ["0.3", "0.5", "0.7"],
"WEIGHT_DECAY": ["0.05", "0.1", "0.2"],
"ADAM_BETAS": ["(0.8,0.95)", "(0.9,0.95)"],
"WARMUP_RATIO": ["0.0", "0.05", "0.1"],
"WARMDOWN_RATIO": ["0.3", "0.5", "0.7"],
"FINAL_LR_FRAC": ["0.0", "0.05"],
"DEPTH": ["3", "4", "5", "6"],
"DEVICE_BATCH_SIZE": ["8", "12", "16", "24"]
}
# --- Our "Recipe Writer" ---
def sample_candidate():
# Pick 2, 3, or 4 hyperparameters to change at random
keys = random.sample(list(SEARCH_SPACE.keys()), random.choice([2, 3, 4]))
cand = dict(base_hparams)
changes = {}
for k in keys:
cand[k] = random.choice(SEARCH_SPACE[k]) # Pick a new value from our pantry
changes[k] = cand[k]
return cand, changes
# --- Our "Line Cook" ---
def apply_hparams(candidate):
text = read_text(BACKUP_FILE) # Always start from the clean backup
for k, v in candidate.items():
text = set_hparam(text, k, v)
write_text(TRAIN_FILE, text) # Overwrite the main train file with the new recipe
# --- And a function to run the experiment and get the results ---
def run_experiment(tag):
log = f"{tag}.log"
subprocess.run(f"python train.py > {log} 2>&1", shell=True)
metrics = parse_run_log(log)
metrics["log"] = log
return metrics
We now have all the machinery in place. We have a way to generate new ideas, a way to implement them, and a way to test them. It's time for the main event.
Let It Run! The Autonomous Research Loop
This is the moment we’ve been building towards. We’re going to write a simple loop that tells our agent to run a few experiments, one after another.
For each experiment, the loop will:
- Generate a new set of hyperparameters.
- Apply them to the
train.pyscript. - Run the training process.
- Evaluate the result (
val_bpb). - Decide if it's a new "best" model. If it is, we save this new configuration as our new champion. If not, we discard the changes and go back to the previous best.
- Log everything in our results table.
This is where we get to sit back and watch our little AI researcher do its thing.
# Copy Code
N_EXPERIMENTS = 3 # Let's run 3 new experiments for this demo
# Load our results table
df = pd.read_csv(results_path, sep="\t")
# Find our current best score (lower is better)
best = df["val_bpb"].replace(0, 999).min()
for i in range(N_EXPERIMENTS):
tag = f"exp_{i+1}"
print(f"--- Starting Experiment {tag} ---")
# 1. Generate a new idea
candidate, changes = sample_candidate()
# 2. Apply it to the code
apply_hparams(candidate)
# 3. Run the experiment
metrics = run_experiment(tag)
# 4. & 5. Evaluate and decide
if metrics["val_bpb"] and metrics["val_bpb"] < best:
status = "keep"
best = metrics["val_bpb"]
# This is the new champion! We update our backup file.
shutil.copy2(TRAIN_FILE, BACKUP_FILE)
else:
status = "discard"
# Not better. We revert back to our previous best.
shutil.copy2(BACKUP_FILE, TRAIN_FILE)
# 6. Log the results
row = {
"commit": tag,
"val_bpb": metrics["val_bpb"] or 0,
"memory_gb": round((metrics["peak_vram_mb"] or 0)/1024, 1),
"status": status,
"description": str(changes)
}
df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)
df.to_csv(results_path, sep="\t", index=False)
print(f"Changes tried: {changes}")
print(f"Metrics: {metrics}")
print(f"Result: {status.upper()} this configuration.")
print("-" * 20)
print("\n--- Final Results ---")
# Let's see the sorted leaderboard of our experiments
print(df.sort_values("val_bpb"))
# You can also download the best training file and the results log
try:
from google.colab import files
print("\nDownloading the best train.py and results.tsv...")
files.download("train.py")
files.download("results.tsv")
except:
pass
So, What Did We Just Do?
Take a step back and look at what we've accomplished. We didn't just run a Python script; we built a miniature, autonomous research system. It has a goal (find a lower val_bpb), a method for exploring (randomly changing hyperparameters), and a way to learn from its results (keeping only the best configurations).
This is a simple version, of course. More advanced systems might use smarter search algorithms instead of random guessing. But the core principle is the same: automating the tedious cycle of experimentation.
By building this in Google Colab, we’ve shown that you don’t need a massive data center to start exploring these powerful ideas. You can empower a machine to search for better models for you, freeing up your own brainpower for the creative, high-level thinking that machines can't do (yet!).
This is more than just a cool trick; it’s a glimpse into a more efficient and powerful way to do machine learning research. Now, imagine what you could do by expanding that search space or letting it run for a few hundred experiments instead of just three. The possibilities are pretty exciting.




