“I’ve Joined Anthropic”: What Karpathy’s Move Means For You

Karpathy just joined Anthropic to use Claude for pretraining research. Here's how to actually understand what that means - by training your own LLM.

Riley Brooks2026-05-197 min readBeginner

By the end of this guide, you’ll have either trained your own tiny ChatGPT clone from scratch – or, more realistically, you’ll understand exactly what Andrej Karpathy will be doing at his new job. Same thing, different scale.

Here’s the news that broke on May 19, 2026: “I’ve joined Anthropic,” Karpathy posted on X. The internet did its usual thing. Every AI newsletter scrambled to publish the same recap: OpenAI co-founder, Tesla Autopilot, Eureka Labs, AI talent war, etc. You’ve probably read three of them already.

This isn’t another one. This is the tutorial version – what Karpathy will actually do at Anthropic, and how you can do a tiny version of it yourself this weekend.

What “I’ve joined Anthropic” actually means in practice

Strip away the celebrity-hire framing and the job is specific. Karpathy is working on pre-training under team lead Nick Joseph (TechCrunch, May 19). An Anthropic spokesperson confirmed he’ll start a team focused on using Claude to accelerate pre-training research.

Read that twice. Claude is going to help train the next Claude.

Pre-training is where models actually get built. Not fine-tuned, not prompted – built. The large-scale training runs that give Claude its core knowledge and capabilities happen here, and so does most of the compute bill. Everything Claude “knows” – language, code, basic reasoning – comes from this phase. Fine-tuning and RLHF polish the surface; pretraining builds the brain.

So when Karpathy says he wants to get “back to R&D,” he means: back to the part where you stare at loss curves at 3am wondering why the model just got dumber.

The shortcut: train your own mini-Claude with nanochat

Here’s the thing competitor articles miss. Karpathy already wrote the tutorial version of his new job. It’s called nanochat, and it does the entire pipeline he’ll now be working on at scale.

The whole thing – tokenization, pretraining, finetuning, evaluation, inference, web UI – in a single codebase. About 8,000 lines of code, ~330KB packaged (as of the current README), which fits inside a 100K-token context window. You can paste the entire repo into Claude and ask it to explain any function. That’s not a gimmick; that’s Karpathy’s design intent.

The economics are wild. Training GPT-2 cost approximately $43,000 in 2019. Nanochat’s speedrun script runs the same class of experiment – 8×H100, start to finish – in about 4 hours at roughly $3/GPU/hour. That’s ~$100 to do what OpenAI burned $43k on seven years ago.

The setup, walked backwards from the result

The endpoint is a webpage with a chat box. You type, your own model responds. Here’s how you get there:

Rent the box. An 8×H100 GPU node from Lambda, Together AI, Nebius, or similar. Hourly billing.
Clone and install.git clone https://github.com/karpathy/nanochat, then uv sync --extra gpu. nanochat uses the uv package manager.
Launch in a screen session.screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh – the screen part matters, more on that below.
Wait ~4 hours. Tokenizer trains, then base model pretrains on FineWeb, then mid-training, then SFT, then eval.
Activate venv and serve.source .venv/bin/activate && python -m scripts.chat_web, then open the URL.

Cost: roughly $100 on current on-demand 8×H100 pricing. The result: a working chat interface backed by a model you trained yourself.

The pretraining stage, decoded

One number drives everything in nanochat. Inside speedrun.sh, the key call is:

torchrun --standalone --nproc_per_node=8 
 -m scripts.base_train -- 
 --depth=20

That --depth flag? It auto-sets every other hyperparameter – model width, number of attention heads, learning rate schedule, training horizon, weight decay. All of it. The README describes this as the core design: one integer controls how big you want the model, and everything else is derived to be compute-optimal. You’re not tuning; you’re just asking for a smaller or larger brain.

Depth 20 is the cheap speedrun. The next step up, depth 26 (~$300, ~12 hours), slightly outperforms GPT-2’s benchmark score. Then there’s d32 – 32 layers, 1.9 billion parameters, trained on 38 billion tokens, total cost ~$800 over about 33 hours. That’s the model Karpathy hosts publicly at nanochat.karpathy.ai (as of May 2026).

Pro tip: Before you do anything, paste the whole repo into Claude using files-to-prompt or a similar tool. The packaged file is ~330KB – well under a 100K-token context window. Ask Claude to explain any function. You’re now running Karpathy’s exact playbook in miniature: using an LLM to help you understand LLM training.

The gotchas no one mentions

Three traps to know about before you spend money.

VRAM silently kills runs. The speedrun script assumes H100 80GB. Less than that – say, a 40GB A100 – and the default --device_batch_size=32 will OOM mid-training with no warning. The fix is in the README (as of current docs): reduce to 16, 8, 4, 2, or even 1. The script doesn’t tell you this. You find out two hours in when the job dies quietly.

The single-GPU illusion. The repo runs fine on one GPU – just drop the torchrun prefix. It produces identical results. The catch is time: single-GPU runs take 8× longer. Your $100 speedrun becomes a $200+ run because you’re still paying hourly while the math grinds sequentially.

The screen session saves your wallet. If your SSH connection drops mid-training and you didn’t use screen or tmux, the run dies. You keep paying for the GPU until you log back in. Spot instances bring the total cost down to around $15 – but spot instances also get preempted. Same problem, worse odds.

Honest expectations about what you’ll build

Your $100 model is not Claude. Not even in the same timezone.

Directly from the nanochat README: these micro models outperform 2019-era GPT-2 on benchmarks, but fall far short of modern LLMs. They hallucinate constantly – Karpathy’s own description is “a bit like children.” Naive, silly, confidently wrong. Kind of amusing, actually.

Think of it like building a paper airplane to understand aerodynamics. You will not fly to Tokyo. You will, however, finally get what “lift” means in a way no textbook explained.

That’s the whole point. The reason Karpathy joining Anthropic matters isn’t the tweet. It’s that the same person who built the educational ladder into LLMs – one of the most-watched LLM educators on YouTube – is now climbing it inside the lab that makes Claude. His specific assignment: use Claude to help design the next Claude’s training runs. As AI companies race to automate parts of AI development, and the economics of that bet matter, this is the most credible public attempt yet. If it works, the model you use next year was partially designed by the model you’re using today.

What to do this week

Pick one. If you have $100 and a free afternoon, rent an 8×H100 from a provider like Lambda or Together AI and run speedrun.sh. If you don’t, clone the repo locally, package it with files-to-prompt, and have Claude walk you through base_train.py line by line. Either path teaches you more about what “pretraining” means than any article about Karpathy’s job change – including this one.

Keep an eye on Anthropic’s announcements over the next 6-12 months. If the pretraining team actually ships, the next Claude release will be the first frontier model whose training was meaningfully accelerated by its predecessor.

FAQ

Is Eureka Labs dead now that Karpathy joined Anthropic?

Unclear, but probably not. His announcement said he’d resume education work “in time” – paused, not killed. He hasn’t shared updates on Eureka Labs since its 2024 launch, so don’t expect news soon.

Can I run nanochat on my MacBook?

Technically yes, practically no. Use --device_type=mps on Apple Silicon and drop the depth way down – try depth=4 or 6 just to see the code run. Expect output closer to word salad than conversation. It’s useful for stepping through the code in a debugger, tracing what happens during a forward pass, seeing the tokenizer in action. For an actual talking model? Rent the cloud node.

Why is this announcement a bigger deal than other AI hires?

Most senior AI hires are about headcount. This one has a specific technical thesis attached: that Claude can materially speed up the research that produces future versions of Claude. That’s a falsifiable bet. Either the next Claude ships faster and cheaper than Claude 3 did, or it doesn’t. Karpathy’s resume – original OpenAI team, Tesla Autopilot, nanoGPT, nanochat – is uniquely matched to testing that thesis. That’s why it matters.