feb 2025 · andrej karpathy · 3.5-hour deep dive · distilled

LLMs,
end to end.

The full training stack, intuitively. From raw internet text to a model that reasons — with the quirks, failures, and mental models that make sense of why LLMs feel the way they do.

Source: Deep Dive into LLMs like ChatGPT · pairs with Karpathy's 2023 first-principles intro.

01 / three stages

Three stages. Three different teams.

Inside an AI lab, the model passes through three handoffs before it talks to you.

Karpathy: "different teams, different data, different cadence." Pre-training builds knowledge. SFT shapes format and helpfulness. RL shapes reasoning behavior. Most of what's new in 2024–2026 happens in stage 3.

first principle When a model surprises you (good or bad), ask which stage owns the surprise. Wrong fact? → pre-training. Wrong format / refusal style? → SFT. Wrong reasoning trajectory? → RL.

02 / pre-training data

The internet, but cleaned.

CommonCrawl in. ~15T tokens of high-quality text out. The pipeline does the work.

FineWeb (HuggingFace's filtering pipeline, built on Common Crawl) implements this in the open. Each layer drops 50–95% of what enters. The output: a few thousand books' worth of text, repeated nowhere, scored as useful.

interpretation "Trained on the internet" undersells how filtered the input is. By the time text enters pre-training, it's been classified, deduped, and weighted by domain — closer to a curated library than a crawl dump.

03 / tokenization

Models don't read letters. They read tokens.

Byte-pair encoding chunks text into ~50k–100k atoms. Most of the model's weirdness lives here.

Byte-pair encoding (BPE) merges the most common character pairs into tokens. Common words = 1 token. Rare words = many tokens. The model's whole input layer is integer IDs into a ~100k-token vocabulary — it has no native concept of "the letter R."

why models fail at math + spelling ↓ "Multiply 348 × 619" gets tokenized into 5 tokens that look nothing like the digits. The model has to reason about arithmetic over an opaque encoding. Tools (Python) bypass this entirely. Letter counting fails for the same reason.

04 / the transformer

The architecture is fully understood. The weights aren't.

A transformer is just a function: tokens in, probability distribution over the next token out. The 100B parameters inside are the mystery.

The whole loop: tokens → forward pass → probability over the next token → sample → append → repeat. That's it. "Generation" is iterating this. The architecture is a few hundred lines of math; the magic is in the 100B numbers we can't interpret.

interpretability The field of mechanistic interpretability tries to reverse-engineer what circuits inside the network do. Real progress, but most of the model is still a fog. Don't rely on a "rationale" the model offers about itself — it's a generated rationale, not a reading of its own weights.

05 / the base model

A document simulator, not an assistant.

After pre-training, the model can do one thing: continue any internet-shaped text. That's surprisingly powerful — and surprisingly limited.

The base model's only job is "predict the next token in the same distribution as training." It's already capable of in-context learning — pattern-matching from prompt examples — which is the conceptual seed for both few-shot prompting and the SFT trick.

first principle Every behavior an LLM shows you starts as next-token prediction. "Be helpful" is just "produce tokens that look like the helpful examples in the training distribution." Nothing about it is qualitatively different from autocomplete.

06 / hallucination

The model has no concept of "I don't know."

Sampling is probabilistic. Every output is a guess — including the confident ones.

Modern frontier models (GPT-5, Claude 4) hallucinate less because labs ran an explicit "uncertainty audit" — probing the base model for facts it's unsure about, then SFT-training the assistant to say "I don't know" on those probes. The mechanism isn't introspection; it's a learned reflex.

for your work A lower hallucination rate ≠ epistemic awareness. The model learned to refuse on a distribution similar to your prompt — for prompts outside that distribution, it'll happily fabricate again. Always pair high-stakes outputs with verification.

07 / sft

SFT: conversations as the recipe.

Same algorithm as pre-training. Different data. The data is the entire shape of the assistant.

SFT is "show, don't tell." Want the model to refuse harmful requests? Include refusals. Want it to use Markdown? Use Markdown in answers. Want shorter outputs? Make examples shorter. The labeler-guideline doc is the personality spec.

first principle The assistant you're talking to is, statistically, an imitation of the labelers Scale AI / Anthropic / OpenAI hired. The model's "values" are the average values expressed in the training conversations. Different labeler pool → different model.

08 / rl

RL: from tutor to coach.

SFT shows the model what right answers look like. RL teaches it to find them on its own.

The RLHF predecessor used human comparisons; the new RL wave uses verifiable rewards. For math, code, and reasoning problems where you can check the answer, the model can self-discover better reasoning paths than any human would have written down.

why this matters in 2026 Reasoning models (o1, o3, Claude extended thinking, DeepSeek R1, Gemini Deep Think) are all trained with RL on verifiable rewards. The breakthrough wasn't a new architecture — it was finding tasks where you can score model output without a human in the loop.

09 / reasoning

Long internal chains-of-thought.

RL on verifiable problems creates models that "think out loud" before answering. The thinking can be 10× longer than the answer.

Reasoning models spend most of their compute "thinking" — sampling 100s–1000s of internal tokens before producing the visible answer. The internal trace gets graded (during training) on whether the final answer is correct. Strategies that reliably land are reinforced.

interpretation The "chain-of-thought" you sometimes see in chat isn't introspection — it's a learned habit. Models trained with RL on math-and-code rewards generalize that habit to other domains. Sometimes it helps; sometimes it just adds latency.

10 / llm psychology

The weird, predictable ways models fail.

A working mental model of LLM "psychology" — six emergent quirks worth knowing.

Karpathy calls this "LLM psychology." Each quirk traces to a specific stage of the training pipeline — tokenizer choices, pre-training data gaps, SFT distribution, RL reward shaping. None of them are bugs to be patched at the prompt level.

first principle Treat LLMs like a brilliant, well-read collaborator with intermittent amnesia, dyslexia about characters, and an inability to know what they don't know. Build the harness around those constraints.

11 / takeaways

What changes about how you use them.

Eleven things to remember when working with frontier models in 2026.

stages Three stages, three different teams. Wrong fact = pre-training. Wrong format = SFT. Wrong reasoning = RL. The diagnosis points at the fix.
tokens Models read tokens, not letters. Spelling, character counting, exact arithmetic — all degraded by the input layer. Tools (Python, regex) are the workaround.
predict Every output is next-token prediction. "Helpful," "polite," "refuses harm" — all the same mechanism, just trained on different data.
in-context Few-shot in the prompt is real "training." No weight updates needed. Use it for adapting tone, output format, or chain-of-thought style without an API call to fine-tune.
hallucinate Hallucination is the default; refusal is the trained behavior. Lower hallucination rate ≠ epistemic awareness. Verify high-stakes outputs.
format Format is not truth. A perfectly cited paragraph with a real-looking DOI may still be made up. Don't conflate fluency with accuracy.
memory Context = RAM. Weights = frozen disk. Put fresh, private, or specific info in context; treat the weights as a stale general-knowledge disk.
rl Reasoning models think before they speak. Pay for them on hard problems with verifiable answers (math, code, scientific). Skip them for chat that just needs a fast reply.
jagged Capability is uneven. Don't extrapolate from "great at X" to "great at Y." Build domain-specific evals.
tools Tools beat scaling. Calculator for math, browser for facts, Python for transformations, RAG for private knowledge. Cheaper than waiting for a stronger model.
personality The "personality" is the labelers. The model imitates the average of the people who wrote its SFT data. Different lab → different averages → different model.