feb 2025 · andrej karpathy · 3.5-hour deep dive · distilled

LLMs,
end to end.

The full training stack, intuitively. From raw internet text to a model that reasons — with the quirks, failures, and mental models that make sense of why LLMs feel the way they do.

Source: Deep Dive into LLMs like ChatGPT · pairs with Karpathy's 2023 first-principles intro.

01 / three stages

Three stages. Three different teams.

Inside an AI lab, the model passes through three handoffs before it talks to you.

STAGE 1 · pre-training ~15T tokens of internet months · ~$millions · once a year → "document simulator" STAGE 2 · supervised fine-tune ~millions of conversations days · ~$thousands · weekly → "helpful assistant" STAGE 3 · reinforcement learning verifiable problems · rewards days–weeks · iterates → "reasoner" Each stage = a different team. Each team produces an artifact for the next. deployed model what you call

Karpathy: "different teams, different data, different cadence." Pre-training builds knowledge. SFT shapes format and helpfulness. RL shapes reasoning behavior. Most of what's new in 2024–2026 happens in stage 3.

first principle When a model surprises you (good or bad), ask which stage owns the surprise. Wrong fact? → pre-training. Wrong format / refusal style? → SFT. Wrong reasoning trajectory? → RL.

02 / pre-training data

The internet, but cleaned.

CommonCrawl in. ~15T tokens of high-quality text out. The pipeline does the work.

Raw web crawl · ~petabytes URL filter remove adult, malware, spam domains HTML extraction strip menus, ads, boilerplate language filter keep target language(s) only quality classifier small model scores "is this useful?" deduplication MinHash · LSH PII removal · final dedupe → ~15T tokens

FineWeb (HuggingFace's filtering pipeline, built on Common Crawl) implements this in the open. Each layer drops 50–95% of what enters. The output: a few thousand books' worth of text, repeated nowhere, scored as useful.

interpretation "Trained on the internet" undersells how filtered the input is. By the time text enters pre-training, it's been classified, deduped, and weighted by domain — closer to a curated library than a crawl dump.

03 / tokenization

Models don't read letters. They read tokens.

Byte-pair encoding chunks text into ~50k–100k atoms. Most of the model's weirdness lives here.

"hello world" hello id 15339 world id 1917 2 tokens · case-sensitive · spaces are part of the token "hello world" (two spaces) hello id 220 world 3 tokens · whitespace shifts the entire encoding "strawberry" → how many R's? str aw berry 3 tokens. Model never sees the letters — only these chunks. Counting "r" requires reasoning over chunks → fails often.

Byte-pair encoding (BPE) merges the most common character pairs into tokens. Common words = 1 token. Rare words = many tokens. The model's whole input layer is integer IDs into a ~100k-token vocabulary — it has no native concept of "the letter R."

why models fail at math + spelling "Multiply 348 × 619" gets tokenized into 5 tokens that look nothing like the digits. The model has to reason about arithmetic over an opaque encoding. Tools (Python) bypass this entirely. Letter counting fails for the same reason.

04 / the transformer

The architecture is fully understood. The weights aren't.

A transformer is just a function: tokens in, probability distribution over the next token out. The 100B parameters inside are the mystery.

input tokens The cat sat on the transformer ~100B parameters we know the architecture. we don't know what the weights mean. next-token probability mat0.97 floor0.02 chair0.005 rug0.003 … ~100,000 more tokens, each with a tiny probability "sample" = pick one according to the distribution → feed back as new input

The whole loop: tokens → forward pass → probability over the next token → sample → append → repeat. That's it. "Generation" is iterating this. The architecture is a few hundred lines of math; the magic is in the 100B numbers we can't interpret.

interpretability The field of mechanistic interpretability tries to reverse-engineer what circuits inside the network do. Real progress, but most of the model is still a fog. Don't rely on a "rationale" the model offers about itself — it's a generated rationale, not a reading of its own weights.

05 / the base model

A document simulator, not an assistant.

After pre-training, the model can do one thing: continue any internet-shaped text. That's surprisingly powerful — and surprisingly limited.

prompt: bare question What is the capital of France? → base model continues: What is the capital of Germany? What is the capital of Italy? What is the capital of Spain? It's continuing a quiz. Not answering. prompt: few-shot pattern apple → 사과 water → 물 teacher → 선생님 school → 학교 book → → base model continues: "In-context learning" — pattern-match, no weight updates needed. This is why the assistant trick works: give it lots of "user/assistant" pairs.

The base model's only job is "predict the next token in the same distribution as training." It's already capable of in-context learning — pattern-matching from prompt examples — which is the conceptual seed for both few-shot prompting and the SFT trick.

first principle Every behavior an LLM shows you starts as next-token prediction. "Be helpful" is just "produce tokens that look like the helpful examples in the training distribution." Nothing about it is qualitatively different from autocomplete.

06 / hallucination

The model has no concept of "I don't know."

Sampling is probabilistic. Every output is a guess — including the confident ones.

"Who is Tom Cruise?" → accurate, real answer "Who is Orson Kovats?" (a name the model has never seen) "Orson Kovats was an American novelist, born in Chicago in 1958, best known for his trilogy 'The Glass Compendium' (1991)..." → confidently fabricated. Each token's probability is highest for "novelist," "1958," "trilogy"... THE MECHANICAL FIX Add explicit "I don't know" examples to SFT USER: Who is Orson Kovats? ASSISTANT: I don't have information about that person. This may be a fictional name. → probe the base model: which facts is it uncertain about? → for those, train the assistant to refuse, not invent. Hallucination isn't a bug to suppress — it's the default. Refusal is a behavior that has to be explicitly trained in.

Modern frontier models (GPT-5, Claude 4) hallucinate less because labs ran an explicit "uncertainty audit" — probing the base model for facts it's unsure about, then SFT-training the assistant to say "I don't know" on those probes. The mechanism isn't introspection; it's a learned reflex.

for your work A lower hallucination rate ≠ epistemic awareness. The model learned to refuse on a distribution similar to your prompt — for prompts outside that distribution, it'll happily fabricate again. Always pair high-stakes outputs with verification.

07 / sft

SFT: conversations as the recipe.

Same algorithm as pre-training. Different data. The data is the entire shape of the assistant.

SFT example · written by a labeler USER Can you write a short intro about monopsony in economics? ASSISTANT In labor economics, "monopsony" describes a market in which a single buyer (typically an employer) faces many sellers (workers). Like its mirror image, monopoly, it leads to inefficient outcomes... Labeler follows a guideline doc that defines tone, length, refusal behavior, citation style, etc. A modern lab uses LLMs to draft these and humans to edit/select. Increasingly machine-assisted — but the curation is still human. multiply by ~1M examples → Every dimension of "assistant behavior" is determined by the distribution of these conversations.

SFT is "show, don't tell." Want the model to refuse harmful requests? Include refusals. Want it to use Markdown? Use Markdown in answers. Want shorter outputs? Make examples shorter. The labeler-guideline doc is the personality spec.

first principle The assistant you're talking to is, statistically, an imitation of the labelers Scale AI / Anthropic / OpenAI hired. The model's "values" are the average values expressed in the training conversations. Different labeler pool → different model.

08 / rl

RL: from tutor to coach.

SFT shows the model what right answers look like. RL teaches it to find them on its own.

SFT · imitation USER: solve x² + 5x + 6 = 0 ASSISTANT: by factoring, (x+2)(x+3) = 0 → x = −2, x = −3 Train: copy the human's reasoning. Constraint: humans bias the trace. A human's path may not be the LLM's best path. RL · self-discovery USER: solve x² + 5x + 6 = 0 model samples 50 attempts... attempt 1: completes the square ✓ attempt 2: tries quadratic formula ✓ attempt 3: factors directly ✓ attempt 4: confused, gives up ✗ attempt 5: arithmetic error ✗ Train: reward whichever path lands the right answer. Model discovers reasoning strategies humans wouldn't write. SFT is bounded by the human teacher. RL isn't.

The RLHF predecessor used human comparisons; the new RL wave uses verifiable rewards. For math, code, and reasoning problems where you can check the answer, the model can self-discover better reasoning paths than any human would have written down.

why this matters in 2026 Reasoning models (o1, o3, Claude extended thinking, DeepSeek R1, Gemini Deep Think) are all trained with RL on verifiable rewards. The breakthrough wasn't a new architecture — it was finding tasks where you can score model output without a human in the loop.

09 / reasoning

Long internal chains-of-thought.

RL on verifiable problems creates models that "think out loud" before answering. The thinking can be 10× longer than the answer.

USER: A train leaves NYC at 60mph. Another leaves Boston at 80mph in the opposite direction. They pass after 1.5h. How far apart were the cities? ASSISTANT (thinking, hidden): Let me think. They're moving toward each other, so closing speed is 60 + 80 = 140 mph. Wait — opposite direction means they're moving apart? No, "pass" means they meet. So they're moving toward each other. Closing speed 140 mph. Time to meet: 1.5 hours. Distance closed: 140 × 1.5 = 210 miles. That's the original distance between them. Let me double-check with another approach. Train A travels 60 × 1.5 = 90 miles. Train B travels 80 × 1.5 = 120 miles. Total: 90 + 120 = 210 miles. ✓ OK, 210 miles. ASSISTANT (visible): The cities were 210 miles apart.

Reasoning models spend most of their compute "thinking" — sampling 100s–1000s of internal tokens before producing the visible answer. The internal trace gets graded (during training) on whether the final answer is correct. Strategies that reliably land are reinforced.

interpretation The "chain-of-thought" you sometimes see in chat isn't introspection — it's a learned habit. Models trained with RL on math-and-code rewards generalize that habit to other domains. Sometimes it helps; sometimes it just adds latency.

10 / llm psychology

The weird, predictable ways models fail.

A working mental model of LLM "psychology" — six emergent quirks worth knowing.

tokenizer blindness "How many R's in strawberry?" fails because the model sees [str][aw][berry], not letters. FIX: give it Python or a tool that operates on characters. swiss-cheese knowledge Holes in random places. Confidence is no signal — the model can't see its holes. FIX: RAG for known-domain facts; probe with verifiable questions. context as working memory Anything in the prompt is fast and malleable. Anything in the weights is slow and frozen. USE: put recent / private / specific info in context, not in weights. format-confidence trap Perfect citations, ISBNs, DOIs, code that imports nonexistent libraries — all confident. FIX: verify all citations; run code; grep for hallucinated symbols. jagged frontier Solves PhD problems, fails 4th-grade ones. Capability ≠ uniform across domains. USE: don't extrapolate from "smart in X" to "smart in Y." Test each. stochastic personality Same prompt, different runs, different opinions / refusals / depths. USE: temperature 0 for determinism; sample N for diversity.

Karpathy calls this "LLM psychology." Each quirk traces to a specific stage of the training pipeline — tokenizer choices, pre-training data gaps, SFT distribution, RL reward shaping. None of them are bugs to be patched at the prompt level.

first principle Treat LLMs like a brilliant, well-read collaborator with intermittent amnesia, dyslexia about characters, and an inability to know what they don't know. Build the harness around those constraints.

11 / takeaways

What changes about how you use them.

Eleven things to remember when working with frontier models in 2026.

  1. stages Three stages, three different teams. Wrong fact = pre-training. Wrong format = SFT. Wrong reasoning = RL. The diagnosis points at the fix.
  2. tokens Models read tokens, not letters. Spelling, character counting, exact arithmetic — all degraded by the input layer. Tools (Python, regex) are the workaround.
  3. predict Every output is next-token prediction. "Helpful," "polite," "refuses harm" — all the same mechanism, just trained on different data.
  4. in-context Few-shot in the prompt is real "training." No weight updates needed. Use it for adapting tone, output format, or chain-of-thought style without an API call to fine-tune.
  5. hallucinate Hallucination is the default; refusal is the trained behavior. Lower hallucination rate ≠ epistemic awareness. Verify high-stakes outputs.
  6. format Format is not truth. A perfectly cited paragraph with a real-looking DOI may still be made up. Don't conflate fluency with accuracy.
  7. memory Context = RAM. Weights = frozen disk. Put fresh, private, or specific info in context; treat the weights as a stale general-knowledge disk.
  8. rl Reasoning models think before they speak. Pay for them on hard problems with verifiable answers (math, code, scientific). Skip them for chat that just needs a fast reply.
  9. jagged Capability is uneven. Don't extrapolate from "great at X" to "great at Y." Build domain-specific evals.
  10. tools Tools beat scaling. Calculator for math, browser for facts, Python for transformations, RAG for private knowledge. Cheaper than waiting for a stronger model.
  11. personality The "personality" is the labelers. The model imitates the average of the people who wrote its SFT data. Different lab → different averages → different model.