may 2024 — apr 2026 · field guide for data scientists

The architecture
of intelligence.

What's actually in frontier model training data — and what changes about your job because of it.

00 / primer

Two stages. Different objectives. Different data.

Pre-training builds the substrate. Post-training shapes the personality. The rest of this guide flips between them — labels mark which.

PRE-TRAINING · learn the language and the world scale: tens of trillions of tokens data: filtered web · licensed corpora · code · multimodal methods: classifiers · dedupe · domain upsampling · early fusion base model → now teach it how to behave POST-TRAINING · shape behavior, alignment, reasoning scale: millions–billions of tokens · ~1,000× smaller data: expert-written reasoning + synthetic-graded outputs methods: SFT · RLHF · RLAIF · DPO · On-Policy Distillation aligned → what you use Most of what you experience as "model behavior" was decided in the bottom lane.

Pre-training is the slow, expensive baseline — tens of trillions of tokens, weeks of gigawatt-scale compute. Post-training is comparatively cheap and iterates often. It's where the model learns to refuse, reason, format, and follow instructions. Same base + different post-training = different model "personalities."

why this matters When two models behave differently on the same prompt, the cause is usually post-training, not pre-training. The base substrate is similar across the frontier; the alignment is each lab's signature.

01 / paradigm

From scale to curation.

Scraping the open web wholesale stopped working. The new race is what to throw away.

raw web crawl · hundreds of trillions of pages deduplication · MinHash · LSH classifier filter · PII · toxicity · low-fidelity domain upsampling · code · math · reasoning + licensed · expert · synthetic core training mix Llama 4: heavy SFT pruning — hard examples only

The numbers behind the funnel. Llama 3: ~15T tokens. Llama 4 Behemoth: 30T+ tokens (Maverick/Scout per-model token counts not separately disclosed at that scale). Meta has reported aggressive SFT pruning across the herd — directional, with the largest cuts on Behemoth — though the exact percentages haven't been published in their main blog. DeepSeek V3 ran on 14.8T immaculate tokens — V4 doubled that to 32T, with even harsher dedupe.

first principle More data isn't the goal — more signal density per token is. A token wasted on internet noise is a token a sparse expert can't learn rare reasoning from.

02 / data streams

Four streams flow into every modern model.

The mix is the moat.

filtered public web · pre CommonCrawl · Reddit · books · code licensed corpora · pre publishers · news · academic · social graphs human-expert generation · post contractors writing reasoning traces · medical · legal synthetic + self-generated · post RLAIF · distilled traces · agent-in-sandbox · constitutional grading model weights — frozen on training day —

The synthetic stream is now the fastest-growing. Anthropic's RLAIF: a Claude model (with the constitution as its system prompt) ranks Claude outputs against the written principles — same generation, different role. OpenAI: hires expert contractors specifically to demonstrate advanced math/code, then expands their traces synthetically. Meta: Behemoth (~2T params) generates distillation targets for Scout and Maverick.

for your work If you fine-tune on outputs from a frontier API, you're not training on "ground truth" — you're training on a particular model's opinions, including its blind spots. Treat synthetic data as a calibration step, not a source of truth.

03 / peak data

Human text ran out. Models started feeding themselves.

The synthetic flywheel: faster than human writing, with one structural risk.

model v(n) outputs + traces curated + graded model v(n+1) ⚠ silent risk biases of v(n) become priors of v(n+1) — and amplify each generation. why it works: flawless reasoning trajectories at infinite scale.

This loop is now the dominant alignment workhorse. Anthropic uses it via the Constitution. OpenAI uses it for "safe-completions" to avoid over-refusing. Meta's Behemoth distills into Scout/Maverick. The 2026 Stanford AI Index calls this the response to "peak data" — the asymptote of high-quality human text.

interpretation A model trained on its own generation isn't getting smarter about the world — it's getting smarter about what models like itself produce. Distinguish those two things when evaluating outputs.

04 / labs · pipelines

Each lab's pre vs post recipe.

What we know — and don't know — about how the four major frontier labs structure each stage.

Meta · Llama 4
pre-training 22–40T tokens · 200 languages, 100+ each above 1B tokens · native early-fusion multimodal · trained on Meta's CDN-scale crawl + licensed data.
post-training Lightweight SFT + online RL + DPO. Meta has reported aggressive SFT pruning across the herd (largest cuts on Behemoth); the exact percentages haven't appeared in the main public blog. Behemoth co-distilled into Scout and Maverick.
DeepSeek · V4
pre-training 32T tokens (doubled from V3's 14.8T) · aggressive MinHash dedupe · domain upsampling for code, math, reasoning · ran on H800 cluster for 2.788M GPU-hours.
post-training Two-stage: independent domain-expert SFT + RL, then On-Policy Distillation (OPD) merges experts into a unified model. Pairs with the new Engram conditional-memory architecture.
Anthropic · Claude 4
pre-training Proprietary mix of filtered internet + licensed corpora. Cutoffs Jul 2025 (Haiku 4.5) and Jan 2026 (Opus 4.7). New tokenizer in 4.7.
post-training Heavy synthetic generation + RLAIF — a Claude model (with the Constitution in its system prompt) ranks Claude outputs against the written principles. Enables ASL-3 deployment for Opus, ASL-2 for Sonnet.
OpenAI · GPT-5
pre-training Token count not disclosed by OpenAI; analyst estimates put it in the ~100T range. Filtered web + heavily licensed publishers/academic + expert-contractor demonstrations. Native multimodal (text, image, audio) ingestion.
post-training RL trains an internal router (fast vs reasoning model). Safe-completions data avoids over-refusal. Synthetic reasoning paths teach backtracking and strategy switching. METR-verified no incentive to sandbag.
the pattern All four labs converged on the same structure: massive pre-training corpus + post-training built on synthetic reasoning + RL grading. The differences are in the grading function (Anthropic's Constitution vs OpenAI's safe-completions) and the merge strategy (Meta's distillation vs DeepSeek's OPD).

05 / freshness

Most of these models stopped learning a year ago.

Knowledge cutoffs, plotted against today.

Aug ’24 Jan ’25 Aug ’25 Jan ’26 today Llama 4 GPT-5 (initial) DeepSeek V3 Claude Haiku 4.5 GPT-5.4 Mistral Large 3 Claude Opus 4.7 Grok 5 — live X firehose, no cutoff Distance from "today" = how stale the model's worldview is.

Llama 4 is closer to two years stale than one. Most flagship models froze their worldview between mid-2024 and early 2026. Grok is the outlier — xAI's bet on the X firehose pays off as a freshness hedge but introduces its own (severe) demographic priors.

for your work Treat any time-sensitive query as suspect. Build a small eval that asks for known post-cutoff facts — count fabrications. Use that as a freshness audit before trusting a model on news, prices, library versions, or recent research.

06 / architecture

Trillions of params, but only a slice fires per token.

Sparse Mixture-of-Experts is the dominant frontier architecture. It changes how you reason about evaluation.

DENSE TRANSFORMER every neuron fires every token active = total params predictable, expensive SPARSE MIXTURE OF EXPERTS router activates a few experts per token router 17B / 400B · 49B / 1.6T · 288B / ~2T active / total per token

Llama 4 Maverick: 17B active of 400B total. DeepSeek V4-Pro: 49B of 1.6T. Behemoth: 288B of nearly 2T. Different prompts route to different experts — meaning latency, quality, and even hallucination patterns can shift across topics within the same model.

interpretation "Parameter count" is not the comparable metric anymore. Two MoEs of identical total size can behave very differently if their experts specialize differently. For your evals, sweep prompt domains — a model that's brilliant on code can route into a weak expert on legal text.

07 / multimodal

Vision and text now share the same brain from birth.

Early fusion vs late fusion is the difference between a polyglot and a phrasebook.

LATE FUSION separate encoders, joined late text image audio text enc vision enc audio enc merge three brains shake hands at the end EARLY FUSION tokens interleaved into one backbone txt img txt txt aud img txt txt unified transformer cross-modal attention from layer one one shared latent space

All three frontier families — Llama 4, GPT-5, and Gemini 3.1 — are early fusion. Llama 4 trained on up to 48 images per context window, blended with text from token zero. Gemini 3.1 Pro can ingest 3,000 images, 8.4 hours of audio, or one hour of video per prompt — a scale of cross-modal ingestion only possible with a unified backbone. Late fusion (the older "stitch a vision encoder onto a text model" pattern) is no longer competitive at the frontier.

for your work If your task is genuinely cross-modal — annotate-this-chart, transcribe-and-summarize, video-Q&A — early-fusion models will reason about the relationship natively. Late-fusion will paper over gaps with templated phrasing. Test cross-modal grounding explicitly.

08 / disclosure

The industry knows less about itself than it did last year.

Stanford's Foundation Model Transparency Index, 2024 → 2025.

100 50 0 2024 2025 IBM 95 ↑ AI21 ~75 Anthropic 70 ↑ avg ↓ 58 → 40 Meta 31 ↓ –29 Mistral 18 ↓ –37 xAI 14 Foundation Model Transparency Index — points out of 100

Industry average dropped from 58 to 40. IBM is the lone outlier going up — they publish enough detail for a third party to replicate their training. Meta dropped 29 points, mostly because they didn't release a Llama 4 data report. Mistral dropped 37, abandoning the disclosures that defined Mixtral. Ten of thirteen tracked labs (including OpenAI, Google, Anthropic, Amazon, xAI) disclosed zero environmental impact data.

interpretation "Open weights" ≠ open data. Llama 4 weights ship under a permissive license. The ratios of code vs. text vs. video, the licensing partners, the synthetic mix — those are corporate secrets. Do not assume open weights means you can audit what's in the model.

09 / bias surfaces

Five doors. Watch all five.

Where bias enters the model — and which doors are addressable, which are structural.

model weights — frozen on training day — crawl bias which sites get scraped license bias which publishers will sell annotator bias who labels and reasons synthetic-loop bias model amplifies its own priors cutoff bias frozen worldview · structural addressable with audits / counterfactual sets addressable with pipeline design structural — only RAG / fine-tune mitigates

Llama 4's 200-language coverage tackled crawl bias. Grok 5 introduced an extreme one — its X firehose skews political, demographic, and hyper-Western. Every door has a different mitigation: counterfactual eval sets for crawl/license, pipeline audit for annotator/synthetic, RAG / tool-use for cutoff.

for your work Build a YOUR-domain bias audit set. Five families of probes — one per door. If your application sits in healthcare, finance, or any high-stakes domain, demographic and temporal probes belong in CI, not just a one-time check.

10 / memory

The weights are a snapshot. Updating them costs a power plant.

Why continuous learning is structurally hard — and what an early fix looks like.

TODAY · MONOLITHIC WEIGHTS Jan 2026 new fact new fact new fact new fact facts bounce off — only retraining writes DEEPSEEK V4 · ENGRAM CONDITIONAL MEMORY reasoning (MoE) frozen · how to think hash table O(1) key value facts · cheap to update +fact +fact knowing ≠ reasoning · update one without retraining the other

# why it's hard

  • — loss spikes during training (DeepSeek used SwiGLU clamping to recover from these)
  • — catastrophic forgetting on naive fine-tune
  • — petabyte I/O at sub-ms latency to feed GPU clusters
  • — Grok 4 training: ~150,000 tons of CO2eq for one snapshot (Epoch AI est.)

# what's emerging

  • — DeepSeek V4 Engram: facts in O(1) hash, reasoning in MoE
  • — Grok 5 reality engine: cross-references X firehose at query time
  • — RAG & tool-use as the practical decoupling pattern
  • — mid-training: extending context via specialized data, not retraining
first principle Treat a model as a frozen reasoner plus a (mostly external) knowledge layer. Updates to the world go through retrieval and tools, not weight changes. The architectures that bake this split in (Engram, real-time grounding) are the early signal of where the field is going.

11 / cost

Two orders of magnitude separate the most efficient training run from the least.

Same generation. Same benchmark class. Wildly different footprints.

Grok 4 ~150,000 t CO2eq Llama 4 Behemoth undisclosed · >100 MW draw DeepSeek V3 ~597 t (est.) avg car lifetime 63 t ~250× more emissions for the brute-force run · same generation of model

Algorithmic efficiency is now an environmental vector. DeepSeek V3 trained on 14.8T tokens for 2.788M H800 GPU-hours; the 597 t CO2eq number is a third-party estimate from those GPU-hours, not from the V3 paper. Grok 4's ~150,000 t (Epoch AI) used Colossus's 100k–200k H100 cluster — financial cost in the hundreds of millions, carbon cost ~2,400× a single car's lifetime.

for your work Add cost-per-benchmark-point and CO2-per-benchmark-point to your eval rubric. A model that's 2 points stronger but 50× more expensive is rarely the right pick for a production system. DeepSeek-class frontier work proves the gap can be closed.

12 / takeaways

What changes about your job.

Eleven things to do differently in 2026 that you wouldn't have done in 2023.

  1. model fit Pick by data character, not just benchmark. Mistral Large 3 → European/multilingual. Gemini 3.1 → cross-modal reasoning. DeepSeek V4 → cost-efficient code/reasoning. Grok 5 → real-time. Claude Opus 4.7 → safety-critical reasoning. GPT-5.4 → general routing.
  2. context RAG beats long context, often. Supporting 1M tokens isn't using 1M tokens well — quality degrades over the horizon. Chunk and retrieve unless you have a reason not to.
  3. freshness Verify post-cutoff facts. Build a small "known recent fact" probe. Anything time-sensitive — prices, library versions, news, papers — needs grounding outside the weights.
  4. openness Don't treat "open weights" as "transparent." The data mixture, licensing partners, and synthetic ratios are the moat. They're also where bias hides.
  5. synthetic Audit synthetic-data pipelines. Bias compounds when you train on model outputs. If you SFT on a frontier API's responses, you inherit its priors — especially the ones it was trained to deny it has.
  6. eval Eval ≠ benchmark. Frontier models can identify test environments (METR confirmed this in GPT-5). Build YOUR domain's eval set. Sweep prompt domains for MoE — different experts route, different quality.
  7. moe Parameter count isn't the comparable metric. Two MoEs of identical total size can route differently and behave differently. Active params matter for cost; expert specialization matters for quality.
  8. multimodal Test cross-modal grounding explicitly. Early-fusion models reason about chart-and-caption together; late-fusion paste them. Build probes that require the relationship to be the answer.
  9. memory Decouple knowing from reasoning in your stack. Knowledge → retrieval. Reasoning → the model. The architectures (Engram, real-time grounding) are catching up to this pattern; your system design can lead it.
  10. cost $/benchmark-point and CO2/benchmark-point are eval dimensions. A ~250× emissions difference for the same generation tells you efficiency is a strategic axis, not a constraint.
  11. audit Five-door bias audit, in CI. Crawl · license · annotator · synthetic-loop · cutoff. Each gets a different probe. Each fires different alarms. Don't run them once and call it done.