may 2024 — apr 2026 · field guide for data scientists

The architecture
of intelligence.

What's actually in frontier model training data — and what changes about your job because of it.

00 / primer

Two stages. Different objectives. Different data.

Pre-training builds the substrate. Post-training shapes the personality. The rest of this guide flips between them — labels mark which.

Pre-training is the slow, expensive baseline — tens of trillions of tokens, weeks of gigawatt-scale compute. Post-training is comparatively cheap and iterates often. It's where the model learns to refuse, reason, format, and follow instructions. Same base + different post-training = different model "personalities."

why this matters When two models behave differently on the same prompt, the cause is usually post-training, not pre-training. The base substrate is similar across the frontier; the alignment is each lab's signature.

01 / paradigm

From scale to curation.

Scraping the open web wholesale stopped working. The new race is what to throw away.

The numbers behind the funnel. Llama 3: ~15T tokens. Llama 4 Behemoth: 30T+ tokens (Maverick/Scout per-model token counts not separately disclosed at that scale). Meta has reported aggressive SFT pruning across the herd — directional, with the largest cuts on Behemoth — though the exact percentages haven't been published in their main blog. DeepSeek V3 ran on 14.8T immaculate tokens — V4 doubled that to 32T, with even harsher dedupe.

first principle More data isn't the goal — more signal density per token is. A token wasted on internet noise is a token a sparse expert can't learn rare reasoning from.

02 / data streams

Four streams flow into every modern model.

The mix is the moat.

The synthetic stream is now the fastest-growing. Anthropic's RLAIF: a Claude model (with the constitution as its system prompt) ranks Claude outputs against the written principles — same generation, different role. OpenAI: hires expert contractors specifically to demonstrate advanced math/code, then expands their traces synthetically. Meta: Behemoth (~2T params) generates distillation targets for Scout and Maverick.

for your work If you fine-tune on outputs from a frontier API, you're not training on "ground truth" — you're training on a particular model's opinions, including its blind spots. Treat synthetic data as a calibration step, not a source of truth.

03 / peak data

Human text ran out. Models started feeding themselves.

The synthetic flywheel: faster than human writing, with one structural risk.

This loop is now the dominant alignment workhorse. Anthropic uses it via the Constitution. OpenAI uses it for "safe-completions" to avoid over-refusing. Meta's Behemoth distills into Scout/Maverick. The 2026 Stanford AI Index calls this the response to "peak data" — the asymptote of high-quality human text.

interpretation A model trained on its own generation isn't getting smarter about the world — it's getting smarter about what models like itself produce. Distinguish those two things when evaluating outputs.

04 / labs · pipelines

Each lab's pre vs post recipe.

What we know — and don't know — about how the four major frontier labs structure each stage.

Meta · Llama 4

pre-training 22–40T tokens · 200 languages, 100+ each above 1B tokens · native early-fusion multimodal · trained on Meta's CDN-scale crawl + licensed data.

post-training Lightweight SFT + online RL + DPO. Meta has reported aggressive SFT pruning across the herd (largest cuts on Behemoth); the exact percentages haven't appeared in the main public blog. Behemoth co-distilled into Scout and Maverick.

DeepSeek · V4

pre-training 32T tokens (doubled from V3's 14.8T) · aggressive MinHash dedupe · domain upsampling for code, math, reasoning · ran on H800 cluster for 2.788M GPU-hours.

post-training Two-stage: independent domain-expert SFT + RL, then On-Policy Distillation (OPD) merges experts into a unified model. Pairs with the new Engram conditional-memory architecture.

Anthropic · Claude 4

pre-training Proprietary mix of filtered internet + licensed corpora. Cutoffs Jul 2025 (Haiku 4.5) and Jan 2026 (Opus 4.7). New tokenizer in 4.7.

post-training Heavy synthetic generation + RLAIF — a Claude model (with the Constitution in its system prompt) ranks Claude outputs against the written principles. Enables ASL-3 deployment for Opus, ASL-2 for Sonnet.

OpenAI · GPT-5

pre-training Token count not disclosed by OpenAI; analyst estimates put it in the ~100T range. Filtered web + heavily licensed publishers/academic + expert-contractor demonstrations. Native multimodal (text, image, audio) ingestion.

post-training RL trains an internal router (fast vs reasoning model). Safe-completions data avoids over-refusal. Synthetic reasoning paths teach backtracking and strategy switching. METR-verified no incentive to sandbag.

the pattern All four labs converged on the same structure: massive pre-training corpus + post-training built on synthetic reasoning + RL grading. The differences are in the grading function (Anthropic's Constitution vs OpenAI's safe-completions) and the merge strategy (Meta's distillation vs DeepSeek's OPD).

05 / freshness

Most of these models stopped learning a year ago.

Knowledge cutoffs, plotted against today.

Llama 4 is closer to two years stale than one. Most flagship models froze their worldview between mid-2024 and early 2026. Grok is the outlier — xAI's bet on the X firehose pays off as a freshness hedge but introduces its own (severe) demographic priors.

for your work Treat any time-sensitive query as suspect. Build a small eval that asks for known post-cutoff facts — count fabrications. Use that as a freshness audit before trusting a model on news, prices, library versions, or recent research.

06 / architecture

Trillions of params, but only a slice fires per token.

Sparse Mixture-of-Experts is the dominant frontier architecture. It changes how you reason about evaluation.

Llama 4 Maverick: 17B active of 400B total. DeepSeek V4-Pro: 49B of 1.6T. Behemoth: 288B of nearly 2T. Different prompts route to different experts — meaning latency, quality, and even hallucination patterns can shift across topics within the same model.

interpretation "Parameter count" is not the comparable metric anymore. Two MoEs of identical total size can behave very differently if their experts specialize differently. For your evals, sweep prompt domains — a model that's brilliant on code can route into a weak expert on legal text.

07 / multimodal

Vision and text now share the same brain from birth.

Early fusion vs late fusion is the difference between a polyglot and a phrasebook.

All three frontier families — Llama 4, GPT-5, and Gemini 3.1 — are early fusion. Llama 4 trained on up to 48 images per context window, blended with text from token zero. Gemini 3.1 Pro can ingest 3,000 images, 8.4 hours of audio, or one hour of video per prompt — a scale of cross-modal ingestion only possible with a unified backbone. Late fusion (the older "stitch a vision encoder onto a text model" pattern) is no longer competitive at the frontier.

for your work If your task is genuinely cross-modal — annotate-this-chart, transcribe-and-summarize, video-Q&A — early-fusion models will reason about the relationship natively. Late-fusion will paper over gaps with templated phrasing. Test cross-modal grounding explicitly.

08 / disclosure

The industry knows less about itself than it did last year.

Stanford's Foundation Model Transparency Index, 2024 → 2025.

Industry average dropped from 58 to 40. IBM is the lone outlier going up — they publish enough detail for a third party to replicate their training. Meta dropped 29 points, mostly because they didn't release a Llama 4 data report. Mistral dropped 37, abandoning the disclosures that defined Mixtral. Ten of thirteen tracked labs (including OpenAI, Google, Anthropic, Amazon, xAI) disclosed zero environmental impact data.

interpretation "Open weights" ≠ open data. Llama 4 weights ship under a permissive license. The ratios of code vs. text vs. video, the licensing partners, the synthetic mix — those are corporate secrets. Do not assume open weights means you can audit what's in the model.

09 / bias surfaces

Five doors. Watch all five.

Where bias enters the model — and which doors are addressable, which are structural.

Llama 4's 200-language coverage tackled crawl bias. Grok 5 introduced an extreme one — its X firehose skews political, demographic, and hyper-Western. Every door has a different mitigation: counterfactual eval sets for crawl/license, pipeline audit for annotator/synthetic, RAG / tool-use for cutoff.

for your work Build a YOUR-domain bias audit set. Five families of probes — one per door. If your application sits in healthcare, finance, or any high-stakes domain, demographic and temporal probes belong in CI, not just a one-time check.

10 / memory

The weights are a snapshot. Updating them costs a power plant.

Why continuous learning is structurally hard — and what an early fix looks like.

# why it's hard
— loss spikes during training (DeepSeek used SwiGLU clamping to recover from these)
— catastrophic forgetting on naive fine-tune
— petabyte I/O at sub-ms latency to feed GPU clusters
— Grok 4 training: ~150,000 tons of CO2eq for one snapshot (Epoch AI est.)
# what's emerging
— DeepSeek V4 Engram: facts in O(1) hash, reasoning in MoE
— Grok 5 reality engine: cross-references X firehose at query time
— RAG & tool-use as the practical decoupling pattern
— mid-training: extending context via specialized data, not retraining

first principle Treat a model as a frozen reasoner plus a (mostly external) knowledge layer. Updates to the world go through retrieval and tools, not weight changes. The architectures that bake this split in (Engram, real-time grounding) are the early signal of where the field is going.

11 / cost

Two orders of magnitude separate the most efficient training run from the least.

Same generation. Same benchmark class. Wildly different footprints.

Algorithmic efficiency is now an environmental vector. DeepSeek V3 trained on 14.8T tokens for 2.788M H800 GPU-hours; the 597 t CO2eq number is a third-party estimate from those GPU-hours, not from the V3 paper. Grok 4's ~150,000 t (Epoch AI) used Colossus's 100k–200k H100 cluster — financial cost in the hundreds of millions, carbon cost ~2,400× a single car's lifetime.

for your work Add cost-per-benchmark-point and CO2-per-benchmark-point to your eval rubric. A model that's 2 points stronger but 50× more expensive is rarely the right pick for a production system. DeepSeek-class frontier work proves the gap can be closed.

12 / takeaways

What changes about your job.

Eleven things to do differently in 2026 that you wouldn't have done in 2023.

model fit Pick by data character, not just benchmark. Mistral Large 3 → European/multilingual. Gemini 3.1 → cross-modal reasoning. DeepSeek V4 → cost-efficient code/reasoning. Grok 5 → real-time. Claude Opus 4.7 → safety-critical reasoning. GPT-5.4 → general routing.
context RAG beats long context, often. Supporting 1M tokens isn't using 1M tokens well — quality degrades over the horizon. Chunk and retrieve unless you have a reason not to.
freshness Verify post-cutoff facts. Build a small "known recent fact" probe. Anything time-sensitive — prices, library versions, news, papers — needs grounding outside the weights.
openness Don't treat "open weights" as "transparent." The data mixture, licensing partners, and synthetic ratios are the moat. They're also where bias hides.
synthetic Audit synthetic-data pipelines. Bias compounds when you train on model outputs. If you SFT on a frontier API's responses, you inherit its priors — especially the ones it was trained to deny it has.
eval Eval ≠ benchmark. Frontier models can identify test environments (METR confirmed this in GPT-5). Build YOUR domain's eval set. Sweep prompt domains for MoE — different experts route, different quality.
moe Parameter count isn't the comparable metric. Two MoEs of identical total size can route differently and behave differently. Active params matter for cost; expert specialization matters for quality.
multimodal Test cross-modal grounding explicitly. Early-fusion models reason about chart-and-caption together; late-fusion paste them. Build probes that require the relationship to be the answer.
memory Decouple knowing from reasoning in your stack. Knowledge → retrieval. Reasoning → the model. The architectures (Engram, real-time grounding) are catching up to this pattern; your system design can lead it.
cost $/benchmark-point and CO2/benchmark-point are eval dimensions. A ~250× emissions difference for the same generation tells you efficiency is a strategic axis, not a constraint.
audit Five-door bias audit, in CI. Crawl · license · annotator · synthetic-loop · cutoff. Each gets a different probe. Each fires different alarms. Don't run them once and call it done.