Inference from Scratch

Why another "from scratch" series

Most "build X from scratch" content stops at one model. You watch someone reproduce DeepSeek, or GPT-2, or Llama — and you walk away knowing how that one model works. The next paper drops a new attention variant, a new routing scheme, a new decoding trick, and you're back at square one.

This series takes the opposite approach. Instead of one model, we cover the whole design space of modern LLM inference — every meaningful variant of attention, positional encoding, routing, sampling, quantization, and serving — and we build each one from scratch so you understand the trade-off it was invented to solve, not just the implementation.

By the end, when the next frontier model drops, you should be able to read the architecture section in the morning and have an informed opinion by lunch.

The teaching philosophy

Three rules guide every tutorial in this series.

1. Plain language first, math second

You can't intuit something you can only manipulate. Every concept starts with a sentence you could say to a colleague at a coffee machine — "GQA is MHA with fewer K/V heads, traded for a smaller KV cache" — before any equation shows up.

When the math arrives, we walk through it line by line, no skipped steps. No "it can be shown that." If a tensor changes shape, we say what shape, why, and what each axis means.

2. Build it on one GPU before you scale it

A single-GPU implementation is the unit test for your mental model. If you can't write FlashAttention for one GPU on one node, you have no business reasoning about Ring Attention across a pod.

Every topic that can be implemented on a single device is implemented there first — usually in plain PyTorch, sometimes a Triton kernel where it matters. You'll see the naive version, the bug, the fix, and the optimized version. In that order.

3. Then take it distributed

After the single-GPU version works, we rewrite it for the realistic case — tensor-parallel, sequence-parallel, expert-parallel, or all-to-all'd across the cluster. This is where most "from scratch" content drops off and where production actually lives.

You'll see the collectives (all_reduce, all_gather, reduce_scatter, all_to_all), where they sit in the forward pass, what they cost, and how to overlap them with compute. The goal is for the multi-GPU code to feel like a natural extension of the single-GPU code, not a different language.

The roadmap

Nine parts, ~50 tutorials, sequenced so each one earns the next. Topics get linked here as they ship.

Part 1 — Attention, the full family

The popular content covers GQA → MLA in the context of one model. We cover the whole genealogy and the trade-off each variant was invented to solve.

Self-attention from first principles — queries, keys, values, and the dot product
Multi-Head Attention (MHA) — why split, how to split, what each head learns
Multi-Query Attention (MQA) — one shared K/V, the memory-bandwidth motivation
Grouped-Query Attention (GQA) — the MHA↔MQA spectrum, Llama 2/3's choice
Multi-Head Latent Attention (MLA) — DeepSeek's low-rank KV compression
Sliding Window Attention — Mistral, Longformer, attention sinks
Cross-Attention — encoder-decoder, multimodal fusion
FlashAttention v1 → v2 → v3 — IO-aware attention, why softmax tiling matters
Linear & sub-quadratic attention — Performer, Linformer, what they trade away

Part 2 — Positional encodings

Why transformers don't know order, and the surprisingly long arc of how we taught them.

Why positions matter — the permutation-invariance problem
Absolute encodings — sinusoidal, learned, integer/binary
Relative position — Shaw et al., T5 bias
ALiBi — linear bias, length extrapolation for free
RoPE — rotary embeddings derived from scratch
Long-context RoPE scaling — Linear, NTK-aware, YaRN, LongRoPE

Part 3 — KV cache & memory

The KV cache is the central data structure of inference. Most production complexity exists to manage it.

Why the KV cache exists — the prefill vs decode asymmetry
KV cache memory layout and arithmetic intensity
PagedAttention — vLLM's virtual-memory analog for KV blocks
Prefix caching & cross-request KV reuse
KV cache quantization — FP8, INT8, INT4
StreamingLLM & attention sinks for effectively-infinite context

Part 4 — Sampling & decoding

The output side of the model. Where latency is won or lost.

Greedy, temperature, top-k, top-p, min-p — what each one actually does
Beam search and why LLMs largely abandoned it
Constrained decoding — grammars, JSON schema, regex-guided
Speculative decoding — draft + verify
Medusa — multi-head speculative
EAGLE & lookahead decoding
Multi-Token Prediction (MTP) — DeepSeek's training-time MTP, inference-time uses

Part 5 — Mixture of Experts

The architectural shift behind the largest open models. Everything you need to build, balance, and serve MoE.

MoE from scratch — gating, top-k routing
Load balancing — auxiliary loss, expert utilization
Switch Transformer (top-1) vs GShard (top-2)
DeepSeek MoE — fine-grained experts + shared experts
Aux-loss-free balancing (DeepSeek-V3)
Expert parallelism for inference — routing, all-to-all, EP overlap

Part 6 — Quantization for inference

Squeezing the model into the memory you actually have, without breaking it.

INT8 / INT4 basics — symmetric vs asymmetric, per-tensor vs per-channel
GPTQ — second-order quantization
AWQ — activation-aware quantization
SmoothQuant — migrating outliers from activations to weights
FP8 inference on Hopper/Blackwell
1-bit territory — BitNet b1.58 and the extreme low end

Part 7 — Serving systems

Where the engineering you've learned meets the requests you're actually getting.

Continuous batching — Orca, why static batching is dead
Chunked prefill — bounding TTFT under load
Prefill/decode disaggregation — different hardware for different bottlenecks
RadixAttention — SGLang's prefix tree
CUDA Graphs for decode — eliminating launch overhead
Tensor parallelism for inference (vs training) — why TP-for-inference is its own problem

Part 8 — Long context & test-time compute

The two frontiers pulling inference systems in opposite directions.

Ring Attention — sharding the sequence dimension across GPUs
Tree Attention for branched generation
Test-time compute scaling — best-of-N, majority voting, the cost curve
Reasoning models (o1, R1) — KV pressure when chain-of-thought runs 10k+ tokens

Part 9 — Block-level architecture choices

The small decisions everyone copies without explaining why.

RMSNorm vs LayerNorm — and why everyone moved
SwiGLU vs GELU — the FFN that ate the transformer
Pre-norm vs post-norm — training stability vs representation quality
Tokenization for inference — BPE, SentencePiece, tiktoken, byte-level

How to follow along

You can read these in order — they're sequenced for that — or jump to the part most relevant to what you're building. Every tutorial is self-contained enough that the prerequisites are listed up front, so if you land on Part 5 without reading Part 1, you'll know what you're missing.

Code lives alongside the prose. Single-GPU versions run on anything with ≥16 GB of VRAM. Multi-GPU sections assume access to at least 2× consumer GPUs (for tensor parallel) or a multi-node setup (for expert parallel and ring attention) — instructions for renting time on each are included where they apply.

If you find an error, an explanation that didn't land, or a topic that's missing — open an issue or reach out. This series is meant to be lived in, not just published.