Agents from Scratch

Why another agent series

The agent tutorial landscape splits into two unhelpful piles. One pile is prompt cookbooks — "here's how to phrase your prompt so the agent does the right thing." The other is framework demos — "here's how to call LangGraph / CrewAI / Swarm to make a thing." Neither teaches you how agents actually work.

This series takes the third path: build every meaningful agent primitive from scratch, in the smallest amount of code that captures it, then scale it across machines.

By the end you should be able to read a multi-agent system's source on a Friday and have an opinion on its failure modes by Monday — because you'll have built each piece yourself.

The teaching philosophy

The same three rules from the Inference series apply here, adapted for agents.

1. Plain language first, formalism second

You can't reason about a planner you can only call. Every concept starts with a sentence you could say to a colleague at a coffee machine — "ReAct is a while loop that lets the model decide between thinking and acting at each step" — before any formalism shows up.

When the formalism arrives (state machines, protocols, schemas), we walk through it piece by piece, no skipped steps. If a message gets transformed before it's appended to context, we say exactly what changes and why.

2. Build it on one machine before you scale it

A single-process implementation is the unit test for your mental model. If you can't write a ReAct loop in 100 lines of plain Python, you have no business reasoning about durable distributed agent scheduling.

Every primitive that can be implemented in a single process is built there first — no framework, no abstractions you didn't write yourself. You'll see the naive version, the bug, the fix, and the production version. In that order.

3. Then take it distributed

After the single-process version works, we rewrite it for the realistic case: multiple agent workers behind a queue, durable state in Redis or Postgres, tracing across processes, scheduling under cost and rate-limit constraints. This is where most agent content drops off and where production actually lives.

You'll see the message protocols, the failure modes (deadlock, divergence, role drift), and how the patterns scale — or don't.

The roadmap

Ten parts, ~60 tutorials, sequenced so each one earns the next. Topics get linked here as they ship.

Part 1 — The agent loop

The foundation. Before frameworks, before tools — what is an agent?

What makes a system "agentic" — autonomy, tools, feedback
The minimal loop: perceive → reason → act → observe
ReAct from scratch in <200 lines, no framework
LLM-as-planner vs LLM-as-worker
State machines vs flexible loops — when each wins

Part 2 — Tool use

The thing that makes an agent useful. The thing that makes it dangerous.

Native function calling — what's actually in the API payload
JSON-schema decoding and grammar-constrained outputs
Parallel tool calls — async patterns, fan-out/fan-in
Tool selection at scale — when you have 100+ tools (retrieval, hierarchies)
Tool result feedback — error recovery, retries, idempotency
Designing tools well — schema, side-effect taxonomy, reversibility

Part 3 — Memory

The agent's view of the world beyond the current context.

Why the context window isn't enough — the agent memory hierarchy
Conversation memory — buffer, summary, sliding window
Vector memory — embeddings as a retrieval index
Key-value & structured memory — when you don't need similarity
Episodic vs semantic memory
Memory compaction & eviction — the bounded-context problem

Part 4 — Retrieval-Augmented Generation

RAG deserves its own part. It's the single most common production agent pattern, and every interview asks about it. We build the whole stack, including the bits everyone hand-waves through.

Why RAG — the parametric vs non-parametric memory split
The core pipeline from scratch — load, chunk, embed, retrieve, generate
Chunking strategies — fixed, semantic, recursive, document-aware
Embedding models — what they encode, where they fail
Vector DBs from the inside — HNSW, IVF, pgvector
Hybrid search — BM25 + dense + reranking with cross-encoders
Query transformation — HyDE, multi-query, step-back prompting
Advanced retrieval — parent-child, contextual retrieval, ColBERT
Agentic RAG — the agent decides what to retrieve, when, and how
Self-RAG & Corrective RAG
Graph RAG — knowledge graph + vector hybrid
RAG evaluation — faithfulness, context precision/recall, RAGAs

Part 5 — Planning & reasoning

When one step isn't enough.

Single-step vs multi-step planning
Plan-and-execute
Tree of Thoughts
LATS — Language Agent Tree Search
Self-reflection & critic loops
Verifier agents — separating "do" from "check"

Part 6 — Context engineering

The 2025–26 hot topic. Treating the context window as a resource you have to budget.

Context windows as a resource — budget like RAM
Compaction strategies — summarization, eviction, layered context
Subagent isolation — when to spin up a fresh context
Cache-friendly prompts — structuring for prompt caching
Context pollution — drift, recency bias, instruction decay

Part 7 — Multi-agent systems

Multiple agents, multiple failure modes.

The orchestrator-worker pattern
Debate & adversarial agents
Role-based crews (CrewAI-style)
Handoff protocols — OpenAI Swarm patterns
A2A and the interop landscape
Communication failure modes — deadlock, divergence, role drift

Part 8 — Build it from scratch

The frameworks everyone uses, taken apart.

A minimal agent loop in 100 lines
LangGraph internals — rebuild the executor
AutoGen architecture
OpenAI Agents SDK / Swarm internals
Claude Agent SDK & Claude Code's loop
MCP from scratch — protocol, custom server, custom client

Part 9 — Vertical agents

Where the patterns meet real product surfaces.

Coding agents — Claude Code, Cursor, Aider patterns
Browser & computer-use agents — OSWorld, Playwright loops
Research agents — Deep Research patterns
Data agents — NL2SQL, autonomous analytics
Long-running & scheduled agents — cron, event-driven, durable execution

Part 10 — Production agents

This is the part where the systems work pays off. Distributed scheduling, observability, eval pipelines, cost control — agents as infrastructure, not as demos.

Trajectory eval vs outcome eval; LLM-as-judge
Benchmarks — SWE-bench, GAIA, WebArena, OSWorld
Tracing & replay — Langfuse / custom OTEL pipelines
Safety — prompt injection, sandboxing, HITL checkpoints
Cost & latency engineering — model routing, parallelism, caching
Distributed agent scheduling — Ray, queue-backed workers, statefulness
A/B testing agents in production

How this series fits the bigger picture

This is one of four breadth tracks I'm building, with ML infrastructure as the depth bar — a T-shaped portfolio.

Pre-Training from Scratch — planned
Post-Training from Scratch — planned
Inference from Scratch — shipping now
Agents from Scratch — this series

Every breadth track lands in the same place: a "Production X" part that uses the infra depth — distributed scheduling, observability, queue infrastructure, eval pipelines. That's the bridge back to the vertical bar of the T.

How to follow along

You can read these in order — they're sequenced for that — or jump to the part most relevant to what you're building. RAG (Part 4) and Production Agents (Part 10) are the parts most interview loops care about; if you're optimizing for that, start there and double back for foundations.

Code lives alongside the prose. Single-process versions run on anything with a CPU and an API key. Distributed sections assume access to a small cluster, a managed Redis/Postgres, and a vector DB — instructions for free-tier setups are included where they apply.

If you find an error, an explanation that didn't land, or a topic that's missing — open an issue or reach out. This series is meant to be lived in, not just published.