Phase 10 — AI Orchestration

Levels 6–8 of Agentic Engineering

This phase is reference material for advanced orchestration patterns. You may not execute all of it immediately after Phase 9. Return to it as projects grow. The progression from Level 5 to Level 6 typically takes months of daily practice, not days.

Chapters 43–47Phase Gate + TaskForge

Before You Begin Phase 10

This phase assumes you can: evaluate AI-generated code and identify common failure patterns (Ch 38), write effective CLAUDE.md files and manage context (Ch 40), and use Claude Code for single-task development with tests (Ch 39-42). Phase 10 is reference material — return to it as your projects grow.

Why One Agent Isn't Always Enough

A single Claude Code session is powerful. But some work doesn't fit in one session:

Context limits: One session can't hold an entire large codebase. Specialized agents focus on specific areas.
Separation of concerns: The agent that writes code shouldn't grade its own work. Splitting implementation and review catches more bugs.
Parallelism: Three independent features built simultaneously finish in the time of one.
Specialization: An agent configured for security auditing produces better results than a general-purpose agent asked to check security as an afterthought.

Think of it like a team at a company: one person writes code, another reviews it, another writes tests, another updates documentation. More people means more throughput—but also more coordination overhead. More agents without constraints produces chaos, not productivity. That's why this phase focuses on harnesses and guardrails as much as on dispatching work.

Chapter 43 Level 6 — Harness Engineering & Automated Feedback Loops

Why This Matters Now

At Levels 1-5, you're the quality gate—you review every change. That works for single features, but it doesn't scale. Level 6 builds automated quality gates so agents can catch their own mistakes, freeing you to focus on design decisions instead of line-by-line review.

At Level 5, you gave the agent capabilities. At Level 6, you give the agent guardrails that let it verify its own work. The shift: your job moves from reviewing code to designing the harness.

Backpressure

Backpressure is automated feedback that lets agents self-correct without human intervention. Instead of you catching errors in code review, the tools catch them immediately:

Backpressure mechanisms for automated agent correction
Mechanism	What It Catches
TypeScript strict / mypy	Type errors
Linter (ESLint, Ruff)	Style violations
Test suite	Behavioral regressions
Pre-commit hooks	Format, lint, type-check before every commit
CI pipeline	Integration failures

When the agent runs git commit and a pre-commit hook fails, the agent sees the error, fixes the code, and tries again. That's backpressure in action.

What Happens Without Backpressure

You ask an agent to add a feature. It writes code with a subtle type error. Without a type checker, the code looks fine. You deploy it. It crashes in production. With pre-commit hooks running the type checker, the agent catches and fixes the error before it ever reaches you. The fix costs seconds instead of hours.

Constraints > Instructions

Step-by-step instructions tell the model how to work. Constraints tell it what success looks like and let it figure out the how. Constraints scale better because the model adapts its approach to the problem.

Step-by-Step (Brittle)

1. Read the user model.
2. Add an email field.
3. Add RFC 5322 validation.
4. Update the migration.
5. Update the tests.

Constraint-Based (Robust)

Add email to user model.

Requirements: RFC 5322 validation, unique constraint.
Acceptance: All tests pass, new tests for valid/invalid/duplicate/null.
Run: Work until npm test passes.

Security Boundaries

Agents, code, and secrets live in separate trust domains. Never give an agent direct access to production credentials. Use environment variables, secret managers, and least-privilege access.

Docs-as-Navigation

CLAUDE.md as a ~100-line table of contents. Detailed docs elsewhere (architecture.md, API docs, deployment guide). The model discovers on demand via the progressive disclosure pattern from Chapter 22.

The harness lets the agent verify its own work. Your job shifts from reviewing code to designing the harness.

TaskForge Connection

You'll set up a pre-commit hook for TaskForge that runs Ruff (linter) and pytest. This lets Claude Code self-correct when adding features.

Case Study: The Docker Scaffold

Here's a real-world instruction scaffold that applies every principle from this chapter and from Chapter 21's scaffold section. The goal: tell Claude Code how to containerize any project for multi-agent development. This scaffold was refined over dozens of iterations where Claude got things wrong—each failure became a new constraint.

What Makes It Work

The scaffold is a CLAUDE.md file, roughly 300 lines. Here's why each section exists:

One-sentence goal

"One container. When you exec into it, you get Claude Code pointed at your repo. CMD is sleep infinity." Leaves zero room for interpretation. The model knows what "done" looks like.

Exact file contents, not descriptions

The Dockerfile, docker-compose service, and shell script are provided verbatim—not described. FROM debian:bookworm-slim, not "use a Debian-based image." curl -fsSL https://claude.ai/install.sh | bash, not "install Claude Code." Every line the model generates from scratch is a line that could vary between runs.

Explicit prohibitions from real failures

Each "NEVER" line exists because Claude did the wrong thing in practice:

"NEVER use npm install—it causes auth/PATH corruption" — Claude's training data includes old npm-based tutorials
"Do NOT add CLAUDE_MODEL to the environment block—it is not a real env var" — Claude hallucinated this env var repeatedly
"NEVER overwrite .env—append only" — Claude destroyed existing configuration by rewriting the file
"Do NOT add any environment: block to the claude service" — Claude added redundant env vars that conflicted with env_file

Failure mode table

An authentication reference table maps symptoms to causes to fixes. "Onboarding wizard appears" → ".claude.json missing" → "Dockerfile pre-creates it." This gives the model a debugging playbook, not just a build recipe.

Validation checklist

15 yes/no checks that verify correctness: "Is the base image debian:bookworm-slim?" "Was .env NOT overwritten?" "Does agent.sh use --dangerously-skip-permissions?" These are deterministic guardrails. The model can't rationalize its way past a checklist.

The Non-Determinism Principle in Practice

This scaffold works because it minimizes the decisions the model needs to make. A vague "set up Docker" gives Claude thousands of possible configurations. The scaffold narrows that to essentially one. This is the fundamental pattern: the tighter the constraints, the more deterministic the output. Every unspecified detail is a coin flip. Good scaffolds eliminate coin flips.

Notice the scaffold doesn't just instruct—it protects. Append-only rules for .env prevent data loss. Separate Dockerfile.claude prevents clobbering the project's existing Docker setup. Volume mounts for auth prevent re-authentication on every restart. The scaffold is defensive engineering: it assumes the model will try to do something destructive and makes it structurally impossible.

When you build your own scaffolds—for deployment, testing, code generation, or any repeatable task—follow this pattern: goal → exact content → prohibitions → failure modes → validation. The more specific you are, the more reliable the output becomes, across every run, regardless of the model's non-determinism.

Micro-Exercises

1: Install Pre-commit

Create .pre-commit-config.yaml for TaskForge with a Ruff linter step. Run pre-commit install. Make a deliberate style violation and commit—watch it get caught.

2: Write a Constraint Prompt

Write a constraint-based prompt (not step-by-step) for adding a feature to TaskForge. Include: what to build, acceptance criteria, and the command to verify.

Try This Now

Set up pre-commit (linter + pytest) for TaskForge. Then ask Claude Code:

Add a `priority` field to tasks (high/medium/low) with validation.
Work on it until pre-commit passes cleanly.
Fix failures yourself. Don't ask me unless genuinely stuck.

Watch the agent self-correct through backpressure.

Verification: The pre-commit hook passes. The feature works. You didn't intervene.

If this doesn't work: (1) Pre-commit not running → pre-commit install must be run inside the git repo. (2) Agent enters infinite loop → constraints might be contradictory. Simplify. (3) Agent asks for help immediately → your CLAUDE.md might lack necessary context.

You just watched an agent self-correct through automated feedback—without your intervention. Your job shifted from reviewing code to designing the system that reviews code. That's the Level 6 transition.

Interactive Exercises

Code Validator

Write validate_code(code_str) that checks Python code for common issues. Return a list of issue strings. Check for: bare except: (should specify exception type), from X import *, functions longer than 30 lines, and TODO comments.

Iterate through lines. Check each line for patterns.

For bare except: look for lines matching except: (with colon, no exception type).

For function length: track when a def starts and count indented lines until the next def or unindented line.

Knowledge Check

What is 'backpressure' in agentic engineering?

Harness Setup

Pre-commit hooks installed in TaskForge Linter (ruff or similar) runs on commit pytest runs on commit Agent self-corrected through backpressure (fixed a lint or test failure automatically)

Chapter 44 Level 7 — Background Agents

Why This Matters Now

Level 7 is where AI stops being a tool you use and starts being a team you manage. Instead of working with one agent at a time, you dispatch specialists—like a project manager assigning tasks to team members with different expertise. Each agent works in its own space, and results come back to you for review.

At Level 6, the agent self-corrects. At Level 7, you dispatch multiple agents on independent tasks while your session stays lean. This is where the leverage multiplies.

Background Agent Orchestration

Your main session becomes a command center. Workers execute in isolated contexts (fresh context windows, separate worktrees). Stuck workers surface questions back to you.

Remember: each "background agent" is just a Claude Code session with a fresh context, a specific task, and access to your project's files and tools. There's no magic—it uses the same model, the same CLAUDE.md, the same test suite. The leverage comes from parallelism and specialization, not from any new capability.

Current Tool (March 2026)

Dispatch. Install: npx skills add bassimeledath/dispatch -g. Workers get fresh context windows. Stuck workers surface questions.

/dispatch pre-launch sweep for TaskForge:
1) security audit the auth flow — use opus, worktree
2) write missing integration tests — use sonnet
3) update documentation — use haiku

Multi-Model Dispatch

Different models for different tasks. Each has strengths:

AI model strengths for multi-model dispatch
Model	Best For
Opus	Architecture, security audits, complex reasoning
Sonnet	Implementation, feature building, test writing
Haiku	Formatting, documentation, simple transforms
Gemini	Research, large-context analysis
Codex	Code review, parallel generation

Choosing the Right Model for the Task

Not every task needs the most powerful model. Matching model capability to task complexity saves cost and often improves speed without sacrificing quality.

Recommended model tier by task complexity
Task Type	Recommended Tier	Why	Example
Architecture decisions, complex refactors, ambiguous specs	Highest capability (e.g., Opus)	Requires deep reasoning, handling ambiguity, and maintaining coherence across many files	"Redesign TaskForge's storage layer from JSON files to SQLite, maintaining all existing tests"
Feature implementation, bug fixes, test writing	Mid-tier (e.g., Sonnet)	Well-defined tasks with clear acceptance criteria; strong capability at lower cost	"Add a `--priority` flag to the `add` command with values low/medium/high"
Documentation, formatting, boilerplate, simple edits	Fast/lightweight (e.g., Haiku)	Mechanical tasks where speed matters more than deep reasoning	"Add docstrings to all public functions in `taskforge/api.py`"

The Two-Layer Pattern Applies Here Too

Model names change. "Opus," "Sonnet," and "Haiku" are current as of March 2026. The principle is stable: match model capability to task complexity. Use the most capable model for ambiguous or high-stakes work; use faster models for well-defined, mechanical tasks. Check docs.anthropic.com/models for the current lineup.

When in doubt, use the mid-tier model. It handles 80% of development tasks well. Escalate to the highest tier only when the task is ambiguous, touches many files, or requires architectural judgment. Drop to the lightweight tier for bulk operations where you'd review the output regardless.

Implementer/Reviewer Separation

Never let the same model grade its own exam. If Sonnet writes the code, use Opus or a different model to review it. Biased self-evaluation is a known failure mode.

The Ralph Loop

An autonomous agent loop: run until a PRD (Product Requirements Document) is complete, each iteration gets a fresh context. Caution: under-specified PRDs bite back. The loop amplifies ambiguity—a vague requirement becomes a confidently wrong implementation repeated across 10 iterations.

Hands-On: Build a Ralph Loop

The Ralph Loop is an autonomous looping agent pattern: the agent runs a task, checks the results against success criteria, and if the criteria aren't met, it iterates—fixing issues and re-checking—until the task succeeds or a maximum iteration limit is hit. Each iteration gets a fresh context window, which prevents context pollution from failed attempts. The key insight is that the loop is self-correcting: the agent doesn't just retry blindly, it reads the failure output and adapts its approach. This makes it powerful for tasks with clear, testable success criteria.

Set up a simple Ralph Loop for TaskForge:

claude --background "Fix all failing tests in TaskForge.
Run: python3 -m pytest
If any tests fail, read the failure output, fix the code, and re-run.
Repeat until all tests pass.
Maximum 5 iterations. If still failing after 5, stop and report what's left."

To test this, intentionally break something first: introduce a bug in models.py (e.g., change a return value, break a validation check). Then launch the loop and watch the agent find and fix it.

Verification: The agent's final output shows all tests passing. Check git diff to confirm the fix is sensible—not a hack like deleting the failing test.

If this doesn't work: (1) Agent loops forever → the iteration limit is essential, never omit it. (2) Agent deletes tests instead of fixing code → add to the prompt: "Do NOT delete or skip tests. Fix the source code." (3) --background not available → use dispatch instead: /dispatch fix all failing tests, max 5 iterations — use sonnet.

Always Set Iteration Limits

A Ralph Loop without a maximum iteration count can run indefinitely, burning API credits and potentially making the codebase worse with each pass. Always specify a hard limit (3-5 iterations for most tasks). If the agent can't fix it in 5 tries, the problem needs human judgment, not more iterations.

CI-Triggered Agents

Agents that activate on repository events: PR review bots on pull request, docs updaters on merge, security scanners on dependency changes.

Your session becomes a command center. Workers execute in fresh contexts. Questions surface to you. Results return when done.

TaskForge Connection

You'll dispatch multiple independent improvements to TaskForge using different models. This is the first time TaskForge benefits from parallel AI development.

Micro-Exercises

1: Install Dispatch

Run npx skills add bassimeledath/dispatch -g. Then: /dispatch use sonnet to list all functions in the project without docstrings.

2: Multi-Model Comparison

Dispatch two tasks simultaneously to different models. Compare output quality. Which model was better for which task?

Try This Now

Dispatch three independent TaskForge improvements:

/dispatch three TaskForge improvements:
1) add due dates with reminder logic — use sonnet
2) add tags with filtering — use sonnet
3) improve error handling across all functions — use haiku

Review all three outputs. Are there conflicts? Fix them.

Verification: All three features work. Tests pass. No conflicts in the merged result.

If this doesn't work: (1) Dispatch not found → ensure install completed, restart Claude Code. (2) Workers fail silently → check .dispatch/ for logs. (3) Merge conflicts → expected with parallel work. Resolve manually.

You just dispatched parallel AI workers and merged their results. Three features built in the time of one—that's the Level 7 multiplier.

Interactive Exercises

Knowledge Check

What is a Ralph Loop?

Knowledge Check

Why should you set a maximum iteration count for autonomous agent loops?

Background Agent Usage

Dispatched at least one background agent task Reviewed the agent's output Identified at least one issue in the output Documented the correction needed

Chapter 45 Claude Code in Docker

Why This Matters Now

Running Claude Code in Docker gives you isolated, reproducible AI coding environments. This is how you scale from one agent to many without polluting your local machine.

Why Run Claude Code in Docker?

Running Claude Code directly on your host machine works fine for single sessions. But as you scale to multiple agents, Docker solves problems that become unavoidable:

Isolated environments per project/agent: Each agent gets its own filesystem, tools, and dependencies—no cross-contamination.
Reproducible setups: Same tools, same config, every time. No drift between machines or sessions.
Safe experimentation with --dangerously-skip-permissions: Container isolation makes this flag safe—the agent can't touch anything outside the container.
Multiple parallel agents: Run several agents simultaneously on the same codebase using volume mounts.
CI/CD integration: Trigger Claude Code agents from pipelines—automated code review, test generation, documentation updates.

Prerequisites

Before proceeding, ensure you have:

Docker installed and running (Chapter 16)
Claude Max or Pro subscription (or an Anthropic API key)

The Dockerfile.claude Pattern

Here's the production-grade Dockerfile for running Claude Code in a container. This is the result of the scaffold pattern from Chapter 43—every line exists because leaving it out caused a failure.

FROM debian:bookworm-slim

# System packages — minimal set for Claude Code to operate
RUN apt-get update && apt-get install -y --no-install-recommends \
    git bash curl ca-certificates build-essential \
    jq tree ripgrep python3 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Non-root user
RUN groupadd -r claude && useradd -r -g claude -m -s /bin/bash claude

# Install Claude Code (NATIVE installer — never npm)
USER claude
WORKDIR /home/claude
RUN curl -fsSL https://claude.ai/install.sh | bash
ENV PATH="/home/claude/.local/bin:/home/claude/.claude/bin:${PATH}"

# Pre-configure auth so the onboarding wizard never appears
RUN mkdir -p /home/claude/.claude && \
    echo '{"hasCompletedOnboarding":true,"installMethod":"native"}' \
      > /home/claude/.claude/.claude.json && \
    ln -sf /home/claude/.claude/.claude.json /home/claude/.claude.json && \
    echo '{"permissions":{"allow":["*"],"deny":[]},"env":{"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC":"1"}}' \
      > /home/claude/.claude/settings.json

WORKDIR /workspace
CMD ["sleep", "infinity"]

Why each decision matters:

Dockerfile design decisions for headless Claude Code
Decision	Why
`debian:bookworm-slim`	Must be glibc-based. Alpine/musl breaks the native Claude Code binary.
Native installer, not npm	The npm package uses different config paths and causes auth corruption. The native installer is Anthropic's official method.
Non-root `claude` user	Security best practice. Limits blast radius if something goes wrong.
Pre-created `.claude.json`	Without it, the onboarding wizard appears every time—blocking headless operation. The symlink covers both config paths different versions check.
`settings.json` with `"allow":["*"]`	Equivalent to `--dangerously-skip-permissions` baked into config. No prompts in headless mode.
`CMD ["sleep", "infinity"]`	Container stays alive. You `docker exec` into it for each session. Multiple execs = multiple Claude Code sessions, all sharing the workspace.

The Non-Determinism Lesson

This Dockerfile is the scaffold pattern in action. A vague "create a Dockerfile for Claude Code" instruction produces wildly different results: npm vs native, Alpine vs Debian, root vs non-root, ENTRYPOINT vs CMD. Each wrong choice causes a specific failure—auth loops, binary crashes, permission errors. The scaffold eliminates every coin flip. That's how you make non-deterministic output reliable.

Authentication Methods

Claude Code needs authentication. Two approaches, and they must never coexist in the same .env:

OAuth Token (Claude Max/Pro)

Generate a token on your host machine, then pass it as an environment variable:

Terminal (on host)claude setup-token

Add the resulting token to your .env file:

CLAUDE_CODE_OAUTH_TOKEN=sk-ant-oat01-your-token-here

The token is passed into the container via env_file: .env in docker-compose.yml. It is never baked into the Docker image.

API Key

For per-token billing (without a subscription):

ANTHROPIC_API_KEY=sk-ant-api03-your-key-here

Never Set Both

If both ANTHROPIC_API_KEY and CLAUDE_CODE_OAUTH_TOKEN exist, the API key takes precedence and you get billed per-token instead of using your subscription. Pick one.

Running a Single Claude Agent

The pattern: start the container once, then docker exec into it for each session. Each exec is an independent Claude Code session sharing the same workspace.

Terminal# Build and start the container docker compose up -d claude # Open an interactive Claude Code session docker exec -it claude-myproject claude --dangerously-skip-permissions # Open ANOTHER session in a second terminal pane docker exec -it claude-myproject claude --dangerously-skip-permissions

The -v $(pwd):/workspace volume mount maps your current directory into the container. Changes the agent makes inside the container appear on your host filesystem immediately.

Running with --dangerously-skip-permissions

The --dangerously-skip-permissions flag tells Claude Code to skip all permission prompts—file writes, command execution, network requests. The agent acts without asking.

When to Use This Flag

On your host machine, this flag is genuinely dangerous—the agent could modify any file, run any command. Inside a Docker container, the risk is contained: the agent can only affect the container's filesystem and the mounted volume. This is why containers and --dangerously-skip-permissions are a natural pair: container isolation makes autonomous operation safe.

Run a non-interactive task with full autonomy:

Terminaldocker run -it -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY -v $(pwd):/workspace claude-agent --dangerously-skip-permissions -p "fix all failing tests"

The -p flag passes a prompt directly instead of opening an interactive session. Combined with --dangerously-skip-permissions, this lets the agent work completely autonomously: read files, edit code, run tests, and iterate until done.

Docker Compose Setup

The compose file defines the claude service alongside any existing project services:

# docker-compose.yml
services:
  claude:
    build:
      context: .
      dockerfile: Dockerfile.claude
    container_name: claude-${COMPOSE_PROJECT_NAME:-agents}
    env_file: .env
    volumes:
      - .:/workspace
      - claude-auth:/home/claude/.claude
    working_dir: /workspace
    stdin_open: true
    tty: true

volumes:
  claude-auth:
    name: claude-auth-${COMPOSE_PROJECT_NAME:-default}

Key details: claude-auth volume persists authentication across container restarts. .:/workspace mounts your project. env_file: .env passes the OAuth token. The container name includes the project directory name so each repo gets a unique container.

Terminal# Start the container docker compose up -d claude # Run multiple agent sessions (each in its own terminal pane) docker exec -it claude-myproject claude --dangerously-skip-permissions --model claude-opus-4-6 docker exec -it claude-myproject claude --dangerously-skip-permissions --model claude-opus-4-6

Coordination Challenges

When multiple agents share the same volume mount, they can write to the same files simultaneously. This can cause conflicts—one agent overwrites another's changes. Strategies: (1) assign agents to non-overlapping directories, (2) use git worktrees so each agent has its own working copy, (3) run agents sequentially in a pipeline instead of in parallel.

The agent.sh Pattern

A thin convenience script that ensures the container is running, then execs in. The model is forced via --model flag (environment variables don't work for model selection).

#!/usr/bin/env bash
set -euo pipefail

PROJECT_NAME="${COMPOSE_PROJECT_NAME:-$(basename "$(pwd)")}"
CONTAINER="claude-${PROJECT_NAME}"

# Start container if not running
if ! docker ps --format '{{.Names}}' | grep -q "^${CONTAINER}$"; then
  docker compose up -d claude
  sleep 2
fi

# Interactive or one-shot
if [ $# -gt 0 ]; then
  docker exec -it "$CONTAINER" claude \
    --dangerously-skip-permissions --model claude-opus-4-6 -p "$*"
else
  docker exec -it "$CONTAINER" claude \
    --dangerously-skip-permissions --model claude-opus-4-6
fi

Usage:

Terminal./agent.sh # interactive session ./agent.sh "refactor the auth module" # one-shot prompt

To run multiple agents, open more terminal panes and run ./agent.sh in each one. Each exec is an independent session, all sharing the same workspace.

Practical Patterns

Sequential Pipeline

Agent 1 writes code, Agent 2 reviews it, Agent 3 writes tests. Each agent's output feeds into the next. This ensures review before testing and catches issues early.

Parallel Workers

Multiple agents work on independent features simultaneously. When all finish, merge the results. Best when features don't share files.

Watcher Agent

An agent that monitors test results (or CI output) and dispatches fix agents when tests fail. This creates a self-healing pipeline: break something, and the watcher assigns an agent to fix it.

Resource Management

Docker lets you limit resources per container so one runaway agent doesn't consume your entire machine:

# docker-compose.yml
services:
  agent-backend:
    build:
      context: .
      dockerfile: Dockerfile.claude
    command: ["--dangerously-skip-permissions", "-p", "implement API endpoints"]
    volumes:
      - .:/workspace
    environment:
      - ANTHROPIC_API_KEY
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "1.0"

Clean up stopped containers regularly to reclaim disk space:

Terminal# Remove all stopped containers docker container prune # Remove unused images docker image prune

Common Failure Modes

Common Docker container failure modes and fixes
Symptom	Cause	Fix
Onboarding wizard appears	`.claude.json` missing or malformed	Dockerfile must pre-create it + symlink both paths
Token expires on restart	Auth files not persisted	`claude-auth` volume on `/home/claude/.claude`
API billing instead of subscription	Both `ANTHROPIC_API_KEY` and OAuth token in .env	Remove the API key; keep only the OAuth token
Binary crashes on startup	Alpine/musl base image	Use `debian:bookworm-slim` (glibc)
npm install breaks auth flow	npm version uses different config paths	Use native installer only

TaskForge Connection

Run two Claude agents in Docker simultaneously—one to add a new feature to TaskForge, another to write tests for existing features. Then merge the results. This is the first time you'll see parallel AI development on your own project with full container isolation.

Docker Compose orchestrates multiple Claude Code agents. Each runs in its own container but shares the project code through a mounted volume. The developer manages the composition, not the individual agents.

Micro-Exercises

1: Build the Dockerfile.claude

Create Dockerfile.claude for TaskForge using the pattern above. Build with docker compose build claude. Verify the container starts: docker compose up -d claude && docker exec claude-taskforge-project echo "ok".

2: Run Claude Code in Docker

Run ./agent.sh "list all Python files in the project" and examine the output. Verify the agent found your project files inside /workspace.

Try This Now

Set up the full Docker scaffold for TaskForge: Dockerfile.claude, docker-compose.yml (with claude-auth volume), agent.sh, and .env with your OAuth token. Then open two terminal panes and run:

Terminal Pane 1./agent.sh "add a delete_task function to TaskForge that removes a task by ID"

Terminal Pane 2./agent.sh "write comprehensive pytest tests for the add_task function"

Both sessions share the same container and workspace.

Verification: Both agents complete their tasks. git diff shows modifications from both agents. The container stayed running throughout.

If this doesn't work: (1) Onboarding wizard appears → docker compose down -v && docker compose build --no-cache claude && docker compose up -d claude. (2) Token expired → run claude setup-token on host, update .env, restart container. (3) Agents conflict on the same file → expected; resolve manually or use git worktrees.

You just ran multiple AI agents in isolated containers, each working on your project simultaneously. This is the pattern that scales from two agents to twenty—container isolation makes it safe, and Docker Compose makes it manageable.

Interactive Exercises

Knowledge Check

Why is --dangerously-skip-permissions acceptable inside a Docker container?

Knowledge Check

What does -v $(pwd):/workspace do when running Claude Code in Docker?

Docker Agent

Built Dockerfile.claude for TaskForge Ran Claude Code inside a Docker container Agent successfully read files in /workspace Agent's changes were visible on host machine

Chapter 46 Level 8 — Autonomous Agent Teams

Why This Matters Now

Level 8 is the frontier—and it's important to understand both its power and its limits. Agent teams can tackle large projects where coordination between frontend, backend, and testing is essential. But more agents also means more coordination overhead, more potential for conflicting changes, and more need for robust CI. This chapter teaches you when teams are worth the complexity—and when simpler patterns are better.

Level 7 dispatches workers that report back to you. Level 8 is agents that communicate with each other—peer-to-peer coordination, not hub-and-spoke. This is experimental territory.

Peer-to-Peer Agent Coordination

Instead of all workers reporting to a single orchestrator, agents in a team share a mailbox and coordinate directly. A frontend agent can ask a backend agent about an API contract without routing through you.

Current Feature (March 2026)

Claude Code Agent Teams (experimental): export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. Team lead + workers with shared mailbox.

What Pioneers Found

Anthropic (16 agents building a C compiler): needed CI to prevent regressions. Without automated tests, agents would break each other's work.

Cursor (hundreds of agents for codebase migration): without hierarchy, agents churned—making and reverting the same changes repeatedly.

Current Orchestrators

Current multi-agent orchestration tools and their patterns
Tool	Pattern
Dispatch	Local, hub-and-spoke
Gas Town	Structured workflows
Multiclaude	Simple parallel execution
Claude Flow	Complex multi-step workflows
Ramp Inspect	Cloud VMs for isolation

The Honest Assessment

"For day-to-day work, Level 7 is where the leverage is." Level 8 is for very large projects where the coordination cost is justified. Most TaskForge-sized projects never need it.

The Multi-Agent Tradeoff

More agents = more throughput—three features built simultaneously instead of sequentially. But more agents = more coordination overhead—merge conflicts, inconsistent patterns, duplicated work. And more agents without constraints = chaos—agents that undo each other's changes, introduce conflicting patterns, or confidently build the wrong thing faster. Verification and CI become more important as concurrency increases, not less.

Agent Teams differ from Dispatch: workers communicate with each other (dashed lines), not just through the lead.

Decision Tree: Choosing the Right Pattern

More agents is not always better. Match the coordination pattern to the task:

Match the coordination pattern to the task. More agents is not always better.

When NOT to Use Multiple Agents

The decision tree above starts with "Simple, one-file" routing to a single session. But that category is larger than it looks. Most tasks fit a single agent. Here's the concrete decision framework:

Default to Single Agent

Use a single well-configured agent when:

The task fits in one context window (roughly <20 files touched)
The feature touches fewer than 5 files
There are no independent subtasks that could run in parallel
You can describe the entire change in one prompt

Escalate to multi-agent when:

Independent features can genuinely be parallelized (not just "it would be nice")
Implementation and review should be separated (builder-validator pattern)
The task exceeds one context window and has natural split points
You need different tool configurations for different subtasks

TaskForge at its current size? Single agent. A full-stack app with separate frontend, backend, and infrastructure changes? Consider dispatch. If you find yourself writing more orchestration code than feature code, step back to a simpler pattern.

TaskForge Connection

Would agent teams be appropriate for TaskForge? No—it's too small. Subagents or Dispatch are sufficient. Knowing when not to use a pattern is as important as knowing how.

Micro-Exercises

1: Draw Your Decision Tree

Write a decision tree (on paper or in markdown) for when you'd use each agent pattern. Include at least 5 branches.

2: Assess TaskForge

For your TaskForge project: would agent teams be appropriate? Why or why not? (Expected: no—it's too small.)

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."
—Antoine de Saint-Exupéry, Airman's Odyssey

Try This Now: Parallel Agents and Merge Conflicts

Dispatch two agents to work on different TaskForge features simultaneously, then resolve the resulting merge conflict:

/dispatch two TaskForge features in parallel:
1) add an "archive_task" command that moves completed tasks
   to an "archived" list — use sonnet, worktree
2) add a "task_stats" command that prints counts of
   pending/completed tasks — use sonnet, worktree

Both agents will modify shared files (main.py, models.py, tests). When both finish:

Try to merge both branches into main. Expect a merge conflict.
Open the conflicting file(s). Look for <<<<<<< markers.
Resolve by keeping both features—don't discard either agent's work.
Run python3 -m pytest to verify both features work together.

Expected outcome: Both archive_task and task_stats work. Tests pass. Git log shows the merge commit.

Verification: git log --graph --oneline shows two branches merging. Both new commands work when tested manually. All tests pass.

If this doesn't work: (1) No merge conflict → agents may have touched different files. This is fine—it means the features were truly independent. (2) Tests fail after merge → both agents may have added conflicting test fixtures. Reconcile the test setup. (3) One agent's feature breaks the other → this is the coordination cost from the chapter introduction. Fix by reading both implementations and adjusting the integration points.

You just articulated when each agent pattern is appropriate—and when it's overkill. Knowing when NOT to add complexity is as valuable as knowing how.

Interactive Exercises

Knowledge Check

When should you NOT use autonomous agent teams?

Dependency Planner

Write plan_execution(tasks) that takes a dict mapping task names to their dependencies (list of task names) and returns a list of sets, where each set contains tasks that can run in parallel. Tasks in later sets depend on tasks in earlier sets. Raise ValueError if there's a circular dependency.

Start by finding tasks with no dependencies (no unresolved deps). Those go in the first set.

After processing a set, remove those tasks from all dependency lists. Repeat until all tasks are planned.

If a round produces no new tasks but some remain, there's a cycle.

Multi-Agent Experience

Dispatched 2+ agents on different features simultaneously Both agents completed their tasks Merged results (resolved any conflicts) All tests pass after merging

Chapter 47 The Multiplayer Effect and What Comes Next

Why This Matters Now

Individual skill has a ceiling. Teams have compounding leverage. This chapter connects your personal growth through the 8 Levels to the broader challenge of making your team effective—because the bottleneck is always the least-equipped member of the workflow.

Your individual level matters. But your team's level matters more. This chapter connects the 8 Levels to real-world team dynamics.

The Multiplayer Bottleneck

"If you're Level 7 raising PRs while you sleep, but your reviewer is Level 2, your throughput is capped at Level 2."

Your individual level matters. Your team's level matters more. Pull your team up.

Team Skills Registry

At Block (and other companies), shared skills get PRs, reviews, and versions—same as code. A team skills registry means everyone benefits from every improvement.

Self-Assessment

Find Your Level

The full self-assessment quiz with detailed checklists and next-step guidance for each level is in Appendix C: Self-Assessment Quiz. Take it now—identify your current level and the specific item blocking you from the next one.

What Comes Next

The field is moving fast. Expect: voice-to-voice coding, tighter CI/CD integration, cross-model coordination protocols, and the iterative nature of software itself being reimagined. But the fundamentals from Phases 1-4 don't change. Code is still logic expressed in text. Tests still verify behavior. Architecture still matters.

TaskForge Connection

Look at how far TaskForge has come: from a 40-line script to a tested, structured, AI-configured, multi-agent-ready project. That progression mirrors the 8 Levels. Your next project starts at whatever level you've reached.

Micro-Exercises

1: Take the Self-Assessment

Use the full checklist in Appendix C. Write down your level honestly. Identify the specific item that blocks you from the next level.

2: Audit Your Team

If you work on a team (even a team of 2): estimate each member's level. Find the bottleneck. What's one thing you could share (a skill, a CLAUDE.md template, a workflow) to raise the team's floor?

Try This Now

Your Action Plan

Take the self-assessment above. Identify your current level. Write a 5-sentence action plan for reaching the next level within 30 days. Be specific: what tool to install, what skill to create, what habit to build.

Verification: Your action plan has concrete dates and deliverables, not just intentions.

"The computer programmer is a creator of universes for which he alone is the lawgiver. No playwright, no combiner of things ever fashioned universes complete with their own laws."
—Joseph Weizenbaum, Computer Power and Human Reason

Interactive Exercises

Knowledge Check

At what levels do most professional developers operate?

Self Assessment & Action Plan

Completed the Self-Assessment Quiz in Appendix C Identified my current level honestly Identified the specific item blocking advancement Wrote a 5-sentence action plan with concrete dates Shared my assessment with a colleague or mentor (optional)

Phase 10 Gate Checkpoint & TaskForge Multi-Agent

Minimum Competency

Pre-commit hooks providing backpressure. 2+ background agent tasks dispatched and reviewed. Output review with error identification. Agent pattern decision criteria articulated.

Your Artifact

TaskForge with: git log showing agent-implemented features with backpressure verification. A written decision tree covering all 5 agent patterns.

Verification

Pre-commit passes. Agent output was reviewed (corrections documented). Decision tree covers all patterns with specific criteria.

Failure Signal

If background agents produce output you cannot evaluate → return to Phase 9. Evaluation skill (Chapter 38) is the prerequisite for everything in Phase 10.

TaskForge Checkpoint

TaskForge now has multi-agent-implemented features, automated quality gates, and a decision framework for future agent coordination. The curriculum is complete.

What You Can Now Do

Design automated feedback loops (harnesses) that let agents self-correct
Dispatch background agents for parallel development
Choose the right agent pattern for the right task
Evaluate and merge multi-agent output
Articulate when multi-agent coordination is worth the overhead—and when it isn't

The Full Arc

From Zero to Supervisor

Look at how far you've come. In Phase 1, you couldn't read a line of code. Now you can orchestrate multiple AI agents working in parallel on a structured project with automated quality gates. Every phase was necessary: you can't supervise AI code you can't read (Phase 1), test code you can't write (Phase 2), manage projects without professional tools (Phase 3), deploy without infrastructure knowledge (Phase 4), or direct agents without context engineering skills (Phase 9). The phases aren't detours—they're the prerequisites that make competent AI supervision possible.