Phase 9 — AI-Assisted Development

Levels 1–5 of Agentic Engineering

From evaluating AI output to custom skills and MCPs. The progression from "I can judge AI code" to "the AI has capabilities I've given it."

Chapters 38–42Phase Gate + TaskForge

Before You Begin Phase 9

This phase assumes you have completed Phases 1–8. Specifically: you can read and write Python confidently (Phase 2), use Git for version control (Phase 3), build a working web application with Flask, SQLite, and Docker (Phase 4), and have a solid grasp of data structures (Phase 5), algorithms (Phase 6), engineering craft (Phase 7), and systems design (Phase 8). Phases 5–8 provide additional depth that strengthens your ability to evaluate AI-generated code.

Before You Begin: What an LLM Actually Is

You're about to start using AI coding tools. Before you do, understand what they are—and what they are not.

A large language model (LLM) predicts the most likely next words based on patterns learned from vast amounts of text. It does not "understand" your project the way a human engineer does. It has no memory between sessions. It cannot verify its own output. It generates text that looks correct based on patterns—and it's often genuinely useful—but it can also produce confident, plausible nonsense.

Think of it like a brilliant research assistant who has read millions of codebases but has never seen yours. They can write fluent code, but they might misunderstand your architecture, invent APIs that don't exist, or produce logic that passes a surface read but fails on edge cases.

Agents are not magical beings. They are workflows built around model calls: a prompt goes in, a response comes out, tools execute actions, and rules constrain behavior. Every "intelligent" thing an agent does traces back to context you provided, tools you configured, and constraints you set. The better your setup, the better the agent performs.

This means two things: (1) your judgment is the quality gate, and (2) everything you've learned in Phases 1-4—reading code, understanding data flow, writing tests—is what makes that judgment possible.

Chapter 38 Evaluating and Directing AI-Generated Code

Why This Matters Now

AI tools generate code fast. But speed without judgment is dangerous. This chapter teaches you to be the quality gate—the person who decides whether AI output is correct, complete, and safe before it enters your codebase.

This is the most important chapter in Phase 9. Without these skills, every subsequent chapter produces a reader who uses AI tools without the judgment to evaluate their output.

The Core Problem: Non-Determinism

Every programming concept you've learned so far is deterministic. Run 2 + 2 a million times, you get 4 every time. Call sorted([3,1,2]) and the result is always [1,2,3]. Functions have inputs and outputs. Tests pass or fail. This predictability is the entire foundation of software engineering.

Large language models (LLMs) are fundamentally non-deterministic. Ask the same question twice and you may get different answers. Ask it to write a function and the variable names, structure, edge case handling, and even the algorithm choice can vary between runs. This isn't a bug—it's how the technology works. LLMs sample from probability distributions over tokens; small floating-point differences cascade into different outputs.

This Changes Everything

When you use a calculator, you trust the output. When you use an LLM, you verify the output. This is not a philosophical position—it's an engineering requirement. The entire discipline of AI-assisted development exists because LLM output cannot be trusted by default.

This creates a paradox: you want reliable software from an unreliable generator. The only way to make a non-deterministic system behave as if it were deterministic is to surround it with extremely robust guardrails:

Tests — the code either passes or it doesn't, regardless of how it was generated
Specifications — precise instructions reduce the space of possible outputs
Context documents (CLAUDE.md, rules files) — constrain the model's choices to match your project's patterns
Review pipelines — diffs, linting, type checks catch what the model missed
Deterministic validation — CI/CD, automated checks, acceptance criteria that don't depend on judgment

Every technique in Phases 5 and 6—context engineering, compounding rules, harness engineering, Docker isolation—exists to constrain non-deterministic output into deterministic boundaries. The guardrails aren't optional extras. They are the entire point. Without them, you're shipping lottery tickets.

The AI generates. You verify. The guardrails enforce. That's the contract. Every chapter from here on teaches you to build better guardrails.

Reading Diffs

A diff shows line-by-line changes: + = added, - = removed. When Claude Code makes changes, you review the diff before accepting. Check: does the change match what I asked? Any unrelated changes? Any wrong deletions?

Worked Example: Reviewing a Real Diff

Here's how an experienced developer reviews a diff. Suppose you asked Claude to "add error handling to the load_tasks function." The diff looks like this:

 def load_tasks(filepath):
-    with open(filepath) as f:
-        data = json.load(f)
-    return data
+    if not isinstance(filepath, str) or not filepath.strip():
+        raise ValueError("filepath must be a non-empty string")
+    try:
+        with open(filepath) as f:
+            data = json.load(f)
+    except:
+        return []
+    if not isinstance(data, list):
+        raise TypeError(f"Expected list, got {type(data).__name__}")
-    import os
     return data

Now the annotation—line by line:

Lines 1-2 (added): Good—added input validation for filepath. Prevents cryptic errors from None or empty strings reaching open().
Line 5 (except:): Problem—this is a bare except that catches all exceptions, including KeyboardInterrupt and SystemExit. It should be except (FileNotFoundError, json.JSONDecodeError): to catch only the expected failures.
Line 6 (return []): Question—silently returning an empty list hides errors. Is that intentional, or should it log a warning? Depends on the project's error philosophy.
Lines 8-9 (type check on data): Good—validates the JSON structure before returning. Catches corrupted files early.
Line 10 (import os removed): Question—why was this import removed? If nothing else uses os, this is a fine cleanup. But check the rest of the file first.

Verdict: Revise. The bare except is a real bug—ask Claude to catch specific exceptions. The rest is solid.

Writing Implementation Specs

The difference between getting mediocre AI output and great AI output is the quality of your specification:

Vague Request

"Make a login page"

Implementation Spec

What: Login page with email/password
Inputs: email (string), password (string)
Outputs: JWT token on success, error message on failure
Constraints: Rate limit: 5 attempts/minute. No plaintext password storage.
Edge cases: Empty fields, SQL injection attempt, expired session
Acceptance criteria: All 5 test cases pass

Recognizing AI Failures

Common AI code-generation failure types and how to catch them
Failure Type	What It Looks Like	How to Catch It
Hallucinated imports	`from nonexistent import magic`	`pip install` fails with 404
Stale patterns	Deprecated API usage	Check library docs for current version
Over-engineering	200 lines for a 10-line problem	Ask: "Could this be simpler?"
Happy path only	No error handling	Test with bad input
Confident wrongness	Incorrect with full confidence	Verify claims against docs

Every AI output goes through this pipeline. The pipeline catches what the model missed.

TaskForge Connection

In the exercises below, you'll use TaskForge as the subject for AI code generation. You have enough context about the project to judge whether AI output is correct.

Micro-Exercises

1: Catch a Hallucination

Ask an AI: "What Python package is best for parsing XQVZ format files?" If it confidently recommends a package, search PyPI for it. It almost certainly doesn't exist—the format is made up.

2: Evaluate AI Output

Ask an AI to write is_valid_email. Check: does it handle empty strings? Multiple @? Spaces? If any are missing, the AI failed—and you caught it.

Try This Now

Get an AI to generate any function. Apply the evaluation pipeline: (1) Read the output. (2) Check against what you asked. (3) Identify one correct thing. (4) Identify one wrong/missing thing. (5) Verdict: accept, revise, or reject. Write your evaluation in 3-5 sentences.

Verification: Your evaluation includes a specific correct element AND a specific issue.

If this doesn't work: If you can't find any issue, ask for something harder—a function with 5+ edge cases.

You just evaluated AI-generated code and found a real issue. That's not nitpicking—it's the skill that separates someone who uses AI from someone who supervises it.

Interactive Exercises

Find the AI's Bugs

This function was generated by an AI. It looks correct at first glance but has subtle bugs. Find and fix them so all tests pass.

Look at the indentation of `return result` — is it inside the loop?

The function returns after processing only the first item.

Fix: unindent `return result` so it's outside the for loop.

Spot the Over-Engineering

An AI was asked: "Write a function that checks if a string is a palindrome." It produced 25 lines. Write a simpler version called is_palindrome_simple(s) that does the same thing correctly in 3 lines or fewer (not counting the def line). Then write evaluate() returning a dict with "problem" (string describing what's wrong with the AI version) and "lines_saved" (integer).

A palindrome reads the same forwards and backwards. Python can reverse a string with s[::-1].

The entire function can be: clean the string, lowercase it, compare to its reverse.

The AI imported re, functools, used recursion with memoization, and added type checking—none of which were needed for a simple palindrome check.

Find the Missing Edge Cases

An AI wrote parse_config(text) to parse KEY=VALUE config files. It works for valid input but crashes on 3 common edge cases. Write test_edge_cases() that calls parse_config with 3 different bad inputs and catches the failures. Then fix the function so all edge cases return sensible results instead of crashing.

Try parse_config("")—it tries to split "" on = and gets [""], not ["key", "value"].

Skip empty lines and lines starting with # before splitting on =.

Use line.split("=", 1) to split on only the first =, so values containing = aren't broken.

Knowledge Check

What should you do FIRST when you receive AI-generated code?

Knowledge Check

What is a 'hallucination' in the context of AI-generated code?

Chapter 39 Levels 1 & 2 — Tab Complete and the Agent IDE

Why This Matters Now

This chapter is where you start using AI coding tools hands-on. Levels 1 and 2 are your entry points—from simple autocomplete to full conversational coding. Every higher level builds on these interactions.

Level 1: Tab Complete

Concept: Inline code suggestions as you type. Press Tab to accept.

Current Tools (March 2026)

GitHub Copilot (VS Code extension), Codeium (free), Supermaven (fastest).

Level 2: Agent IDE

Concept: Chat connected to your codebase. Describe in natural language, AI produces multi-file edits.

Current Tools (March 2026)

Cursor (cursor.com), Windsurf (windsurf.com), VS Code + Copilot Chat, Claude Code (terminal: npm install -g @anthropic-ai/claude-code, requires Node.js v18+).

Plan Mode

Describe → AI plans → you review → approve → execute → review diff → test. Boris Cherny (Claude Code creator) starts 80% of tasks in plan mode. With each model generation, one-shot success climbs. Plan mode is the right entry point.

Each level unlocks the next. You can't skip ahead—Levels 3-5 are prerequisites for 6-8.

TaskForge Connection

You'll use Claude Code on TaskForge for the first time in this chapter's exercise.

Micro-Exercises

1: Install Claude Code

npm install -g @anthropic-ai/claude-code
claude --version

Expected output:

$ npm install -g @anthropic-ai/claude-code

added 1 package in 8s

Verify the installation:

$ claude --version
claude v1.x.x

If npm is not found, return to Phase 1 Chapter 3 and install Node.js. If you get a permissions error, try sudo npm install -g @anthropic-ai/claude-code (macOS/Linux) or run your terminal as Administrator (Windows).

2: First Conversation

Navigate to TaskForge and run claude. Ask: "Explain the architecture of this project in one paragraph." Is it accurate?

"The most important property of a program is whether it accomplishes the intention of its user."
—C.A.R. Hoare, Turing Award lecture, 1980

Try This Now

In your TaskForge project, ask Claude Code: "Add a delete_task(task_id) function that removes a task by ID. Plan first, then implement after I approve." Review the plan. Approve. Check git diff. Run tests.

Verification: The diff shows a new function. Existing tests pass. The new function handles a non-existent ID gracefully.

If this doesn't work: (1) claude: command not found → check node --version. (2) Auth error → follow the auth flow. (3) Unwanted changes → git checkout . reverts.

You just used an AI agent to plan and implement a feature, then verified the result with git diff and tests. That's the Level 2 workflow—and you already have the judgment to evaluate the output.

Interactive Exercises

Knowledge Check

What is the key difference between Level 1 (Tab Complete) and Level 2 (Agent IDE)?

Knowledge Check

Why should you review `git diff` after an agent makes changes?

First Agent Interaction

Claude Code installed — `claude --version` works Opened Claude Code in TaskForge directory Asked Claude to explain a file and verified accuracy Asked Claude to implement a small feature Reviewed `git diff` after the agent's changes Ran tests to verify the feature works

Chapter 40 Level 3 — Context Engineering

Why This Matters Now

The difference between getting mediocre AI output and exceptional output is almost never the model—it's the context. What files does the model see? What instructions guide it? What constraints prevent bad output? Context engineering is the highest-leverage skill in AI-assisted development.

Context engineering is controlling what the model sees so every token does useful work. "Every token needs to fight for its place in the prompt."

The Context Window

Think of the context window as the model's working memory budget. Everything the model can consider at once—your prompt, the system instructions, file contents, conversation history—must fit in this budget. When it fills up, older information gets pushed out. Managing this budget is the core skill of context engineering.

The Four Surfaces

CLAUDE.md

Persistent instructions read at every session start. Under 200 lines, specific, actionable.

Tool Descriptions

Model reads these to decide which tools to call. Clear descriptions = better tool use.

Conversation Management

/compact at ~50% context. Subagents for exploration. /clear for task switches.

Selective Context

Include relevant files only, not the entire codebase. Less noise = better output.

CLAUDE.md Example for TaskForge

# TaskForge
Command-line task manager with JSON persistence and Flask API.

## Commands
- `python3 -m pytest` — run tests
- `python3 -m flask run` — start API server
- `python3 src/taskforge/main.py` — run CLI

## Architecture
- src/taskforge/ — core logic (main.py, models.py, api.py)
- tests/ — pytest test files
- data/ — JSON storage (gitignored)

## Code Style
- Type hints on all function signatures
- Docstrings on all public functions
- No bare try/except — always catch specific exceptions

## Gotchas
- Task IDs are auto-incrementing integers, NOT UUIDs
- JSON file is the single source of truth
- Flask API mirrors CLI functionality exactly

Before/After: Context Quality

The same model with different context produces dramatically different results:

Vague Prompt

"Add search to TaskForge"

Constrained Prompt

"Add search_tasks(keyword: str) -> list[dict] to TaskForge that matches against task descriptions, case-insensitive. Return empty list for no matches. Add 3 tests: match found, no match, empty keyword."

Why it's better: Types specified. Behavior defined. Edge cases listed. The model has less room to guess wrong.

Noisy Context

Paste the entire 2,000-line codebase into the prompt, plus unrelated README text, plus old changelogs.

Selective Context

Include only models.py (where the function goes) and test_models.py (where the tests go). Let the model discover other files on demand.

Why it's better: Less noise means the model spends its context window on relevant code, not irrelevant text.

Weak CLAUDE.md

"This is a Python project. Please write good code. Follow best practices."

Actionable CLAUDE.md

"Tests use pytest. Type hints on all signatures. No bare except. Task IDs are integers, not UUIDs. JSON file is single source of truth."

Why it's better: Every line is a specific constraint the model can follow. No guessing required.

The context window has a budget. Place critical info at the START and END. Leave 20-30% for the response.

Building Instruction Scaffolds

A CLAUDE.md for a small project is straightforward. But what if your goal is bigger—like telling Claude Code how to set up an entire environment from scratch? This is where you build an instruction scaffold: a document so precise that a non-deterministic model produces the same correct output nearly every time.

Remember the core problem from Chapter 38: LLMs are non-deterministic. Ask Claude to "set up Docker for this project" five times and you'll get five different Dockerfiles, different compose configs, different auth approaches. Some will work. Some won't. The variation is the enemy.

The solution is a scaffold that eliminates ambiguity at every decision point:

State the goal in one sentence

Not "set up Docker" but "One container. When you exec into it, you get Claude Code pointed at your repo. CMD is sleep infinity. That's it."

Specify exact versions and commands

Not "install Claude Code" but curl -fsSL https://claude.ai/install.sh | bash. Not "use Debian" but FROM debian:bookworm-slim. Every unspecified choice is a coin flip the model will get wrong half the time.

Explicitly forbid wrong approaches

"NEVER use npm install—it causes auth/PATH corruption." "Do NOT add CLAUDE_MODEL to the environment block—it is not a real env var." Negative constraints are as important as positive ones, because the model's training data contains every wrong approach too.

Provide the exact file content

Don't describe what the Dockerfile should contain—write the Dockerfile. The model should copy, not improvise. Every line it generates from scratch is a line that could vary between runs.

Add a validation checklist

A list of yes/no checks that verify the output is correct. "Is the base image debian:bookworm-slim?" "Is CMD sleep infinity?" "Does .env still contain all pre-existing lines?" These are your deterministic guardrails around non-deterministic output.

Protect existing files

"NEVER overwrite .env—append only." "Do NOT touch the project's existing Dockerfile." Destruction is irreversible. The scaffold must make it harder to break things than to do the right thing.

Why This Level of Precision Matters

A vague instruction like "set up Docker for Claude Code" gives the model thousands of possible interpretations. A precise scaffold with exact file contents, forbidden approaches, and validation checklists narrows it to essentially one. You're not writing instructions for a human colleague who can ask clarifying questions—you're writing constraints for a system that will confidently do the wrong thing if you leave room for interpretation. The tighter the guardrails, the more deterministic the output.

In Chapter 24 (Harness Engineering), you'll see a complete scaffold that applies every one of these principles to set up a Docker environment for Claude Code. It's over 300 lines—and every line exists because leaving it out caused a failure in practice.

Micro-Exercises

1: Audit Your CLAUDE.md

Run /init in TaskForge. Read the generated CLAUDE.md. Count lines Claude could figure out from the code. Delete those.

2: Add a Gotcha

Add one "Gotcha" about a mistake Claude made. Start a new session. See if Claude avoids it.

See the Difference: With and Without Context

This exercise makes the impact of context engineering visceral. You'll run the exact same prompt twice and compare the results.

Without context: Rename or delete your CLAUDE.md (or create an empty one). Start a fresh Claude Code session in TaskForge. Run this prompt: Add input validation to the add_task function in TaskForge. Save the output somewhere.
With context: Restore your CLAUDE.md (the one from the example above, or your own). Start a new session with /clear. Run the exact same prompt: Add input validation to the add_task function in TaskForge.
Compare: Look at both outputs side by side. Check: Did the second version follow your code style? Did it use the right test framework? Did it respect project-specific constraints (e.g., integer IDs, not UUIDs)?

Verification: The "with CLAUDE.md" output matches your project conventions in at least 3 specific ways the "without" version did not.

If this doesn't work: If both outputs look the same, your CLAUDE.md may be too generic. Add more project-specific constraints and retry.

"A change in perspective is worth 80 IQ points."
—Alan Kay

Try This Now

Write a CLAUDE.md for TaskForge (under 50 lines). Start a Claude Code session. Ask: "Add a search_tasks(keyword) function." Did Claude follow your code style rules?

Verification: Generated code has type hints and docstrings (per your rules).

If this doesn't work: (1) Claude ignores rules → your CLAUDE.md is too long or vague. Shorten. (2) Wrong location → CLAUDE.md must be in the project root.

You just shaped AI output by controlling context—not by writing better code, but by writing better instructions. That's context engineering, and it's the highest-leverage skill in AI-assisted development.

Interactive Exercises

CLAUDE.md Linter

Write a function `lint_claude_md(text)` that checks a CLAUDE.md file for quality. Return a list of issue strings. Flag: files over 50 lines, lines containing 'be careful' or 'write good code' (vague), and missing a test command (no line containing 'pytest' or 'test').

Count lines with `text.split('\n')` — flag if len > 50.

Check for vague phrases with `if 'be careful' in text.lower()`.

Check for test command: `if 'pytest' not in text.lower() and 'test' not in text.lower()`.

Knowledge Check

Which CLAUDE.md instruction is better?

Context Setup

Created CLAUDE.md in TaskForge root CLAUDE.md is under 50 lines Includes specific testing command Includes at least 3 verifiable code style rules Tested: started new session and verified Claude follows the rules

Chapter 41 Level 4 — Compounding Engineering

Why This Matters Now

Without compounding, you solve the same problems in every session. Claude makes the same mistakes, you correct the same issues, and progress doesn't stick. Compounding engineering turns every fix into a permanent improvement—the difference between running in place and building momentum.

What This Chapter Covers

This chapter covers two key topics: compounding engineering (making AI more effective over time by codifying lessons into persistent rules and context) and the spec-driven development workflow (the professional approach to AI-assisted software development, where you brainstorm, write a spec, create a prompt plan, then execute step by step).

Compounding engineering improves every session after the current one. The loop: Plan → Delegate → Assess → Codify. The Codify step is what makes it compound.

The codify step is what separates compounding engineers from everyone else. Without it, you solve the same problems repeatedly.

Where to Codify

Where to codify recurring agent mistakes
What Happened	Where to Codify
Re-introduces removed dependency	CLAUDE.md: "Do NOT use [X]."
Ignores naming convention	`.claude/rules/naming.md`
Wrong test style	CLAUDE.md: "Tests use pytest, not unittest."
Complex module needs special handling	Subdirectory CLAUDE.md
Recurring workflow	`.claude/skills/[name]/SKILL.md`

Anti-Pattern: Over-Codification

LLMs follow ~150-200 instructions reliably. Beyond that, compliance degrades. Keep CLAUDE.md as a concise table of contents. Detailed docs go elsewhere.

TaskForge Connection

You'll create a docs/architecture.md for TaskForge and link to it from CLAUDE.md. This is compounding: future Claude sessions will find the architecture doc and produce better code.

Micro-Exercises

1: Write a Rule

Think of the last time Claude did something you didn't want. Write a one-line CLAUDE.md rule to prevent it.

2: Create Architecture Docs

Create docs/architecture.md for TaskForge: system overview, data flow, key decisions. Point to it from CLAUDE.md.

"Civilization advances by extending the number of important operations which we can perform without thinking about them."
—Alfred North Whitehead, An Introduction to Mathematics

Try This Now

Ask Claude to implement a feature on TaskForge. Note one thing it does wrong. Write the correction as a CLAUDE.md rule or .claude/rules/*.md file. Start a NEW session and ask for a similar feature. Does Claude avoid the mistake?

Verification: The second session's output doesn't repeat the first session's mistake.

If this doesn't work: (1) Same mistake → rule is too vague. Be specific. (2) Rule not loaded → verify CLAUDE.md is in project root.

You just made a permanent improvement to every future AI session. One rule, written once, prevents the same mistake forever. That's compounding—the difference between running in place and building momentum.

Interactive Exercises

Knowledge Check

What makes engineering 'compound' across sessions?

Compounding Loop

Identified a mistake Claude made in TaskForge Wrote a specific CLAUDE.md rule to prevent it Started a new session and asked for similar work Verified Claude avoided the mistake

The Spec-Driven Workflow

The single most effective workflow for AI-assisted development. Instead of jumping straight to code, you go: brainstorm → spec → plan → execute.

The spec-driven workflow: never start with code. Start with understanding, write it down, break it into steps, then execute.

Step 1: Brainstorm

Start with a conversation, not code. Ask the AI to ask you questions:

I want to build [feature]. Before we write any code, ask me clarifying questions
one at a time to understand: what problem this solves, who uses it, what the core
features are, what it should NOT do, and what edge cases matter.

The AI asks clarifying questions. You answer. Output: a shared understanding of the problem.

Step 2: Write a Spec (spec.md)

Turn the brainstorm into a written specification:

What the system does (requirements)
What it does NOT do (explicit scope boundaries)
Data model (what gets stored)
API or interface design (inputs, outputs, errors)
Edge cases and error handling
Acceptance criteria (how you know it's done)

Step 3: Generate a Plan (prompt_plan.md)

Break the spec into ordered implementation steps. Each step is a prompt you'll give the AI, small enough to review in under 5 minutes.

Step 4: Execute Step by Step

Give each prompt from the plan one at a time. Review the output. Run tests. Fix issues before moving on. Never skip review. Never batch multiple steps.

Why this works: The spec is your source of truth, not the AI's memory. The plan creates checkpoints. If something goes wrong, you know exactly where. And it compounds — your spec quality improves with every project.

Further Reading: The Spec-Driven Approach

This workflow is adapted from practitioners who build production software with AI daily:

Harper Reed: My LLM Codegen Workflow — The original spec-driven workflow.
Harper Reed: Basic Claude Code — Practical patterns for working with Claude Code.

Exercise: Analyze a Bad Spec

Given a badly-written spec, identify the problems. Return a dict with the missing sections.

Check if keywords like "requirements", "data model", "edge case", "acceptance" appear in the text (case-insensitive).

Look for vague phrases like "should work well" or "make it nice" — these are problems because they're not testable.

Check for each section keyword. Add problems like "No explicit requirements list", "No data model defined", "Contains vague requirement: 'should work well'".

Chapter 42 Level 5 — MCP, Skills, and Capabilities

Why This Matters Now

Levels 1-4 use the tools as they come out of the box. Level 5 is where you start extending them. MCPs connect Claude Code to databases, browsers, and external services. Skills give it reusable workflows. Subagents give it specialized roles. You're moving from user to architect.

MCP (Model Context Protocol)

Concept: Standardized connectors letting Claude Code interact with external tools and services.

Current Setup (March 2026)

// .mcp.json (project root)
{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres"],
      "env": { "DATABASE_URL": "postgresql://..." }
    }
  }
}

Custom Skills

Concept: Reusable instruction sets loaded on demand. A skill defines a workflow, tool restrictions, and model choice.

# .claude/skills/pr-review/SKILL.md
---
name: pr-review
description: Reviews PRs for bugs, security, performance, style.
tools: Read, Glob, Grep, Bash
model: opus
---
You are a senior code reviewer. Check:
## Security — hardcoded secrets, injection, XSS
## Performance — N+1 queries, missing indexes
## Output — BLOCKING, SUGGESTION, or APPROVED

Custom Subagents

Concept: Specialized roles with restricted tools. Each subagent only has access to the tools it needs.

# .claude/agents/test-writer.md
---
name: test-writer
description: Writes tests. Can read source and write test files only.
tools: Read, Glob, Write
model: sonnet
---
Write comprehensive tests. Cover: happy path, edge cases, error cases.

A single skill can spawn specialized subagents, each checking a different dimension. Results converge into a unified review.

Hooks

Automated scripts at lifecycle points:

// .claude/settings.json
{ "hooks": { "preCommit": [{ "command": "npm run lint && npm test" }] } }

TaskForge Connection

You'll create a skill for TaskForge that adds a new feature with tests, and a subagent that reviews the result.

Micro-Exercises

1: Create a Skill

Create .claude/skills/explain/SKILL.md with a prompt to explain any file. Test with /explain.

2: Create a Subagent

Create .claude/agents/security-check.md with tools restricted to Read and Grep.

"The best way to predict the future is to invent it."
—Alan Kay

Try This Now

Create a skill for TaskForge that adds a new feature with tests. Create a subagent that reviews the result. Run both.

Verification: Skill loads without error. Subagent loads. Both produce output.

If this doesn't work: (1) Skill not recognized → file must be .claude/skills/[name]/SKILL.md. (2) Subagent not loading → restart session. (3) Frontmatter error → check YAML indentation.

You just extended an AI agent with custom capabilities. The agent can now do things it couldn't do out of the box—because you gave it skills. You've moved from user to architect.

Interactive Exercises

Knowledge Check

What is an MCP server?

Knowledge Check

What is the difference between a skill and a subagent?

Custom Tooling

Created a custom skill in `.claude/skills/` Skill loads and runs without errors Created a subagent in `.claude/agents/` Subagent completes a task successfully

Phase 9 Gate Checkpoint & TaskForge AI Integration

Minimum Competency

Write a CLAUDE.md under 100 lines. Create a skill. Create a subagent. Use Claude Code to implement a feature with test verification. Articulate why a model output is wrong.

Your Artifact

TaskForge with: CLAUDE.md, one custom skill in .claude/skills/, one subagent in .claude/agents/, git log showing an AI-implemented feature with tests.

Verification

Skill and subagent load. CLAUDE.md has no vague instructions. Git log shows test-verified AI feature.

Failure Signal

If your CLAUDE.md is 300 lines of vague instructions, or you can't explain why Claude's output was wrong → return to Chapters 38-40.

TaskForge Checkpoint

TaskForge now has AI-readable configuration, custom tooling, and at least one AI-implemented feature. Ready for multi-agent work.

What You Can Now Do

Evaluate AI-generated code: read diffs, catch hallucinations, identify edge case failures
Use Claude Code for feature development with Plan Mode
Write effective CLAUDE.md files and manage context deliberately
Codify lessons so each session improves the next
Create custom skills, subagents, and MCP connections

Bridge to Phase 10

So far, you've worked with one AI agent at a time. But some tasks are too large or too varied for a single session. What if you could dispatch multiple agents—one writing code, one reviewing it, one updating docs—working in parallel? Phase 10 introduces multi-agent patterns. The catch: more agents means more coordination overhead. Without the harness skills from this phase, more agents just means more mess. That's why Levels 1-5 come first.