Phase 5 — Data Structures

Data Structures & Complexity

How data lives in memory, how to measure performance, and which structure to pick for which problem. Every engineering decision starts here.

Chapters 19–23Phase Gate + TaskForge
Before You Begin Phase 5

This phase assumes you can: write Python functions with tests (Ch 05, 09), use lists, dicts, and sets for everyday tasks (Ch 06), and build a Flask API with a database (Ch 14-15). If any of these feel shaky, revisit the relevant chapter first.

Chapter 19 Complexity Analysis — Measuring What Matters

Why This Matters Now

You need to predict how code performs before running it. When AI generates a nested loop over 10 million records, can you tell it will take hours? When a colleague writes a function that scans a list inside another loop, can you spot that it scales terribly? Big-O notation gives you that vocabulary. It is the language engineers use to reason about performance, and it is the foundation every data structure and algorithm discussion builds on.

What Is Big-O Notation?

Big-O notation describes how the number of operations a function performs grows as the input size n grows. It answers one question: "If I double the input, how much longer does this take?"

We count operations, not seconds, because seconds depend on your CPU, your operating system, what else is running, and the phase of the moon. Operations are deterministic. If a function does n comparisons for an input of size n, it will always do 2n comparisons for an input of size 2n—regardless of hardware.

Big-O has two simplification rules:

  • Drop constants. O(3n) becomes O(n). O(100) becomes O(1). Constants don't change the growth shape.
  • Drop lower-order terms. O(n² + n) becomes O(n²). When n is large, the n² term dominates so completely that the n term is irrelevant.

A concrete comparison makes this tangible. Suppose you have a list of tasks and you want to check whether a specific title exists:

# O(n) — scan the list one element at a time
def find_in_list(tasks, title):
    for task in tasks:
        if task["title"] == title:
            return True
    return False

# O(1) — look up in a dict (hash table) directly
task_index = {"Buy groceries": 1, "Write tests": 2, "Deploy API": 3}
def find_in_dict(task_index, title):
    return title in task_index

With 10 tasks, both finish instantly. With 10 million tasks, the list scan might take seconds while the dict lookup still finishes in microseconds. Big-O told you this would happen before you ran a single benchmark.

Common Complexity Classes

There are six complexity classes you will encounter constantly. Each one has a distinct growth shape and a real-world example you already know.

Common Big-O complexity classes and their growth behavior
ClassNameExampleDoubles Input → Time Impact
O(1)ConstantDict lookup, array index accessNo change
O(log n)LogarithmicBinary search in a sorted list+1 step
O(n)LinearScanning a list, summing elements2x time
O(n log n)LinearithmicEfficient sorting (merge sort, Timsort)~2x time
O(n²)QuadraticNested loops comparing every pair4x time
O(2ⁿ)ExponentialGenerating all subsets of a setSquared time

Here is a Python example for each class so you can see the pattern in code:

# O(1) — Constant: dictionary lookup
def get_task(task_dict, task_id):
    return task_dict.get(task_id)  # one operation regardless of dict size

# O(log n) — Logarithmic: binary search
def binary_search(sorted_list, target):
    lo, hi = 0, len(sorted_list) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if sorted_list[mid] == target:
            return mid
        elif sorted_list[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1  # halves the search space each step

# O(n) — Linear: scan every element
def find_max(numbers):
    best = numbers[0]
    for num in numbers:
        if num > best:
            best = num
    return best  # touches each element exactly once

# O(n log n) — Linearithmic: Python's built-in sort
def sort_tasks(tasks):
    return sorted(tasks, key=lambda t: t["priority"])
    # Timsort: divides and conquers, n log n comparisons

# O(n^2) — Quadratic: check all pairs for duplicates
def has_duplicate_naive(items):
    for i in range(len(items)):
        for j in range(i + 1, len(items)):
            if items[i] == items[j]:
                return True
    return False  # compares every pair: n*(n-1)/2 operations

# O(2^n) — Exponential: generate all subsets
def all_subsets(items):
    if not items:
        return [[]]
    first = items[0]
    rest_subsets = all_subsets(items[1:])
    return rest_subsets + [[first] + s for s in rest_subsets]
    # each element doubles the number of subsets

Analyzing Your Own Code

You don't need to memorize rules. You need to recognize two patterns:

Rule 1: Sequential blocks → add (take the max). If your function first scans a list O(n), then sorts it O(n log n), the total is O(n) + O(n log n) = O(n log n). You keep the dominant term.

def process_tasks(tasks):
    # Step 1: filter — O(n)
    active = [t for t in tasks if t["status"] == "active"]
    # Step 2: sort — O(n log n)
    active.sort(key=lambda t: t["priority"])
    # Step 3: return first 10 — O(1)
    return active[:10]
    # Total: O(n) + O(n log n) + O(1) = O(n log n)

Rule 2: Nested loops → multiply. A loop inside a loop means the inner loop runs once for every iteration of the outer loop. If both iterate over n items, you get O(n) × O(n) = O(n²).

def find_common_tasks(list_a, list_b):
    common = []
    for a in list_a:          # O(n)
        for b in list_b:      #   O(m) for each a
            if a["title"] == b["title"]:
                common.append(a)
    return common
    # Total: O(n * m). If both lists are size n, this is O(n^2)

Let's walk through a more subtle example:

def group_by_status(tasks):
    groups = {}
    for task in tasks:           # O(n)
        status = task["status"]
        if status not in groups:
            groups[status] = []  # O(1) dict operations
        groups[status].append(task)  # O(1) amortized
    return groups
    # Total: O(n) — one pass, all dict operations are O(1)

And a tricky one where the complexity is hidden:

def remove_duplicates_naive(items):
    unique = []
    for item in items:           # O(n)
        if item not in unique:   # O(k) where k = len(unique)
            unique.append(item)
    return unique
    # Worst case: all unique, so k grows to n
    # Total: O(n^2) — the "in" check on a list is O(n)!

def remove_duplicates_fast(items):
    seen = set()
    unique = []
    for item in items:           # O(n)
        if item not in seen:     # O(1) set lookup
            seen.add(item)       # O(1)
            unique.append(item)
    return unique
    # Total: O(n) — set lookup is O(1)

The difference between these two functions is the difference between O(n²) and O(n). For 10,000 items, that is 100,000,000 vs 10,000 operations. The second function isn't just "a little faster"—it is 10,000 times faster.

Space Complexity

Time complexity measures operations. Space complexity measures how much additional memory your algorithm uses beyond the input itself.

# O(1) space — modifies in place
def reverse_in_place(items):
    left, right = 0, len(items) - 1
    while left < right:
        items[left], items[right] = items[right], items[left]
        left += 1
        right -= 1
    # Uses only two integer variables — constant extra space

# O(n) space — creates a new list
def reverse_copy(items):
    return items[::-1]
    # Creates an entirely new list of the same size

Space complexity matters when your data is large. Reversing a 10-element list? Space doesn't matter. Reversing a 10-million-element list? Creating a copy doubles your memory usage. In-place algorithms trade readability for memory efficiency.

Amortized Analysis

Python's list.append() is advertised as O(1), but occasionally the underlying array runs out of space and must be resized—copying every element to a new, larger array. That single resize is O(n). How can append be O(1)?

Amortized analysis averages the cost over a sequence of operations. When a dynamic array doubles its capacity (say from 8 to 16 slots), the next 8 appends are "free"—no resize needed. The expensive O(n) copy gets spread across the cheap O(1) appends that follow it. On average, each append costs O(1).

Think of it like paying rent. You pay a large amount once a month, but amortized over 30 days, it's a fixed daily cost. One expensive day doesn't make your average expense high.

# Demonstrating dynamic array growth
import sys

items = []
prev_size = 0
for i in range(65):
    items.append(i)
    curr_size = sys.getsizeof(items)
    if curr_size != prev_size:
        print(f"Length {len(items):3d} -> Memory {curr_size:5d} bytes")
        prev_size = curr_size
# You'll see memory jumps at irregular intervals as the array resizes

When Big-O Lies

Big-O tells you about growth rate, not actual speed. For small inputs, constant factors dominate. A few important exceptions to keep in mind:

  • Small n: For fewer than ~50 elements, an O(n²) insertion sort often beats an O(n log n) merge sort because insertion sort has much lower overhead (fewer function calls, better cache usage). Python's Timsort actually uses insertion sort for small runs inside its merge sort.
  • Cache locality: An array stores elements contiguously in memory. When the CPU loads one element, it loads nearby elements into cache for free. A linked list scatters nodes across memory, causing cache misses—each access potentially waits for main memory. An O(n) array scan can be 10–100x faster than an O(n) linked list scan because of cache effects.
  • Constant factors: An O(n) algorithm with a constant factor of 1000 is slower than an O(n log n) algorithm with a constant factor of 1 for any n below about 10,000. Big-O hides these constants by design.

The lesson: Big-O is a thinking tool, not the final answer. For large-scale decisions (O(n) vs O(n²) for millions of records), trust the theory. For micro-optimization (choosing between two O(n) algorithms), measure with a profiler.

Engineering Angle

In the real world, profiling with timeit and cProfile often contradicts theoretical analysis. A dict comprehension and a for-loop have the same Big-O, but the comprehension can be 20% faster due to fewer bytecode instructions. Know when to trust theory (choosing between O(n) and O(n²) for large data) and when to measure (choosing between two O(n) approaches with different constants). Senior engineers do both: Big-O for design decisions, profiling for optimization.

TaskForge Connection

Analyze TaskForge's current operations. Adding a task to the end of a list is O(1) amortized. Searching for a task by title requires scanning every task: O(n). Sorting tasks by priority uses Python's Timsort: O(n log n). As your task list grows from 10 to 10,000 to 10 million, search becomes the bottleneck. In Chapter 21, you will build a hash-based index that makes search O(1)—and Big-O is how you know it is worth doing.

Input size (n) Operations 0 20 40 60 80 100 0 2k 5k 7.5k 10k O(1) O(log n) O(n) O(n log n) O(n²) O(2ⁿ) Complexity Growth Rates
How different complexity classes grow as input size increases. O(1) and O(log n) remain manageable at any scale. O(n²) and O(2ⁿ) become unusable quickly. The right algorithm choice can mean the difference between milliseconds and hours.

Micro-Exercises

1: Classify These Functions

Determine the Big-O time complexity of each function:

# Function A
def func_a(items):
    return items[0] if items else None

# Function B
def func_b(items):
    total = 0
    for x in items:
        total += x
    return total

# Function C
def func_c(items):
    for x in items:
        for y in items:
            print(x, y)

# Function D
def func_d(items):
    items.sort()
    return items[0]

Answers: A is O(1)—single index access. B is O(n)—one pass. C is O(n²)—nested loops. D is O(n log n)—the sort dominates.

2: Spot the Hidden Quadratic

This function looks linear but is actually O(n²). Can you see why?

def build_string(n):
    result = ""
    for i in range(n):
        result += str(i)  # string concatenation creates a new string each time
    return result

Answer: Strings are immutable in Python. Each += creates a new string and copies all previous characters. The total copies are 1 + 2 + 3 + ... + n = O(n²). Fix: use "".join(str(i) for i in range(n)) for O(n).

Big-O is not about math. It is about predicting the future—how your code will behave when the data gets 10x, 100x, or 1000x larger than your test cases.

Interactive Exercises

Design Challenge: "Predict and Measure"

Three functions solve the same problem: finding if a list contains duplicates. Predict their Big-O, then time them to verify. Fill in the predict dictionary and the measure() function.

The nested function has two loops over n elements. What do nested loops multiply to?

The sort function calls sorted() which is O(n log n), then scans once which is O(n). Which dominates?

Building a set from n elements is O(n). Comparing two lengths is O(1). Total: O(n).

Knowledge Check

What is the time complexity of searching for a value in an unsorted list of n elements?

Chapter 20 Linear Structures — Arrays, Linked Lists, Stacks & Queues

Why This Matters Now

Every program stores sequences of data. Understanding how they are stored—contiguous vs scattered, indexed vs linked—determines whether your code runs in milliseconds or minutes. This chapter covers the four fundamental linear structures and teaches you when to use each one.

Arrays in Memory

An array stores elements in contiguous (side-by-side) memory locations. If element 0 starts at memory address 1000 and each element takes 8 bytes, then element 5 lives at address 1000 + (5 × 8) = 1040. This address arithmetic is why array index access is O(1)—the CPU can jump directly to any element without scanning.

Python's list is a dynamic array. Under the hood, it stores an array of pointers (references) to Python objects. The array itself is contiguous, even though the objects it points to may be scattered in memory.

Contiguous storage gives arrays an enormous advantage: cache locality. Modern CPUs don't fetch one byte at a time—they load entire cache lines (typically 64 bytes) from memory. When you access element 0 of an array, the CPU loads elements 0 through 7 (or more) into cache for free. Iterating through an array is fast because the next element is almost always already in cache.

Dynamic Arrays

Static arrays have a fixed size set at creation. Dynamic arrays (like Python's list) grow automatically. The strategy is geometric growth: when the array is full, allocate a new array that is roughly double the size, copy everything over, and free the old array.

This makes append() O(1) amortized (as we analyzed in Ch 19). But operations at the beginning of the array are expensive:

Time complexity of Python list operations
OperationTimeWhy
list[i] (access by index)O(1)Address arithmetic
list.append(x)O(1) amortizedAdds to the end; occasional resize
list.insert(0, x)O(n)Every element shifts right by one
list.pop() (last element)O(1)Removes from the end
list.pop(0) (first element)O(n)Every element shifts left by one
x in listO(n)Must scan the entire list

The critical insight: inserting or removing from the front of a Python list is O(n), not O(1). Every element after the insertion point must shift. This is one of the most common performance traps in Python code.

Linked Lists

A linked list stores elements as individual node objects, each containing a value and a reference (pointer) to the next node. Nodes can be scattered anywhere in memory—they are connected by pointers, not by physical position.

class Node:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node

class LinkedList:
    def __init__(self):
        self.head = None
        self.size = 0

    def prepend(self, value):
        """Add to the front — O(1)."""
        self.head = Node(value, self.head)
        self.size += 1

    def get(self, index):
        """Access by index — O(n). Must walk the chain."""
        current = self.head
        for _ in range(index):
            if current is None:
                raise IndexError("Index out of range")
            current = current.next
        return current.value

    def __len__(self):
        return self.size

    def __iter__(self):
        current = self.head
        while current:
            yield current.value
            current = current.next

In a singly linked list, each node points to the next. In a doubly linked list, each node also points to the previous, enabling traversal in both directions and O(1) removal from either end (if you have a reference to the node).

Operation complexity: arrays vs. linked lists
OperationArray (Python list)Linked List
Access by indexO(1)O(n)
Insert at frontO(n)O(1)
Insert at endO(1) amortizedO(1) with tail pointer
Delete from frontO(n)O(1)
SearchO(n)O(n)
Memory per element8 bytes (pointer)~56 bytes (object + pointer)

When Linked Lists Win (And When They Don't)

Textbooks love linked lists for their O(1) insert/delete at the head. Reality is more nuanced. Python's list (a dynamic array) beats linked lists in almost every practical scenario because:

  • Cache misses destroy performance. Array elements are contiguous; the CPU prefetches them. Linked list nodes are scattered; every node access is a potential cache miss.
  • Memory overhead. Each linked list node is a Python object with header, value, and pointer(s). A list of 1 million integers uses roughly 8 MB as an array vs 56 MB as a linked list.
  • Python overhead. Creating millions of Node objects puts pressure on Python's garbage collector.

Linked lists shine in two specific scenarios: (1) when you need O(1) insert/delete at both ends without index access (use collections.deque, which is a doubly-linked list of fixed-size blocks), and (2) when you need to splice sublists or reorder nodes without copying data (rare in Python, common in systems programming).

Stacks (LIFO)

A stack is a Last-In, First-Out (LIFO) structure. You push items on top and pop them off the top. Think of a stack of plates: you add to the top and take from the top.

# Python list is a perfect stack
stack = []
stack.append("open file")    # push
stack.append("edit line 5")  # push
stack.append("delete word")  # push

last_action = stack.pop()    # pop -> "delete word"
print(last_action)           # most recent action

Use cases for stacks:

  • Undo systems—push each action, pop to undo. This is exactly what you'll build for TaskForge.
  • The call stack—when Python calls a function, it pushes a frame onto the call stack. When the function returns, it pops the frame. This is why infinite recursion causes a "stack overflow."
  • Parentheses matching—push open brackets, pop when you see a closing bracket. If the stack is empty when you pop, or non-empty at the end, the brackets are unbalanced.
  • Expression evaluation—calculators use stacks to evaluate 3 + 4 * 2 with correct precedence.
def is_balanced(expression):
    """Check if parentheses are balanced — classic stack problem."""
    stack = []
    pairs = {"(": ")", "[": "]", "{": "}"}
    for char in expression:
        if char in pairs:
            stack.append(char)
        elif char in pairs.values():
            if not stack:
                return False
            if pairs[stack.pop()] != char:
                return False
    return len(stack) == 0

print(is_balanced("({[]})"))   # True
print(is_balanced("({[})"))    # False
print(is_balanced("((()"))     # False

Queues and Deques (FIFO)

A queue is a First-In, First-Out (FIFO) structure. Items enter at the back and leave from the front, like a line at a ticket counter.

You might be tempted to use a Python list as a queue, but there is a critical performance problem:

# BAD: Using list as a queue — pop(0) is O(n)!
queue = [1, 2, 3, 4, 5]
queue.append(6)       # enqueue at back — O(1), fine
front = queue.pop(0)  # dequeue from front — O(n), every element shifts!

# GOOD: Use collections.deque — popleft() is O(1)
from collections import deque
queue = deque([1, 2, 3, 4, 5])
queue.append(6)          # enqueue at back — O(1)
front = queue.popleft()  # dequeue from front — O(1)!

collections.deque (double-ended queue) supports O(1) operations at both ends. It is implemented as a doubly-linked list of fixed-size blocks—getting cache-friendly behavior of arrays with the flexibility of linked lists at the ends.

Use cases for queues:

  • Task scheduling—process tasks in the order they arrive. Print queues, message queues, job queues.
  • Breadth-first search—explore neighbors layer by layer (you'll use this with graphs in Ch 23).
  • Rate limiting—track timestamps of recent requests in a sliding window.
Memory Layout: Array vs Linked List Array (Python list) Contiguous in memory "A" [0] "B" [1] "C" [2] "D" [3] "E" [4] CPU Cache Line (64 bytes) — loads all at once Linked List Scattered in memory "A" "B" "C" "D" "E" Cache line covers only 1 node
Arrays store elements side by side in memory, so a single CPU cache line loads multiple elements at once. Linked list nodes are scattered, causing a cache miss on nearly every access. This is why arrays are faster in practice even when linked lists have better theoretical complexity for certain operations.
Engineering Angle

Python's collections.deque is the right choice when you need a queue. Using list.pop(0) is O(n)—every element shifts. deque.popleft() is O(1). This is the kind of performance trap that AI-generated code often falls into. When you see queue = [] followed by queue.pop(0) in AI output, you know to replace it with deque. One data structure swap, no algorithm change, and performance goes from O(n²) to O(n) for a series of dequeue operations.

TaskForge Connection

Implement an undo stack for TaskForge. Every operation—add task, delete task, update task—pushes a record onto the stack that stores enough information to reverse it. Calling undo() pops the last record and reverses the operation. This is the same pattern used by text editors, drawing tools, and database transactions.

Micro-Exercises

1: Stack-Based Bracket Checker

Test the is_balanced() function from the stacks section with these inputs. Predict the output before running:

print(is_balanced(""))         # ?
print(is_balanced("()"))       # ?
print(is_balanced("([{}])"))   # ?
print(is_balanced("(]"))       # ?
print(is_balanced("((())"))    # ?

Answers: True, True, True, False, False. An empty string has balanced brackets (nothing to mismatch).

2: Deque vs List Performance

Time this comparison and observe the dramatic difference:

import time
from collections import deque

n = 100000

# List: pop from front
data = list(range(n))
start = time.time()
while data:
    data.pop(0)
list_time = time.time() - start

# Deque: pop from front
data = deque(range(n))
start = time.time()
while data:
    data.popleft()
deque_time = time.time() - start

print(f"list.pop(0):    {list_time:.4f}s")
print(f"deque.popleft():{deque_time:.4f}s")
print(f"Deque is {list_time/deque_time:.0f}x faster")
The right data structure can turn an O(n²) disaster into an O(n) solution. The wrong one hides performance bugs that only surface when your data grows.

Interactive Exercises

Design Challenge: "TaskForge Undo Stack"

Build an UndoStack class that records operations and reverses them. Each operation is a dict with "action" (add/delete), "task" (task dict). The stack manages a task list. add_task(task) adds and records. undo() reverses the last operation. get_tasks() returns current tasks.

In add_task, append the task to self.tasks, then push {"action": "add", "task": task} onto self.history.

In undo, pop from self.history. If the action was "add", remove that task from self.tasks. If "delete", re-add it.

In delete_task, find and remove the task, then push {"action": "delete", "task": removed_task} onto history.

Knowledge Check

You need a data structure where you frequently add to the back and remove from the front. Which is the best choice?

Chapter 21 Hash Tables, Sets & Maps

Why This Matters Now

The dict is Python's most powerful data structure. Understanding how it works internally helps you know when it is O(1), when it degrades, and why certain things can't be dict keys. Every time you write if x in my_dict and it returns instantly, a hash table is doing the work. Every time you get a TypeError: unhashable type: 'list', it is because hash tables require immutable keys. This chapter explains why.

How Hashing Works

A hash function converts a key (any immutable value) into an integer called a hash code. That integer, taken modulo the table size, determines which bucket the key-value pair lives in.

# Python's built-in hash function
print(hash("hello"))     # e.g., 2534023372954268
print(hash("hello") % 8) # e.g., 4 — bucket 4 in an 8-bucket table
print(hash(42))          # 42 — integers hash to themselves
print(hash((1, 2, 3)))   # tuples are hashable because they're immutable

The process for looking up my_dict["hello"]:

  1. Compute hash("hello") → some large integer
  2. Compute hash_code % table_size → bucket index (e.g., 4)
  3. Go directly to bucket 4 and find the value

No scanning, no comparisons with other keys. That is why dict lookup is O(1).

Collision Resolution

Two different keys can hash to the same bucket. This is called a collision. If hash("hello") % 8 == 4 and hash("world") % 8 == 4, both keys map to bucket 4. Hash tables must handle this.

There are two main strategies:

  • Chaining: Each bucket holds a linked list (or small array) of entries. When a collision occurs, the new entry is appended to the chain. Lookup scans the chain—O(k) where k is the chain length. If the hash function distributes evenly, chains stay short.
  • Open addressing: When a collision occurs, probe the next empty slot using a deterministic sequence (linear probing, quadratic probing, or double hashing). CPython uses a variant of open addressing with a pseudo-random probe sequence.

Both strategies have trade-offs: chaining wastes memory on node pointers, open addressing suffers from clustering (occupied slots clump together, making probe sequences longer). In practice, both work well when the table is not too full.

Load Factor and Rehashing

The load factor is the ratio of entries to buckets: n / table_size. As the load factor increases, collisions become more frequent and performance degrades toward O(n).

CPython keeps the load factor below roughly 2/3 (0.67). When inserting a new entry would exceed this threshold, the table rehashes: allocates a new, larger table (typically double the size) and reinserts every existing entry. This is an O(n) operation, but it happens infrequently—amortized O(1) per insert, the same pattern as dynamic array resizing.

# You can observe rehashing indirectly via memory usage
import sys

d = {}
prev = sys.getsizeof(d)
for i in range(100):
    d[i] = i
    curr = sys.getsizeof(d)
    if curr != prev:
        print(f"At {i} entries: dict size jumped from {prev} to {curr} bytes")
        prev = curr
# You'll see jumps at approximately 5, 10, 21, 42, 85 entries

Python Dict Internals

Since CPython 3.6, dicts use a compact layout with two separate arrays:

  • A sparse hash table (mostly empty) that stores indices into the dense array
  • A dense entries array that stores keys and values in insertion order

This is why Python dicts preserve insertion order since 3.7—the dense array records entries in the order they were added. It also explains why dicts use more memory than you might expect: the sparse hash table has empty slots to keep the load factor low.

Key facts about Python dicts:

  • Average lookup, insert, and delete: O(1)
  • Worst-case (all keys collide): O(n)—but this essentially never happens with Python's hash function
  • Iteration is O(n) and follows insertion order
  • Memory usage is roughly 50–70 bytes per entry (much more than a list)

Sets

A set is a hash table without values—it stores only keys. This gives you O(1) membership testing, which is the single most common reason to use a set over a list.

# O(n) — membership test on a list
def has_overdue_list(task_ids, overdue_ids_list):
    for tid in task_ids:
        if tid in overdue_ids_list:  # O(n) each time
            return True
    return False
# Total: O(n * m) where n = task_ids, m = overdue list

# O(n) — membership test on a set
def has_overdue_set(task_ids, overdue_ids_set):
    for tid in task_ids:
        if tid in overdue_ids_set:  # O(1) each time
            return True
    return False
# Total: O(n) where n = task_ids

Sets also support efficient mathematical operations:

team_a = {"Alice", "Bob", "Carol"}
team_b = {"Bob", "Dave", "Carol"}

print(team_a & team_b)   # Intersection: {"Bob", "Carol"}
print(team_a | team_b)   # Union: {"Alice", "Bob", "Carol", "Dave"}
print(team_a - team_b)   # Difference: {"Alice"}
print(team_a ^ team_b)   # Symmetric diff: {"Alice", "Dave"}

When to use a set vs a list: If you need to check "is X in my collection?" more than occasionally, convert to a set. For 100 elements, the difference is negligible. For 1 million elements, x in my_list takes a millisecond while x in my_set takes a microsecond—a 1000x difference.

When Hashing Breaks

Hash tables have failure modes you need to know about:

  • Mutable keys are forbidden. If you could modify a list after using it as a dict key, its hash would change and the dict could no longer find it. Python prevents this: TypeError: unhashable type: 'list'. Convert to a tuple first.
  • Hash collision attacks. An attacker who knows your hash function can craft inputs that all map to the same bucket, degrading O(1) to O(n). Python mitigates this with hash randomization (different seed per process).
  • Pathological distributions. If many keys hash to the same bucket despite a good hash function, performance degrades. This is extremely rare with Python's hash, but worth knowing about for debugging.
# This fails: lists are mutable, so unhashable
try:
    d = {[1, 2, 3]: "value"}
except TypeError as e:
    print(e)  # unhashable type: 'list'

# This works: tuples are immutable, so hashable
d = {(1, 2, 3): "value"}
print(d[(1, 2, 3)])  # "value"
Hash Table with Chaining hash(key) % 8 Hash Function [0] empty [1] "age": 30 [2] empty [3] "name": "Alice" "city": "Boston" Collision! Two keys in bucket 3 [4] empty [5] "email": "a@b.c" [6] "role": "dev" [7] empty 5 entries in 8 buckets — load factor = 0.625 (below 0.67 threshold)
A hash table with 8 buckets holding 5 key-value pairs. Bucket 3 has a collision: two keys hashed to the same index. Chaining links them in a chain. Lookup checks the bucket, then walks the chain if needed. With a low load factor, chains stay short and lookups remain O(1).
Engineering Angle

Choosing between dict, set, and list for lookups is a daily engineering decision. if x in my_list is O(n). if x in my_set is O(1). For 100 elements it doesn't matter. For 1 million, it is the difference between 1 second and 1 microsecond. When you see AI-generated code that stores unique IDs in a list and checks membership in a loop, you know to swap it for a set. This is the kind of optimization that scales.

TaskForge Connection

Build a tag-based index for tasks. A dict maps tag strings to lists of task IDs: {"urgent": [1, 3, 7], "backend": [2, 3]}. When you need all tasks tagged "urgent," the dict gives you the answer in O(1) instead of scanning every task. This is the same pattern databases use for indexes—you are building one by hand.

Micro-Exercises

1: Set vs List Membership

Convert this O(n²) function to O(n) using a set:

# O(n^2) — checking membership in a list inside a loop
def find_common(list_a, list_b):
    common = []
    for item in list_a:
        if item in list_b:  # O(n) for each check
            common.append(item)
    return common

# Your O(n) version:
def find_common_fast(list_a, list_b):
    set_b = set(list_b)       # O(m) one-time conversion
    common = []
    for item in list_a:
        if item in set_b:     # O(1) for each check
            common.append(item)
    return common
    # Total: O(n + m) instead of O(n * m)
The dict is not just a data structure—it is a design pattern. Anytime you need fast lookup by key, you need a hash table. Once you see this pattern, you see it everywhere.

Interactive Exercises

Design Challenge: "TaskForge Tag Index"

Build a TagIndex class that maintains a dict mapping tags to sets of task IDs. Implement add_task(task_id, tags), remove_task(task_id, tags), and search(tag) that returns a set of task IDs.

In add_task, loop through tags. For each tag, if it's not in self.index, create an empty set. Then add the task_id to the set.

Use self.index.setdefault(tag, set()).add(task_id) for a concise one-liner per tag.

In search, return self.index.get(tag, set()) to handle missing tags gracefully.

Knowledge Check

Why can't you use a list as a dictionary key in Python?

Chapter 22 Trees

Why This Matters Now

Trees are everywhere you don't see them. Your file system is a tree. Database indexes are trees. The HTML DOM is a tree. JSON is a tree. When you call os.walk() to traverse directories, you are traversing a tree. Understanding trees means understanding how all these systems perform and why database queries are fast even on tables with millions of rows.

Binary Tree Basics

A binary tree is a hierarchy where each node has at most two children, conventionally called left and right. The topmost node is the root. Nodes with no children are leaves. The depth of a node is its distance from the root. The height of the tree is the depth of its deepest leaf.

class TreeNode:
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right

# Build a small tree:
#        8
#       / \
#      3   10
#     / \    \
#    1   6    14
root = TreeNode(8,
    TreeNode(3, TreeNode(1), TreeNode(6)),
    TreeNode(10, None, TreeNode(14))
)

Tree Traversals

There are four standard ways to visit every node in a tree. Each traversal order serves a different purpose:

def inorder(node):
    """Left -> Root -> Right. For BSTs, visits nodes in sorted order."""
    if node is None:
        return []
    return inorder(node.left) + [node.value] + inorder(node.right)

def preorder(node):
    """Root -> Left -> Right. Copies the tree structure."""
    if node is None:
        return []
    return [node.value] + preorder(node.left) + preorder(node.right)

def postorder(node):
    """Left -> Right -> Root. Safe for deletion (children before parent)."""
    if node is None:
        return []
    return postorder(node.left) + postorder(node.right) + [node.value]

def levelorder(node):
    """Layer by layer (BFS). Uses a queue."""
    from collections import deque
    if node is None:
        return []
    result, queue = [], deque([node])
    while queue:
        current = queue.popleft()
        result.append(current.value)
        if current.left:
            queue.append(current.left)
        if current.right:
            queue.append(current.right)
    return result

# Using our tree from above:
print("Inorder:   ", inorder(root))    # [1, 3, 6, 8, 10, 14] — sorted!
print("Preorder:  ", preorder(root))   # [8, 3, 1, 6, 10, 14]
print("Postorder: ", postorder(root))  # [1, 6, 3, 14, 10, 8]
print("Levelorder:", levelorder(root)) # [8, 3, 10, 1, 6, 14]

Binary Search Trees

A Binary Search Tree (BST) enforces an ordering rule: for every node, all values in the left subtree are smaller and all values in the right subtree are larger. This rule makes search, insert, and delete all O(h) where h is the height of the tree.

def bst_search(node, target):
    """Search a BST — O(h) where h = height."""
    if node is None:
        return False
    if target == node.value:
        return True
    elif target < node.value:
        return bst_search(node.left, target)   # go left
    else:
        return bst_search(node.right, target)  # go right

def bst_insert(node, value):
    """Insert into a BST — O(h)."""
    if node is None:
        return TreeNode(value)
    if value < node.value:
        node.left = bst_insert(node.left, value)
    elif value > node.value:
        node.right = bst_insert(node.right, value)
    # if value == node.value, duplicate — skip or handle as you wish
    return node

# Build a BST from scratch
tree = None
for val in [8, 3, 10, 1, 6, 14]:
    tree = bst_insert(tree, val)

print(bst_search(tree, 6))   # True
print(bst_search(tree, 7))   # False
print(inorder(tree))          # [1, 3, 6, 8, 10, 14] — always sorted

Why Balance Matters

The performance of a BST depends entirely on its height. A balanced BST with n nodes has height log n, giving O(log n) operations. But if you insert sorted data into a BST, every node goes to the right, creating a degenerate tree that is effectively a linked list with height n.

# Balanced insertion: [8, 3, 10, 1, 6, 14]
# Creates a nice tree with height 3:
#        8
#       / \
#      3   10
#     / \    \
#    1   6    14
# Search: O(log n) = O(log 6) ≈ 3 comparisons

# Sorted insertion: [1, 3, 6, 8, 10, 14]
# Creates a degenerate tree with height 6:
#    1
#     \
#      3
#       \
#        6
#         \
#          8
#           \
#            10
#              \
#               14
# Search: O(n) = 6 comparisons — it's a linked list!

Self-balancing trees (AVL trees, red-black trees) automatically restructure after inserts and deletes to guarantee O(log n) height. You will almost never implement these yourself—they exist inside libraries and databases. But you need to know why they exist: to prevent the O(n) degenerate case. Python's sorted containers library and C++'s std::map use red-black trees. Database indexes use B-trees (below).

Heaps and Priority Queues

A min-heap is a complete binary tree where every parent is less than or equal to its children. The minimum value is always at the root. Heaps are stored as arrays using index arithmetic:

  • Children of node i: indices 2i+1 and 2i+2
  • Parent of node i: index (i-1)//2

Python's heapq module provides a min-heap backed by a regular list:

import heapq

# Build a heap from a list
tasks = [(3, "Low priority task"),
         (1, "URGENT task"),
         (2, "Medium task")]
heapq.heapify(tasks)  # O(n) — rearranges in place

# Pop the smallest (highest priority)
print(heapq.heappop(tasks))  # (1, "URGENT task")
print(heapq.heappop(tasks))  # (2, "Medium task")
print(heapq.heappop(tasks))  # (3, "Low priority task")

# Push a new item
heapq.heappush(tasks, (0, "Critical fix"))

# Get top-K without sorting the whole list
data = [42, 17, 93, 5, 68, 31, 84, 2]
print(heapq.nsmallest(3, data))  # [2, 5, 17] — O(n log k)
print(heapq.nlargest(3, data))   # [93, 84, 68]
Heap operation time complexities
Heap OperationTime Complexity
heapify(list)O(n)
heappush(heap, item)O(log n)
heappop(heap)O(log n)
nsmallest(k, iterable)O(n log k)
Access minimumO(1) — it is heap[0]

The key insight: if you only need the minimum (or maximum) element repeatedly, a heap is better than sorting. Sorting gives you all elements in order (O(n log n)). A heap gives you just the min in O(1) and lets you remove it in O(log n). For "give me the top 10 most important tasks out of 10,000," a heap is the right tool.

Tries

A trie (pronounced "try") is a tree where each node represents a character. A path from the root to a node spells a prefix. Tries power autocomplete features: to find all words starting with "pro", walk down the p-r-o path and return everything below.

The critical property: lookup time is O(m) where m is the key length, regardless of how many keys are stored. A trie with 1 million words finds "programming" in exactly 11 steps.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True

    def search(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                return False
            node = node.children[char]
        return node.is_end

    def starts_with(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return False
            node = node.children[char]
        return True

t = Trie()
for word in ["task", "tag", "test", "team"]:
    t.insert(word)
print(t.search("task"))       # True
print(t.search("tas"))        # False (prefix, not complete word)
print(t.starts_with("ta"))    # True
print(t.starts_with("te"))    # True
print(t.starts_with("tx"))    # False

B-Trees (Conceptual)

Databases don't use BSTs. They use B-trees—wide, shallow trees designed for disk-based storage.

The problem with BSTs on disk: each node is one comparison, and reading a node from disk takes ~10 milliseconds (mechanical disk) or ~0.1 ms (SSD). A balanced BST with 1 million nodes has height 20, requiring 20 disk reads. At 0.1 ms each, that is 2 ms per query—tolerable but not great.

A B-tree solves this by making nodes wide: each node holds hundreds of keys and has hundreds of children. A B-tree with branching factor 100 and 3 levels can hold 100³ = 1 million keys with only 3 disk reads per search. That is why SELECT * FROM tasks WHERE id = 42 returns instantly on a table with millions of rows.

You don't implement B-trees. You use them through databases (SQLite, PostgreSQL). But knowing they exist explains why database indexes are fast and why CREATE INDEX speeds up queries—it builds a B-tree over the indexed column.

BST Shape Depends on Insertion Order Random Order: [4,2,6,1,3,5,7] Height = 3 — Search: O(log n) 4 2 6 1 3 5 7 Sorted Order: [1,2,3,4,5,6,7] Height = 7 — Search: O(n) 1 2 3 4 5 6 7
The same 7 numbers produce very different trees depending on insertion order. Random order gives a balanced tree with O(log n) search. Sorted order gives a degenerate tree—effectively a linked list—with O(n) search. Self-balancing trees prevent the degenerate case.
Engineering Angle

You almost never implement a tree yourself. But understanding them is essential: database indexes are B-trees, file systems are tree-structured, and heapq is the right tool for "give me the top 10 most important tasks." When AI generates code that sorts a list just to find the max, you know a single max() call is O(n) and even better, if you need the max repeatedly as items change, a heap gives O(log n) per update vs re-sorting at O(n log n). Knowing the structure means knowing the right tool.

TaskForge Connection

Implement priority-based task scheduling using heapq. Tasks with lower priority numbers get processed first (priority 1 = most urgent). Support add_task(priority, name) and next_task() that returns the highest-priority task. This replaces the naive approach of sorting the entire task list every time you want the next task.

Micro-Exercises

1: Traversal Practice

Given this tree, predict the output of each traversal before running the code:

#       10
#      /  \
#     5    15
#    / \     \
#   2   7    20

root = TreeNode(10,
    TreeNode(5, TreeNode(2), TreeNode(7)),
    TreeNode(15, None, TreeNode(20))
)

print("Inorder:   ", inorder(root))    # ?
print("Preorder:  ", preorder(root))   # ?
print("Postorder: ", postorder(root))  # ?

Answers: Inorder: [2, 5, 7, 10, 15, 20]. Preorder: [10, 5, 2, 7, 15, 20]. Postorder: [2, 7, 5, 20, 15, 10].

2: Heap for Top-K

Use heapq.nlargest to find the 3 tasks with the highest priority (largest number):

import heapq
tasks = [(3, "Low"), (10, "Critical"), (1, "Trivial"),
         (7, "High"), (5, "Medium"), (9, "Urgent")]
top_3 = heapq.nlargest(3, tasks)
print(top_3)  # [(10, 'Critical'), (9, 'Urgent'), (7, 'High')]
Trees let you organize data hierarchically and search it efficiently. The difference between a balanced tree and a degenerate one is the difference between O(log n) and O(n)—the same difference that makes database indexes fast.

Interactive Exercises

Design Challenge: "TaskForge Priority Scheduler"

Build a TaskScheduler class using heapq. add_task(priority, name) adds a task. next_task() returns and removes the highest-priority (lowest number) task as a (priority, name) tuple, or None if empty. peek() returns the next task without removing it.

Use heapq.heappush(self.heap, (priority, self.counter, name)). The counter acts as a tiebreaker for equal priorities (FIFO order).

In next_task, use heapq.heappop(self.heap) which returns (priority, counter, name). Return (priority, name) to the caller.

In peek, access self.heap[0] without popping. Return (self.heap[0][0], self.heap[0][2]).

Knowledge Check

A BST has 1 million nodes and is perfectly balanced. How many comparisons does a search take at most?

Chapter 23 Graphs

Why This Matters Now

Not all relationships are hierarchical. Social networks, flight routes, task dependencies, web page links—these are all graphs. When you need to model "A depends on B" or "Alice knows Bob" or "there's a road from X to Y," a tree won't cut it because trees can't represent cycles, multiple paths, or peer-to-peer relationships. Graphs are the most general-purpose data structure for modeling connections.

Graph Vocabulary

A graph is a set of nodes (also called vertices) connected by edges. That is the entire definition. Everything else is a specialization:

  • Directed vs undirected: In a directed graph, edge A→B is one-way (A depends on B, but B does not depend on A). In an undirected graph, edges go both ways (Alice knows Bob means Bob knows Alice).
  • Weighted vs unweighted: Weighted edges carry a value (distance, cost, time). Unweighted edges represent only connection.
  • Path: A sequence of edges connecting two nodes. "Can I get from A to D?" is a question about whether a path exists.
  • Cycle: A path that starts and ends at the same node. A graph with no cycles is acyclic.
  • Connected component: A group of nodes where every node can reach every other node. A graph may have multiple disconnected components.
  • Degree: The number of edges connected to a node. In directed graphs, in-degree counts incoming edges and out-degree counts outgoing edges.

Representations

There are two standard ways to store a graph in code. Both represent the same information but have different performance characteristics.

Adjacency List

A dict mapping each node to a list (or set) of its neighbors. This is the most common representation and the one you should default to.

# Directed graph: A->B, A->C, B->C, C->D
adj_list = {
    "A": ["B", "C"],
    "B": ["C"],
    "C": ["D"],
    "D": []
}

# Checking neighbors of A: O(1) dict lookup
print(adj_list["A"])  # ["B", "C"]

# Checking if edge A->B exists: O(degree of A)
print("B" in adj_list["A"])  # True

Adjacency Matrix

A 2D grid where matrix[i][j] = 1 (or a weight) if there is an edge from node i to node j, and 0 otherwise.

# Same graph: A=0, B=1, C=2, D=3
#       A  B  C  D
adj_matrix = [
    [0, 1, 1, 0],  # A -> B, A -> C
    [0, 0, 1, 0],  # B -> C
    [0, 0, 0, 1],  # C -> D
    [0, 0, 0, 0],  # D -> (none)
]

# Checking if edge A->B exists: O(1)
print(adj_matrix[0][1])  # 1 — edge exists
Adjacency list vs. adjacency matrix operation costs
OperationAdjacency ListAdjacency Matrix
SpaceO(V + E)O(V²)
Check if edge existsO(degree)O(1)
Find all neighborsO(degree)O(V)
Add edgeO(1)O(1)
Best forSparse graphs (few edges)Dense graphs (many edges)

Most real-world graphs are sparse: a social network with 1 billion users doesn't have 1 billion friends per person. Adjacency lists are the default. Use a matrix only when you need O(1) edge lookup and the graph is dense.

Building a Graph Class

class Graph:
    def __init__(self, directed=False):
        self.adj = {}
        self.directed = directed

    def add_node(self, node):
        if node not in self.adj:
            self.adj[node] = set()

    def add_edge(self, u, v):
        self.add_node(u)
        self.add_node(v)
        self.adj[u].add(v)
        if not self.directed:
            self.adj[v].add(u)

    def neighbors(self, node):
        return self.adj.get(node, set())

    def has_edge(self, u, v):
        return v in self.adj.get(u, set())

    def nodes(self):
        return list(self.adj.keys())

    def __repr__(self):
        lines = []
        for node, neighbors in self.adj.items():
            for n in neighbors:
                lines.append(f"  {node} -> {n}")
        return "Graph:\n" + "\n".join(lines)

# Example: social network (undirected)
social = Graph(directed=False)
social.add_edge("Alice", "Bob")
social.add_edge("Bob", "Carol")
social.add_edge("Alice", "Carol")
print(social.neighbors("Alice"))  # {"Bob", "Carol"}
print(social.has_edge("Bob", "Alice"))  # True (undirected)

# Example: task dependencies (directed)
deps = Graph(directed=True)
deps.add_edge("Write code", "Run tests")
deps.add_edge("Run tests", "Deploy")
deps.add_edge("Write code", "Code review")
print(deps.neighbors("Write code"))  # {"Run tests", "Code review"}
print(deps.has_edge("Deploy", "Write code"))  # False (directed)

Modeling Real Problems as Graphs

The power of graphs is that they model a huge variety of real-world problems with the same code:

Real-world problems modeled as graphs
ProblemNodesEdgesType
Task dependenciesTasksA must finish before BDirected, unweighted
Social networkPeopleFriendshipUndirected, unweighted
Flight routesCitiesFlight with distance/costDirected or undirected, weighted
Web linksPagesHyperlinksDirected, unweighted
Package dependenciesPackagesA requires BDirected, unweighted (DAG)

Directed Acyclic Graphs (DAGs)

A DAG is a directed graph with no cycles. DAGs are critical for dependency modeling: if task A depends on B and B depends on C, that is fine. But if C also depends on A, you have a cycle—no valid execution order exists.

# Valid DAG (no cycles):
#   Write code -> Run tests -> Deploy
#   Write code -> Code review -> Deploy

# Invalid (cycle):
#   A -> B -> C -> A
# No task can go first because each depends on another

Detecting cycles is therefore a critical validation step: before scheduling tasks based on dependencies, verify that the dependency graph is a DAG. If it contains a cycle, the user has an impossible set of requirements.

Cycle Detection with DFS

You can detect cycles in a directed graph using depth-first search (DFS) with a recursion stack. The idea: as you explore deeper via DFS, keep track of all nodes currently on the exploration path. If you reach a node that is already on the current path, you have found a cycle.

def has_cycle(graph):
    """Detect cycles in a directed graph using DFS.

    Uses three states per node:
    - unvisited: not yet explored
    - in_stack: currently on the DFS path (ancestors of current node)
    - done: fully explored, no cycles through this node
    """
    visited = set()
    in_stack = set()

    def dfs(node):
        visited.add(node)
        in_stack.add(node)

        for neighbor in graph.neighbors(node):
            if neighbor not in visited:
                if dfs(neighbor):
                    return True
            elif neighbor in in_stack:
                return True  # found a cycle!

        in_stack.remove(node)
        return False

    for node in graph.nodes():
        if node not in visited:
            if dfs(node):
                return True
    return False

# Test: valid DAG
dag = Graph(directed=True)
dag.add_edge("A", "B")
dag.add_edge("B", "C")
dag.add_edge("A", "C")
print(has_cycle(dag))  # False

# Test: graph with cycle
cyclic = Graph(directed=True)
cyclic.add_edge("A", "B")
cyclic.add_edge("B", "C")
cyclic.add_edge("C", "A")  # creates cycle
print(has_cycle(cyclic))  # True
Two Representations of the Same Graph A directed graph with 5 nodes and 6 edges A B C D E Adjacency List (dict) A: [B, E, C] B: [C] C: [D] D: [E] E: [] Adjacency Matrix A B C D E A [0 1 1 0 1] B [0 0 1 0 0] C [0 0 0 1 0] D [0 0 0 0 1] E [0 0 0 0 0]
The same directed graph shown as an adjacency list (left) and adjacency matrix (right). The adjacency list uses space proportional to the number of edges. The matrix uses V² space regardless of how many edges exist. For this sparse graph (6 edges, 25 matrix cells), the list is more space-efficient.
Engineering Angle

Graphs show up everywhere in professional engineering. pip resolves package dependencies using a DAG and topological sort. Google Maps finds routes using weighted graphs and shortest path algorithms (Dijkstra's). LinkedIn's "people you may know" uses BFS on the social graph. Docker's layer system forms a DAG. Even Git's commit history is a DAG where each commit points to its parent(s). You will implement some of these algorithms in Phase 6.

TaskForge Connection

Model task dependencies as a DAG. If Task B depends on Task A, add a directed edge A → B (meaning A must be completed before B). Before scheduling, validate the DAG with has_cycle()—a cycle means the dependencies are impossible. In Phase 6, you will learn topological sort to find a valid execution order for all tasks.

Micro-Exercises

1: Build a Graph

Model this scenario as a graph and answer the questions:

"Task A depends on nothing. Task B depends on A. Task C depends on A. Task D depends on both B and C."

g = Graph(directed=True)
g.add_edge("A", "B")   # A must finish before B
g.add_edge("A", "C")   # A must finish before C
g.add_edge("B", "D")   # B must finish before D
g.add_edge("C", "D")   # C must finish before D
print(has_cycle(g))     # False — valid DAG
print(g.neighbors("A")) # {"B", "C"}
# Valid execution orders: A, B, C, D  or  A, C, B, D
2: Detect an Invalid Dependency

What happens when you add an edge from D back to A?

g.add_edge("D", "A")    # D depends on A, but A depends on B/C which depend on D!
print(has_cycle(g))      # True — impossible dependency cycle
Graphs are the most flexible data structure you will learn. Any problem that involves relationships, connections, or dependencies can be modeled as a graph and solved with graph algorithms.

Interactive Exercises

Design Challenge: "TaskForge Dependency Graph"

Build a TaskGraph class that manages tasks as nodes and dependencies as directed edges. Implement add_task(name), add_dependency(task, depends_on), and has_cycle() that returns True if the dependency graph contains a cycle.

Store the graph as self.adj[node] = set() where the set contains nodes that this node has edges to. For add_dependency("B", "A"), add an edge A -> B (A must finish before B).

For cycle detection, implement DFS with two sets: visited (all nodes seen) and in_stack (nodes on the current DFS path). If you reach a node in in_stack, you found a cycle.

Remember to iterate over all nodes in the outer loop, not just one starting node. The graph might have disconnected components.

Knowledge Check

You are modeling flight routes between cities with distances. Which graph representation is most appropriate?

Phase 5 Gate Checkpoint & TaskForge Data Structures

Minimum Competency

Analyze the time and space complexity of any Python function. Choose the right data structure for a given problem. Understand how arrays, linked lists, hash tables, trees, and graphs work internally. Implement undo systems, tag indexes, priority queues, and dependency graphs.

Your Artifact

Extend TaskForge with a dependency system: model tasks as a DAG (Ch 23), schedule by priority using a heap (Ch 22), and support undo for all operations (Ch 20). Your system should detect circular dependencies, process tasks in priority order within dependency constraints, and let users undo their last 10 operations.

Verification

Adding a circular dependency raises an error or returns a warning. next_task() returns the highest-priority task whose dependencies are all complete. undo() reverses the last operation and restores previous state. The tag index (Ch 21) returns matching tasks in O(1) instead of scanning all tasks.

Failure Signal

If you cannot explain why list.pop(0) is O(n) and deque.popleft() is O(1), or why x in my_list is O(n) and x in my_set is O(1), return to Chapters 19–21. These performance distinctions are the foundation of every algorithm discussion in Phase 6.

What You Can Now Do

  • Analyze time and space complexity using Big-O notation
  • Choose between arrays, linked lists, stacks, queues, and deques
  • Use dicts and sets for O(1) lookups and build custom indexes
  • Use heaps for priority-based scheduling
  • Model relationships and dependencies as graphs
  • Detect cycles in directed graphs to validate dependencies
Bridge to Phase 6

You now know the structures. Phase 6 teaches you the algorithms that operate on them—searching, sorting, dynamic programming, and graph traversal. These are the tools that transform data structures from storage containers into problem-solving machines. You will implement binary search on sorted arrays, BFS and DFS on graphs, and dynamic programming on sequences. Every algorithm in Phase 6 builds directly on the data structures you learned here.