Data Structures & Complexity
How data lives in memory, how to measure performance, and which structure to pick for which problem. Every engineering decision starts here.
This phase assumes you can: write Python functions with tests (Ch 05, 09), use lists, dicts, and sets for everyday tasks (Ch 06), and build a Flask API with a database (Ch 14-15). If any of these feel shaky, revisit the relevant chapter first.
Chapter 19 Complexity Analysis — Measuring What Matters
You need to predict how code performs before running it. When AI generates a nested loop over 10 million records, can you tell it will take hours? When a colleague writes a function that scans a list inside another loop, can you spot that it scales terribly? Big-O notation gives you that vocabulary. It is the language engineers use to reason about performance, and it is the foundation every data structure and algorithm discussion builds on.
What Is Big-O Notation?
Big-O notation describes how the number of operations a function performs grows as the input size n grows. It answers one question: "If I double the input, how much longer does this take?"
We count operations, not seconds, because seconds depend on your CPU, your operating system, what else is running, and the phase of the moon. Operations are deterministic. If a function does n comparisons for an input of size n, it will always do 2n comparisons for an input of size 2n—regardless of hardware.
Big-O has two simplification rules:
- Drop constants. O(3n) becomes O(n). O(100) becomes O(1). Constants don't change the growth shape.
- Drop lower-order terms. O(n² + n) becomes O(n²). When n is large, the n² term dominates so completely that the n term is irrelevant.
A concrete comparison makes this tangible. Suppose you have a list of tasks and you want to check whether a specific title exists:
# O(n) — scan the list one element at a time
def find_in_list(tasks, title):
for task in tasks:
if task["title"] == title:
return True
return False
# O(1) — look up in a dict (hash table) directly
task_index = {"Buy groceries": 1, "Write tests": 2, "Deploy API": 3}
def find_in_dict(task_index, title):
return title in task_index
With 10 tasks, both finish instantly. With 10 million tasks, the list scan might take seconds while the dict lookup still finishes in microseconds. Big-O told you this would happen before you ran a single benchmark.
Common Complexity Classes
There are six complexity classes you will encounter constantly. Each one has a distinct growth shape and a real-world example you already know.
| Class | Name | Example | Doubles Input → Time Impact |
|---|---|---|---|
| O(1) | Constant | Dict lookup, array index access | No change |
| O(log n) | Logarithmic | Binary search in a sorted list | +1 step |
| O(n) | Linear | Scanning a list, summing elements | 2x time |
| O(n log n) | Linearithmic | Efficient sorting (merge sort, Timsort) | ~2x time |
| O(n²) | Quadratic | Nested loops comparing every pair | 4x time |
| O(2ⁿ) | Exponential | Generating all subsets of a set | Squared time |
Here is a Python example for each class so you can see the pattern in code:
# O(1) — Constant: dictionary lookup
def get_task(task_dict, task_id):
return task_dict.get(task_id) # one operation regardless of dict size
# O(log n) — Logarithmic: binary search
def binary_search(sorted_list, target):
lo, hi = 0, len(sorted_list) - 1
while lo <= hi:
mid = (lo + hi) // 2
if sorted_list[mid] == target:
return mid
elif sorted_list[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1 # halves the search space each step
# O(n) — Linear: scan every element
def find_max(numbers):
best = numbers[0]
for num in numbers:
if num > best:
best = num
return best # touches each element exactly once
# O(n log n) — Linearithmic: Python's built-in sort
def sort_tasks(tasks):
return sorted(tasks, key=lambda t: t["priority"])
# Timsort: divides and conquers, n log n comparisons
# O(n^2) — Quadratic: check all pairs for duplicates
def has_duplicate_naive(items):
for i in range(len(items)):
for j in range(i + 1, len(items)):
if items[i] == items[j]:
return True
return False # compares every pair: n*(n-1)/2 operations
# O(2^n) — Exponential: generate all subsets
def all_subsets(items):
if not items:
return [[]]
first = items[0]
rest_subsets = all_subsets(items[1:])
return rest_subsets + [[first] + s for s in rest_subsets]
# each element doubles the number of subsets
Analyzing Your Own Code
You don't need to memorize rules. You need to recognize two patterns:
Rule 1: Sequential blocks → add (take the max). If your function first scans a list O(n), then sorts it O(n log n), the total is O(n) + O(n log n) = O(n log n). You keep the dominant term.
def process_tasks(tasks):
# Step 1: filter — O(n)
active = [t for t in tasks if t["status"] == "active"]
# Step 2: sort — O(n log n)
active.sort(key=lambda t: t["priority"])
# Step 3: return first 10 — O(1)
return active[:10]
# Total: O(n) + O(n log n) + O(1) = O(n log n)
Rule 2: Nested loops → multiply. A loop inside a loop means the inner loop runs once for every iteration of the outer loop. If both iterate over n items, you get O(n) × O(n) = O(n²).
def find_common_tasks(list_a, list_b):
common = []
for a in list_a: # O(n)
for b in list_b: # O(m) for each a
if a["title"] == b["title"]:
common.append(a)
return common
# Total: O(n * m). If both lists are size n, this is O(n^2)
Let's walk through a more subtle example:
def group_by_status(tasks):
groups = {}
for task in tasks: # O(n)
status = task["status"]
if status not in groups:
groups[status] = [] # O(1) dict operations
groups[status].append(task) # O(1) amortized
return groups
# Total: O(n) — one pass, all dict operations are O(1)
And a tricky one where the complexity is hidden:
def remove_duplicates_naive(items):
unique = []
for item in items: # O(n)
if item not in unique: # O(k) where k = len(unique)
unique.append(item)
return unique
# Worst case: all unique, so k grows to n
# Total: O(n^2) — the "in" check on a list is O(n)!
def remove_duplicates_fast(items):
seen = set()
unique = []
for item in items: # O(n)
if item not in seen: # O(1) set lookup
seen.add(item) # O(1)
unique.append(item)
return unique
# Total: O(n) — set lookup is O(1)
The difference between these two functions is the difference between O(n²) and O(n). For 10,000 items, that is 100,000,000 vs 10,000 operations. The second function isn't just "a little faster"—it is 10,000 times faster.
Space Complexity
Time complexity measures operations. Space complexity measures how much additional memory your algorithm uses beyond the input itself.
# O(1) space — modifies in place
def reverse_in_place(items):
left, right = 0, len(items) - 1
while left < right:
items[left], items[right] = items[right], items[left]
left += 1
right -= 1
# Uses only two integer variables — constant extra space
# O(n) space — creates a new list
def reverse_copy(items):
return items[::-1]
# Creates an entirely new list of the same size
Space complexity matters when your data is large. Reversing a 10-element list? Space doesn't matter. Reversing a 10-million-element list? Creating a copy doubles your memory usage. In-place algorithms trade readability for memory efficiency.
Amortized Analysis
Python's list.append() is advertised as O(1), but occasionally the underlying array runs out of space and must be resized—copying every element to a new, larger array. That single resize is O(n). How can append be O(1)?
Amortized analysis averages the cost over a sequence of operations. When a dynamic array doubles its capacity (say from 8 to 16 slots), the next 8 appends are "free"—no resize needed. The expensive O(n) copy gets spread across the cheap O(1) appends that follow it. On average, each append costs O(1).
Think of it like paying rent. You pay a large amount once a month, but amortized over 30 days, it's a fixed daily cost. One expensive day doesn't make your average expense high.
# Demonstrating dynamic array growth
import sys
items = []
prev_size = 0
for i in range(65):
items.append(i)
curr_size = sys.getsizeof(items)
if curr_size != prev_size:
print(f"Length {len(items):3d} -> Memory {curr_size:5d} bytes")
prev_size = curr_size
# You'll see memory jumps at irregular intervals as the array resizes
When Big-O Lies
Big-O tells you about growth rate, not actual speed. For small inputs, constant factors dominate. A few important exceptions to keep in mind:
- Small n: For fewer than ~50 elements, an O(n²) insertion sort often beats an O(n log n) merge sort because insertion sort has much lower overhead (fewer function calls, better cache usage). Python's Timsort actually uses insertion sort for small runs inside its merge sort.
- Cache locality: An array stores elements contiguously in memory. When the CPU loads one element, it loads nearby elements into cache for free. A linked list scatters nodes across memory, causing cache misses—each access potentially waits for main memory. An O(n) array scan can be 10–100x faster than an O(n) linked list scan because of cache effects.
- Constant factors: An O(n) algorithm with a constant factor of 1000 is slower than an O(n log n) algorithm with a constant factor of 1 for any n below about 10,000. Big-O hides these constants by design.
The lesson: Big-O is a thinking tool, not the final answer. For large-scale decisions (O(n) vs O(n²) for millions of records), trust the theory. For micro-optimization (choosing between two O(n) algorithms), measure with a profiler.
In the real world, profiling with timeit and cProfile often contradicts theoretical analysis. A dict comprehension and a for-loop have the same Big-O, but the comprehension can be 20% faster due to fewer bytecode instructions. Know when to trust theory (choosing between O(n) and O(n²) for large data) and when to measure (choosing between two O(n) approaches with different constants). Senior engineers do both: Big-O for design decisions, profiling for optimization.
Analyze TaskForge's current operations. Adding a task to the end of a list is O(1) amortized. Searching for a task by title requires scanning every task: O(n). Sorting tasks by priority uses Python's Timsort: O(n log n). As your task list grows from 10 to 10,000 to 10 million, search becomes the bottleneck. In Chapter 21, you will build a hash-based index that makes search O(1)—and Big-O is how you know it is worth doing.
Micro-Exercises
Determine the Big-O time complexity of each function:
# Function A
def func_a(items):
return items[0] if items else None
# Function B
def func_b(items):
total = 0
for x in items:
total += x
return total
# Function C
def func_c(items):
for x in items:
for y in items:
print(x, y)
# Function D
def func_d(items):
items.sort()
return items[0]
Answers: A is O(1)—single index access. B is O(n)—one pass. C is O(n²)—nested loops. D is O(n log n)—the sort dominates.
This function looks linear but is actually O(n²). Can you see why?
def build_string(n):
result = ""
for i in range(n):
result += str(i) # string concatenation creates a new string each time
return result
Answer: Strings are immutable in Python. Each += creates a new string and copies all previous characters. The total copies are 1 + 2 + 3 + ... + n = O(n²). Fix: use "".join(str(i) for i in range(n)) for O(n).
Interactive Exercises
Design Challenge: "Predict and Measure"
Three functions solve the same problem: finding if a list contains duplicates. Predict their Big-O, then time them to verify. Fill in the predict dictionary and the measure() function.
The nested function has two loops over n elements. What do nested loops multiply to?
The sort function calls sorted() which is O(n log n), then scans once which is O(n). Which dominates?
Building a set from n elements is O(n). Comparing two lengths is O(1). Total: O(n).
Knowledge Check
What is the time complexity of searching for a value in an unsorted list of n elements?
Chapter 20 Linear Structures — Arrays, Linked Lists, Stacks & Queues
Every program stores sequences of data. Understanding how they are stored—contiguous vs scattered, indexed vs linked—determines whether your code runs in milliseconds or minutes. This chapter covers the four fundamental linear structures and teaches you when to use each one.
Arrays in Memory
An array stores elements in contiguous (side-by-side) memory locations. If element 0 starts at memory address 1000 and each element takes 8 bytes, then element 5 lives at address 1000 + (5 × 8) = 1040. This address arithmetic is why array index access is O(1)—the CPU can jump directly to any element without scanning.
Python's list is a dynamic array. Under the hood, it stores an array of pointers (references) to Python objects. The array itself is contiguous, even though the objects it points to may be scattered in memory.
Contiguous storage gives arrays an enormous advantage: cache locality. Modern CPUs don't fetch one byte at a time—they load entire cache lines (typically 64 bytes) from memory. When you access element 0 of an array, the CPU loads elements 0 through 7 (or more) into cache for free. Iterating through an array is fast because the next element is almost always already in cache.
Dynamic Arrays
Static arrays have a fixed size set at creation. Dynamic arrays (like Python's list) grow automatically. The strategy is geometric growth: when the array is full, allocate a new array that is roughly double the size, copy everything over, and free the old array.
This makes append() O(1) amortized (as we analyzed in Ch 19). But operations at the beginning of the array are expensive:
| Operation | Time | Why |
|---|---|---|
list[i] (access by index) | O(1) | Address arithmetic |
list.append(x) | O(1) amortized | Adds to the end; occasional resize |
list.insert(0, x) | O(n) | Every element shifts right by one |
list.pop() (last element) | O(1) | Removes from the end |
list.pop(0) (first element) | O(n) | Every element shifts left by one |
x in list | O(n) | Must scan the entire list |
The critical insight: inserting or removing from the front of a Python list is O(n), not O(1). Every element after the insertion point must shift. This is one of the most common performance traps in Python code.
Linked Lists
A linked list stores elements as individual node objects, each containing a value and a reference (pointer) to the next node. Nodes can be scattered anywhere in memory—they are connected by pointers, not by physical position.
class Node:
def __init__(self, value, next_node=None):
self.value = value
self.next = next_node
class LinkedList:
def __init__(self):
self.head = None
self.size = 0
def prepend(self, value):
"""Add to the front — O(1)."""
self.head = Node(value, self.head)
self.size += 1
def get(self, index):
"""Access by index — O(n). Must walk the chain."""
current = self.head
for _ in range(index):
if current is None:
raise IndexError("Index out of range")
current = current.next
return current.value
def __len__(self):
return self.size
def __iter__(self):
current = self.head
while current:
yield current.value
current = current.next
In a singly linked list, each node points to the next. In a doubly linked list, each node also points to the previous, enabling traversal in both directions and O(1) removal from either end (if you have a reference to the node).
| Operation | Array (Python list) | Linked List |
|---|---|---|
| Access by index | O(1) | O(n) |
| Insert at front | O(n) | O(1) |
| Insert at end | O(1) amortized | O(1) with tail pointer |
| Delete from front | O(n) | O(1) |
| Search | O(n) | O(n) |
| Memory per element | 8 bytes (pointer) | ~56 bytes (object + pointer) |
When Linked Lists Win (And When They Don't)
Textbooks love linked lists for their O(1) insert/delete at the head. Reality is more nuanced. Python's list (a dynamic array) beats linked lists in almost every practical scenario because:
- Cache misses destroy performance. Array elements are contiguous; the CPU prefetches them. Linked list nodes are scattered; every node access is a potential cache miss.
- Memory overhead. Each linked list node is a Python object with header, value, and pointer(s). A list of 1 million integers uses roughly 8 MB as an array vs 56 MB as a linked list.
- Python overhead. Creating millions of
Nodeobjects puts pressure on Python's garbage collector.
Linked lists shine in two specific scenarios: (1) when you need O(1) insert/delete at both ends without index access (use collections.deque, which is a doubly-linked list of fixed-size blocks), and (2) when you need to splice sublists or reorder nodes without copying data (rare in Python, common in systems programming).
Stacks (LIFO)
A stack is a Last-In, First-Out (LIFO) structure. You push items on top and pop them off the top. Think of a stack of plates: you add to the top and take from the top.
# Python list is a perfect stack
stack = []
stack.append("open file") # push
stack.append("edit line 5") # push
stack.append("delete word") # push
last_action = stack.pop() # pop -> "delete word"
print(last_action) # most recent action
Use cases for stacks:
- Undo systems—push each action, pop to undo. This is exactly what you'll build for TaskForge.
- The call stack—when Python calls a function, it pushes a frame onto the call stack. When the function returns, it pops the frame. This is why infinite recursion causes a "stack overflow."
- Parentheses matching—push open brackets, pop when you see a closing bracket. If the stack is empty when you pop, or non-empty at the end, the brackets are unbalanced.
- Expression evaluation—calculators use stacks to evaluate
3 + 4 * 2with correct precedence.
def is_balanced(expression):
"""Check if parentheses are balanced — classic stack problem."""
stack = []
pairs = {"(": ")", "[": "]", "{": "}"}
for char in expression:
if char in pairs:
stack.append(char)
elif char in pairs.values():
if not stack:
return False
if pairs[stack.pop()] != char:
return False
return len(stack) == 0
print(is_balanced("({[]})")) # True
print(is_balanced("({[})")) # False
print(is_balanced("((()")) # False
Queues and Deques (FIFO)
A queue is a First-In, First-Out (FIFO) structure. Items enter at the back and leave from the front, like a line at a ticket counter.
You might be tempted to use a Python list as a queue, but there is a critical performance problem:
# BAD: Using list as a queue — pop(0) is O(n)!
queue = [1, 2, 3, 4, 5]
queue.append(6) # enqueue at back — O(1), fine
front = queue.pop(0) # dequeue from front — O(n), every element shifts!
# GOOD: Use collections.deque — popleft() is O(1)
from collections import deque
queue = deque([1, 2, 3, 4, 5])
queue.append(6) # enqueue at back — O(1)
front = queue.popleft() # dequeue from front — O(1)!
collections.deque (double-ended queue) supports O(1) operations at both ends. It is implemented as a doubly-linked list of fixed-size blocks—getting cache-friendly behavior of arrays with the flexibility of linked lists at the ends.
Use cases for queues:
- Task scheduling—process tasks in the order they arrive. Print queues, message queues, job queues.
- Breadth-first search—explore neighbors layer by layer (you'll use this with graphs in Ch 23).
- Rate limiting—track timestamps of recent requests in a sliding window.
Python's collections.deque is the right choice when you need a queue. Using list.pop(0) is O(n)—every element shifts. deque.popleft() is O(1). This is the kind of performance trap that AI-generated code often falls into. When you see queue = [] followed by queue.pop(0) in AI output, you know to replace it with deque. One data structure swap, no algorithm change, and performance goes from O(n²) to O(n) for a series of dequeue operations.
Implement an undo stack for TaskForge. Every operation—add task, delete task, update task—pushes a record onto the stack that stores enough information to reverse it. Calling undo() pops the last record and reverses the operation. This is the same pattern used by text editors, drawing tools, and database transactions.
Micro-Exercises
Test the is_balanced() function from the stacks section with these inputs. Predict the output before running:
print(is_balanced("")) # ?
print(is_balanced("()")) # ?
print(is_balanced("([{}])")) # ?
print(is_balanced("(]")) # ?
print(is_balanced("((())")) # ?
Answers: True, True, True, False, False. An empty string has balanced brackets (nothing to mismatch).
Time this comparison and observe the dramatic difference:
import time
from collections import deque
n = 100000
# List: pop from front
data = list(range(n))
start = time.time()
while data:
data.pop(0)
list_time = time.time() - start
# Deque: pop from front
data = deque(range(n))
start = time.time()
while data:
data.popleft()
deque_time = time.time() - start
print(f"list.pop(0): {list_time:.4f}s")
print(f"deque.popleft():{deque_time:.4f}s")
print(f"Deque is {list_time/deque_time:.0f}x faster")
Interactive Exercises
Design Challenge: "TaskForge Undo Stack"
Build an UndoStack class that records operations and reverses them. Each operation is a dict with "action" (add/delete), "task" (task dict). The stack manages a task list. add_task(task) adds and records. undo() reverses the last operation. get_tasks() returns current tasks.
In add_task, append the task to self.tasks, then push {"action": "add", "task": task} onto self.history.
In undo, pop from self.history. If the action was "add", remove that task from self.tasks. If "delete", re-add it.
In delete_task, find and remove the task, then push {"action": "delete", "task": removed_task} onto history.
Knowledge Check
You need a data structure where you frequently add to the back and remove from the front. Which is the best choice?
Chapter 21 Hash Tables, Sets & Maps
The dict is Python's most powerful data structure. Understanding how it works internally helps you know when it is O(1), when it degrades, and why certain things can't be dict keys. Every time you write if x in my_dict and it returns instantly, a hash table is doing the work. Every time you get a TypeError: unhashable type: 'list', it is because hash tables require immutable keys. This chapter explains why.
How Hashing Works
A hash function converts a key (any immutable value) into an integer called a hash code. That integer, taken modulo the table size, determines which bucket the key-value pair lives in.
# Python's built-in hash function
print(hash("hello")) # e.g., 2534023372954268
print(hash("hello") % 8) # e.g., 4 — bucket 4 in an 8-bucket table
print(hash(42)) # 42 — integers hash to themselves
print(hash((1, 2, 3))) # tuples are hashable because they're immutable
The process for looking up my_dict["hello"]:
- Compute
hash("hello")→ some large integer - Compute
hash_code % table_size→ bucket index (e.g., 4) - Go directly to bucket 4 and find the value
No scanning, no comparisons with other keys. That is why dict lookup is O(1).
Collision Resolution
Two different keys can hash to the same bucket. This is called a collision. If hash("hello") % 8 == 4 and hash("world") % 8 == 4, both keys map to bucket 4. Hash tables must handle this.
There are two main strategies:
- Chaining: Each bucket holds a linked list (or small array) of entries. When a collision occurs, the new entry is appended to the chain. Lookup scans the chain—O(k) where k is the chain length. If the hash function distributes evenly, chains stay short.
- Open addressing: When a collision occurs, probe the next empty slot using a deterministic sequence (linear probing, quadratic probing, or double hashing). CPython uses a variant of open addressing with a pseudo-random probe sequence.
Both strategies have trade-offs: chaining wastes memory on node pointers, open addressing suffers from clustering (occupied slots clump together, making probe sequences longer). In practice, both work well when the table is not too full.
Load Factor and Rehashing
The load factor is the ratio of entries to buckets: n / table_size. As the load factor increases, collisions become more frequent and performance degrades toward O(n).
CPython keeps the load factor below roughly 2/3 (0.67). When inserting a new entry would exceed this threshold, the table rehashes: allocates a new, larger table (typically double the size) and reinserts every existing entry. This is an O(n) operation, but it happens infrequently—amortized O(1) per insert, the same pattern as dynamic array resizing.
# You can observe rehashing indirectly via memory usage
import sys
d = {}
prev = sys.getsizeof(d)
for i in range(100):
d[i] = i
curr = sys.getsizeof(d)
if curr != prev:
print(f"At {i} entries: dict size jumped from {prev} to {curr} bytes")
prev = curr
# You'll see jumps at approximately 5, 10, 21, 42, 85 entries
Python Dict Internals
Since CPython 3.6, dicts use a compact layout with two separate arrays:
- A sparse hash table (mostly empty) that stores indices into the dense array
- A dense entries array that stores keys and values in insertion order
This is why Python dicts preserve insertion order since 3.7—the dense array records entries in the order they were added. It also explains why dicts use more memory than you might expect: the sparse hash table has empty slots to keep the load factor low.
Key facts about Python dicts:
- Average lookup, insert, and delete: O(1)
- Worst-case (all keys collide): O(n)—but this essentially never happens with Python's hash function
- Iteration is O(n) and follows insertion order
- Memory usage is roughly 50–70 bytes per entry (much more than a list)
Sets
A set is a hash table without values—it stores only keys. This gives you O(1) membership testing, which is the single most common reason to use a set over a list.
# O(n) — membership test on a list
def has_overdue_list(task_ids, overdue_ids_list):
for tid in task_ids:
if tid in overdue_ids_list: # O(n) each time
return True
return False
# Total: O(n * m) where n = task_ids, m = overdue list
# O(n) — membership test on a set
def has_overdue_set(task_ids, overdue_ids_set):
for tid in task_ids:
if tid in overdue_ids_set: # O(1) each time
return True
return False
# Total: O(n) where n = task_ids
Sets also support efficient mathematical operations:
team_a = {"Alice", "Bob", "Carol"}
team_b = {"Bob", "Dave", "Carol"}
print(team_a & team_b) # Intersection: {"Bob", "Carol"}
print(team_a | team_b) # Union: {"Alice", "Bob", "Carol", "Dave"}
print(team_a - team_b) # Difference: {"Alice"}
print(team_a ^ team_b) # Symmetric diff: {"Alice", "Dave"}
When to use a set vs a list: If you need to check "is X in my collection?" more than occasionally, convert to a set. For 100 elements, the difference is negligible. For 1 million elements, x in my_list takes a millisecond while x in my_set takes a microsecond—a 1000x difference.
When Hashing Breaks
Hash tables have failure modes you need to know about:
- Mutable keys are forbidden. If you could modify a list after using it as a dict key, its hash would change and the dict could no longer find it. Python prevents this:
TypeError: unhashable type: 'list'. Convert to a tuple first. - Hash collision attacks. An attacker who knows your hash function can craft inputs that all map to the same bucket, degrading O(1) to O(n). Python mitigates this with hash randomization (different seed per process).
- Pathological distributions. If many keys hash to the same bucket despite a good hash function, performance degrades. This is extremely rare with Python's hash, but worth knowing about for debugging.
# This fails: lists are mutable, so unhashable
try:
d = {[1, 2, 3]: "value"}
except TypeError as e:
print(e) # unhashable type: 'list'
# This works: tuples are immutable, so hashable
d = {(1, 2, 3): "value"}
print(d[(1, 2, 3)]) # "value"
Choosing between dict, set, and list for lookups is a daily engineering decision. if x in my_list is O(n). if x in my_set is O(1). For 100 elements it doesn't matter. For 1 million, it is the difference between 1 second and 1 microsecond. When you see AI-generated code that stores unique IDs in a list and checks membership in a loop, you know to swap it for a set. This is the kind of optimization that scales.
Build a tag-based index for tasks. A dict maps tag strings to lists of task IDs: {"urgent": [1, 3, 7], "backend": [2, 3]}. When you need all tasks tagged "urgent," the dict gives you the answer in O(1) instead of scanning every task. This is the same pattern databases use for indexes—you are building one by hand.
Micro-Exercises
Convert this O(n²) function to O(n) using a set:
# O(n^2) — checking membership in a list inside a loop
def find_common(list_a, list_b):
common = []
for item in list_a:
if item in list_b: # O(n) for each check
common.append(item)
return common
# Your O(n) version:
def find_common_fast(list_a, list_b):
set_b = set(list_b) # O(m) one-time conversion
common = []
for item in list_a:
if item in set_b: # O(1) for each check
common.append(item)
return common
# Total: O(n + m) instead of O(n * m)
Interactive Exercises
Design Challenge: "TaskForge Tag Index"
Build a TagIndex class that maintains a dict mapping tags to sets of task IDs. Implement add_task(task_id, tags), remove_task(task_id, tags), and search(tag) that returns a set of task IDs.
In add_task, loop through tags. For each tag, if it's not in self.index, create an empty set. Then add the task_id to the set.
Use self.index.setdefault(tag, set()).add(task_id) for a concise one-liner per tag.
In search, return self.index.get(tag, set()) to handle missing tags gracefully.
Knowledge Check
Why can't you use a list as a dictionary key in Python?
Chapter 22 Trees
Trees are everywhere you don't see them. Your file system is a tree. Database indexes are trees. The HTML DOM is a tree. JSON is a tree. When you call os.walk() to traverse directories, you are traversing a tree. Understanding trees means understanding how all these systems perform and why database queries are fast even on tables with millions of rows.
Binary Tree Basics
A binary tree is a hierarchy where each node has at most two children, conventionally called left and right. The topmost node is the root. Nodes with no children are leaves. The depth of a node is its distance from the root. The height of the tree is the depth of its deepest leaf.
class TreeNode:
def __init__(self, value, left=None, right=None):
self.value = value
self.left = left
self.right = right
# Build a small tree:
# 8
# / \
# 3 10
# / \ \
# 1 6 14
root = TreeNode(8,
TreeNode(3, TreeNode(1), TreeNode(6)),
TreeNode(10, None, TreeNode(14))
)
Tree Traversals
There are four standard ways to visit every node in a tree. Each traversal order serves a different purpose:
def inorder(node):
"""Left -> Root -> Right. For BSTs, visits nodes in sorted order."""
if node is None:
return []
return inorder(node.left) + [node.value] + inorder(node.right)
def preorder(node):
"""Root -> Left -> Right. Copies the tree structure."""
if node is None:
return []
return [node.value] + preorder(node.left) + preorder(node.right)
def postorder(node):
"""Left -> Right -> Root. Safe for deletion (children before parent)."""
if node is None:
return []
return postorder(node.left) + postorder(node.right) + [node.value]
def levelorder(node):
"""Layer by layer (BFS). Uses a queue."""
from collections import deque
if node is None:
return []
result, queue = [], deque([node])
while queue:
current = queue.popleft()
result.append(current.value)
if current.left:
queue.append(current.left)
if current.right:
queue.append(current.right)
return result
# Using our tree from above:
print("Inorder: ", inorder(root)) # [1, 3, 6, 8, 10, 14] — sorted!
print("Preorder: ", preorder(root)) # [8, 3, 1, 6, 10, 14]
print("Postorder: ", postorder(root)) # [1, 6, 3, 14, 10, 8]
print("Levelorder:", levelorder(root)) # [8, 3, 10, 1, 6, 14]
Binary Search Trees
A Binary Search Tree (BST) enforces an ordering rule: for every node, all values in the left subtree are smaller and all values in the right subtree are larger. This rule makes search, insert, and delete all O(h) where h is the height of the tree.
def bst_search(node, target):
"""Search a BST — O(h) where h = height."""
if node is None:
return False
if target == node.value:
return True
elif target < node.value:
return bst_search(node.left, target) # go left
else:
return bst_search(node.right, target) # go right
def bst_insert(node, value):
"""Insert into a BST — O(h)."""
if node is None:
return TreeNode(value)
if value < node.value:
node.left = bst_insert(node.left, value)
elif value > node.value:
node.right = bst_insert(node.right, value)
# if value == node.value, duplicate — skip or handle as you wish
return node
# Build a BST from scratch
tree = None
for val in [8, 3, 10, 1, 6, 14]:
tree = bst_insert(tree, val)
print(bst_search(tree, 6)) # True
print(bst_search(tree, 7)) # False
print(inorder(tree)) # [1, 3, 6, 8, 10, 14] — always sorted
Why Balance Matters
The performance of a BST depends entirely on its height. A balanced BST with n nodes has height log n, giving O(log n) operations. But if you insert sorted data into a BST, every node goes to the right, creating a degenerate tree that is effectively a linked list with height n.
# Balanced insertion: [8, 3, 10, 1, 6, 14]
# Creates a nice tree with height 3:
# 8
# / \
# 3 10
# / \ \
# 1 6 14
# Search: O(log n) = O(log 6) ≈ 3 comparisons
# Sorted insertion: [1, 3, 6, 8, 10, 14]
# Creates a degenerate tree with height 6:
# 1
# \
# 3
# \
# 6
# \
# 8
# \
# 10
# \
# 14
# Search: O(n) = 6 comparisons — it's a linked list!
Self-balancing trees (AVL trees, red-black trees) automatically restructure after inserts and deletes to guarantee O(log n) height. You will almost never implement these yourself—they exist inside libraries and databases. But you need to know why they exist: to prevent the O(n) degenerate case. Python's sorted containers library and C++'s std::map use red-black trees. Database indexes use B-trees (below).
Heaps and Priority Queues
A min-heap is a complete binary tree where every parent is less than or equal to its children. The minimum value is always at the root. Heaps are stored as arrays using index arithmetic:
- Children of node i: indices 2i+1 and 2i+2
- Parent of node i: index (i-1)//2
Python's heapq module provides a min-heap backed by a regular list:
import heapq
# Build a heap from a list
tasks = [(3, "Low priority task"),
(1, "URGENT task"),
(2, "Medium task")]
heapq.heapify(tasks) # O(n) — rearranges in place
# Pop the smallest (highest priority)
print(heapq.heappop(tasks)) # (1, "URGENT task")
print(heapq.heappop(tasks)) # (2, "Medium task")
print(heapq.heappop(tasks)) # (3, "Low priority task")
# Push a new item
heapq.heappush(tasks, (0, "Critical fix"))
# Get top-K without sorting the whole list
data = [42, 17, 93, 5, 68, 31, 84, 2]
print(heapq.nsmallest(3, data)) # [2, 5, 17] — O(n log k)
print(heapq.nlargest(3, data)) # [93, 84, 68]
| Heap Operation | Time Complexity |
|---|---|
heapify(list) | O(n) |
heappush(heap, item) | O(log n) |
heappop(heap) | O(log n) |
nsmallest(k, iterable) | O(n log k) |
| Access minimum | O(1) — it is heap[0] |
The key insight: if you only need the minimum (or maximum) element repeatedly, a heap is better than sorting. Sorting gives you all elements in order (O(n log n)). A heap gives you just the min in O(1) and lets you remove it in O(log n). For "give me the top 10 most important tasks out of 10,000," a heap is the right tool.
Tries
A trie (pronounced "try") is a tree where each node represents a character. A path from the root to a node spells a prefix. Tries power autocomplete features: to find all words starting with "pro", walk down the p-r-o path and return everything below.
The critical property: lookup time is O(m) where m is the key length, regardless of how many keys are stored. A trie with 1 million words finds "programming" in exactly 11 steps.
class TrieNode:
def __init__(self):
self.children = {}
self.is_end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end = True
def search(self, word):
node = self.root
for char in word:
if char not in node.children:
return False
node = node.children[char]
return node.is_end
def starts_with(self, prefix):
node = self.root
for char in prefix:
if char not in node.children:
return False
node = node.children[char]
return True
t = Trie()
for word in ["task", "tag", "test", "team"]:
t.insert(word)
print(t.search("task")) # True
print(t.search("tas")) # False (prefix, not complete word)
print(t.starts_with("ta")) # True
print(t.starts_with("te")) # True
print(t.starts_with("tx")) # False
B-Trees (Conceptual)
Databases don't use BSTs. They use B-trees—wide, shallow trees designed for disk-based storage.
The problem with BSTs on disk: each node is one comparison, and reading a node from disk takes ~10 milliseconds (mechanical disk) or ~0.1 ms (SSD). A balanced BST with 1 million nodes has height 20, requiring 20 disk reads. At 0.1 ms each, that is 2 ms per query—tolerable but not great.
A B-tree solves this by making nodes wide: each node holds hundreds of keys and has hundreds of children. A B-tree with branching factor 100 and 3 levels can hold 100³ = 1 million keys with only 3 disk reads per search. That is why SELECT * FROM tasks WHERE id = 42 returns instantly on a table with millions of rows.
You don't implement B-trees. You use them through databases (SQLite, PostgreSQL). But knowing they exist explains why database indexes are fast and why CREATE INDEX speeds up queries—it builds a B-tree over the indexed column.
You almost never implement a tree yourself. But understanding them is essential: database indexes are B-trees, file systems are tree-structured, and heapq is the right tool for "give me the top 10 most important tasks." When AI generates code that sorts a list just to find the max, you know a single max() call is O(n) and even better, if you need the max repeatedly as items change, a heap gives O(log n) per update vs re-sorting at O(n log n). Knowing the structure means knowing the right tool.
Implement priority-based task scheduling using heapq. Tasks with lower priority numbers get processed first (priority 1 = most urgent). Support add_task(priority, name) and next_task() that returns the highest-priority task. This replaces the naive approach of sorting the entire task list every time you want the next task.
Micro-Exercises
Given this tree, predict the output of each traversal before running the code:
# 10
# / \
# 5 15
# / \ \
# 2 7 20
root = TreeNode(10,
TreeNode(5, TreeNode(2), TreeNode(7)),
TreeNode(15, None, TreeNode(20))
)
print("Inorder: ", inorder(root)) # ?
print("Preorder: ", preorder(root)) # ?
print("Postorder: ", postorder(root)) # ?
Answers: Inorder: [2, 5, 7, 10, 15, 20]. Preorder: [10, 5, 2, 7, 15, 20]. Postorder: [2, 7, 5, 20, 15, 10].
Use heapq.nlargest to find the 3 tasks with the highest priority (largest number):
import heapq
tasks = [(3, "Low"), (10, "Critical"), (1, "Trivial"),
(7, "High"), (5, "Medium"), (9, "Urgent")]
top_3 = heapq.nlargest(3, tasks)
print(top_3) # [(10, 'Critical'), (9, 'Urgent'), (7, 'High')]
Interactive Exercises
Design Challenge: "TaskForge Priority Scheduler"
Build a TaskScheduler class using heapq. add_task(priority, name) adds a task. next_task() returns and removes the highest-priority (lowest number) task as a (priority, name) tuple, or None if empty. peek() returns the next task without removing it.
Use heapq.heappush(self.heap, (priority, self.counter, name)). The counter acts as a tiebreaker for equal priorities (FIFO order).
In next_task, use heapq.heappop(self.heap) which returns (priority, counter, name). Return (priority, name) to the caller.
In peek, access self.heap[0] without popping. Return (self.heap[0][0], self.heap[0][2]).
Knowledge Check
A BST has 1 million nodes and is perfectly balanced. How many comparisons does a search take at most?
Chapter 23 Graphs
Not all relationships are hierarchical. Social networks, flight routes, task dependencies, web page links—these are all graphs. When you need to model "A depends on B" or "Alice knows Bob" or "there's a road from X to Y," a tree won't cut it because trees can't represent cycles, multiple paths, or peer-to-peer relationships. Graphs are the most general-purpose data structure for modeling connections.
Graph Vocabulary
A graph is a set of nodes (also called vertices) connected by edges. That is the entire definition. Everything else is a specialization:
- Directed vs undirected: In a directed graph, edge A→B is one-way (A depends on B, but B does not depend on A). In an undirected graph, edges go both ways (Alice knows Bob means Bob knows Alice).
- Weighted vs unweighted: Weighted edges carry a value (distance, cost, time). Unweighted edges represent only connection.
- Path: A sequence of edges connecting two nodes. "Can I get from A to D?" is a question about whether a path exists.
- Cycle: A path that starts and ends at the same node. A graph with no cycles is acyclic.
- Connected component: A group of nodes where every node can reach every other node. A graph may have multiple disconnected components.
- Degree: The number of edges connected to a node. In directed graphs, in-degree counts incoming edges and out-degree counts outgoing edges.
Representations
There are two standard ways to store a graph in code. Both represent the same information but have different performance characteristics.
Adjacency List
A dict mapping each node to a list (or set) of its neighbors. This is the most common representation and the one you should default to.
# Directed graph: A->B, A->C, B->C, C->D
adj_list = {
"A": ["B", "C"],
"B": ["C"],
"C": ["D"],
"D": []
}
# Checking neighbors of A: O(1) dict lookup
print(adj_list["A"]) # ["B", "C"]
# Checking if edge A->B exists: O(degree of A)
print("B" in adj_list["A"]) # True
Adjacency Matrix
A 2D grid where matrix[i][j] = 1 (or a weight) if there is an edge from node i to node j, and 0 otherwise.
# Same graph: A=0, B=1, C=2, D=3
# A B C D
adj_matrix = [
[0, 1, 1, 0], # A -> B, A -> C
[0, 0, 1, 0], # B -> C
[0, 0, 0, 1], # C -> D
[0, 0, 0, 0], # D -> (none)
]
# Checking if edge A->B exists: O(1)
print(adj_matrix[0][1]) # 1 — edge exists
| Operation | Adjacency List | Adjacency Matrix |
|---|---|---|
| Space | O(V + E) | O(V²) |
| Check if edge exists | O(degree) | O(1) |
| Find all neighbors | O(degree) | O(V) |
| Add edge | O(1) | O(1) |
| Best for | Sparse graphs (few edges) | Dense graphs (many edges) |
Most real-world graphs are sparse: a social network with 1 billion users doesn't have 1 billion friends per person. Adjacency lists are the default. Use a matrix only when you need O(1) edge lookup and the graph is dense.
Building a Graph Class
class Graph:
def __init__(self, directed=False):
self.adj = {}
self.directed = directed
def add_node(self, node):
if node not in self.adj:
self.adj[node] = set()
def add_edge(self, u, v):
self.add_node(u)
self.add_node(v)
self.adj[u].add(v)
if not self.directed:
self.adj[v].add(u)
def neighbors(self, node):
return self.adj.get(node, set())
def has_edge(self, u, v):
return v in self.adj.get(u, set())
def nodes(self):
return list(self.adj.keys())
def __repr__(self):
lines = []
for node, neighbors in self.adj.items():
for n in neighbors:
lines.append(f" {node} -> {n}")
return "Graph:\n" + "\n".join(lines)
# Example: social network (undirected)
social = Graph(directed=False)
social.add_edge("Alice", "Bob")
social.add_edge("Bob", "Carol")
social.add_edge("Alice", "Carol")
print(social.neighbors("Alice")) # {"Bob", "Carol"}
print(social.has_edge("Bob", "Alice")) # True (undirected)
# Example: task dependencies (directed)
deps = Graph(directed=True)
deps.add_edge("Write code", "Run tests")
deps.add_edge("Run tests", "Deploy")
deps.add_edge("Write code", "Code review")
print(deps.neighbors("Write code")) # {"Run tests", "Code review"}
print(deps.has_edge("Deploy", "Write code")) # False (directed)
Modeling Real Problems as Graphs
The power of graphs is that they model a huge variety of real-world problems with the same code:
| Problem | Nodes | Edges | Type |
|---|---|---|---|
| Task dependencies | Tasks | A must finish before B | Directed, unweighted |
| Social network | People | Friendship | Undirected, unweighted |
| Flight routes | Cities | Flight with distance/cost | Directed or undirected, weighted |
| Web links | Pages | Hyperlinks | Directed, unweighted |
| Package dependencies | Packages | A requires B | Directed, unweighted (DAG) |
Directed Acyclic Graphs (DAGs)
A DAG is a directed graph with no cycles. DAGs are critical for dependency modeling: if task A depends on B and B depends on C, that is fine. But if C also depends on A, you have a cycle—no valid execution order exists.
# Valid DAG (no cycles):
# Write code -> Run tests -> Deploy
# Write code -> Code review -> Deploy
# Invalid (cycle):
# A -> B -> C -> A
# No task can go first because each depends on another
Detecting cycles is therefore a critical validation step: before scheduling tasks based on dependencies, verify that the dependency graph is a DAG. If it contains a cycle, the user has an impossible set of requirements.
Cycle Detection with DFS
You can detect cycles in a directed graph using depth-first search (DFS) with a recursion stack. The idea: as you explore deeper via DFS, keep track of all nodes currently on the exploration path. If you reach a node that is already on the current path, you have found a cycle.
def has_cycle(graph):
"""Detect cycles in a directed graph using DFS.
Uses three states per node:
- unvisited: not yet explored
- in_stack: currently on the DFS path (ancestors of current node)
- done: fully explored, no cycles through this node
"""
visited = set()
in_stack = set()
def dfs(node):
visited.add(node)
in_stack.add(node)
for neighbor in graph.neighbors(node):
if neighbor not in visited:
if dfs(neighbor):
return True
elif neighbor in in_stack:
return True # found a cycle!
in_stack.remove(node)
return False
for node in graph.nodes():
if node not in visited:
if dfs(node):
return True
return False
# Test: valid DAG
dag = Graph(directed=True)
dag.add_edge("A", "B")
dag.add_edge("B", "C")
dag.add_edge("A", "C")
print(has_cycle(dag)) # False
# Test: graph with cycle
cyclic = Graph(directed=True)
cyclic.add_edge("A", "B")
cyclic.add_edge("B", "C")
cyclic.add_edge("C", "A") # creates cycle
print(has_cycle(cyclic)) # True
Graphs show up everywhere in professional engineering. pip resolves package dependencies using a DAG and topological sort. Google Maps finds routes using weighted graphs and shortest path algorithms (Dijkstra's). LinkedIn's "people you may know" uses BFS on the social graph. Docker's layer system forms a DAG. Even Git's commit history is a DAG where each commit points to its parent(s). You will implement some of these algorithms in Phase 6.
Model task dependencies as a DAG. If Task B depends on Task A, add a directed edge A → B (meaning A must be completed before B). Before scheduling, validate the DAG with has_cycle()—a cycle means the dependencies are impossible. In Phase 6, you will learn topological sort to find a valid execution order for all tasks.
Micro-Exercises
Model this scenario as a graph and answer the questions:
"Task A depends on nothing. Task B depends on A. Task C depends on A. Task D depends on both B and C."
g = Graph(directed=True)
g.add_edge("A", "B") # A must finish before B
g.add_edge("A", "C") # A must finish before C
g.add_edge("B", "D") # B must finish before D
g.add_edge("C", "D") # C must finish before D
print(has_cycle(g)) # False — valid DAG
print(g.neighbors("A")) # {"B", "C"}
# Valid execution orders: A, B, C, D or A, C, B, D
What happens when you add an edge from D back to A?
g.add_edge("D", "A") # D depends on A, but A depends on B/C which depend on D!
print(has_cycle(g)) # True — impossible dependency cycle
Interactive Exercises
Design Challenge: "TaskForge Dependency Graph"
Build a TaskGraph class that manages tasks as nodes and dependencies as directed edges. Implement add_task(name), add_dependency(task, depends_on), and has_cycle() that returns True if the dependency graph contains a cycle.
Store the graph as self.adj[node] = set() where the set contains nodes that this node has edges to. For add_dependency("B", "A"), add an edge A -> B (A must finish before B).
For cycle detection, implement DFS with two sets: visited (all nodes seen) and in_stack (nodes on the current DFS path). If you reach a node in in_stack, you found a cycle.
Remember to iterate over all nodes in the outer loop, not just one starting node. The graph might have disconnected components.
Knowledge Check
You are modeling flight routes between cities with distances. Which graph representation is most appropriate?
Phase 5 Gate Checkpoint & TaskForge Data Structures
Minimum Competency
Analyze the time and space complexity of any Python function. Choose the right data structure for a given problem. Understand how arrays, linked lists, hash tables, trees, and graphs work internally. Implement undo systems, tag indexes, priority queues, and dependency graphs.
Your Artifact
Extend TaskForge with a dependency system: model tasks as a DAG (Ch 23), schedule by priority using a heap (Ch 22), and support undo for all operations (Ch 20). Your system should detect circular dependencies, process tasks in priority order within dependency constraints, and let users undo their last 10 operations.
Verification
Adding a circular dependency raises an error or returns a warning. next_task() returns the highest-priority task whose dependencies are all complete. undo() reverses the last operation and restores previous state. The tag index (Ch 21) returns matching tasks in O(1) instead of scanning all tasks.
If you cannot explain why list.pop(0) is O(n) and deque.popleft() is O(1), or why x in my_list is O(n) and x in my_set is O(1), return to Chapters 19–21. These performance distinctions are the foundation of every algorithm discussion in Phase 6.
What You Can Now Do
- Analyze time and space complexity using Big-O notation
- Choose between arrays, linked lists, stacks, queues, and deques
- Use dicts and sets for O(1) lookups and build custom indexes
- Use heaps for priority-based scheduling
- Model relationships and dependencies as graphs
- Detect cycles in directed graphs to validate dependencies
You now know the structures. Phase 6 teaches you the algorithms that operate on them—searching, sorting, dynamic programming, and graph traversal. These are the tools that transform data structures from storage containers into problem-solving machines. You will implement binary search on sorted arrays, BFS and DFS on graphs, and dynamic programming on sequences. Every algorithm in Phase 6 builds directly on the data structures you learned here.