Midterm Review & Cheatsheet for CS 188 Fall 2024 | Introduction to Artificial Intelligence at UC Berkeley
Author: SimonXie2004.github.io
Resources
Lec1: Introduction
N/A
Lec2: Search
Reflex Agents V.S. Planning Agents:
- Reflex Agents: Consider how the world IS
- Planning Agents: Consider how the world WOULD BE
Properties of Agents
- Completeness: Guaranteed to find a solution if one exists.
- Optimality: Guaranteed to find the least cost path.
Definition of Search Problem:
State Space
,Successor Function
,Start State
&Goal Test
Definition of State Space: World State & Search State
State Space Graph: Nodes = states, Arcs = successors (action results)
Tree Search
Main Idea: Expand out potential nodes; Maintain a fringe of partial plans under consideration; Expand less nodes.
Key notions: Expansion, Expansion Strategy, Fringe
Common tree search patterns (Suppose b = branching factor, m = tree depth.) Nodes in search tree? \(\sum_{i=0}^{m}b^i = O(b^m)\) (For BFS, suppose s = depth of shallowest solution) (For Uniform Cost Search, suppose solution costs \(C^*\), min(arc_cost) = \(\epsilon\))
Strategy Fringe Time Memory Completeness Optimality DFS Expand deepest node first LIFO Stack \(O(b^m)\) \(O(bm)\) True (if no cycles) False BFS Expand shallowest node first FIFO Queue \(O(b^s)\) \(O(b^s)\) True True (if cost=1) UCS Expand cheapest node first Priority Queue (p=cumulative cost) \(O(C^*/\epsilon)\) \(O(b^{C^*/\epsilon})\) True True Special Idea: Iterative Deepening Run DFS(depth_limit=1), DFS(depth_limit=2), ...
Example Problem: Pancake flipping; Cost: Number of pancakes flipped
Graph Search
- Idea: never expand a state twice
- Method: record set of expanded states where elements = (state, cost). If a node popped from queue is NOT visited, visit it. If a node popped from queue is visited, check its cost. If the cost if lower, expand it. Else skip it.
Lec3: Informed Search
Definition of heuristic: function that estimates how close a state is to a goal; Problem specific!
Example heuristics: (Relaxed-problem heuristic)
- Pancake flipping: heuristic = the number of largest pancake that is still out of place
- Dot-Eating Pacman: heuristic = the sum of all weights in a MST (of dots & current coordinate)
- Classic 8 Puzzle: heuristic = number of tiles misplaced
- Easy 8 Puzzle (allow tile to be piled intermediately): heuristic = total Manhattan distance
Remark: Can't use actual cost as heuristic, since you have to solve that first!
Comparison of algorithms:
- Greedy Search: expand closest node (to goal); orders by forward cost h(n); suboptimal
- UCS: expand closest node (to start state); orders by backward cost g(n); suboptimal
- A* Search: orders by sum f(n) = g(n) + h(n)
A* Search
When to stop: Only if we dequeue a goal
Admissible (optimistic) heuristic: \(\forall n, 0 \le h(n) \le h^*(n)\). A* Tree Search is optimal if heuristic is admissible. Proof: Step 1, Step 2.
Consistent heuristic: \(\forall A, B, h(A) - h(B) \le cost(A, B)\) A* Graph Search is optimal if heuristic is consistent. Proof: Sketch, Step 1, Step 2
Semi-Lattice of Heuristics
- Dominance: define \(h_a \ge h_c\) if \(\forall n, h_a(n) \ge h_c(n)\)
- Heuristics form semi-lattice because: \(\forall h(n) = max(h_a(n), h_b(n)) \in H\)
- Bottom of lattice is zero-heuristic. Top of lattice is exact-heuristic
Lec 4-5: Constraint Satisfaction Problems
Definition of CSP Problems: (A special subset of search problems)
- State: Varibles {Xi}, with values from domain D
- Goal Test: set of constraints specifying allowable combinations of values
Example of CSP Problems:
N-Queens
Formulation 1: Variables: Xij, Domains: {0, 1}, Constraints: \(\forall i, j, k, (X_{ij}, X_{jk}) \neq (1, 1), \cdots\) and \(\sum_{i, j}X_{ij} = N\)
Formulation 2: Variables Qk, Domains: {1, ..., N}, Constraints: \(\forall (i, j), \text{non-threatening}(Q_i, Q_j)\)
Cryptarithmetic
Constraint Graph:
- Circle nodes = Variables; Rectangular nodes = Constraints.
- If there is a relation between some variables, they are connected to a constraint node.
Simple Backtracking Search
- One variable at a time
- Check constraints as you go. (Only consider constraints not conflicting to previous assignments)
Simple Backtracking Algorithm
= DFS + variable-ordering + fail-on-violation
Filtering & Arc Consistency
Definition: Arc \(X \rightarrow Y\) is consistent if \(\forall x \in X, \exists y \in Y\) that could be assigned. (Basically X is enforcing constraints on Y)
Filtering: Forward Checking: Enforcing consistency of arcs pointing to each new assignment
Filtering: Constraint Propagation: If X loses a Value, neighbors of X need to be rechecked.
Usage: run arc consistency as a preprocessor or after each assignment
Algorithm with Runtime \(O(n^2d^3)\):
Advanced Definition: K-Consistency
K-Consistency: For each k nodes, any consistent assignment to k-1 nodes can be extended to kth node.
Example of being NOT 3-consistent:
Strong K-Consistency: also k-1, k-2, ..., 1-Consistent; Can be solved immediately without searching
Problems of Arc-consistency: only considers 2-consitency
Advanced Arc-Consistency: Ordering
Variable Ordering: MRV (Minimum Remaining Value): Choose the variable with fewest legal left values in domain
Value Ordering: LCV (Least Constraining Value): Choose the value that rules out fewest values in remaining variables.
(May require re-running filtering.)
Advanced Arc-Consistency: Observing Problem Structure
Suppose graph of n variables can be broken into subproblems with c variables: Can solve in \(O(\frac{n}{c}) \cdot O(d^c)\)
Suppose graph is a tree: Can solve in \(O(nd^2)\). Method as follows
Remove backward: For i = n : 2, apply
RemoveInconsistent(Parent(Xi),Xi)
Assign forward: For i = 1 : n, assign Xi consistently with Parent(Xi)
*Remark: After backward pass, all root-to-leaf are consistent. Forward assignment will not backtrack.
Advanced Arc-Consistency: Improving Problem Structure
Idea: Initiate a variable and prune its neighbors' domains.
Method: instantiate a set of vars such that remaining constraint graph is a tree (cutset conditioning)
Runtime: \(O(d^c \cdot (n-c)d^2)\) to solve CSP.
Iterative Methods: No Fringe!
Local Search
Algorithm: While not solved, randomly select any conflicted variable. Assign value by min-conflicts heuristic.
Performance: can solve n-queens in almost constant time for arbitrary n with high probability, except a few of them.
Hill-climbing
Simulated Annealing
Remark: Stationary distribution: \(p(x) \propto e^{\frac{E(x)}{kT}}\)
Genetic Algorithms
Lec 6: Simple Game Trees (Minimax)
Zero-Sum Games V.S. General Games: Opposite utilities v.s. Independent utilities
- Examples of Zero-Sum Games: Tic-tac-toe, chess, checkers, ...
Value of State: Best achievable outcome (utility) from that state.
- For MAX players, \(max_{s' \in \text{children}(s)} V(s')\); For MIN players, min...
Search Strategy: Minimax
Minimax properties:
- Optimal against perfect player. Sub-optimal otherwise.
- Time: \(O(b^m)\), Space: \(O(bm)\)
Alpha-Beta Pruning
Algorithm:
Properties:
- Meaning of Alpha: maximum reward for MAX players, best option so far for MAX player
- Meaning of Beta: minimum loss for MIN players, best option so far for MIN player
- Have no effect on root value; intermediate values might be wrong.
- With perfect ordering, time complexity drops to \(O(b^{m/2})\)
Depth-Limited Minimax: replace terminal utilities with an evaluation function for non-terminate positions
- Evaluation Functions: weighted sum of features observed
Iterative Deepening: run minimax with depth_limit = 1, 2, 3, ... until timeout
Lec 7: More Game Trees (Expectimax, Utilities, Multiplayer)
Expetimax Algorithm:
Assumptions V.S. Reality: Rational & Irrational Agents
Axioms of Rationality
MEU Principle
Given any preferences satisfying these constraints, there exists a real-valued function U such that:
Risk-adverse v.s. Risk-prone
Def. \(L = [p, X, 1-p, Y]\)
If \(U(L) < U(EMV(L))\), risk-adverse
Where \(U(L) = pU(X) + (1-p)U(Y)\), \(U(EMV(L)) = U(pX + (1-p)Y)\)
i.e. if U is concave, like y=log2x, then risk-adverse
Otherwise, risk-prone.
i.e. if U is convex, like y=x^2, then risk-prone
Lec 8-9: Markov Decision Processes
MDP World: Noisy movement, maze-like problem, receives rewards.
- "Markov": Successor only depends on current state (not the history)
MDP World Definition:
- States, Actions
- Transition Function \(T(s, a, s')\) or \(Pr(s' | s, a)\), Reward Function \(R(s, a, s')\)
- Start State, (Probably) Terminal State
MDP Target: optimal policy \(\pi^*: S \rightarrow A\)
Discounting: Earlier is Better! No infinity rewards!
- \(U([r_0, \cdots, r_\inf]) = \sum_{t=0}^\inf \gamma^tr_t =\le R_{\text{max}} / (1-\gamma)\)
MDP Search Trees:
Value of State: expected utility starting in s and acting optimally.
\(V^*(s) = \max_a Q^*(s, a)\)
Value of Q-State: expected utility starting out having taken action a from state s and (thereafter) acting optimally.
\(Q^*(s, a) = \sum_{s'}T(s, a, s')[R(s, a, s' + \gamma V^*(s'))]\)
Optimal Policy: \(\pi^*(s)\)
Solving MDP Equations: Value Iteration
- Bellman Equation: \(V^*(s) = \max_a \sum_{s'}T(s, a, s')[R(s, a, s') + \gamma V(s')]\)
- Value Calculation: \(V_{k+1}(s) \leftarrow \max_a \sum_{s'}T(s, a, s')[R(s, a, s') + \gamma V_k(s')]\)
- Policy Extraction: \(\pi^*(s) = \arg \max_{a} \sum_{s'} T(s, a, s')[R(s, a, s') + \gamma V^*(s')]\)
- Complexity (of each iteration): \(O(S^2A)\)
- Must converge to optimal values. Policy may converge much earlier.
Solving MDP Equations: Q-Value Iteration
- Bellman Equation: \(Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s' + \gamma \max_{a'} Q^*(s', a'))]\)
- Policy Extraction: \(\pi^*(s) = \arg \max_{a} Q^*(s, a)\)
MDP Policy Evalutaion: Evaluating V for fixed policy \(\pi\)
- Idea 1: remove the max'es from Bellman, iterating \(V_{k+1}(s) \leftarrow \sum_{s'}T(s, \pi(s), s')[R(s, \pi(s), s') + \gamma V_k(s')]\)
- Idea 2: is a linear system. Use a linear system solver.
Solving MDP Equations: Policy Iteration
Idea: Update Policy & Value meanwhile, much much faster!
Algorithm:
Lec 10-11: Reinforcement Learning
Intuition: Suppose we know nothing about the world. Don't know T or R.
Passive RL I: Model-Based RL
- Count outcomes s' for each s, a; Record R;
- Calculate MDP through any iteration
- Run policy. If not satisfied, add data and goto step 1
Passive RL II: Model-Free RL (Direct Evaluation, Sample-Based Bellman Updates)
Intuition: Direct evaluation from samples. improve our estimate of V by computing averages of samples.
Input: fixed policy \(\pi(s)\)
Act according to \(\pi\). Each time we visit a state, write down what the sum of discounted rewards turned out to be.
Average the samples, we get estimate of \(V(s)\)
Problem: wastes information about state connections. Each state learned separately. Takes long time.
Passive RL II: Model-Free RL (Temporal Difference Learning)
Intuition: learn from everywhere / every experience.
Recent samples are more important. \(\hat{x}_n = (1-a) x_{n-1} + ax_n\)
Update:
Decreasing learning rate (alpha) converges
Problem: Can't do policy extraction, can't calculate \(Q(s, a)\) without T or R
Idea: learn Q-values directly! Make action selection model-free too!
Passive RL III: Model-Free RL (Q-Learning + Time Difference Learning)
- Intuition: Learn as you go.
- Update:
- Receive a sample (s, a, s', r)
- Let sample = \(R(s, a, s') + \gamma \max_{a'} Q(s', a')\)
- Incorporate new estimate into a running average: \(Q(s, a) \leftarrow (1 - a) Q(s,a) + a \cdot \text{sample}\)
- Another representation: \(Q(s, a) \leftarrow Q(s, a) + a \cdot \text{Difference}\) where diff = sample - orig
- This is off-policy learning!
Active RL: How to act to collect data
- Exploration schemes: eps-greedy
- With probability \(\epsilon\), act randomly from all options
- With probability \(1 - \epsilon\), act on current policy
- Exploration functions: use an optimistic utility instead of real
utility
- Def. optimistic utility \(f(u, n) = u + k / n\), suppose u = value estimate, n = visit count
- Modified Q-Update: \(Q(s, a) \leftarrow_a R(s, a, s') + \gamma \max_{a'} f(Q(s', a'), N(s', a'))\)
- Measuring total mistake cost: sum of difference between expected rewards and suboptimality rewards.
- Exploration schemes: eps-greedy
Scaling up RL: Approximate Q Learning
State space too large & sparse? Use linear functions to approximately learn \(Q(s,a)\) or \(V(s)\)
Definition: \(Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + ...\)
Q-learning with linear Q-fuctions:
Transition := (s, a, r, s')
Difference := \([r + \gamma \max_{a'}Q(s', a')] - Q(s, a)\)
Approx. Update weight: \(w_i \leftarrow w_i + a \cdot \text{Difference} \cdot f_i(s, a)\)
Lec 12: Probability
N/A
Lec 13: Bayes Nets
N/A