Summary of "Lec 20 - Cauchy Sequence and Green's Equation"
Main ideas & lessons
-
Recap of MDP setup and Bellman optimality
-
The instructor previously defined a Markov Decision Process (MDP) using the tuple [ (S, A, P, R, \gamma). ] (They also describe it earlier informally as “MDPs are defined by (S, A)…”.)
-
The Bellman equation for the value function (v) is referenced, along with an optimality equation.
- A key assumption about the optimality equation is that the optimal policy yields the same “max” term, i.e., the maximization over actions behaves consistently under the formulation.
-
-
Deterministic optimal policy assumption
- The course assumes the optimal policy can be taken as deterministic.
- Notation/representation:
- Instead of writing (\pi(a \mid s)), the instructor uses (\pi(s)) (an abuse of notation).
- Here, (\pi(s)) is treated as the single action chosen in state (s) for a deterministic policy.
- More generally: a policy is stochastic if it corresponds to a distribution over actions; a deterministic policy assigns probability (1) to one action and (0) to all others.
- Key claim:
- Even if there may exist multiple stochastic optimal policies, the instructor states the assumption that if an MDP has an optimal policy, then at least one optimal deterministic policy exists.
-
Plan for next step: existence/uniqueness and convergence of solutions
- The instructor focuses on proving existence and uniqueness of solutions to the value-function equations, using light functional analysis / linear algebra.
- The discussion targets finite MDPs.
Detailed methodology / instruction-like content
A) Restrict attention to finite MDPs
Assume:
- State space (S) is finite
- Action space (A) is finite
- Expected returns are bounded
Then:
- The value function can be viewed as a vector in an (|S|)-dimensional vector space.
B) Vector-space and norm basics (for the later proof)
- Define a norm (|x|) on vectors (conceptually, a notion of “size”).
- Required norm properties:
- Nonnegativity / definiteness: (|x| \ge 0), and (|x|=0) iff (x=0)
- Homogeneity: (|\alpha x| = |\alpha| |x|)
- Triangle inequality: (|x+y| \le |x| + |y|)
- Mentioned norm:
- The max norm is the norm intended for the value-function space.
C) Define Cauchy sequences and completeness (for convergence arguments later)
-
A Cauchy sequence ({x_k}) in a normed vector space satisfies:
- For every (\varepsilon > 0), there exists (N) such that for all (m,n \ge N), [ |x_m - x_n| < \varepsilon. ]
-
Completeness:
- A normed vector space is complete if every Cauchy sequence converges to a point in the space.
- Intuition/example style:
- If a space is not complete, a Cauchy sequence might converge to a limit outside the space.
- Example idea: a sequence like (x_n = 1/n) might converge, while a deliberately defined space could exclude the limit point (so convergence fails to land in the domain).
D) Translate Bellman equations into linear algebra objects
Under a fixed policy (\pi), define:
-
Reward vector (R^\pi)
- (R^\pi(s)) is the expected immediate (one-step) reward from state (s) when following (\pi).
-
Transition matrix (P^\pi)
- (P^\pi(s,j)) is the probability of moving from state (s) to state (j) in one step under policy (\pi).
Deterministic simplification:
- If (\pi) is deterministic, the “sum over actions” disappears.
- Then (P^\pi(s,j)) corresponds to transitions induced by the single action (\pi(s)).
E) Properties of (P^\pi) as a stochastic matrix
- (P^\pi) is a stochastic matrix:
- entries are nonnegative
- each row sums to (1) (each row is a discrete probability distribution)
- Discount factor assumption:
- (\gamma \in [0,1)) (i.e., (0 \le \gamma < 1)); the later reasoning relies on (\gamma < 1).
F) Derive the “Green’s equation” (linear system) for (v_\pi)
Interpret the Bellman equation as a one-step decision with a terminal cost interpretation:
- Take one action, then “stop,” and receive:
- the one-step reward, plus
- a terminal payoff equal to the value at the next state.
- Under this view, (v_\pi) behaves like the terminal payoff/value function.
Core linear equation: [ v_\pi = R^\pi + \gamma P^\pi v_\pi ]
Rearrange into a linear system: [ (I - \gamma P^\pi)v_\pi = R^\pi ]
Solve by inversion: [ v_\pi = (I - \gamma P^\pi)^{-1} R^\pi ]
G) Existence and uniqueness via invertibility
Claim: (I - \gamma P^\pi) is invertible, so (v_\pi) has a unique solution.
Reasoning steps (as presented):
- For a stochastic matrix (P^\pi), the largest eigenvalue is (1).
- Then the largest eigenvalue of (\gamma P^\pi) is (\gamma).
- Since (\gamma < 1), (I - \gamma P^\pi) has no zero eigenvalue.
- If a matrix has no zero eigenvalue, it is invertible, yielding uniqueness.
Alternative justification mentioned:
- A matrix is not invertible iff its determinant is 0.
- The determinant is linked to eigenvalues (product of eigenvalues), so if there is no zero eigenvalue, the determinant is not zero.
H) Convergence viewpoint (iterative method idea)
Another way to argue uniqueness/existence:
- Consider an iteration:
- Start with an arbitrary guess (v_0)
- Produce (v_1, v_2, \dots) using the Bellman/linear update
- The instructor claims:
- the iteration will converge to (v_\pi),
- and (v_\pi) is unique.
- The iteration/convergence approach is said to be useful later when handling the optimality equation.
I) Course pacing note
- The instructor says they will likely continue in a second class to complete the intended details.
Speakers / sources featured
Speaker
- Unspecified instructor/lecturer (the only human voice in the subtitles)
Sources mentioned
- Wikipedia (for examples such as non-convergent Cauchy sequences)
- Stack Exchange / Stack Exchange-like resources (suggested for eigenvalues/invertibility relationships)
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.