Time-domain game interpretation of optimal control
1. The “Game” Setup
We define the system, the players, and the rules.
System Dynamics
\[x_{k+1} = A x_k + B_1 w_k + B_2 u_k.\]We use a standard LQR-like output for simplicity so that \(\|z_k\|_2^2 = x_k^T Q x_k + u_k^T R u_k\):
\[z_k = C_1 x_k + D_{12} u_k = \begin{bmatrix} Q^{1/2} x_k \\ R^{1/2} u_k \end{bmatrix}\]Players
- Controller (\(u_k\)): tries to minimize the cost.
- Disturbance (\(w_k\)): tries to maximize the cost.
The Game’s Objective
The \(H_\infty\) goal is to ensure the energy gain is bounded: \(\sum_{k=0}^{\infty} \|z_k\|_2^2 < \gamma^2 \sum_{k=0}^{\infty} \|w_k\|_2^2\). Define:
\[J = \sum_{k=0}^{\infty} \left( \|z_k\|_2^2 - \gamma^2 \|w_k\|_2^2 \right)\]This can be rewritten as a zero-sum game:
- The controller \(u_k\) tries to minimize the score.
- The disturbance \(w_k\) tries to maximize it.
2. The “Scorecard” (Dynamic Programming)
We apply dynamic programming to find the “value” of the game.
Value Function
Define the cost-to-go:
\[V(x_k) = x_k^T P x_k\]Bellman Equation
\[V(x_k) = \min_{u_k} \max_{w_k} \left\{ \|z_k\|_2^2 - \gamma^2 \|w_k\|_2^2 + V(x_{k+1}) \right\}\]Substitute the known quantities: Why the game cost is formulated as a min-max rather than a max-min program?
When analyzing the discrete-time or continuous-time \(H_\infty\) control problem, the dynamic programming approach leads to a min–max optimization: the controller minimizes cost while the disturbance maximizes it.
A natural question is: what happens if we flip the order to max–min?
1. Meaning of Changing the Order
-
Standard formulation (min–max):
The controller picks \(u_k\) to minimize the worst-case effect of \(w_k\). This corresponds to the full information control:
\[V(x_k) = \min_{u_k} \max_{w_k} \left\{ \ell(x_k,u_k,w_k) + V(x_{k+1}) \right\}.\] -
Flipped formulation (max–min):
The controller must decide first (announce \(u_k\) or even a strategy), and the disturbance reacts after observing it. This means the controller don’t know about the disturbance at current step:
\[V(x_k) = \max_{w_k} \min_{u_k} \left\{ \ell(x_k,u_k,w_k) + V(x_{k+1}) \right\}.\]
If the cost function is convex in \(u\) and concave in \(w\), then the two formulations are equivalent due to the minimax theorem (saddle point exists). Otherwise, the order matters — typically, the controller is worse off when forced to move first (a Stackelberg-type disadvantage).
2. Quadratic Stage Cost
For the standard (H_\infty) setting, define
\[L(u,w) = x^\top Q x + u^\top R u - \gamma^2 w^\top w + (A x + B_1 w + B_2 u)^\top P (A x + B_1 w + B_2 u).\]Here (P) is the value function matrix at the next step in the dynamic programming recursion.
3. Convex–Concave (Isaacs) Condition
Compute the Hessians with respect to \(u\) and \(w\):
\[\frac{\partial^2 L}{\partial u^2} = 2(R + B_2^\top P B_2), \qquad \frac{\partial^2 L}{\partial w^2} = -2(\gamma^2 I - B_1^\top P B_1).\]Thus:
- \(L\) is convex in \(u\) if \(R + B_2^\top P B_2 \succ 0\).
- \(L\) is concave in \(w\) if \(\gamma^2 I - B_1^\top P B_1 \succ 0\).
This last inequality is the Isaacs condition (or the saddle-point condition). When it holds, we can swap the order of optimization:
\[\min_u \max_w L(u,w) = \max_w \min_u L(u,w),\]and the \(H_\infty\) Riccati equation derived via dynamic programming remains valid.
4. Intuitive Interpretation
-
If the Isaacs condition holds, the disturbance’s energy is bounded by \(\gamma\), ensuring that the problem has a well-defined saddle point.
The order of min–max does not matter. -
If the condition fails, \(L\) is no longer concave in \(w\); the disturbance can exploit the controller’s pre-commitment, and
the max–min value (controller as leader) becomes larger — meaning worse performance for the controller.
This is analogous to the Stackelberg vs. Nash distinction in game theory.
5. Practical Check
After solving the Riccati or LMI for a given \(\gamma\), test:
\[M = \gamma^2 I - B_1^\top P B_1.\]If \(M \succ 0\), then the Isaacs condition holds and
\(\min\max = \max\min\).
If \(M\) has nonpositive eigenvalues, the saddle-point equivalence fails.
3. Solving the Game (finding the moves)
We denote the expression inside as \(L\):
\[L = x_k^T Q x_k + u_k^T R u_k - \gamma^2 w_k^T w_k + (A x_k + B_1 w_k + B_2 u_k)^T P (A x_k + B_1 w_k + B_2 u_k)\]Step 1: The Disturbance’s Move (maximize \(L\))
\[\frac{\partial L}{\partial w_k} = -2\gamma^2 w_k + 2 B_1^T P (A x_k + B_1 w_k + B_2 u_k) = 0\]Solve for \(w_k^*\): \(w_k^* = (\gamma^2 I - B_1^T P B_1)^{-1} B_1^T P (A x_k + B_2 u_k)\)
Key insight:
This is the worst-case disturbance, dependent on both \(x_k\) and \(u_k\).
Step 2: The Controller’s Move (minimize \(L\))
\(\frac{\partial L}{\partial u_k} = 2R u_k + 2 B_2^T P (A x_k + B_1 w_k + B_2 u_k) = 0\) Solve for \(u_k^*\): \(u_k^* = -(R + B_2^T P B_2)^{-1} B_2^T P (A x_k + B_1 w_k)\)
Problem: \(u_k^*\) depends on \(w_k\), which is not known in real time.
Solution: Assume the controller uses state feedback: \(u_k = -K x_k\) and the disturbance knows this strategy. (This is equivalent to the max-min formulation.)
- Note: we still don’t know how this assumption should be changed when \(u_k\) must subject to some strict causality on \(x_k\).
4. The Result: The \(H_\infty\) Riccati Equation
We plug the worst-case disturbance \(w_k^*\) into \(L\), then find the \(u_k\) minimizing the result.
After algebraic simplification (as in Doyle et al.), the Bellman equation yields the Discrete-Time Algebraic Riccati Equation (DARE) for \(H_\infty\) control:
where \(R_{H_\infty} = \begin{bmatrix} -\gamma^2 I & 0 \\ 0 & R \end{bmatrix}.\)
5. LQR vs. \(H_\infty\) (a simple comparison)
| Control Type | Riccati Equation | Interpretation |
|---|---|---|
| LQR (H₂) | \(P = Q + A^T P A - A^T P B_2 (R + B_2^T P B_2)^{-1} B_2^T P A\) | Controller minimizes its own cost |
| \(H_\infty\) | \(P = Q + A^T P A - A^T P B_{\text{all}} R_{\text{all}}^{-1} B_{\text{all}}^T P A\) | Controller competes with disturbance (B_1 w) |
Here, \(B_{\text{all}}\) and \(R_{\text{all}}\) combine the control and disturbance effects:
\[B_{\text{all}} = [B_1 \; B_2], \quad R_{\text{all}} = \begin{bmatrix} -\gamma^2 I & 0 \\ 0 & R \end{bmatrix}\]If there is \(Y\) and \(X\succ 0\) such that the following LMI holds:
\[\begin{bmatrix} X & (A X + B_2 Y)^{\top} & B_1^{\top} & (C_1 X + D_{12} Y)^{\top} \\ A X + B_2 Y & X & 0 & 0 \\ B_1 & 0 & \gamma^2 I & 0 \\ C_1 X + D_{12} Y & 0 & 0 & I \end{bmatrix} \succ 0,\]then the following state-feedback control policy \(K = Y X^{-1}\) ensures that \(J < 0\) for all \(w \in \ell_2\), i.e. \(\|T_{zw}\|_\infty< \gamma\).
Interpretation
- The \(-\gamma^2 I\) term acts as a negative cost, representing the disturbance player’s gain.
- The controller must design feedback strong enough to counteract it, ensuring that the closed-loop system remains stable.
- The \(H_\infty\) Riccati equation is just the LQR Riccati equation modified to include the negative cost of the maximizing disturbance player, scaled by \(\frac{1}{\gamma^2}\).