Gittin Index's Optimality Proof by the Prevailing Charge Argument from Weber (1992)
For the discrete-time Bandit Process definition, consider a total of $n$ of these discrete-time Bandit Processes. The state of the bandits are $x_1(t), \ldots, x_n(t)$. One arm is allowed to be pulled at each $t$. Taking action $a_j, j\in [n]$ corresponds to pulling arm $j$ and generate reward $R_j(x_j(t))$. The aim is to maximize total-discounted ($\beta < 1$) reward, given initial state $\vec x(0)$. The Gittins Index Policy pulls the arm $j$ with the highest Gittins Index $\nu(x_j(t))$: $$ a(t) = \arg\max_j \nu_j(x_j(t)), $$ and is known to be optimal....