Gittins Index Closed-form Calculation
Definition Consider a discrete-time Bandit Process with one arm (dropping arm index $i$ for convenience): Markov state transition, binary action, reward associated with positive action (aka ‘pull’) denoted as $r(x(t))$ at each time point $t = 1, \ldots$ The states $x(t)$ doesn’t change if the arm is idle. And the goal is to maximize the discounted reward criteria: $$ \text{Reward}:=\mathbb E[\sum_t \beta^t r(x(t))]. $$ ($x(\cdot)$ or $x$ is state) The Gittins Index $v(x)$ is calculated for each state $x$ as $$ v(x) :=\sup_{\tau >0}\frac{\mathbb E[\sum_{t = 1}^\tau \beta^t r(x(t))\mid x(0) = x]}{\mathbb E[\sum_{t = 1}^\tau \beta^t\mid x(0) = x]} $$ Note $\tau$ is a past-measurable stopping time—so expectation is taken w....