Consider an RMAB instance with $N$ arms, where each arm $i \in [N]$ has a finite state space $\mathbb S_i$ and can receive an action $y_i^t \in {0, 1}$ (representing not pulling or pulling the arm, respectively) at each time step $t$. The state of arm $i$ at time $t$ is denoted by $s_i^t$. Depending on the action taken, a reward $r_i(s_i^t, y_i^t)$ is accrued. As a decision maker, our objective is to maximize the averaged total reward over an infinite time horizon, under a constraint that only $B$ arms can be pulled at any time step. This setup can be formalized as follows: $$ \max_{\pi::\mathbb S \to Y} \lim_{T \to \infty} \frac{1}{T} \mathbb{E} [\sum_{t = 1}^T \sum_{i\in [N]} r_i(s_i^t, y_i^t)]\ \text{s.t. } \sum_{i\in [N]} y_i^t \le B, \forall t $$ Consider a relaxed version where the ‘hard’ budget constraint is averaged over time: $$ \max_{\pi::\mathbb S \to Y} \lim_{T \to \infty} \frac{1}{T} \mathbb{E} [\sum_{t = 1}^T \sum_{i\in [N]} r_i(s_i^t, y_i^t)]\ \text{s.t. } \frac1T\mathbb E[\sum_{i\in [N]} y_i^t ]\le B $$ Let $\lambda$ be the Lagrangian multiplier for the relaxed budget constraint $\frac1T\mathbb E[\sum_{i\in [N]} y_i^t ]\le B$. The duality transforms into the following $$ \min_{\lambda} \max_{\pi::\mathbb S \to Y} \lim_{T \to \infty} \frac{1}{T} \left( \mathbb{E} \left[\sum_{t=1}^T \sum_{i \in [N]} r_i(s_i^t, y_i^t)\right] - \lambda \left(\mathbb{E} \left[\sum_{t=1}^T \sum_{i \in [N]} y_i^t\right] - TB\right) \right) $$ Notice that we can decouple the above optimization problem for every arm, by leaving out the budget constraint on the outside: $$ \begin{align} & \lim_{T \to \infty} \frac{1}{T} \left( \mathbb{E} \left[\sum_{t=1}^T \sum_{i \in [N]} r_i(s_i^t, y_i^t)\right] - \lambda \left(\mathbb{E} \left[\sum_{t=1}^T \sum_{i \in [N]} y_i^t\right] - TB\right) \right)\cr = & \lim_{T \to \infty} \frac{1}{T}\left(\sum_{i\in [N]} \left( \mathbb{E} \left[\sum_{t=1}^T r_i(s_i^t, y_i^t)\right] - \lambda \mathbb{E} \left[\sum_{t=1}^T y_i^t\right]\right) - \lambda TB\right) \end{align} $$ So, for each $i$, consider the inner maximization. $$ \max_{\pi::\mathbb S_i \to Y_i} \left(\lim_{T \to \infty} \frac{1}{T} \left( \mathbb{E} \left[\sum_{t=1}^T r_i(s_i^t, y_i^t)\right] - \lambda \left(\mathbb{E} \left[\sum_{t=1}^T y_i^t\right] - TB_i\right) \right)\right) $$ Notice that, the last term $\lambda TB$ can be dropped for each individual $i$ because it’s irrelevant with the inner maximization. Let: $$ \phi_i(\lambda) :=\max_{\pi::\mathbb S_i \to Y_i}\left(\lim_{T \to \infty} \frac{1}{T} \left( \mathbb{E} \left[\sum_{t=1}^T r_i(s_i^t, y_i^t)\right] - \lambda \mathbb{E} \left[\sum_{t=1}^T y_i^t\right] \right)\right) $$ The Lagrangian multiplier here now has an economic intepretation – the price charged for pulling the arm. Consider the optimal stationay policy $\pi_i^*$ for the above optimization problem. Define an arm $i$’s indexability:
...