This is the reading notes of “Strategic Experimentation with Exponential Bandits” by Godfrey Keller, Sven Rady and Martin Cripps, published in Econometrica, 2025. I think it is a really cool theoretical economic paper, because the math is solid, elegant and inspiring.
For more details, please refer to the original paper. The following are notes for the models and and results for reference, if one wants to quicly grasp the intuition.
model
The base is a continuous-time two-armed bandit:
One arm S is ‘safe’ and yields a known deterministic flow payoff whenever it is played; the other arm R is ‘risky’ and can be either ‘bad’ or ‘good’. If it is bad, then it always yields 0. If it is good, then it yields 0 until a ‘breakthrough’ occurs. This happens once the total time that the arm has been used reaches some random threshold that is exponentially distributed; once it happens, the arm yields a known positive (fixed) flow payoff forever.
The paper assumes that if the risky arm R is discovered to be good, then R is strictly better than S—in this way agents have motive to experiment with the risky action in the hope that discovering R is indeed good.
Formally, for continuous time $t\in [0, \infty]$ and discount rate $r > 0$, let the flow payoff of the safe arm S be $s$, let the flow payoff for a good risky arm R be $g$ (assuming $0 < s < g$). Importantly, let the total time that a good risky arm be used before it generates the payoff $g$ be exponentially distributed with parameter $\lambda > 0$ (i.e. with mean $\frac 1\lambda$).
Starting with a prior belief $p_0$ that the risky arm is good. At every specific time point $t$, let $k_t$ indicates an agent’s choice between S ($k_t = 0$) and R ($k_t = 1$). Her flow payoff at this time is $$ (1 - k_t)s + k_t \pi t, \text{ where } \pi_t\in \lbrace 0, g\rbrace. $$ And one agent’s overall objective is to choose a strategy $\lbrace k_t \rbrace{t\ge 0}$ that maximizes $$ \mathbb E\left [\int_0^\infty re^{-rt}[(1 - k_t)s + k_t \pi_t], \text{d}t\mid p_0\right]. $$ (Quote the paper) Of course, strategy $k_t$ should be subject to the constraint that the action taken at any time $t$ be measurable w.r.t. the information available at that time.
when there are more than one agents
Players have the same prior belief, the same discount rate, replica two-armed bandits, and information is public. The risky arms’ time thresholds are i.i.d..
To be continued…