MAB UCB1 No-Regret Proof
This is beautiful, beautiful, beautiful!
This is beautiful, beautiful, beautiful!
Evoking the rush of water, the stroke of oars and the motion of the ocean, the Barcarolle was a folk song sung by Venetian gondoliers (the word comes from “Barca” meaning “boat”). Characterised by a rocking rhythm, suggestive of the movement of the gondola, a Barcarolle is usually of moderate tempo scored in compound time (often 6/8, 9/8 or 12/8). The genre has been used by many composers to great expressive effect....
Multi-armed bandits with switching costs are a special case of the restless-bandit model. Setup Consider the infinite-horizon, discounted MAB problem with finite state space $\mathcal S$, binary action set ${0,1}$ per arm ($1$ = pull, $0$ = idle), discount factor $0\le\beta<1$, arms evolve only when pulled (i.e. “static” when $a_i=0$), per-pull reward $r_i(s)$. We now add two costs for each arm $i$: switch-in cost $c_i$: paid (once) whenever we switch to arm $i$,...
Are you listening to anything the last time you read/wrote a paper? Sounds of science: how music at work can fine-tune your research Nature | https://www.nature.com/articles/d41586-023-00984-4 Researchers describe how listening to music at work can boost (or hamper) productivity, and share the tunes that keep them focused. TLDR: music cheers you up—almost like a mental massage, dopamine boosters. So it makes tedious, repetitive work less unenjoyable. But music also takes up the brain’s processing power, especially for people with musical training....
“The literature on the RMABP, whether on its theoretical, algorithmic, or application aspects, is currently vast to the point where it is virtually infeasible for researchers to keep up to date with the latest advances in the field.” True. Markovian Restless Bandits and Index Policies: A Review José Niño-Mora | Mathematics, 2023 The review is organized as follows. Section 2 surveys the antecedents to the RMABP, in particular, the classic MABP and the Gittins index policy....
Someone wrote about Swan Lake music: Once ballet music leaves the stage and enters the concert hall or recording, it becomes a kind of group-form symphony. Yet, it lacks the weight of a true symphony or concerto, as the dance drama itself is inherently “light.” Composers have never used ballet to convey grand themes—after all, a fictional prince can leap and twirl, but imagine Peter the Great or Napoleon doing the same; the image borders on the absurd....
We’re back! Last time, we explored unconstrained first-order methods (FOMs), where gradient descent works well and its time-traveling cousins (momentum and acceleration) helped even more. Now adding constraints: Here’s how FOMs extend to constrained problems, especially equality constraints. We’ll walk through two major methods: The Augmented Lagrangian Method with Multipliers (ALMM) and its smoother, modular evolution—ADMM. For constrained problems like this: $$ \min_x ; f(x) \quad \text{s.t.} \quad h(x) = 0,; x \in X $$...
A lot of seemingly non-convex optimization problems are de facto convex. For example $$ \begin{align*} \min_{a_i, r_i}& ;\frac{a_i}{r_i}\cr s.t.& ;a_i, r_i \ge0 \end{align*}\tag{1} $$ can actually be massaged into a convex optimization. Let $x_i\ge \frac{a_i}{r_i}$, optimization $(1)$ is equivalent to $$ \begin{align*} \min_{a_i, r_i, x_i}& ;x_i\cr s.t.& ;a_i, r_i \ge0 \end{align*}\tag{2} $$ And is equivalent to $$ \begin{align*} \min_{a_i, r_i, x_i}& ;x_i\cr s.t.& ;\begin{bmatrix} x_i & \sqrt{a_i}\cr \sqrt{a_i}& r_i \end{bmatrix}\succeq0\cr & a_i, r_i \ge 0 \end{align*} $$ So now it’s in Positive Semidefinite (PSD) Programming, and it is convex....
For the discrete-time Bandit Process definition, consider a total of $n$ of these discrete-time Bandit Processes. The state of the bandits are $x_1(t), \ldots, x_n(t)$. One arm is allowed to be pulled at each $t$. Taking action $a_j, j\in [n]$ corresponds to pulling arm $j$ and generate reward $R_j(x_j(t))$. The aim is to maximize total-discounted ($\beta < 1$) reward, given initial state $\vec x(0)$. The Gittins Index Policy pulls the arm $j$ with the highest Gittins Index $\nu(x_j(t))$: $$ a(t) = \arg\max_j \nu_j(x_j(t)), $$ and is known to be optimal....
Definition Consider a discrete-time Bandit Process with one arm (dropping arm index $i$ for convenience): Markov state transition, binary action, reward associated with positive action (aka ‘pull’) denoted as $r(x(t))$ at each time point $t = 1, \ldots$ The states $x(t)$ doesn’t change if the arm is idle. And the goal is to maximize the discounted reward criteria: $$ \text{Reward}:=\mathbb E[\sum_t \beta^t r(x(t))]. $$ ($x(\cdot)$ or $x$ is state) The Gittins Index $v(x)$ is calculated for each state $x$ as $$ v(x) :=\sup_{\tau >0}\frac{\mathbb E[\sum_{t = 1}^\tau \beta^t r(x(t))\mid x(0) = x]}{\mathbb E[\sum_{t = 1}^\tau \beta^t\mid x(0) = x]} $$ Note $\tau$ is a past-measurable stopping time—so expectation is taken w....