At the Conference on Frontiers in Machine Learning and Economics (2025), Professor Tamara Broderick from MIT gave a brilliant keynote, based on the paper An Automatic Finite-Sample Robustness Metric. It’s a metric that measures how sensitive are conclusions to the removal of just a small fraction of data?
Motivation
Even dropping a vanishingly small proportion of observations can matter.
We should worry if removing a fraction $\alpha \in (0,1)$ of the data:
- flips the sign of an estimated effect,
- alters statistical significance, or
- reverses substantive conclusions.
Formally, define the Maximum Influence Perturbation (MIP): the largest change in an analysis outcome when at most $\alpha$ fraction of the data is removed.
Methodology Setup
The challenge is clear: brute-force exploration of all subsets feels NP-hard like obvious. So, approximation algorithms:
We frame estimation as minimizing empirical loss with optional penalty:
$$ \hat{\theta}(w) = \arg\min_\theta \sum_{n=1}^N w_n f(\theta, d_n) + R(\theta), $$
where $w \in {0,1}^N$ encodes which data points are included.
Define a quantity of interest, e.g. a coefficient $\phi(\hat{\theta}(w))$.
Apply a first-order Taylor expansion around the full-data solution $w = 1_N$:
$$ \phi(w) \approx \phi(1) + \sum_{n=1}^N (w_n - 1),\psi_n, $$
with influence score
$$ \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{w=1}. $$
This yields a linear influence approximation.
Computation reduces to evaluating $\psi_n$, which is automatable with autodiff and implicit function theorems.
A greedy-style selection over these influence scores provides a practical approximation to the Most Influential Set.
Example: Microcredit RCTs
Case study: Angelucci et al. (2015), 16,560 households in randomized controlled trials on microcredit.
Original regression result:
$$ \hat{\theta}_1 = -4.55 \ \text{USD PPP/2 weeks}, \quad \text{std. err.} = 5.88. $$Robustness analysis:
- Dropping one household flips the sign (negative → positive).
- Dropping 15 households yields:
$$ \hat{\theta}_1 = 7.03, \quad \text{std. err.} = 2.55. $$
Even in large, carefully designed experiments, results can be extremely fragile.
I think the paper’s contribution lies in methodology: using linear influence approximations to make finite-sample robustness measurable and practical. In other words, if the conclusions hinge on a handful of data points, results are not necessarily robust.