The Drop Test for Robustness, or Drop, and Give Me Significance***

At the Conference on Frontiers in Machine Learning and Economics (2025), Professor Tamara Broderick from MIT gave a brilliant keynote, based on the paper An Automatic Finite-Sample Robustness Metric. It’s a metric that measures how sensitive are conclusions to the removal of just a small fraction of data?

Motivation

Even dropping a vanishingly small proportion of observations can matter.
We should worry if removing a fraction $\alpha \in (0,1)$ of the data:

flips the sign of an estimated effect,
alters statistical significance, or
reverses substantive conclusions.

Formally, define the Maximum Influence Perturbation (MIP): the largest change in an analysis outcome when at most $\alpha$ fraction of the data is removed.

Methodology Setup

The challenge is clear: brute-force exploration of all subsets feels NP-hard like obvious. So, approximation algorithms:

We frame estimation as minimizing empirical loss with optional penalty:
$$ \hat{\theta}(w) = \arg\min_\theta \sum_{n=1}^N w_n f(\theta, d_n) + R(\theta), $$
where $w \in {0,1}^N$ encodes which data points are included.
Define a quantity of interest, e.g. a coefficient $\phi(\hat{\theta}(w))$.
Apply a first-order Taylor expansion around the full-data solution $w = 1_N$:
$$ \phi(w) \approx \phi(1) + \sum_{n=1}^N (w_n - 1),\psi_n, $$
with influence score
$$ \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{w=1}. $$

This yields a linear influence approximation.
Computation reduces to evaluating $\psi_n$, which is automatable with autodiff and implicit function theorems.

A greedy-style selection over these influence scores provides a practical approximation to the Most Influential Set.

Example: Microcredit RCTs

Case study: Angelucci et al. (2015), 16,560 households in randomized controlled trials on microcredit.

Original regression result:
$$ \hat{\theta}_1 = -4.55 \ \text{USD PPP/2 weeks}, \quad \text{std. err.} = 5.88. $$
Robustness analysis:
- Dropping one household flips the sign (negative → positive).
- Dropping 15 households yields:
  $$ \hat{\theta}_1 = 7.03, \quad \text{std. err.} = 2.55. $$

Even in large, carefully designed experiments, results can be extremely fragile.

I think the paper’s contribution lies in methodology: using linear influence approximations to make finite-sample robustness measurable and practical. In other words, if the conclusions hinge on a handful of data points, results are not necessarily robust.

Motivation#

Methodology Setup#

Example: Microcredit RCTs#

Motivation

Methodology Setup

Example: Microcredit RCTs