At the Conference on Frontiers in Machine Learning and Economics (2025), Professor Tamara Broderick from MIT gave a brilliant keynote, based on the paper An Automatic Finite-Sample Robustness Metric. It’s a metric that measures how sensitive are conclusions to the removal of just a small fraction of data?


Motivation

Even dropping a vanishingly small proportion of observations can matter.
We should worry if removing a fraction $\alpha \in (0,1)$ of the data:

  • flips the sign of an estimated effect,
  • alters statistical significance, or
  • reverses substantive conclusions.

Formally, define the Maximum Influence Perturbation (MIP): the largest change in an analysis outcome when at most $\alpha$ fraction of the data is removed.


Methodology Setup

The challenge is clear: brute-force exploration of all subsets feels NP-hard like obvious. So, approximation algorithms:

  1. We frame estimation as minimizing empirical loss with optional penalty:

    $$ \hat{\theta}(w) = \arg\min_\theta \sum_{n=1}^N w_n f(\theta, d_n) + R(\theta), $$

    where $w \in {0,1}^N$ encodes which data points are included.

  2. Define a quantity of interest, e.g. a coefficient $\phi(\hat{\theta}(w))$.

  3. Apply a first-order Taylor expansion around the full-data solution $w = 1_N$:

    $$ \phi(w) \approx \phi(1) + \sum_{n=1}^N (w_n - 1),\psi_n, $$

    with influence score

    $$ \psi_n := \frac{\partial \phi(w)}{\partial w_n}\Bigg|_{w=1}. $$

This yields a linear influence approximation.
Computation reduces to evaluating $\psi_n$, which is automatable with autodiff and implicit function theorems.

A greedy-style selection over these influence scores provides a practical approximation to the Most Influential Set.


Example: Microcredit RCTs

Case study: Angelucci et al. (2015), 16,560 households in randomized controlled trials on microcredit.

  • Original regression result:
    $$ \hat{\theta}_1 = -4.55 \ \text{USD PPP/2 weeks}, \quad \text{std. err.} = 5.88. $$

  • Robustness analysis:

    • Dropping one household flips the sign (negative → positive).
    • Dropping 15 households yields:
      $$ \hat{\theta}_1 = 7.03, \quad \text{std. err.} = 2.55. $$

Even in large, carefully designed experiments, results can be extremely fragile.


I think the paper’s contribution lies in methodology: using linear influence approximations to make finite-sample robustness measurable and practical. In other words, if the conclusions hinge on a handful of data points, results are not necessarily robust.