# Difference-in-Differences Causal Inference: The Naloxone Controversy

Jeremy Albright

Posted on
Causal Inference

The worlds of machine learning and statistics, while sharing some common mathematical vocabulary, are ultimately engaged in two different enterprises. The former seeks to leverage data to yield accurate predictions about previously unseen cases, but it is less concerned about understanding the causal process by which these outcomes are generated. Statistics, as it has come to be practiced in the 21st century, is instead interested in pinning down the causal story. Whereas prediction requires turning a lot of knobs simultaneously, statistics identifies causality by finding ways to turn one knob at a time, essentially comparing two worlds identical in every way except for the value of a single variable. Note that you are wrong if you think this can simply be done by adding controls to a regression model.

The last twenty years in statistical research have introduced and refined new methods, most of which are essential for understanding causality but completely irrelevant for prediction, and therefore are not at all part of the machine learning toolkit. The distinction between the two fields is important to keep in mind because critiquing statistical papers without knowing what, say, a regression discontinuity design is can lead to broad attacks that do not do justice to the work that has gone into a research paper.

A recent manuscript from Jennifer Doleac and Anita Mukherjee makes the argument that easy access to Naloxone, a drug that can reverse opiod overdoses and save lives, has the unintended consequence of increasing opiod-related crime and ER visits without doing much to reduce opiod-related mortality. The authors suggest two avenues by which this may occur: 1) saving lives increases the number of opiod abusers who would otherwise have died, and 2) increasing risk-taking behavior by reducing its most severe consequences. They employ a fixed effects difference-in-differences design to isolate the causal effect of Naloxone policy, and their findings, at least for some regions in the United States, support their theory of selection effects and/or moral hazard.

The article has generated a lot of discussion online, often vituperative and on the basis of a poor understanding of the econometric techniques the authors employ. There are certainly reasons to be skeptical of the findings, but those concerns need to be raised within the context of the causal inference. No single research paper should be taken as the final statement in a controversy. Issues related to measurement error, selection bias, data sources, and confounders will lead to different results in distinct research projects, and conclusions should be drawn through the triangulation of findings from distinct studies. Peer review is meant to identify flaws in individual projects so that subsequent research can be improved. Peer review is useless, however, when criticisms are based on knee-jerk reactions and/or a desire to be contrarian for its own sake.

This blog post does not seek to say that the authors are correct or incorrect but rather to explain what the statistics being used were meant to do. Once the methodology is understood, one is better positioned to critique any shortcomings in the manner that thoughtful research deserves.

## Causal Inference

21st century statistics has evolved far beyond the era when it was assumed that a regression model with a bunch of controls could provide reliable hypothesis tests (admittedly, not all journals and researchers have yet gotten this message). Instead, the literature that has come to be lumped together under the descriptor “causal inference” carefully outlines the mathematical underpinnings and assumptions required to state that an intervention is actually causing an outcome to change.

In the context of the Naloxone paper, one goal is to determine if policies facilitating access to the drug leads to reduced mortality. The problem with asserting causality is the same in any study: we never observe the counterfactual. In other words, for states that adopted Naloxone policies in a given year, we never see what the mortality or crime rates would have been had the states not adopted the new rules. The following table illustrates (with fake mortality numbers):

State Year Adopted Deaths without Policy Death with Policy Difference
Minnesota 2015 Yes NA 6000 NA
North Dakota 2015 Yes NA 3000 NA
South Dakota 2015 No 5000 NA NA
Iowa 2015 No 2000 NA NA

In an ideal experiment we could observe the alternate realities in which everything was the same except that Minnesota and North Dakota did not adopt, and South Dakota and Iowa did adopt. With these counterfactual observations we could calculate the average treatment effect as the mean of the Difference column. We can’t observe any of the differences, however, so we need to figure out what we can know.

It can be shown with just a little bit of expectation algebra that we can recover the average treatment effect as the the mean of the observed “Deaths without Policy” values and the mean of the observed “Death with Policy” values, provided the treatment is randomized. This is why randomized control trials are the gold standard for research. Given randomization (and, okay, a couple of other assumptions: Google “SUTVA”), we can get an unbiased estimate of the average treatment effect without having to observe the counterfactuals.

The problem facing most social science research is that we work with observational data, meaning that the treatments we study cannot be randomized. Certainly adoption of Naloxone policies is not random. This introduces the problem of confounders, which are factors that simultaneously affect the treatment and the outcome.

## Fixed Effects Models

Take a model of mortality in city $$c$$ at time $$t$$ as a function of policy.

$Y_{ct} = \beta_0 + \beta_1 \textrm{Policy}_{ct} + u_{ct}$

This model will produce a biased estimate of the policy effect due to confounders. There are two kinds of confounders:

1. Those that only vary between cities.
2. Those that vary within cities.

An example of the former would be culture, which is pretty constant within a city over time but tends to be different between cites (compare Fargo to Detroit, for example). An example of the latter would be public policies other than Naloxone policies that change during the entirety of the time frame.

The benefit of the fixed effects model is that we can rid ourselves of all city-level confounders. Do this by expanding the error term to separate time-invariant (i.e. only changing between cities) error and remaining error.

$Y_{ct} = \beta_0 + \beta_1 \textrm{Policy}_{ct} +\alpha_c + u_{ct}$

Setting that aside for a moment, we can also consider a model that is only interested in between-city differences, but not really interested in how things are changing within a single city over time. We do this by calculating the mean of each variable across time separately for each city. In this case, we only have as many observations as cities, because we’ve taken the average of each time point within a city. This “between” model would be:

$\overline{Y}_{c} = \beta_0 + \beta_1 \overline{\textrm{Policy}}_{c} +\alpha_c + \overline{u}_{c}$

The fixed effects estimator, on the other hand, is only interested in change within cities. In effect, we remove time-invariant differences between cities and consider only the remaining variability. We do this by subtracting out the between-city differences from our first model with the two error terms.

$\left(Y_{ct}- \overline{Y}_{c}\right) = \beta_1 \left(\textrm{Policy}_{ct} - \overline{\textrm{Policy}}_{c} \right) + \left(u_{ct} - \overline{u}_{c}\right)$

Because all of the $$\alpha_c$$ are time invariant, their values are equal to their means and thus drop out of the model. By doing this operation, we have removed the between-city differences in our variables. Fitting OLS to the above model is equivalent to running a regression using only the variability within a city. In other words, we have removed all confounding due to city differences, including those we cannot even measure or observe.

It turns out that the above equation is algebraically equivalent to fitting a model with dummy variables for city, though taking the steps to derive the within-estimator makes it more explicit that we’re subtracting out city-level confounders. We can write the dummy variable model as above:

$Y_{ct} = \beta_1 \textrm{Policy}_{ct} +\alpha_c + u_{ct}$

Where $$\alpha_c$$ is the city-specific effect, and the intercept has been removed to avoid perfect multicolinearity. The within-estimator is also preferable if we don’t want to print out coefficients for a ton of dummy variables. Note that standard errors will also be the same whether one estimates the within model or the dummy variable model, as the same number of degrees of freedom are taken up to estimate city means.

We can also take this a step further and add a fixed effect for time:

$Y_{ct} = \beta_1 \textrm{Policy}_{ct} +\alpha_c + \delta_t + u_{ct}$

Now we have a model that removes time-invariant confounding and city-invariant confounding. The latter would occur if, for example, macro-economic conditions at the US-level were causing similar trends in all of the cities simultaneously. All that’s left to control for is confounders that vary within-city differently over time.

## The Counterfactual

The study is a variation on what is called a difference-in-differences (DiD) design. The classical DiD setup has two time points and two groups, a treated group and a control group. The expectation is that the change from $$t_1$$ to $$t_2$$ will differ by treatment, as in the following: