The worlds of machine learning and statistics, while sharing some common mathematical vocabulary, are ultimately engaged in two different enterprises. The former seeks to leverage data to yield accurate predictions about previously unseen cases, but it is less concerned about understanding the causal process by which these outcomes are generated. Statistics, as it has come to be practiced in the 21st century, is instead interested in pinning down the causal story. Whereas prediction requires turning a lot of knobs simultaneously, statistics identifies causality by finding ways to turn one knob at a time, essentially comparing two worlds identical in every way except for the value of a single variable. Note that you are wrong if you think this can simply be done by adding controls to a regression model.

The last twenty years in statistical research have introduced and refined new methods, most of which are essential for understanding causality but completely irrelevant for prediction, and therefore are not at all part of the machine learning toolkit. The distinction between the two fields is important to keep in mind because critiquing statistical papers without knowing what, say, a regression discontinuity design is can lead to broad attacks that do not do justice to the work that has gone into a research paper.

A recent manuscript from Jennifer Doleac and Anita Mukherjee makes the argument that easy access to Naloxone, a drug that can reverse opiod overdoses and save lives, has the unintended consequence of increasing opiod-related crime and ER visits without doing much to reduce opiod-related mortality. The authors suggest two avenues by which this may occur: 1) saving lives increases the number of opiod abusers who would otherwise have died, and 2) increasing risk-taking behavior by reducing its most severe consequences. They employ a fixed effects difference-in-differences design to isolate the causal effect of Naloxone policy, and their findings, at least for some regions in the United States, support their theory of selection effects and/or moral hazard.

The article has generated a lot of discussion online, often vituperative and on the basis of a poor understanding of the econometric techniques the authors employ. There are certainly reasons to be skeptical of the findings, but those concerns need to be raised within the context of the causal inference. No single research paper should be taken as the final statement in a controversy. Issues related to measurement error, selection bias, data sources, and confounders will lead to different results in distinct research projects, and conclusions should be drawn through the triangulation of findings from distinct studies. Peer review is meant to identify flaws in individual projects so that subsequent research can be improved. Peer review is useless, however, when criticisms are based on knee-jerk reactions and/or a desire to be contrarian for its own sake.

This blog post does not seek to say that the authors are correct or incorrect but rather to explain what the statistics being used were meant to do. Once the methodology is understood, one is better positioned to critique any shortcomings in the manner that thoughtful research deserves.

## Causal Inference

21st century statistics has evolved far beyond the era when it was assumed that a regression model with a bunch of controls could provide reliable hypothesis tests (admittedly, not all journals and researchers have yet gotten this message). Instead, the literature that has come to be lumped together under the descriptor “causal inference” carefully outlines the mathematical underpinnings and assumptions required to state that an intervention is actually causing an outcome to change.

In the context of the Naloxone paper, one goal is to determine if policies facilitating access to the drug leads to reduced mortality. The problem with asserting causality is the same in any study: we never observe the counterfactual. In other words, for states that adopted Naloxone policies in a given year, we never see what the mortality or crime rates would have been had the states not adopted the new rules. The following table illustrates (with fake mortality numbers):

State | Year | Adopted | Deaths without Policy | Death with Policy | Difference |
---|---|---|---|---|---|

Minnesota | 2015 | Yes | NA | 6000 | NA |

North Dakota | 2015 | Yes | NA | 3000 | NA |

South Dakota | 2015 | No | 5000 | NA | NA |

Iowa | 2015 | No | 2000 | NA | NA |

In an ideal experiment we could observe the alternate realities in which everything was the same *except* that Minnesota and North Dakota did not adopt, and South Dakota and Iowa did adopt. With these counterfactual observations we could calculate the average treatment effect as the mean of the `Difference`

column. We can’t observe any of the differences, however, so we need to figure out what we can know.

It can be shown with just a little bit of expectation algebra that we can recover the average treatment effect as the the mean of the observed “Deaths without Policy” values and the mean of the observed “Death with Policy” values, *provided the treatment is randomized*. This is why randomized control trials are the gold standard for research. Given randomization (and, okay, a couple of other assumptions: Google “SUTVA”), we can get an unbiased estimate of the average treatment effect *without having to observe the counterfactuals*.

The problem facing most social science research is that we work with observational data, meaning that the treatments we study cannot be randomized. Certainly adoption of Naloxone policies is not random. This introduces the problem of *confounders*, which are factors that simultaneously affect the treatment and the outcome.

The problem is that not all of the covariance between Naloxone and Mortality is causal, because some confounders are causing the two to move together.

What your grandparents did to deal with confounding was to “control” for them by fitting a regression model that included the measures of the confounders. This approach, however, is known to be a poor tool for isolating the causal effect. There are a few reasons for this.

- The causal effect is not defined over all values of the confounders (e.g. perhaps no poor municipalities ever adopted policies), yet regression will weight observations with these values as much as those for which the causal effect is defined. This is the motivation for propensity score methods.
- Not all confounders can be measured or even observed.

When we have panel data, data, however, we can rid ourselves of having to worry about many (though not all) potential confounders, including ones that cannot be measured. This is done through a fixed effects design.

## Fixed Effects Models

Take a model of mortality in city \(c\) at time \(t\) as a function of policy.

\[ Y_{ct} = \beta_0 + \beta_1 \textrm{Policy}_{ct} + u_{ct} \]

This model will produce a biased estimate of the policy effect due to confounders. There are two kinds of confounders:

- Those that only vary between cities.
- Those that vary within cities.

An example of the former would be culture, which is pretty constant within a city over time but tends to be different between cites (compare Fargo to Detroit, for example). An example of the latter would be public policies other than Naloxone policies that change during the entirety of the time frame.

The benefit of the fixed effects model is that we can rid ourselves of all city-level confounders. Do this by expanding the error term to separate time-invariant (i.e. only changing between cities) error and remaining error.

\[ Y_{ct} = \beta_0 + \beta_1 \textrm{Policy}_{ct} +\alpha_c + u_{ct} \]

Setting that aside for a moment, we can also consider a model that is only interested in between-city differences, but not really interested in how things are changing within a single city over time. We do this by calculating the mean of each variable across time separately for each city. In this case, we only have as many observations as cities, because we’ve taken the average of each time point within a city. This “between” model would be:

\[ \overline{Y}_{c} = \beta_0 + \beta_1 \overline{\textrm{Policy}}_{c} +\alpha_c + \overline{u}_{c} \]

The fixed effects estimator, on the other hand, is only interested in change *within* cities. In effect, we remove time-invariant differences between cities and consider only the remaining variability. We do this by subtracting out the between-city differences from our first model with the two error terms.

\[ \left(Y_{ct}- \overline{Y}_{c}\right) = \beta_1 \left(\textrm{Policy}_{ct} - \overline{\textrm{Policy}}_{c} \right) + \left(u_{ct} - \overline{u}_{c}\right) \]

Because all of the \(\alpha_c\) are time invariant, their values are equal to their means and thus drop out of the model. By doing this operation, we have removed the between-city differences in our variables. Fitting OLS to the above model is equivalent to running a regression using only the variability within a city. In other words, we have removed all confounding due to city differences, including those we cannot even measure or observe.

It turns out that the above equation is algebraically equivalent to fitting a model with dummy variables for city, though taking the steps to derive the *within-estimator* makes it more explicit that we’re subtracting out city-level confounders. We can write the dummy variable model as above:

\[ Y_{ct} = \beta_1 \textrm{Policy}_{ct} +\alpha_c + u_{ct} \]

Where \(\alpha_c\) is the city-specific effect, and the intercept has been removed to avoid perfect multicolinearity. The within-estimator is also preferable if we don’t want to print out coefficients for a ton of dummy variables. Note that standard errors will also be the same whether one estimates the within model or the dummy variable model, as the same number of degrees of freedom are taken up to estimate city means.

We can also take this a step further and add a fixed effect for time:

\[ Y_{ct} = \beta_1 \textrm{Policy}_{ct} +\alpha_c + \delta_t + u_{ct} \]

Now we have a model that removes time-invariant confounding and city-invariant confounding. The latter would occur if, for example, macro-economic conditions at the US-level were causing similar trends in all of the cities simultaneously. All that’s left to control for is confounders that vary within-city differently over time.

## The Counterfactual

The study is a variation on what is called a difference-in-differences (DiD) design. The classical DiD setup has two time points and two groups, a treated group and a control group. The expectation is that the change from \(t_1\) to \(t_2\) will differ by treatment, as in the following:

The counterfactual for the Naloxone group is the observed amount of crime had no laws been passed. An essential assumption of the DiD design is that the trend in the counterfactual case would have been equal to the untreated trend. This is known as the “parallel trends” assumption.

The treatment effect can be determined as the difference between the red and green dots at time two.

The Doleac and Mukherjee paper is a variation on the DiD design that exploits the differential timing of Naloxone laws to determine the treatment effect, and consequently the counterfactual assumption of parallel trends is necessary. Had a city not adopted Naloxone laws in a given year, the counterfactual trend is assumed to be the same as for cities that had not yet adopted laws.

The case for the assumption being met is bolstered somewhat by the inclusion of several time points prior to adoption in most states, which makes it possible to see longer term changes in outcomes. In addition, the city fixed effects remove city-level confounders, and the time fixed effects remove macro trends shared by all cities. The authors also present some graphs attempting to demonstrate that there were no pre-existing trends. An example is the following:

(Note: since this post was originally written, the authors have updated the figures in the most recent version of their manuscript. I am keeping the prior versions, as they have been the source of much of the criticism that has occurred on Twitter).

Why is the y-axis labeled “Residuals”? The authors fit their fixed effects model with all of their covariates but excluded the policy variable, the key treatment. If the model is sufficiently accounting for overall trends, then what’s left to explain - the residuals - should not be a function of time. The scale of the y-axis appears to be based on standardized residuals, a different scale than the original measurement.

Some of their figures more convincingly rule out pre-treatment trends than others. For example, the fixed effects and controls did not seem to eliminate time effects in the Pacific/West region, though the authors do not draw strong conclusions about this region.

Having done away with city and (macro) time confounding, the treatment effect is estimated as the mean after policies minus the mean prior to policies. The loess lines in the figures distract from this but were likely included in order to show overall trends. The actual treatment effect is a difference in means test, after adjusting with fixed effects and some additional non-Naloxone policy controls. It is therefore a mistake to judge the quality of the paper on the graphs alone.

It’s also worth noting that the statistical models from which the estimated treatment effects came were population weighted, whereas the loess curves are presumably not weighted. This can lead to confusion when comparing the graphs with the treatment effect estimates reported in the paper’s tables.

While the lack of pre-treatment trends overall supports the parallel trends assumption, it is still not a direct test of the (impossible to prove) counterfactual. The authors therefore present a series of robustness tests. A common approach in DiD studies is to carry out placebo tests, which attempt to find similar changes in outcomes that should not have been affected by treatment. If the same trends are found, then the model does not adequately address the problem of parallel trends. The placebos the authors provide are:

- Suicide, as an indicator of changing economic despair.
- Deaths by heart disease, as an indicator of overall changing health.
- Deaths due to motor vehicle accidents, as an indicator of overall trends increasing risk taking.

The authors take the lack of significant treatment effects for these outcomes to indicate that their results are not picking up other trends, especially ones tied to the economy and public health. They also test a series of additional models changing the control variables and including different terms to account for possible nonlinear trends. They conclude that the results are overall robust.

## Critiquing the Article

If one wishes to get on the interwebs and publicly attack the paper, one should understand what was done.

- The authors wanted to identify the treatment effect of Naloxone laws, but these laws have not been randomly assigned. This raises the problem of confounders.
- The authors deal with time-invariant confounders and macro trends, including those that cannot be observed, through the use of fixed effects. They also control for other time-varying policies.

- The staggered implementation of laws allows for a DiD approach, where the treatment effect is determined by the adjusted difference in means before and after implementation.

The proper way to approach the paper is to ask the following questions:

- Have the key variables been measured accurately?
- Are there additional within-city confounders that may explain the differences but are not controlled for?
- Is the parallel trends assumption valid?

These key questions are raised by Richard Frank, Keith Humphreys, and Harold Pollack in a summary published on the Health Affairs blog, https://www.healthaffairs.org/do/10.1377/hblog20180316.599095/full/. They make important points that get to the heart of the limitations of the study which are, *inter alia*,

- Measurement: Doleac and Mukherjee’s treatment is an amalgam of very different policies all lumped together. There is no reason to expect each policy type has a similar effect to the others.
- Confounders: The study’s time period overlaps with the expansion of Medicaid in many states, which meant more people were more likely to go to emergency rooms.
- Timing: The authors look for an effect immediately after a law goes into effect, whereas there is generally a lag before a policy has an effect. Exacerbating this is that the authors’ data end right when the most states were implementing policies.

The first of these issues cannot be fixed by any type of identification strategy. The second can only be addressed with the passage of time. The third demonstrates the difficulty of controlling for all confounders, even when using fixed effects and DiD designs.

## The Upshot

The Doleac and Mukherjee paper is a serious piece of research that employs appropriate and standard econometric techniques to adjust as much as possible for confounding. For this reason, the issues that they raise should be taken seriously. At the same time, no one paper will definitively answer important policy-relevant questions in a world where policy interventions are necessarily non-random. Their paper should only be read within the larger context of research on Naloxone laws, many of which disagree with the findings. This is the best way for adults to approach *any* area of scientific inquiry.