Rare events are often of interest in statistics and machine learning. Mortality caused by a prescription drug may be uncommon but of great concern to patients, providers, and manufacturers. Predictive models in finance may be focused on forecasting when equities move substantially, something quite rare relative to the more quotidian shifts in prices. Logistic-type models (logit models in econometrics, neural nets with sigmoidal activation functions) will tend to underestimate the probability of these events occurring. After all, if an event occurs 1% of the time, a model that says no cases will ever experience the event will demonstrate 99% accuracy.

This problem was addressed several years ago in the statistical literature by Gary King and Langche Zeng (2001). In machine learning, the problem is typically addressed by down sampling the non-events to even out the distribution of the outcome. However, as King and Zeng show, this approach is akin to a case-control design in epidemiology for which standard logistic-based classifiers are well-known to be biased. In addition, even if optimizing a bias-corrected likelihood function, the predicted probabilities may still be too small and require a further adjustment based on the sample-to-sample variability in parameter estimates.

We were interested in better understanding this method for its own sake, given that we are frequently asked to predict relatively rare adverse events following, say, surgery. We were also interested in what this approach can suggest for machine learning, especially when down-sampling of nonevents is used. Indeed, King and Zeng (along with Nathaniel Beck) were simultaneously exploring Bayesian neural networks in the context of rare events. However, the logit model adjustments that King and Zeng suggest do not generalize to non-logistic functions, and their own neural net models simply used ensembling and Bayesian shrinkage to avoid overfitting rather than making a generalizable adjustment to the loss function.

At the same time, the authors do note that, even with a bias-corrected logit estimator, predicted probabilities calculated the usual way (using the inverse logit link function) remain too small. We imagine this is not a logit-specific result. The King and Zeng solution is again logit specific, or at least requires a covariance matrix for the model parameters. But it suggests that future research into predictive models of rare events should consider either inflating the probabilities or moving down the threshold on the probability scale for declaring an event to be predicted. The latter can be accomplished, for example, using a careful ROC curve analysis that is specifically calibrated to the trade-off between false positives and false negatives.

Finally, note that the King and Zeng method is not the only statistical approach to adjusting for rare events. Firth’s (1993) penalized likelihood, easily implemented using the `brglm`

package for R, introduces a penalization parameter to the usual likelihood function. Nonetheless, the additional adjustment King and Zeng suggest for predicted probabilities is intriguing and may be considered as complementary to the Firth method. That is, the model can be fit using the Firth method (rather than King and Zeng’s suggested estimator), but predicted probabilities can be given a post hoc adjustment based on King and Zeng’s formula to improve accuracy of forecasting new events.

The remainder of this blog post describes the King and Zeng modified estimators within the context of traditional logistic regression modeling.

### The King and Zeng Estimators

King and Zeng’s article initially focuses on data gathering and notes that, for rare events, substantial cost savings can be had by undersampling the non-events. So long as a weighted version of the logistic estimator is used (with weights based on the proportion of events in the sample and in the population), unbiased parameter estimates can still be had. Even if undersampling of non-events is not used, however, there are consequences to proceeding simply with the usual logit model. Specifically, the following two points are made:

- Parameters for logistic regression are well known to be biased in small samples, but the same bias can exist in large samples if the event is rare.
- Even a bias-corrected estimator for the model parameters does not necessarily lead to optimal predicted probabilities.

If we can determine the amount of bias in a parameter estimate, we can simply subtract it out.

\[\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}} - bias(\hat{\boldsymbol{\beta}})\] Based on McCullagh and Nelder’s (1989) foundational work on generalized linear models, the bias for any GLM is:

\[\text{bias}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^{T}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{W}\xi\] where \(\xi\) is a function of the first and second derivatives of the inverse link function and \(\mathbf{W}\) is, in the logit case, a function of the observed rate of events. The innovation of the King and Zeng estimator given undersampling is to respecify the usual logistic likelihood to be weighted based on the known proportion of events in the population and the proportion of events in the sample. Hence, the estimator adjusts for non-random case selection introduced by the case-control sampling, if used. In the absence of down-sampling non-events, all cases are weighted equally (\(w_i = 1\)), and the bias can be calculated without the weights.

King and Zeng demonstrate how the bias operates on the intercept term in a simple model with one predictor. The bias can be shown (their Appendix D) to be

\[E(\hat{\beta}_0 - \beta_0) \approx \frac{\bar{\pi} - 0.5}{n\bar{\pi}(1 - \bar{\pi})}\]

where \(\bar{\pi}\) is the proportion of events in the data. For rare events, \(\bar{\pi}\) will be less than .5, making the bias negative. The denominator shows that bias is driven by two factors, sample size and the proportion of events. As \(n\) increases, or as \(\bar{\pi}\) increases towards .5 (an even distribution of events and non-events), the bias gets smaller because the denominator gets bigger. Predicted probabilities calculated from the biased estimate will then tend to be too small, as moving the intercept up increases predicted probabilities.

However, even this adjustment may still yield probability estimates that are too low. This becomes clear when one recalls that logistic regression is a linear model of a hypothetical unobserved latent variable, \(y^*\) (see a prior blog post), which takes on a value of zero below a threshold \(\tau\) and one above it. Given a model-based distribution of possible values on the \(y^*\) scale for case *i*, the predicted probability is the area under the curve above the threshold.

Assuming a logistic distribution, this leads to the usual estimator for the predicted probabilities, \(\frac{1}{1 + e^{-\mathbf{x}_i\boldsymbol{\beta}}}\). However, the \(\boldsymbol{\beta}\) are estimated with some amount of uncertainty, whereas the usual calculation assumes that the \(\boldsymbol{\beta}\) are known perfectly. An alternative estimator proposed by King and Zeng averages over the uncertainty in \(\boldsymbol{\beta}\), which turns out to be equivalent to widening the prediction interval on \(y^*\).

To illustrate, the model makes a prediction for case \(i\) on the scale for \(y^*\). The probability is the area under the distribution that exceeds the threshold.

By averaging over the uncertainty in \(\boldsymbol{\beta}\), the distribution becomes wider, and a larger amount of the area under the curve exceeds the threshold.

In the first figure, the area under the curve above the threshold (i.e. the probability of an event) was 0.023. In the second figure, the variance increased, and the area under the curve above the threshold (or probability of an event) grew to 0.091. In other words, upping the variance for \(y^*\) increased the predicted probability that an event occurs.

Probabilities estimated in the context of rare events tend to be low, and it may even be the case that no probabilities exceed .5. By making an adjustment to the predicted probability estimates, in this case one that is based on the covariance matrix of the estimated parameters, we end up (correctly) with more predicted cases. King and Zeng’s adjusted probability estimator is

\[ p(Y_i = 1) \approx \tilde{\pi}_i + C_i\]

where

\[C_i = (0.5 - \tilde{\pi_i})\tilde{\pi_i}(1 - \tilde{\pi_i})\mathbf{X}V(\tilde{\boldsymbol{\beta}})\mathbf{X}^T \]

The tilde ~ is used to represent the fact that the probabilities come from the bias-adjusted logit model described above.

### A Demonstration in R

Say we want to predict an event that occurs 3% of the time in the population. How well do we do with out-of-sample predictions using the regular logit model versus the bias-corrected version?

To answer this question, we’ll generate fake data from a simple model. First, generate a random independent variable \(x_1 \sim \mathcal{N}(0, 1)\). \(Y^*\), the latent continuous value of the observed outcome, will be a linear function of this variable plus some noise drawn from a standard normal distribution, \(Y^* = \beta_0 + \beta_1 x_1 + \mathcal{N}(0, 1)\) (yes, the data generating process is more probit than logit, but variances of 1 are easier to work with than \(\frac{\pi^2}{3}\)). Finally, \(y_i\) is a discrete representation of \(Y_i^*\) such that, if \(Y_i^* >= 0\) then \(y_i = 1\). Otherwise \(Y_i = 0\).

The following code generates the data.

```
set.seed(12345)
signal_to_noise <- .2
n <- 750
prop_ones <- .03
beta_1 <- sqrt(signal_to_noise)
dat <- data_frame(x1 = rnorm(n),
intercept = qnorm(prop_ones, sd = sqrt(beta_1 ^ 2 + 1)),
y_signal = beta_1 * x1,
y_star = intercept + y_signal + rnorm(n),
y = y_star >= 0,
true_prob = pnorm(y_signal + intercept))
```

`signal_to_noise`

, `n`

, and `prop_ones`

are inputs we could theoretically vary.

`n`

is population size, here set to 750. We’ll assess the out-of-sample accuracy by randomly setting aside 1/3 of the cases as a test set.`signal_to_noise`

is the ratio of variation explainable by the predictor compared to the noise term, here set to .2.

`prop_ones`

is the proportion of events in the data, here set as 3%.

```
dat %>%
count(y) %>%
mutate(prop = round(n/sum(n),2))
```

```
## # A tibble: 2 x 3
## y n prop
## <lgl> <int> <dbl>
## 1 FALSE 726 0.97
## 2 TRUE 24 0.03
```

Next, split the data into 66% for model fitting and 34% for out of sample prediction, then fit the traditional logistic GLM.

```
train <- dat %>%
sample_frac(.66)
test <- dat %>%
anti_join(train, by = c("x1", "y_star", "y", "true_prob"))
X_train <- train %>%
select(x1)
X_test <- test %>%
select(x1)
fit <- glm(as.factor(y) ~ x1, data = train, family = "binomial")
```

Extract coefficients and the coefficient covariance matrix.

```
coefs <- fit %>%
coef() %>%
as.numeric()
V <- vcov(fit)
```

Although the `predict.glm`

method will return predicted probabilities, we’ll create our own function that can be fed the bias-corrected estimates.

```
logisticPred <- function(X, coef) {
X %>%
na.omit() %>%
mutate(int = 1) %>%
select(int, everything()) %>%
as.matrix(.) %*% coef %>%
as.vector() %>%
(function(x) 1 / (1 + exp(-x)))
}
```

Using the training and test data:

```
pred <- X_train %>%
logisticPred(coefs)
test_pred <- X_test %>%
logisticPred(coefs)
```

The following figure shows the distribution of probabilities calculated for the test set based on the `glm`

fit to the training data. The `x`

s along the horizontal axis represent cases where an event actually occurred.

None of the probabilities come close to the .5 threshold, so no events are predicted in the test data.

What happens if we make the recommended adjustments to the parameter estimates? The formulas from the King and Zeng paper require linear algebra, so convert the data frame to a matrix. `bias`

is a column in the design matrix representing the intercept.

```
X_matrix <- X_train %>%
mutate(bias = 1) %>%
select(bias, everything()) %>%
as.matrix()
X_test_matrix <- X_test %>%
mutate(bias = 1) %>%
select(bias, everything()) %>%
as.matrix()
```

Use formulas to calculate the amount of bias in the coefficients.

```
W <- diag(pred * (1 - pred))
Q <- X_matrix %*% solve(t(X_matrix) %*% W %*% X_matrix) %*% t(X_matrix)
e <- 0.5 * diag(Q) * (2 * pred - 1)
bias <- (solve(t(X_matrix) %*% W %*% X_matrix) %*% t(X_matrix) %*% W %*% e) %>%
as.numeric()
```

Now subtract the bias and get the corrected variance.

```
unbiased_coefs <- coefs - bias
updated_var <- (nrow(X_train) / (nrow(X_train) + ncol(X_train) + 1)) ^ 2 * V
```

To check our algebra, compare the results to what the `Zelig`

package produces:

```
re_model <- zelig(y ~ x1, data = train, model = "relogit", cite = FALSE)
unbiased_coefs
```

`## [1] -3.909300 1.170486`

`coef(re_model)`

```
## (Intercept) x1
## -3.909300 1.170486
```

`sqrt(diag(updated_var))`

```
## (Intercept) x1
## 0.3788302 0.2743757
```

`sqrt(diag(as.matrix(vcov(re_model)[[1]])))`

```
## (Intercept) x1
## 0.3803608 0.2754843
```

The parameter estimates match, but the standard errors are different. Apparently, `zelig`

reports bias corrected parameters but reports SEs from the regular GLM (!), as seen in the output from the next two code chunks.

```
zelig(y ~ x1, data = train, model = "relogit", cite = FALSE) %>%
summary
```

```
## Model:
##
## Call:
## z5$zelig(formula = y ~ x1, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1729 -0.2897 -0.1987 -0.1343 3.0611
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9093 0.3804 -10.278 < 2e-16
## x1 1.1705 0.2755 4.249 2.15e-05
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.65 on 494 degrees of freedom
## Residual deviance: 132.70 on 493 degrees of freedom
## AIC: 136.7
##
## Number of Fisher Scoring iterations: 7
##
## Next step: Use 'setx' method
```

```
glm(y ~ x1, data = train, family = "binomial") %>%
summary
```

```
##
## Call:
## glm(formula = y ~ x1, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1770 -0.2815 -0.1915 -0.1283 3.0905
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9838 0.3804 -10.474 < 2e-16 ***
## x1 1.1958 0.2755 4.341 1.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.65 on 494 degrees of freedom
## Residual deviance: 132.70 on 493 degrees of freedom
## AIC: 136.7
##
## Number of Fisher Scoring iterations: 7
```

Now we’ll make predictions on the test set *without* the additional adjustment for predicted probabilities but *with* the bias-corrected coefficients.

```
unbiased_coef_pred <- X_test %>%
logisticPred(unbiased_coefs)
```

How many predicted events do we have?

We haven’t done much better. What if we now add the correction to the estimate of the predicted probability? To facilitate calculations, let \(\zeta = (.5 - \tilde{\pi}_i)\tilde{\pi}_i(1 - \tilde{\pi}_i)\) and \(\eta = \mathbf{x}V(\tilde{\boldsymbol{\beta}})\mathbf{x}^T\).

```
new_X_test <- test %>%
as_data_frame %>%
mutate(p = unbiased_coef_pred) %>%
mutate(zeta = (.5-p)*p*(1-p),
eta = (1*updated_var[1,1] + x1*updated_var[1,2])*1 +
(1*updated_var[2,1] + x1*updated_var[2,2])*x1) %>%
mutate(updated_p = p + zeta*eta) %>%
mutate(event = updated_p > .5)
corrected_pred <- unlist(new_X_test$updated_p)
```

And the distribution of probabilities.

In this fake data set, we still failed to predict a single case, though we’ve slightly pushed up the probabilities for events.

King and Zeng provide results from a variety of Monte Carlo simulations that vary the proportion of events, the sample size, and the value of \(\beta_0\) and \(\beta_1\). The results find that very rare events (e.g. 0.15% of cases) are hard to predict even with the corrected method, whereas there is not much difference when the signal-to-noise ratio is very high. Somewhat less rare events predicted from noisier models seem to benefit the most from the adjustment.

It is well known that the logistic model does an inferior job at predicting out-of-sample events relative to machine learning methods like random forests and GBMs, and we will continue to use these when the our projects emphasize accurate forecasts. On the other hand, combining the bias-corrected estimator with propensity-score matched or weighted samples is an option worth considering in statistical projects.