# Bias Adjustment for Rare Events Logistic Regression in R

Jeremy Albright

Posted on
Logistic rstats

Rare events are often of interest in statistics and machine learning. Mortality caused by a prescription drug may be uncommon but of great concern to patients, providers, and manufacturers. Predictive models in finance may be focused on forecasting when equities move substantially, something quite rare relative to the more quotidian shifts in prices. Logistic-type models (logit models in econometrics, neural nets with sigmoidal activation functions) will tend to underestimate the probability of these events occurring. After all, if an event occurs 1% of the time, a model that says no cases will ever experience the event will demonstrate 99% accuracy.

This problem was addressed several years ago in the statistical literature by Gary King and Langche Zeng (2001). In machine learning, the problem is typically addressed by down sampling the non-events to even out the distribution of the outcome. However, as King and Zeng show, this approach is akin to a case-control design in epidemiology for which standard logistic-based classifiers are well-known to be biased. In addition, even if optimizing a bias-corrected likelihood function, the predicted probabilities may still be too small and require a further adjustment based on the sample-to-sample variability in parameter estimates.

We were interested in better understanding this method for its own sake, given that we are frequently asked to predict relatively rare adverse events following, say, surgery. We were also interested in what this approach can suggest for machine learning, especially when down-sampling of nonevents is used. Indeed, King and Zeng (along with Nathaniel Beck) were simultaneously exploring Bayesian neural networks in the context of rare events. However, the logit model adjustments that King and Zeng suggest do not generalize to non-logistic functions, and their own neural net models simply used ensembling and Bayesian shrinkage to avoid overfitting rather than making a generalizable adjustment to the loss function.

At the same time, the authors do note that, even with a bias-corrected logit estimator, predicted probabilities calculated the usual way (using the inverse logit link function) remain too small. We imagine this is not a logit-specific result. The King and Zeng solution is again logit specific, or at least requires a covariance matrix for the model parameters. But it suggests that future research into predictive models of rare events should consider either inflating the probabilities or moving down the threshold on the probability scale for declaring an event to be predicted. The latter can be accomplished, for example, using a careful ROC curve analysis that is specifically calibrated to the trade-off between false positives and false negatives.

Finally, note that the King and Zeng method is not the only statistical approach to adjusting for rare events. Firth’s (1993) penalized likelihood, easily implemented using the brglm package for R, introduces a penalization parameter to the usual likelihood function. Nonetheless, the additional adjustment King and Zeng suggest for predicted probabilities is intriguing and may be considered as complementary to the Firth method. That is, the model can be fit using the Firth method (rather than King and Zeng’s suggested estimator), but predicted probabilities can be given a post hoc adjustment based on King and Zeng’s formula to improve accuracy of forecasting new events.

The remainder of this blog post describes the King and Zeng modified estimators within the context of traditional logistic regression modeling.

## The King and Zeng Estimators

King and Zeng’s article initially focuses on data gathering and notes that, for rare events, substantial cost savings can be had by undersampling the non-events. So long as a weighted version of the logistic estimator is used (with weights based on the proportion of events in the sample and in the population), unbiased parameter estimates can still be had. Even if undersampling of non-events is not used, however, there are consequences to proceeding simply with the usual logit model. Specifically, the following two points are made:

• Parameters for logistic regression are well known to be biased in small samples, but the same bias can exist in large samples if the event is rare.
• Even a bias-corrected estimator for the model parameters does not necessarily lead to optimal predicted probabilities.

If we can determine the amount of bias in a parameter estimate, we can simply subtract it out.

$\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}} - bias(\hat{\boldsymbol{\beta}})$ Based on McCullagh and Nelder’s (1989) foundational work on generalized linear models, the bias for any GLM is:

$\text{bias}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^{T}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{W}\xi$ where $$\xi$$ is a function of the first and second derivatives of the inverse link function and $$\mathbf{W}$$ is, in the logit case, a function of the observed rate of events. The innovation of the King and Zeng estimator given undersampling is to respecify the usual logistic likelihood to be weighted based on the known proportion of events in the population and the proportion of events in the sample. Hence, the estimator adjusts for non-random case selection introduced by the case-control sampling, if used. In the absence of down-sampling non-events, all cases are weighted equally ($$w_i = 1$$), and the bias can be calculated without the weights.

King and Zeng demonstrate how the bias operates on the intercept term in a simple model with one predictor. The bias can be shown (their Appendix D) to be

$E(\hat{\beta}_0 - \beta_0) \approx \frac{\bar{\pi} - 0.5}{n\bar{\pi}(1 - \bar{\pi})}$

where $$\bar{\pi}$$ is the proportion of events in the data. For rare events, $$\bar{\pi}$$ will be less than .5, making the bias negative. The denominator shows that bias is driven by two factors, sample size and the proportion of events. As $$n$$ increases, or as $$\bar{\pi}$$ increases towards .5 (an even distribution of events and non-events), the bias gets smaller because the denominator gets bigger. Predicted probabilities calculated from the biased estimate will then tend to be too small, as moving the intercept up increases predicted probabilities.

However, even this adjustment may still yield probability estimates that are too low. This becomes clear when one recalls that logistic regression is a linear model of a hypothetical unobserved latent variable, $$y^*$$ (see a prior blog post), which takes on a value of zero below a threshold $$\tau$$ and one above it. Given a model-based distribution of possible values on the $$y^*$$ scale for case i, the predicted probability is the area under the curve above the threshold.

Assuming a logistic distribution, this leads to the usual estimator for the predicted probabilities, $$\frac{1}{1 + e^{-\mathbf{x}_i\boldsymbol{\beta}}}$$. However, the $$\boldsymbol{\beta}$$ are estimated with some amount of uncertainty, whereas the usual calculation assumes that the $$\boldsymbol{\beta}$$ are known perfectly. An alternative estimator proposed by King and Zeng averages over the uncertainty in $$\boldsymbol{\beta}$$, which turns out to be equivalent to widening the prediction interval on $$y^*$$.

To illustrate, the model makes a prediction for case $$i$$ on the scale for $$y^*$$. The probability is the area under the distribution that exceeds the threshold.