Delta Method Standard Errors

Jeremy Albright

Posted on
Logistic Predicted Probabilities rstats

Logistic regression produces result that are typically interpreted in one of two ways:

  1. Predicted probabilities
  2. Odds ratios

Odds are the ratio of the probability that something happens to the probabilty it doesn’t happen.

\[ \Omega(X) = \frac{p(y=1|X)}{1-p(y=1|X)} \] An odds ratio is the ratio of two odds, each calculated at a different score for \(X\).

There are strengths and weaknesses to either choice.

  1. Predictored probabilities are intuitive, but require assuming a value for every covariate.
  2. Odds ratios do not require specifying values for other covariates, but ratios of ratios are not always intuitive.

Illustration using 2016 ANES: vote for Trump or Clinton. See prior blog post for details.

trump_model <- glm(vote ~ gender + educ + age, data = anes, 
                   family = binomial(link = "logit"))

estimates <- round(coef(trump_model), 3)

results_tab <- tidy(trump_model) %>%
  mutate_if(is.numeric, funs(round(.,3))) %>%
  var_mapping(term) %>%
  dplyr::rename(Estimate = term, Beta = estimate, SE = std.error, z = statistic, p = p.value)


kable(results_tab, align = c("l",rep("c",4)))
Estimate Beta SE z p
Constant -1.051 0.255 -4.127 0.000
Female -0.374 0.085 -4.384 0.000
Completed HS 0.655 0.231 2.839 0.005
College < 4 Years 0.696 0.218 3.195 0.001
College 4 Year Degree 0.411 0.222 1.853 0.064
Advanced Degree -0.424 0.229 -1.850 0.064
Age 0.015 0.002 6.026 0.000

Interpreting as odds ratios:

  • The odds of voting for Trump are \(100\times[\mbox{exp}(-.374) - 1]\) = 31% lower for women compared to men.
  • The odds of voting for Trump are \(100\times[\mbox{exp}(.655) - 1]\) = 93% higher for those with only a high school diploma compared to those without a high school diploma.
  • The odds of voting for Trump are \(100\times [\mbox{exp}(.696) - 1]\) = 101% higher for those with with some college (but no 4-year degree) compared to those without a high school diploma.
  • Each increase in age of one year leads to a \(100\times [\mbox{exp}(.015) - 1]\) = 2% increase in the odds of voting for Trump.

Interpreting as predicted probabilities:

predict_data <- expand.grid(educ    = levels(anes$educ),
                            gender  = levels(anes$gender),
                            age     = seq(20,90, by = 10))

pred_probs <- predict_data %>%
  mutate(prob_trump = predict(trump_model, newdata = predict_data, 
                              type = "response"))

ggplot(pred_probs, aes(x = age, y = prob_trump, group = educ, color = educ)) + geom_line() +
  facet_grid(~ gender) + labs(x = "Age (Years)", y = "Probability of Voting for Trump") + 
  scale_color_discrete(name="Highest\nEducation")

The problem:

These are sample estimates. How can we assess levels of uncertainty?

For sample estimates, we would like a standard error (SE) or a 95% confidence interval (the former usually used to create the latter).

Start with odds ratios, which at first seems easiest. How can we get a 95% CI around the odds ratio?

  1. The odds ratio is just \(\mbox{exp}(\beta_k)\).
  2. Software gives us a 95% confidence interval around \(\beta_k\).
  3. Create the 95% confidence interval aroune \(\beta_k\) as \(\beta_k \pm 1.96 \times \mbox{SE}_{\beta_k}\).
  4. Exponentiate the lower limit and the upper limit of 95% CI around \(\beta_k\).

This is what is typically done in R with exp(confint(model_object)).

Note that, because exp() is a nonlinear transformation, the resulting confidence intervals will be asymmetric. To aid illustration, add a random noise variable (whose standard error will be large).

set.seed(1)

anes <- anes %>%
  mutate(noise = rnorm(nrow(anes), 0, .05))

Fit the model.

trump_model <- glm(vote ~ gender + educ + age + noise, data = anes, 
                   family = binomial(link = "logit"))

Exponentiate the 95% confidence interval around \(\beta_k\).

confint(trump_model) %>%
  exp
## Waiting for profiling to be done...
##                               2.5 %    97.5 %
## (Intercept)               0.2110361 0.5737587
## genderFemale              0.5820401 0.8131175
## educCompleted HS          1.2277695 3.0374878
## educCollege < 4 Years     1.3127215 3.0913909
## educCollege 4 Year Degree 0.9793854 2.3409015
## educAdvanced Degree       0.4181680 1.0294183
## age                       1.0102303 1.0201813
## noise                     0.2170954 5.4118869

A graph is even better:

cis <- exp(confint(trump_model)) %>%
  as.data.frame %>%
  rownames_to_column("Variable") %>%
  var_mapping(Variable)
## Waiting for profiling to be done...
graph_data <- tidy(trump_model) %>%
  mutate_if(is.numeric, funs(round(.,3))) %>%
  dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>%
  var_mapping(Variable) %>%
  mutate(OR = exp(Beta)) %>%
  left_join(cis) %>%
  filter(Variable != "Constant")
## Joining, by = "Variable"
ggplot(data = graph_data, aes(x = Variable, y = `OR`, ymin = `2.5 %`, ymax = `97.5 %`)) +
  geom_pointrange() +
  geom_hline(yintercept = 1, lty = 2) +
  coord_flip() +
  xlab("Variable") + ylab("Odds Ratio with 95% CI") 

Delta Method Standard Errors for Odds Ratios

Alternatively, we can use the SE for the odds ratio to determine a normal (and symmetric) approximation for the 95% CI. But what is the SE for the odds ratio?

An approach known as the delta method is used frequently to come up with standard errors for nonlinear transformations of model parameters.

It is based on computing the variance for a Taylor series linearization of the function.

A Taylor Series rewrites a function at a given location \(a\) as a (possibly infinite) sum of the function’s derivatives.

\[ f(x) = f(x) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \ldots \]

A Taylor series approximation chops off all but the first one or two derivatives. A linear approximation to \(f(x)\) at \(a\) is thus

\[ f(x) = f(x) + f'(a)(x - a) \]

Take a linear transformation of a random variable \(x\).

\[ f(x) = a + bx \]

The variance of \(f(x)\) is known to be

\[ \mbox{Var}(a + bx) = b^2\mbox{Var}(x) \] or

\[ \mbox{Var}(a + bx) = b\mbox{Var}(x)b \] or

\[ \mbox{Var}(a + bx) = f'(x)\mbox{Var}(x)f'(x) \]

Generalizing to any (univariate) differentiable function of a random varaible \(x\) with mean \(\mu\), we can approximate a function of \(x\) at \(\mu\) with

\[ f(x) \approx f(\mu) + f'(x)(x - \mu) \]

meaning that the variance of the function is

\[ \mbox{Var}\left(f(x)\right) = f'(\mu)\mbox{Var}(x)f'(\mu) \]

This generalizes to functions of multiple variables. Simply replace the derivate with the gradient vector and the variance with the variance-covariance matrix.

\[ \mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \]

Going back to logistic regression, our random variables are the sample estimates \(\widehat{\beta}_k\), and our function is \(f(\beta_k) = e^{\beta_k}\). The maximum likelihood estimates are the values for the vector \(\boldsymbol \mu\). The covariance matrix is the covariance matrix of the estimates. Both are easily recovered in R from a glm object.

coef(trump_model)
vcov(trump_model)

The function in which we are interested is

\[ f(\beta_k) = e^{\beta_k} \] The delta-method variance is

\[ f(\beta_k) = f'(\beta_k)\mbox{Var}(\widehat{\beta_k})f'(\beta_k) \]

This turns out to be a pretty simple problem, given that

\[ \frac{d e^x}{dx} = e^x \]

What would the variance be for the odds ratio on the noise term?

  • The estimate \(\beta_{\mbox{noise}}\) was 0.0804698.
  • The variance was 0.6725694

\[ \mbox{Var}\left(\mbox{exp}\left(\beta_{\mbox{noise}}\right)\right) = e^{\beta}\mbox{Var}\left(\hat{\beta}\right)e^{\beta} = 0.79 \]

A normal-approximated 95% confidence interval is found as

\[ 95\% \mbox{ CI} = \beta_k \pm 1.96 \times \mbox{SE}_{\beta_k} \]

where \(\mbox{SE}\) is the square root of the variance.

A function to return odds ratios and confidence intervals based on normal approximations.

OR_SE <- function(glm_object, variable){
  
  d <- exp(coef(glm_object)[variable])
  v <- vcov(glm_object)[variable, variable]
  
  output_df <- as_tibble(list(Estimate = variable,
                              Beta     = coef(glm_object)[variable],
                              OR       = d,
                              `OR SE`  = sqrt(d*v*d))) %>%
    mutate(`Lower CI` = OR - 1.96*`OR SE`,
           `Upper CI` = OR + 1.96*`OR SE`)

  output_df
  
}

Map over all estimates and reduce to tibble.

or_tibble <- map(names(coef(trump_model)), function(i) OR_SE(trump_model, i)) %>%
  reduce(bind_rows)
kable(or_tibble %>% var_mapping(Estimate), align = c("l", rep("c", 5)))
Estimate Beta OR OR SE Lower CI Upper CI
Constant -1.0513062 0.3494810 0.0890344 0.1749736 0.5239883
Female -0.3738035 0.6881121 0.0586800 0.5730993 0.8031249
Completed HS 0.6543854 1.9239596 0.4436975 1.0543126 2.7936067
College < 4 Years 0.6962532 2.0062217 0.4372836 1.1491458 2.8632976
College 4 Year Degree 0.4109402 1.5082351 0.3344887 0.8526372 2.1638330
Advanced Degree -0.4244709 0.6541158 0.1500084 0.3600993 0.9481323
Age 0.0150621 1.0151761 0.0025378 1.0102021 1.0201501
0.0804698 1.0837961 0.8888248 -0.6583004 2.8258927
cis <- exp(confint(trump_model)) %>%
  as.data.frame %>%
  rownames_to_column("Variable") %>%
  var_mapping(Variable)
## Waiting for profiling to be done...
graph_data_2 <- tidy(trump_model) %>%
  dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>%
  mutate_if(is.numeric, funs(round(.,3))) %>%
  var_mapping(Variable) %>%
  mutate(OR = exp(Beta)) %>%
  left_join(cis) %>%
  mutate(Source = "exp(confint(.))") %>%
  dplyr::select(Source, Variable, OR, `Lower CI` = `2.5 %`, `Upper CI` = `97.5 %`) %>%
  bind_rows(or_tibble %>% 
              mutate(Source = "Delta Method") %>%
              dplyr::select(Source, Variable = Estimate, OR, `Lower CI`, `Upper CI`) %>%
              var_mapping(Variable)) %>%
  mutate(Source = factor(Source, levels = c("exp(confint(.))", "Delta Method"))) %>%
  filter(Variable != "Constant")
## Joining, by = "Variable"
ggplot(data = graph_data_2, aes(x = Variable, y = OR, ymin = `Lower CI`, ymax = `Upper CI`)) +
  geom_pointrange() +
  geom_hline(yintercept = 1, lty = 2) +
  coord_flip() +
  xlab("Variable") + ylab("Odds Ratio with 95% CI") + 
  facet_grid(~ Source)

Which is the correct method?

  • Delta method provides a standard error for the odds ratio, which can be used to create a normal-approximated (i.e. symmetric) confidence interval.
  • But delta method confidence intervals can also extend into negative territory.

What does Stata do?

  • Stata reports standard errors for odds ratios determined by the delta method.
  • But its 95% confidence intervals around the odds ratios are based on \(\mbox{exp}(\beta \pm 1.96*\mbox{SE}_{\beta})\).

That is, the standard error is the delta method, but the confidence intervals are equal to Rs exp(confint(model_object))!

Let’s export our data to Stata and take a look.

haven::write_dta(anes, "anes-with-noise.dta")
. logistic vote i.gender i.educ age noise

Logistic regression                             Number of obs     =      2,368
                                                LR chi2(7)        =     147.10
                                                Prob > chi2       =     0.0000
Log likelihood = -1565.3592                     Pseudo R2         =     0.0449

----------------------------------------------------------------------------------------
                  vote | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
                gender |
               Female  |   .6881121     .05868    -4.38   0.000      .582199    .8132929
                       |
                  educ |
         Completed HS  |    1.92396   .4436975     2.84   0.005     1.224319    3.023412
    College < 4 Years  |   2.006222   .4372836     3.19   0.001     1.308723     3.07546
College 4 Year Degree  |   1.508235   .3344887     1.85   0.064     .9765486    2.329401
      Advanced Degree  |   .6541158   .1500084    -1.85   0.064     .4173001    1.025323
                       |
                   age |   1.015176   .0025378     6.03   0.000     1.010214    1.020162
                 noise |   1.083796   .8888249     0.10   0.922     .2172073    5.407803
                 _cons |    .349481   .0890344    -4.13   0.000     .2121143    .5758072
----------------------------------------------------------------------------------------

Comparing the standard errors from Stata to our delta method SEs, we find the two match.

But the delta method CIs do not match. In fact, Stata’s confidence intervals are close to R’s exp(confint()) results.

Understanding Stata’s output for logit models with results reported as odds ratios:

  • The CIs are exponentiated \(\widehat{\beta_k} \pm 1.96 \times \mbox{SE}_{\widehat{\beta_k}}\).
  • The p-values reported for the odds ratio come from \(z = \frac{\widehat{\beta_k}}{\mbox{SE}_{\widehat{\beta_k}}}\). To see this, compare the p-values with and without the or option to the .logit command.

If the sampling distribution is asymmetric, then the confidence interval should be asymmetric. We can use the nonparametric bootstrap to get a sense of the shape of the sampling distribution. The R code applied to the ANES data is:

or_boot_func <- function(form, data, indices) {
  
  data <- data[indices,]
  ors  <- exp(coef(glm(form, data = data, family = "binomial")))
  
  ors
}  
  
or_boot_values <- boot(data = anes, statistic = or_boot_func,
                       R = 1000,
                       form = vote ~ gender + educ + age + noise) 

boot_ci <- apply(or_boot_values$t, 2, function(i) {
  
  as_tibble(list(Estimate = mean(i),
                 Lower    = quantile(i, .025),
                 Upper    = quantile(i, .975)))
}) %>%
  reduce(bind_rows)
boot_ci %>%
  mutate(Variable = c("Constant", graph_data$Variable)) %>%
  filter(Variable != "Constant") %>%
  ggplot(aes(x = Variable, y = `Estimate`, ymin = `Lower`, ymax = `Upper`)) +
  geom_pointrange() +
  geom_hline(yintercept = 1, lty = 2) +
  coord_flip() +
  xlab("Variable") + ylab("Bootrapped Mean Odds Ratio and 2.5-97.5 Pecentiles") 

Delta Method Standard Errors for Predicted Probabilities

It makes most sense to stick with exp(confint(glm_object)) for confidence intervals around ORs, since this is a trivial task. But what about other quantities that are functions of model parameters, such as predicted probabilities?

Predicted probabilities are obtained from the results of a logit model by:

\[ \begin{array} \mbox{Pr}(y=1|x) &= F(x\beta) \\ & = \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)} \end{array} \]

What is the prediction for a 55-year-old male who finished high school but did not go to college (with average noise)? First, get \(x\beta\).

estimates <- coef(trump_model)

xb        <- estimates[1] + estimates[2]*0 + estimates[3]*1 + estimates[4]*0 + estimates[5]*0 + 
             estimates[6]*0 + estimates[7]*55 + estimates[8]*0
(xb <- as.numeric(round(xb, 3)))
## [1] 0.431

\[ \begin{array} \mbox{Pr}(y=1|x) &= \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)} \\ & = \frac{\mbox{exp}(0.431)}{1 + \mbox{exp}(0.431)} \\ & = 0.606 \end{array} \]

Equivalently.

round(plogis(xb),3)
## [1] 0.606

The predicted probabilty is a function of all parameter estimates, so we need to use the matrix version of the Taylor series approximation to get our SE.

\[ \mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \]

First, determine the gradient vector for the \(\beta_k\).

We’ve used \(F()\) to represent the standard logistic cdf. Let \(f()\) be the standard logistic pdf.

By the chain rule, and by the fact that the first derivative of a cdf is its pdf,

\[ \begin{array} \frac{\partial F(x'\beta)}{\partial \beta} &= \frac{\partial F((x' \beta))}{\partial x' \beta} \\ &= f(x'\beta)x \end{array} \]

Define the \(x\) vector for our subject of interest along with the beta vector from the model results.

x    <- c(1,0,1,0,0,0,55,0)
beta <- coef(trump_model)

cbind(beta,x)
##                                  beta  x
## (Intercept)               -1.05130623  1
## genderFemale              -0.37380350  0
## educCompleted HS           0.65438538  1
## educCollege < 4 Years      0.69625321  0
## educCollege 4 Year Degree  0.41094016  0
## educAdvanced Degree       -0.42447089  0
## age                        0.01506207 55
## noise                      0.08046982  0

And \(x'\beta\) is

xb <- t(x)%*%beta

The variance is thus:

\[ \begin{array} \mbox{Var}(p(\mbox{Vote} = \mbox{Trump})) &= \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \\ &= f(x'\beta)x'\mbox{Cov}(\widehat{\beta})xf(x'\beta) \end{array} \]

Calculating “by hand” in R:

dlogis(xb)%*%t(x)%*%vcov(trump_model)%*%x%*%dlogis(xb) %>%
  sqrt
##            [,1]
## [1,] 0.02747443

The deltamethod function in the msm package will also do this for you. The arguments are

  1. The function expressed as a formula
  2. A vector of means, here \(\mu = \beta\)
  3. The covariance matrix
deltamethod(
  ~ exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)/
          (1 + exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)), 
  coef(trump_model), vcov(trump_model))
## [1] 0.02747443

This is the standard error for our predicted probability.

One catch: because the derivatives are found symbolically, you are limited to what stats::deriv knows. For example, the following would be simpler to code.

deltamethod(~ plogis(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0), 
            coef(trump_model), vcov(trump_model))

But R throws an error, Function plogis is not in the derivatives table.

By the way, this is where Stata truly shines and why I still turn to it despite protestations that “R can do everything.” Here are all of the predicted probabilities, with standard errors, for all combinations of education and gender, for 55 year-olds.

. margins educ, over(gender) at(age = 55 noise = 0)


---------------------------------------------------------------------------------------
                      |            Delta-method
                      |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
          gender#educ |
  Male#HS Incomplete  |   .4445065   .0514349     8.64   0.000     .3436959     .545317
    Male#HS Complete  |   .6062301   .0274744    22.07   0.000     .5523812     .660079
   Male#Some College  |   .6161789   .0205271    30.02   0.000     .5759466    .6564112
          Male#BA/BS  |   .5468739   .0231694    23.60   0.000     .5014627    .5922851
         Male#Higher  |    .343584   .0251349    13.67   0.000     .2943204    .3928475
Female#HS Incomplete  |      .3551   .0482039     7.37   0.000     .2606221    .4495779
  Female#HS Complete  |   .5144184   .0278629    18.46   0.000     .4598081    .5690286
 Female#Some College  |   .5248688   .0201584    26.04   0.000      .485359    .5643786
        Female#BA/BS  |   .4536941   .0231244    19.62   0.000     .4083711    .4990171
       Female#Higher  |   .2648002   .0215839    12.27   0.000     .2224966    .3071038
---------------------------------------------------------------------------------------

Type one more word, and this becomes a pretty graph.

. marginsplot

Again, the delta method gives us an approximation that may not be accurate.

Predicted probabilities are bounded by zero and one, yet a delta method CI may extend below or above these limits.

Alternatives:

  1. Boostrapping
  2. Simulations by repeated draws from \(\boldsymbol \beta \sim \mathcal{N}\left(\widehat{\boldsymbol \beta}, \mbox{Cov}(\widehat{\boldsymbol \beta})\right)\).

Both are computationally intensive and may be less attractive for biger data sets than the delta method.