The forcats Package and How to Use It

Posted on
R factor variables forcats

Intro

This blog post will talk about the forcats package and how to use it to work with factor (categorical) variables. Although decent documentation exists, it can sometimes be a bit too terse for somebody first encountering the package’s functions. This blog post is meant to provide a fuller narrative description and examples of how to use the package. Amelia McNamara’s outstanding presentation from the 2019 RStudio conference is also worth the time to watch to understand the background, motivation, and usefulness of forcats. This post is meant to be an easily Googleable summary of the essential information.

Factors are an important variable type in R because they determine the order in which categorical variables are plotted or how contrasts (such as dummy variables) are coded when running statistical commands. Base R functions exists but are error prone for even experienced users. To demonstrate, we’ll use items from the General Social Survey (GSS), a commonly used source of data in social science that is full of categorical variables. In fact, the forcats installation includes a selection of GSS variables from the 2000-2014 waves, which we will use in this blog post.

First, let’s load the packages we’ll use. The gss_cat dataset loads along with forcats. Our interest will be in the relig variable, which contains respondents’ religious affiliations. The relig variable is already coded in the gss_cat data as a factor variable. However, a common situation researchers encounter is that they first receive a version of the data file in which the categories are coded as integers, and labels must be applied manually. To mimic this typical workflow, we’ll immediately convert the factor version to numeric and begin with the numeric version.

library(haven)
library(tidyverse)
library(forcats)
library(knitr)

gss_cat2 <- gss_cat %>%
  mutate(
    relig = as.numeric(relig)) %>%
  filter(year == 2014)

Useful Functions

The functions discussed in this blog post will be:

  • as_factor: changes a variable from some type to a factor
  • fct_recode: changes the coding of values in a factor
  • fct_count: provides a descriptive table for factors
  • fct_lump: lumps multiple smaller levels into a single other category
  • fct_collapse: allows the user to collapse factor levels into defined groups
  • fct_inorder: reorders factors by appearance
  • fct_infreq: reorders factors by frequency
  • fct_relevel: reorder factors by hand

as_factor

If we were to plot the distribution of religion, it would be unclear which category is which.

gss_cat2 %>% 
  ggplot(aes(x=relig)) + 
  geom_bar(color = 'black', fill = 'firebrick')  +
  xlab("Religion") +
  ylab("Count")

The category that each integer refers to is as follows:

  1. No answer
  2. Don’t know
  3. Inter-nondenominational
  4. Native american
  5. Christian
  6. Orthodox-christian
  7. Moslem/islam
  8. Other eastern
  9. Hinduism
  10. Buddhism
  11. Other
  12. None
  13. Jewish
  14. Catholic
  15. Protestant
  16. Not applicable

However, this is not intuitative, and someone just looking at the data would have no way to know what each integer stands for.

First, use as_factor from forcats to change the categorical variables to factors.

gss_cat2 <- gss_cat2 %>%
  mutate(
    relig = as_factor(relig)) 

However, they are still missing descriptive labels.

gss_cat2 %>%
   count(relig)
## # A tibble: 15 x 2
##    relig     n
##    <fct> <int>
##  1 1        15
##  2 2         3
##  3 3         4
##  4 4         2
##  5 5       134
##  6 6         9
##  7 7         9
##  8 8         3
##  9 9        13
## 10 10       26
## 11 11       27
## 12 12      522
## 13 13       40
## 14 14      606
## 15 15     1125

That’s where fct_recode comes in.

fct_recode

We will use fct_recode from forcats to change the levels of each variable. We can also set levels that we don’t require to NULL to remove them.

gss_cat2 <- gss_cat2 %>% mutate(
  relig = fct_recode(relig,
                     NULL = "1",
                     NULL = "2",
                     "Inter-Nondenom" = "3",
                     "Native American" = "4",
                     "Christian" = "5",
                     "Orthodox-Christian" = "6",
                     "Moslem/Islam" = "7",
                     "Other Eastern" = "8",
                     "Hinduism" = "9",
                     "Buddhism" = "10",
                     "Other" = "11",
                     "None" = "12",
                     "Jewish" = "13",
                     "Catholic" = "14",
                     "Protestant" = "15",
                     NULL = "16"))

gss_cat2 %>% 
  ggplot(aes(x=relig)) + 
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("Religion") +
  ylab("Count") 

Much better. We can get a table with counts and proportions using the fct_count function.

fct_count

Let’s take a look at religion:

gss_cat2 %>%
  pull(relig) %>%
  fct_count(prop = TRUE) %>%
  kable()
f n p
Inter-Nondenom 4 0.0015760
Native American 2 0.0007880
Christian 134 0.0527975
Orthodox-Christian 9 0.0035461
Moslem/Islam 9 0.0035461
Other Eastern 3 0.0011820
Hinduism 13 0.0051221
Buddhism 26 0.0102443
Other 27 0.0106383
None 522 0.2056738
Jewish 40 0.0157604
Catholic 606 0.2387707
Protestant 1125 0.4432624
NA 18 0.0070922

Note that the prop = TRUE argument provides the p column in the table, which is the proportion of cases in each respective category.

fct_lump

We have a lot of small categories in our gss_cat which can make analysis difficult. Let’s combine some of these into one category. We can do that using fct_lump.

gss_cat2 <- gss_cat2 %>% mutate(
  relig_lump = fct_lump(relig, n = 5)
)

The n argument specifies the top n categories to keep. All the rest will be combined into a single “other” category.

gss_cat2 %>% 
  ggplot(aes(x=relig_lump)) + 
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("Religion") +
  ylab("Count")

fct_collapse

fct_c or fct_collapse allows you to collapse factor levels into defined groups. This is very similar to lumping categories, but rather than putting all groups into other, you can create your own groups. Any levels not specified will remain as is.

gss_cat2 <- gss_cat2 %>%
  mutate(
    relig_collapse = fct_collapse(relig,
                         Missing = NA,
                         Other = c("Other", 
                                   "Native American",
                                   "Inter-Nondenominational"),
                         Eastern = c("Buddhism", 
                                     "Hinduism", 
                                     "Other Eastern", 
                                     "Moslem/Islam"),
                         Christian = c("Christian", 
                                       "Orthodox-Christian")
))

gss_cat2 %>% 
  ggplot(aes(x=relig_collapse)) + 
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("Religion") +
  ylab("Count")

Reorder Levels

Another useful set of functions are those used to reorder levels. This can be important for setting the reference level used in modeling functions that create dummy variables for factors, or for making graphs ordered in the desired manner. forcats has some great packages for fixing this.

We use fct_inorder to reorder factors as they appear in the data:

gss_cat2 %>%
  mutate(
    relig_inorder = fct_inorder(relig_lump)) %>%
  ggplot(aes(x=relig_inorder)) +
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Religion") +
  ylab("Count")

We use fct_infreq to reorder factors by frequency:

gss_cat2 %>%
  mutate(
    relig_infreq = fct_infreq(relig_lump)) %>%
  ggplot(aes(x=relig_infreq)) +
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Religion") +
  ylab("Count")

If none of these functions give the desired order, you can use fct_relevel to reorder factors by hand.

gss_cat2 %>%
  mutate(
    relig_relevel = fct_relevel(relig_lump, c("Catholic", "Jewish"))) %>%
  ggplot(aes(x=relig_relevel)) +
  geom_bar(color = 'black', fill = 'firebrick') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Religion") +
  ylab("Count")

Note that listing just two categories moves them to the front while the others remain in the original order.

Hopefully this gave you a good starting point for using the forcats package. It is very powerful tool for working with factor variables.