Understanding Factors

In this lesson, you will understand how to represent categorical data in R, know the difference between ordered and unordered factors, be aware of some of the problems encountered when using factors.

What are factors?

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are actually stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

The factor() command

The factor() command is used to create and modify factors in R. Run the following code to generate a short factor: sex <- factor(c('male', 'female', 'female', 'male')).

sex <- factor(c('male', 'female', 'female', 'male'))

We used the concatenate function (c()) to stick four words together. These form the factor.

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, we have a factor with 2 levels.

R will assign 1 to the level female and 2 to the level male (because ‘f’ comes before ‘m’, even though the first element in this vector is male). Check this by using the function levels()

levels(sex)

## [1] "female" "male"

Check the number of levels using nlevels()

nlevels(sex)

## [1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels.

Check the factor food loaded with the lesson. Type food.

food

## [1] low    high   medium high   low    medium high  
## Levels: high low medium

You can see a factor with 7 elements. Check the levels of this factor.

levels(food)

## [1] "high"   "low"    "medium"

The first level here is ‘high’, then ‘low’, following the alphanumeric rule above. In order to put these levels into a sensible order (low > medium > high), we need to add two arguments to the factor() function.

Ordered factors

The function factor() takes two more arguments. The levels = tells R the different levels in the factor. The ordered = argument tells R whether these factors are ordered (= TRUE) or not (= FALSE). The order will follow the order that the levels are written in the levels = argument.

Modify the food variable to make it an ordered factor, with levels ordered low > medium > high.

food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: ‘low’, ‘medium’, and ‘high’ is more descriptive than ‘1’, ‘2’, ‘3’. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels.

You have a vector representing levels of exercise undertaken by 5 subjects: ‘l’, ‘n’, ‘n’, ‘i’, ‘l’, where n = none, l = light, and i = intense. What is the best way to represent this in R?

exercise <- c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’)
exercise <- factor(c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’), ordered = TRUE)
exercise <- factor(c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’), levels = c(‘n’, ‘l’, ‘i’), ordered = FALSE)
exercise <- factor(c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’), levels = c(‘n’, ‘l’, ‘i’), ordered = TRUE)

exercise <- factor(c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’), levels = c(‘n’, ‘l’, ‘i’), ordered = TRUE)

Working with factors

The function table() tabulates observations and can be used to create bar plots quickly. Run table() on the data Group included with this lesson.

table(Group)

## Group
##    Control Treatment1 Treatment2 
##         29         35         35

We can use the function barplot() to easily display this table. Embed the previous code within a call to barplot().

barplot(table(Group))

Use the factor() command to modify the factor Group so that the control group is plotted last. Do this within the call to barplot.

barplot( table(factor(Group, levels = c('Treatment1', 'Treatment2', 'Control'), ordered = TRUE)) )

Removing Levels from a Factor

Look at the factor Gender. Some of the Gender values in our dataset have been coded incorrectly. Let’s remove some factors.

First tabulate the factor, using table().

table(Gender)

## Gender
##  f  F  m  M 
## 35  4 46 15

You can see that there are four levels here, when there should be only two. Values should have been recorded as lowercase ‘m’ & ‘f’. We can easily correct this.

We can use indexing (see Subsetting Vectors lesson, and Indexing lesson in Unit 2) to select the wrongly-coded elements, and change them. Type: Gender[Gender == 'M'] <- 'm'. This code looks for all the elements ‘M’ and changes them to ‘m’.

Gender[Gender == 'M'] <- 'm'

Let’s plot some of the results of the experiment. Plot a boxplot of the variable BloodPressure as a function of the variable Gender that we have just been working on. In this case, you need to construct a *formula: put the continuous variable on the left, separated from the categorical variable by a tilde (~).

boxplot(BloodPressure ~ Gender)

What is wrong with this figure? It still shows the level M, even though we recoded all the elements. This situation highlights the fact that R keeps the data as integers and the level labels separately. R still has the level M in its’memory.

We can remove unused levels with the function droplevels(). Try that on the Gender variable.

Gender <- droplevels(Gender)

Now, make another boxplot.

boxplot(BloodPressure ~ Gender)

Hurrah! we successfully removed the unused level M.

Nice work. Now you have some experience working with factors.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

That would be lovely

That would be lovely