In this lesson, you will understand how to represent categorical data in R, know the difference between ordered and unordered factors, be aware of some of the problems encountered when using factors.
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors are actually stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
factor() command is used to create and modify factors in R. Run the following code to generate a short factor:
sex <- factor(c('male', 'female', 'female', 'male')).
sex <- factor(c('male', 'female', 'female', 'male'))
We used the concatenate function (
c()) to stick four words together. These form the factor.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, we have a factor with 2 levels.
R will assign
1 to the level
2 to the level
male (because ‘f’ comes before ‘m’, even though the first element in this vector is
male). Check this by using the function
##  "female" "male"
Check the number of levels using
##  2
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels.
Check the factor
food loaded with the lesson. Type
##  low high medium high low medium high ## Levels: high low medium
You can see a factor with 7 elements. Check the levels of this factor.
##  "high" "low" "medium"
The first level here is ‘high’, then ‘low’, following the alphanumeric rule above. In order to put these levels into a sensible order (low > medium > high), we need to add two arguments to the
factor() takes two more arguments. The
levels = tells R the different levels in the factor. The
ordered = argument tells R whether these factors are ordered (
= TRUE) or not (
= FALSE). The order will follow the order that the levels are written in the
levels = argument.
food variable to make it an ordered factor, with levels ordered low > medium > high.
food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)
In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: ‘low’, ‘medium’, and ‘high’ is more descriptive than ‘1’, ‘2’, ‘3’. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels.
You have a vector representing levels of exercise undertaken by 5 subjects: ‘l’, ‘n’, ‘n’, ‘i’, ‘l’, where n = none, l = light, and i = intense. What is the best way to represent this in R?
exercise <- factor(c(‘l’, ‘n’, ‘n’, ‘i’, ‘l’), levels = c(‘n’, ‘l’, ‘i’), ordered = TRUE)
table() tabulates observations and can be used to create bar plots quickly. Run
table() on the data
Group included with this lesson.
## Group ## Control Treatment1 Treatment2 ## 29 35 35
We can use the function
barplot() to easily display this table. Embed the previous code within a call to
Use the factor() command to modify the factor
Group so that the control group is plotted last. Do this within the call to barplot.
barplot( table(factor(Group, levels = c('Treatment1', 'Treatment2', 'Control'), ordered = TRUE)) )
Look at the factor
Gender. Some of the Gender values in our dataset have been coded incorrectly. Let’s remove some factors.
First tabulate the factor, using
## Gender ## f F m M ## 35 4 46 15
You can see that there are four levels here, when there should be only two. Values should have been recorded as lowercase ‘m’ & ‘f’. We can easily correct this.
We can use indexing (see Subsetting Vectors lesson, and Indexing lesson in Unit 2) to select the wrongly-coded elements, and change them. Type:
Gender[Gender == 'M'] <- 'm'. This code looks for all the elements ‘M’ and changes them to ‘m’.
Gender[Gender == 'M'] <- 'm'
Let’s plot some of the results of the experiment. Plot a boxplot of the variable
BloodPressure as a function of the variable
Gender that we have just been working on. In this case, you need to construct a *formula: put the continuous variable on the left, separated from the categorical variable by a tilde (~).
boxplot(BloodPressure ~ Gender)
What is wrong with this figure? It still shows the level
M, even though we recoded all the elements. This situation highlights the fact that R keeps the data as integers and the level labels separately. R still has the level
M in its’memory.
We can remove unused levels with the function
droplevels(). Try that on the Gender variable.
Gender <- droplevels(Gender)
Now, make another boxplot.
boxplot(BloodPressure ~ Gender)
Hurrah! we successfully removed the unused level
Nice work. Now you have some experience working with factors.
Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.
That would be lovely