Missing Values

‘The best solution to handle missing data is to have none’ -R.A. Fisher

Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there’s an underlying pattern or cause for their missingness.

Different software, people, disciplines have different traditions of what is used to represent missing values, as well as what is done with them during statistical analysis. Commonly used values include -999, NA, NAN, 0, and sometimes the data are just left blank!

The implications of what you choose to represent missing data can have profound consequences for your results and conclusions.

However, for now, we will concern ourselves only with the practical aspects. In R, NA is used to represent any value that is ‘not available’ or ‘missing’ (in the statistical sense). In this lesson, we’ll explore missing values further.

Any operation involving NA generally yields NA as the result. To illustrate, create a vector c(44, NA, 5, NA) and assign it to a variable x.

x <- c(44, NA, 5, NA)

Now, multiply x by 3.

x * 3
## [1] 132  NA  15  NA

Notice that the elements of the resulting vector that correspond with the NA values in x are also NA.

To make things a little more interesting, let’s create a vector containing 100 draws from a standard normal distribution with y <- rnorm(100). The function rnorm() generates random numbers from a normal distribution.

y <- rnorm(100)

Next, convert all the negative numbers to NA. Remember that we can use logical indexing to do this. Also, do not put the NA in any quotes, otherwise R will think it is a text string and not a missing value.

y[y < 0] <- NA

Let’s first ask the question of where our NAs are located in our data. The is.na() function tells us whether each element of a vector is NA. Call is.na() on y and assign the result to ‘my_na’.

my_na <- is.na(y)

Now, print my_na to see what you came up with.

my_na
##   [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [12] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
##  [23]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [56] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [67]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [78]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
##  [89]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [100]  TRUE

Everywhere you see a TRUE, you know the corresponding element of y is NA. Likewise, everywhere you see a FALSE, you know the corresponding element of y is one of our positive random draws from the standard normal distribution.

In our previous discussion of logical operators, we introduced the == operator as a method of testing for equality between two objects. So, you might think the expression y == NA yields the same results as is.na(). Give it a try.

y == NA
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [93] NA NA NA NA NA NA NA NA

The reason you got a vector of all NAs is that NA is not really a value, but just a placeholder for a quantity that is not available. Therefore the logical expression is incomplete and R has no choice but to return a vector of the same length as y that contains all NAs.

Don’t worry if that’s a little confusing. The key takeaway is to be cautious when using logical expressions anytime NAs might creep in, since a single NA value can derail the entire thing.

So, back to the task at hand. Now that we have a vector, my_na, that has a TRUE for every NA and FALSE for every numeric value, we can compute the total number of NAs in our data.

The trick is to recognize that underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0. Therefore, if we take the sum of a bunch of TRUEs and FALSEs, we get the total number of TRUEs.

Let’s give that a try here. Call the sum() function on my_na to count the total number of TRUEs in my_na, and thus the total number of NAs in my_data. Don’t assign the result to a new variable.

sum(my_na)
## [1] 55

Pretty cool, huh? Finally, let’s take a look at the data to convince ourselves that everything ‘adds up’. Print y to the console.

TRUE
## [1] TRUE

Now that we’ve got NAs down pat, let’s look at a second type of missing value – NaN, which stands for ‘not a number’. To generate NaN, try dividing (using a forward slash) 0 by 0 now.

0/0
## [1] NaN

Let’s do one more, just for fun. In R, Inf stands for infinity. What happens if you subtract Inf from Inf?

Inf - Inf
## [1] NaN

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

  1. Ok, fine

Ok, fine