Matrices and Dataframes

In this lesson, we’ll cover matrices and data frames. Both represent ‘rectangular’ data types, meaning that they are used to store tabular data, with rows and columns. You’ll also learn some more tools for looking at your data objects in R.

The main difference between matrices and data frames, as you’ll see, is that matrices can only contain a single class of data, while data frames can consist of many different classes of data.

Let’s create a vector containing the numbers 1 through 20 using the : operator. Store the result in a variable called my_vector.

my_vector <- 1:20

View the contents of the vector you just created.

my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

The dim() function tells us the ‘dimensions’ of an object. What happens if we do dim(my_vector)? Give it a try.

dim(my_vector)
## NULL

Clearly, that’s not very helpful! Since my_vector is a vector, it doesn’t have a dim attribute (so it’s just NULL), but we can find its length using the length() function. Try that now.

length(my_vector)
## [1] 20

Ah! That’s what we wanted. But, what happens if we give my_vector a dim attribute? Let’s give it a try. Type dim(my_vector) <- c(4, 5).

dim(my_vector) <- c(4, 5)

It’s okay if that last command seemed a little strange to you. It should! The dim() function allows you to get OR set the dim attribute for an R object. In this case, we assigned the value c(4, 5) to the dim attribute of my_vector.

Use dim(my_vector) to confirm that we’ve set the dim attribute correctly.

dim(my_vector)
## [1] 4 5

Another way to see this is by calling the attributes() function on my_vector. Try it now.

attributes(my_vector)
## $dim
## [1] 4 5

Just like in math class, when dealing with a 2-dimensional object (think rectangular table), the first number is the number of rows and the second is the number of columns. Therefore, we just gave my_vector 4 rows and 5 columns.

But, wait! That doesn’t sound like a vector any more. Well, it’s not. Now it’s a matrix. View the contents of my_vector now to see what it looks like.

my_vector
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

Now, let’s confirm it’s actually a matrix by using another useful funciton, class(). Type class(my_vector) to see what I mean.

class(my_vector)
## [1] "matrix"

Sure enough, my_vector is now a matrix. We should store it in a new variable that helps us remember what it is. Store the value of my_vector in a new variable called my_matrix.

my_matrix <- my_vector

The example that we’ve used so far was meant to illustrate the point that a matrix is simply an atomic vector with a dimension attribute. A more direct method of creating the same matrix uses the matrix() function.

Bring up the help file for the matrix() function now using the ? function.

?matrix

Now, look at the documentation for the matrix function and see if you can figure out how to create a matrix containing the same numbers (1-20) and dimensions (4 rows, 5 columns) by calling the matrix() function. Store the result in a variable called my_matrix2.

my_matrix2 <- matrix(1:20, nrow=4, ncol=5)

In addition to dim(), we can also check the specific number of rows or columns in our 2-dimensional object with either nrow() or ncol(). Check the number of rows of my_matrix2 now to make sure its 4.

nrow(my_matrix2)
## [1] 4

Once you start working with large datasets, nrow() becomes a very useful function for identifying how many observations you have. Now check the number of columns with ncol()

ncol(my_matrix2)
## [1] 5

Finally, let’s confirm that my_matrix and my_matrix2 are actually identical. The identical() function will tell us if its first two arguments are the same. Try it out.

identical(my_matrix, my_matrix2)
## [1] TRUE

Now, imagine that the numbers in our table represent some measurements from a clinical experiment, where each row represents one patient and each column represents one variable for which measurements were taken.

We may want to label the rows, so that we know which numbers belong to each patient in the experiment. One way to do this is to add a column to the matrix, which contains the names of all four people.

Let’s start by creating a character vector containing the names of our patients – Bill, Gina, Kelly, and Sean. Remember that double quotes tell R that something is a character string. Store the result in a variable called patients.

patients <- c("Bill", "Gina", "Kelly", "Sean")

Now we’ll use the cbind() function to ‘combine columns’. Don’t worry about storing the result in a new variable. Just call cbind() with two arguments – the patients vector and my_matrix.

cbind(patients, my_matrix)
##      patients                       
## [1,] "Bill"   "1" "5" "9"  "13" "17"
## [2,] "Gina"   "2" "6" "10" "14" "18"
## [3,] "Kelly"  "3" "7" "11" "15" "19"
## [4,] "Sean"   "4" "8" "12" "16" "20"

Something is fishy about our result! It appears that combining the character vector with our matrix of numbers caused everything to be enclosed in double quotes. This means we’re left with a matrix of character strings, which is no good.

If you remember back to the beginning of this lesson, I told you that matrices can only contain ONE class of data. Therefore, when we tried to combine a character vector with a numeric matrix, R was forced to ‘coerce’ the numbers to characters, hence the double quotes.

This is called ‘implicit coercion’, because we didn’t ask for it. It just happened. But why didn’t R just convert the names of our patients to numbers? I’ll let you ponder that question on your own.

So, we’re still left with the question of how to include the names of our patients in the table without destroying the integrity of our numeric data. Try the following – my_data <- data.frame(patients, my_matrix)

my_data <- data.frame(patients, my_matrix)

Now view the contents of my_data to see what we’ve come up with.

my_data
##   patients X1 X2 X3 X4 X5
## 1     Bill  1  5  9 13 17
## 2     Gina  2  6 10 14 18
## 3    Kelly  3  7 11 15 19
## 4     Sean  4  8 12 16 20

It looks like the data.frame() function allowed us to store our character vector of names right alongside our matrix of numbers. That’s exactly what we were hoping for!

Behind the scenes, the data.frame() function takes any number of arguments and returns a single object of class data.frame that is composed of the original objects.

Let’s confirm this by calling the class() function on our newly created data frame.

class(my_data)
## [1] "data.frame"

If we don’t want to create a brand new data frame, we can also convert our original matrix into one to accomplish the same thing. Create a new object called my_data2 using the function as.data.frame() on my_matrix.

my_data2 <- as.data.frame(my_matrix)

Now try to cbind() patients with my_data2. You don’t need to assign it a new object for now.

cbind(my_data2, patients)
##   V1 V2 V3 V4 V5 patients
## 1  1  5  9 13 17     Bill
## 2  2  6 10 14 18     Gina
## 3  3  7 11 15 19    Kelly
## 4  4  8 12 16 20     Sean

It worked! We converted our matrix to a data frame, which let us include a different data type in our object. Note that when we converted our matrix, R also automatically assigned row and column names to our new data frame.

It’s also possible to assign names to the individual rows and columns of a data frame, which presents another possible way of determining which row of values in our table belongs to each patient.

However, since we’ve already solved that problem, let’s solve a different problem by assigning names to the columns of our data frame so that we know what type of measurement each column represents.

Since we have six columns (including patient names), we’ll need to first create a vector containing one element for each column. Create a character vector called cnames that contains the following values (in order) – ‘patient’, ‘age’, ‘weight’, ‘bp’, ‘rating’, ‘test’.

cnames <- c("patient", "age", "weight", "bp", "rating", "test")

Now, use the colnames() function to set the colnames attribute for our original data frame, my_data. This is similar to the way we used the dim() function earlier in this lesson.

colnames(my_data) <- cnames

Let’s see if that got the job done. Print the contents of my_data.

my_data
##   patient age weight bp rating test
## 1    Bill   1      5  9     13   17
## 2    Gina   2      6 10     14   18
## 3   Kelly   3      7 11     15   19
## 4    Sean   4      8 12     16   20

Now that you’ve made your own data frame, we are going to load one in for you so we can see what a real data frame might look like and explore it a bit.

Go ahead and look at a new object, UScereal, now.

UScereal
##                                       mfr  calories    protein       fat
## 100% Bran                               N 212.12121 12.1212121 3.0303030
## All-Bran                                K 212.12121 12.1212121 3.0303030
## All-Bran with Extra Fiber               K 100.00000  8.0000000 0.0000000
## Apple Cinnamon Cheerios                 G 146.66667  2.6666667 2.6666667
## Apple Jacks                             K 110.00000  2.0000000 0.0000000
## Basic 4                                 G 173.33333  4.0000000 2.6666667
## Bran Chex                               R 134.32836  2.9850746 1.4925373
## Bran Flakes                             P 134.32836  4.4776119 0.0000000
## Cap'n'Crunch                            Q 160.00000  1.3333333 2.6666667
## Cheerios                                G  88.00000  4.8000000 1.6000000
## Cinnamon Toast Crunch                   G 160.00000  1.3333333 4.0000000
## Clusters                                G 220.00000  6.0000000 4.0000000
## Cocoa Puffs                             G 110.00000  1.0000000 1.0000000
## Corn Chex                               R 110.00000  2.0000000 0.0000000
## Corn Flakes                             K 100.00000  2.0000000 0.0000000
## Corn Pops                               K 110.00000  1.0000000 0.0000000
## Count Chocula                           G 110.00000  1.0000000 1.0000000
## Cracklin' Oat Bran                      K 220.00000  6.0000000 6.0000000
## Crispix                                 K 110.00000  2.0000000 0.0000000
## Crispy Wheat & Raisins                  G 133.33333  2.6666667 1.3333333
## Double Chex                             R 133.33333  2.6666667 0.0000000
## Froot Loops                             K 110.00000  2.0000000 1.0000000
## Frosted Flakes                          K 146.66667  1.3333333 0.0000000
## Frosted Mini-Wheats                     K 125.00000  3.7500000 0.0000000
## Fruit & Fibre: Dates Walnuts and Oats   P 179.10448  4.4776119 2.9850746
## Fruitful Bran                           K 179.10448  4.4776119 0.0000000
## Fruity Pebbles                          P 146.66667  1.3333333 1.3333333
## Golden Crisp                            P 113.63636  2.2727273 0.0000000
## Golden Grahams                          G 146.66667  1.3333333 1.3333333
## Grape Nuts Flakes                       P 113.63636  3.4090909 1.1363636
## Grape-Nuts                              P 440.00000 12.0000000 0.0000000
## Great Grains Pecan                      P 363.63636  9.0909091 9.0909091
## Honey Graham Ohs                        Q 120.00000  1.0000000 2.0000000
## Honey Nut Cheerios                      G 146.66667  4.0000000 1.3333333
## Honey-comb                              P  82.70677  0.7518797 0.0000000
## Just Right Fruit & Nut                  K 186.66667  4.0000000 1.3333333
## Kix                                     G  73.33333  1.3333333 0.6666667
## Life                                    Q 149.25373  5.9701493 2.9850746
## Lucky Charms                            G 110.00000  2.0000000 1.0000000
## Mueslix Crispy Blend                    K 238.80597  4.4776119 2.9850746
## Multi-Grain Cheerios                    G 100.00000  2.0000000 1.0000000
## Nut&Honey Crunch                        K 179.10448  2.9850746 1.4925373
## Nutri-Grain Almond-Raisin               K 208.95522  4.4776119 2.9850746
## Oatmeal Raisin Crisp                    G 260.00000  6.0000000 4.0000000
## Post Nat. Raisin Bran                   P 179.10448  4.4776119 1.4925373
## Product 19                              K 100.00000  3.0000000 0.0000000
## Puffed Rice                             Q  50.00000  1.0000000 0.0000000
## Quaker Oat Squares                      Q 200.00000  8.0000000 2.0000000
## Raisin Bran                             K 160.00000  4.0000000 1.3333333
## Raisin Nut Bran                         G 200.00000  6.0000000 4.0000000
## Raisin Squares                          K 180.00000  4.0000000 0.0000000
## Rice Chex                               R  97.34513  0.8849558 0.0000000
## Rice Krispies                           K 110.00000  2.0000000 0.0000000
## Shredded Wheat 'n'Bran                  N 134.32836  4.4776119 0.0000000
## Shredded Wheat spoon size               N 134.32836  4.4776119 0.0000000
## Smacks                                  K 146.66667  2.6666667 1.3333333
## Special K                               K 110.00000  6.0000000 0.0000000
## Total Corn Flakes                       G 110.00000  2.0000000 1.0000000
## Total Raisin Bran                       G 140.00000  3.0000000 1.0000000
## Total Whole Grain                       G 100.00000  3.0000000 1.0000000
## Triples                                 G 146.66667  2.6666667 1.3333333
## Trix                                    G 110.00000  1.0000000 1.0000000
## Wheat Chex                              R 149.25373  4.4776119 1.4925373
## Wheaties                                G 100.00000  3.0000000 1.0000000
## Wheaties Honey Gold                     G 146.66667  2.6666667 1.3333333
##                                          sodium     fibre    carbo
## 100% Bran                             393.93939 30.303030 15.15152
## All-Bran                              787.87879 27.272727 21.21212
## All-Bran with Extra Fiber             280.00000 28.000000 16.00000
## Apple Cinnamon Cheerios               240.00000  2.000000 14.00000
## Apple Jacks                           125.00000  1.000000 11.00000
## Basic 4                               280.00000  2.666667 24.00000
## Bran Chex                             298.50746  5.970149 22.38806
## Bran Flakes                           313.43284  7.462687 19.40299
## Cap'n'Crunch                          293.33333  0.000000 16.00000
## Cheerios                              232.00000  1.600000 13.60000
## Cinnamon Toast Crunch                 280.00000  0.000000 17.33333
## Clusters                              280.00000  4.000000 26.00000
## Cocoa Puffs                           180.00000  0.000000 12.00000
## Corn Chex                             280.00000  0.000000 22.00000
## Corn Flakes                           290.00000  1.000000 21.00000
## Corn Pops                              90.00000  1.000000 13.00000
## Count Chocula                         180.00000  0.000000 12.00000
## Cracklin' Oat Bran                    280.00000  8.000000 20.00000
## Crispix                               220.00000  1.000000 21.00000
## Crispy Wheat & Raisins                186.66667  2.666667 14.66667
## Double Chex                           253.33333  1.333333 24.00000
## Froot Loops                           125.00000  1.000000 11.00000
## Frosted Flakes                        266.66667  1.333333 18.66667
## Frosted Mini-Wheats                     0.00000  3.750000 17.50000
## Fruit & Fibre: Dates Walnuts and Oats 238.80597  7.462687 17.91045
## Fruitful Bran                         358.20896  7.462687 20.89552
## Fruity Pebbles                        180.00000  0.000000 17.33333
## Golden Crisp                           51.13636  0.000000 12.50000
## Golden Grahams                        373.33333  0.000000 20.00000
## Grape Nuts Flakes                     159.09091  3.409091 17.04545
## Grape-Nuts                            680.00000 12.000000 68.00000
## Great Grains Pecan                    227.27273  9.090909 39.39394
## Honey Graham Ohs                      220.00000  1.000000 12.00000
## Honey Nut Cheerios                    333.33333  2.000000 15.33333
## Honey-comb                            135.33835  0.000000 10.52632
## Just Right Fruit & Nut                226.66667  2.666667 26.66667
## Kix                                   173.33333  0.000000 14.00000
## Life                                  223.88060  2.985075 17.91045
## Lucky Charms                          180.00000  0.000000 12.00000
## Mueslix Crispy Blend                  223.88060  4.477612 25.37313
## Multi-Grain Cheerios                  220.00000  2.000000 15.00000
## Nut&Honey Crunch                      283.58209  0.000000 22.38806
## Nutri-Grain Almond-Raisin             328.35821  4.477612 31.34328
## Oatmeal Raisin Crisp                  340.00000  3.000000 27.00000
## Post Nat. Raisin Bran                 298.50746  8.955224 16.41791
## Product 19                            320.00000  1.000000 20.00000
## Puffed Rice                             0.00000  0.000000 13.00000
## Quaker Oat Squares                    270.00000  4.000000 28.00000
## Raisin Bran                           280.00000  6.666667 18.66667
## Raisin Nut Bran                       280.00000  5.000000 21.00000
## Raisin Squares                          0.00000  4.000000 30.00000
## Rice Chex                             212.38938  0.000000 20.35398
## Rice Krispies                         290.00000  0.000000 22.00000
## Shredded Wheat 'n'Bran                  0.00000  5.970149 28.35821
## Shredded Wheat spoon size               0.00000  4.477612 29.85075
## Smacks                                 93.33333  1.333333 12.00000
## Special K                             230.00000  1.000000 16.00000
## Total Corn Flakes                     200.00000  0.000000 21.00000
## Total Raisin Bran                     190.00000  4.000000 15.00000
## Total Whole Grain                     200.00000  3.000000 16.00000
## Triples                               333.33333  0.000000 28.00000
## Trix                                  140.00000  0.000000 13.00000
## Wheat Chex                            343.28358  4.477612 25.37313
## Wheaties                              200.00000  3.000000 17.00000
## Wheaties Honey Gold                   266.66667  1.333333 21.33333
##                                          sugars shelf potassium vitamins
## 100% Bran                             18.181818     3 848.48485 enriched
## All-Bran                              15.151515     3 969.69697 enriched
## All-Bran with Extra Fiber              0.000000     3 660.00000 enriched
## Apple Cinnamon Cheerios               13.333333     1  93.33333 enriched
## Apple Jacks                           14.000000     2  30.00000 enriched
## Basic 4                               10.666667     3 133.33333 enriched
## Bran Chex                              8.955224     1 186.56716 enriched
## Bran Flakes                            7.462687     3 283.58209 enriched
## Cap'n'Crunch                          16.000000     2  46.66667 enriched
## Cheerios                               0.800000     1  84.00000 enriched
## Cinnamon Toast Crunch                 12.000000     2  60.00000 enriched
## Clusters                              14.000000     3 210.00000 enriched
## Cocoa Puffs                           13.000000     2  55.00000 enriched
## Corn Chex                              3.000000     1  25.00000 enriched
## Corn Flakes                            2.000000     1  35.00000 enriched
## Corn Pops                             12.000000     2  20.00000 enriched
## Count Chocula                         13.000000     2  65.00000 enriched
## Cracklin' Oat Bran                    14.000000     3 320.00000 enriched
## Crispix                                3.000000     3  30.00000 enriched
## Crispy Wheat & Raisins                13.333333     3 160.00000 enriched
## Double Chex                            6.666667     3 106.66667 enriched
## Froot Loops                           13.000000     2  30.00000 enriched
## Frosted Flakes                        14.666667     1  33.33333 enriched
## Frosted Mini-Wheats                    8.750000     2 125.00000 enriched
## Fruit & Fibre: Dates Walnuts and Oats 14.925373     3 298.50746 enriched
## Fruitful Bran                         17.910448     3 283.58209 enriched
## Fruity Pebbles                        16.000000     2  33.33333 enriched
## Golden Crisp                          17.045455     1  45.45455 enriched
## Golden Grahams                        12.000000     2  60.00000 enriched
## Grape Nuts Flakes                      5.681818     3  96.59091 enriched
## Grape-Nuts                            12.000000     3 360.00000 enriched
## Great Grains Pecan                    12.121212     3 303.03030 enriched
## Honey Graham Ohs                      11.000000     2  45.00000 enriched
## Honey Nut Cheerios                    13.333333     1 120.00000 enriched
## Honey-comb                             8.270677     1  26.31579 enriched
## Just Right Fruit & Nut                12.000000     3 126.66667     100%
## Kix                                    2.000000     2  26.66667 enriched
## Life                                   8.955224     2 141.79104 enriched
## Lucky Charms                          12.000000     2  55.00000 enriched
## Mueslix Crispy Blend                  19.402985     3 238.80597 enriched
## Multi-Grain Cheerios                   6.000000     1  90.00000 enriched
## Nut&Honey Crunch                      13.432836     2  59.70149 enriched
## Nutri-Grain Almond-Raisin             10.447761     3 194.02985 enriched
## Oatmeal Raisin Crisp                  20.000000     3 240.00000 enriched
## Post Nat. Raisin Bran                 20.895522     3 388.05970 enriched
## Product 19                             3.000000     3  45.00000     100%
## Puffed Rice                            0.000000     3  15.00000     none
## Quaker Oat Squares                    12.000000     3 220.00000 enriched
## Raisin Bran                           16.000000     2 320.00000 enriched
## Raisin Nut Bran                       16.000000     3 280.00000 enriched
## Raisin Squares                        12.000000     3 220.00000 enriched
## Rice Chex                              1.769912     1  26.54867 enriched
## Rice Krispies                          3.000000     1  35.00000 enriched
## Shredded Wheat 'n'Bran                 0.000000     1 208.95522     none
## Shredded Wheat spoon size              0.000000     1 179.10448     none
## Smacks                                20.000000     2  53.33333 enriched
## Special K                              3.000000     1  55.00000 enriched
## Total Corn Flakes                      3.000000     3  35.00000     100%
## Total Raisin Bran                     14.000000     3 230.00000     100%
## Total Whole Grain                      3.000000     3 110.00000     100%
## Triples                                4.000000     3  80.00000 enriched
## Trix                                  12.000000     2  25.00000 enriched
## Wheat Chex                             4.477612     1 171.64179 enriched
## Wheaties                               3.000000     1 110.00000 enriched
## Wheaties Honey Gold                   10.666667     1  80.00000 enriched

You can see that its much larger than anything we’ve worked with in R so far. If your screen isn’t wide enough, the columns may have wrapped around and started new lines. No worries - they’re still where they should be at the end of your data frame, its just a quirk of how data is displayed in R when your screen isn’t large enough.

Now lets get a feel for the data. One of the quickest ways to check a data frame is with the head() function. This function will only display the first 6 rows of data, giving you a much more manageable chunk of data to digest. Go ahead and check the first 6 lines of UScereal

head(UScereal)
##                           mfr calories   protein      fat   sodium
## 100% Bran                   N 212.1212 12.121212 3.030303 393.9394
## All-Bran                    K 212.1212 12.121212 3.030303 787.8788
## All-Bran with Extra Fiber   K 100.0000  8.000000 0.000000 280.0000
## Apple Cinnamon Cheerios     G 146.6667  2.666667 2.666667 240.0000
## Apple Jacks                 K 110.0000  2.000000 0.000000 125.0000
## Basic 4                     G 173.3333  4.000000 2.666667 280.0000
##                               fibre    carbo   sugars shelf potassium
## 100% Bran                 30.303030 15.15152 18.18182     3 848.48485
## All-Bran                  27.272727 21.21212 15.15151     3 969.69697
## All-Bran with Extra Fiber 28.000000 16.00000  0.00000     3 660.00000
## Apple Cinnamon Cheerios    2.000000 14.00000 13.33333     1  93.33333
## Apple Jacks                1.000000 11.00000 14.00000     2  30.00000
## Basic 4                    2.666667 24.00000 10.66667     3 133.33333
##                           vitamins
## 100% Bran                 enriched
## All-Bran                  enriched
## All-Bran with Extra Fiber enriched
## Apple Cinnamon Cheerios   enriched
## Apple Jacks               enriched
## Basic 4                   enriched

Note that you can display different numbers of rows with head() using a second argument. You can query the help file to find out more.

Much nicer. We can see our row names, each representing a different cereal brand, as well as 11 columns, each representing a variable of the cereal. In this case they’re macronutrients.

head() also has a mirror function in tail(). As you might expect, it gives you the last 6 rows of data. Go ahead and find the last 6 rows of UScereal now.

tail(UScereal)
##                     mfr calories  protein      fat   sodium    fibre
## Total Whole Grain     G 100.0000 3.000000 1.000000 200.0000 3.000000
## Triples               G 146.6667 2.666667 1.333333 333.3333 0.000000
## Trix                  G 110.0000 1.000000 1.000000 140.0000 0.000000
## Wheat Chex            R 149.2537 4.477612 1.492537 343.2836 4.477612
## Wheaties              G 100.0000 3.000000 1.000000 200.0000 3.000000
## Wheaties Honey Gold   G 146.6667 2.666667 1.333333 266.6667 1.333333
##                        carbo    sugars shelf potassium vitamins
## Total Whole Grain   16.00000  3.000000     3  110.0000     100%
## Triples             28.00000  4.000000     3   80.0000 enriched
## Trix                13.00000 12.000000     2   25.0000 enriched
## Wheat Chex          25.37313  4.477612     1  171.6418 enriched
## Wheaties            17.00000  3.000000     1  110.0000 enriched
## Wheaties Honey Gold 21.33333 10.666667     1   80.0000 enriched

Note that we have mixed variables - some columns are numeric, while others are characters or factors. Again, this is a feature of data frames. Lets find out exactly what type of data is present in each column. If you remember back a few lessons, we used the structure function, str(), to determine details of a vector. It will tell us even more about our data frame. Go ahead and look at the structure of UScereal now.

str(UScereal)
## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...

Lets unpack this output. str() is first telling us that our object is a data frame with 65 observations and 11 variables. This is the same as saying our data frame has 65 rows and 11 columns.

Next we have a short summary of each column in the data frame. $col_name refers to the column names while the entry after the : tells us the data type. We can see that most of our columsn are numeric, but we have a few factors as well. It also tells us how many levels there are to our factors (return to the Understanding Factors lesson if you need a refresh). Finally, str() gives us the first handful of entries in that column.

We can also use the summary() function on data frames. Go ahead and do so on UScereal.

summary(UScereal)
##  mfr       calories        protein             fat            sodium     
##  G:22   Min.   : 50.0   Min.   : 0.7519   Min.   :0.000   Min.   :  0.0  
##  K:21   1st Qu.:110.0   1st Qu.: 2.0000   1st Qu.:0.000   1st Qu.:180.0  
##  N: 3   Median :134.3   Median : 3.0000   Median :1.000   Median :232.0  
##  P: 9   Mean   :149.4   Mean   : 3.6837   Mean   :1.423   Mean   :237.8  
##  Q: 5   3rd Qu.:179.1   3rd Qu.: 4.4776   3rd Qu.:2.000   3rd Qu.:290.0  
##  R: 5   Max.   :440.0   Max.   :12.1212   Max.   :9.091   Max.   :787.9  
##      fibre            carbo           sugars          shelf      
##  Min.   : 0.000   Min.   :10.53   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:15.00   1st Qu.: 4.00   1st Qu.:1.000  
##  Median : 2.000   Median :18.67   Median :12.00   Median :2.000  
##  Mean   : 3.871   Mean   :19.97   Mean   :10.05   Mean   :2.169  
##  3rd Qu.: 4.478   3rd Qu.:22.39   3rd Qu.:14.00   3rd Qu.:3.000  
##  Max.   :30.303   Max.   :68.00   Max.   :20.90   Max.   :3.000  
##    potassium          vitamins 
##  Min.   : 15.00   100%    : 5  
##  1st Qu.: 45.00   enriched:57  
##  Median : 96.59   none    : 3  
##  Mean   :159.12                
##  3rd Qu.:220.00                
##  Max.   :969.70

Summary treats each column as a separate vector, and gives us a breakdown of each. Note the differences between numeric columns and factors. With numeric, we get basic summary statistics; with factors, we see how many times each factor shows up.

Tabulation of factors is particularly important, and we’ll need to do this frequently once we start analyzing categorical data. We’ll end this lesson by introducing you to an important function and an important operator.

The table() function will count the occurence of factors in a vector. But how do we pull out just one column from our data frame? The easiest way is the $ operator. If we call our data frame and follow its name with $ and the column we want, we’ll just get that column as a vector. Go ahead and enter UScereal$vitamins now.

UScereal$vitamins
##  [1] enriched enriched enriched enriched enriched enriched enriched
##  [8] enriched enriched enriched enriched enriched enriched enriched
## [15] enriched enriched enriched enriched enriched enriched enriched
## [22] enriched enriched enriched enriched enriched enriched enriched
## [29] enriched enriched enriched enriched enriched enriched enriched
## [36] 100%     enriched enriched enriched enriched enriched enriched
## [43] enriched enriched enriched 100%     none     enriched enriched
## [50] enriched enriched enriched enriched none     none     enriched
## [57] enriched 100%     100%     100%     enriched enriched enriched
## [64] enriched enriched
## Levels: 100% enriched none

You’ve isolated just that column of data. Now enter table(UScereal$vitamins).

table(UScereal$vitamins)
## 
##     100% enriched     none 
##        5       57        3

You’ve now tabulated your factors in the vitamins column. Well done! We’ll be using both table() and $ a lot in the near future.

In this lesson, you learned the basics of working with two very important and common data structures – matrices and data frames. There’s much more to learn and we’ll be covering more advanced topics, particularly with respect to data frames, in future lessons.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

  1. As you wish

As you wish