Matrices and Dataframes

In this lesson, we’ll cover matrices and data frames. Both represent ‘rectangular’ data types, meaning that they are used to store tabular data, with rows and columns. You’ll also learn some more tools for looking at your data objects in R.

The main difference between matrices and data frames, as you’ll see, is that matrices can only contain a single class of data, while data frames can consist of many different classes of data.

Let’s create a vector containing the numbers 1 through 20 using the : operator. Store the result in a variable called my_vector.

my_vector <- 1:20

View the contents of the vector you just created.

my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

The dim() function tells us the ‘dimensions’ of an object. What happens if we do dim(my_vector)? Give it a try.

dim(my_vector)
## NULL

Clearly, that’s not very helpful! Since my_vector is a vector, it doesn’t have a dim attribute (so it’s just NULL), but we can find its length using the length() function. Try that now.

length(my_vector)
## [1] 20

Ah! That’s what we wanted. But, what happens if we give my_vector a dim attribute? Let’s give it a try. Type dim(my_vector) <- c(4, 5).

dim(my_vector) <- c(4, 5)

It’s okay if that last command seemed a little strange to you. It should! The dim() function allows you to get OR set the dim attribute for an R object. In this case, we assigned the value c(4, 5) to the dim attribute of my_vector.

Use dim(my_vector) to confirm that we’ve set the dim attribute correctly.

dim(my_vector)
## [1] 4 5

Another way to see this is by calling the attributes() function on my_vector. Try it now.

attributes(my_vector)
## $dim ## [1] 4 5 Just like in math class, when dealing with a 2-dimensional object (think rectangular table), the first number is the number of rows and the second is the number of columns. Therefore, we just gave my_vector 4 rows and 5 columns. But, wait! That doesn’t sound like a vector any more. Well, it’s not. Now it’s a matrix. View the contents of my_vector now to see what it looks like. my_vector ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 5 9 13 17 ## [2,] 2 6 10 14 18 ## [3,] 3 7 11 15 19 ## [4,] 4 8 12 16 20 Now, let’s confirm it’s actually a matrix by using another useful funciton, class(). Type class(my_vector) to see what I mean. class(my_vector) ## [1] "matrix" Sure enough, my_vector is now a matrix. We should store it in a new variable that helps us remember what it is. Store the value of my_vector in a new variable called my_matrix. my_matrix <- my_vector The example that we’ve used so far was meant to illustrate the point that a matrix is simply an atomic vector with a dimension attribute. A more direct method of creating the same matrix uses the matrix() function. Bring up the help file for the matrix() function now using the ? function. ?matrix Now, look at the documentation for the matrix function and see if you can figure out how to create a matrix containing the same numbers (1-20) and dimensions (4 rows, 5 columns) by calling the matrix() function. Store the result in a variable called my_matrix2. my_matrix2 <- matrix(1:20, nrow=4, ncol=5) In addition to dim(), we can also check the specific number of rows or columns in our 2-dimensional object with either nrow() or ncol(). Check the number of rows of my_matrix2 now to make sure its 4. nrow(my_matrix2) ## [1] 4 Once you start working with large datasets, nrow() becomes a very useful function for identifying how many observations you have. Now check the number of columns with ncol() ncol(my_matrix2) ## [1] 5 Finally, let’s confirm that my_matrix and my_matrix2 are actually identical. The identical() function will tell us if its first two arguments are the same. Try it out. identical(my_matrix, my_matrix2) ## [1] TRUE Now, imagine that the numbers in our table represent some measurements from a clinical experiment, where each row represents one patient and each column represents one variable for which measurements were taken. We may want to label the rows, so that we know which numbers belong to each patient in the experiment. One way to do this is to add a column to the matrix, which contains the names of all four people. Let’s start by creating a character vector containing the names of our patients – Bill, Gina, Kelly, and Sean. Remember that double quotes tell R that something is a character string. Store the result in a variable called patients. patients <- c("Bill", "Gina", "Kelly", "Sean") Now we’ll use the cbind() function to ‘combine columns’. Don’t worry about storing the result in a new variable. Just call cbind() with two arguments – the patients vector and my_matrix. cbind(patients, my_matrix) ## patients ## [1,] "Bill" "1" "5" "9" "13" "17" ## [2,] "Gina" "2" "6" "10" "14" "18" ## [3,] "Kelly" "3" "7" "11" "15" "19" ## [4,] "Sean" "4" "8" "12" "16" "20" Something is fishy about our result! It appears that combining the character vector with our matrix of numbers caused everything to be enclosed in double quotes. This means we’re left with a matrix of character strings, which is no good. If you remember back to the beginning of this lesson, I told you that matrices can only contain ONE class of data. Therefore, when we tried to combine a character vector with a numeric matrix, R was forced to ‘coerce’ the numbers to characters, hence the double quotes. This is called ‘implicit coercion’, because we didn’t ask for it. It just happened. But why didn’t R just convert the names of our patients to numbers? I’ll let you ponder that question on your own. So, we’re still left with the question of how to include the names of our patients in the table without destroying the integrity of our numeric data. Try the following – my_data <- data.frame(patients, my_matrix) my_data <- data.frame(patients, my_matrix) Now view the contents of my_data to see what we’ve come up with. my_data ## patients X1 X2 X3 X4 X5 ## 1 Bill 1 5 9 13 17 ## 2 Gina 2 6 10 14 18 ## 3 Kelly 3 7 11 15 19 ## 4 Sean 4 8 12 16 20 It looks like the data.frame() function allowed us to store our character vector of names right alongside our matrix of numbers. That’s exactly what we were hoping for! Behind the scenes, the data.frame() function takes any number of arguments and returns a single object of class data.frame that is composed of the original objects. Let’s confirm this by calling the class() function on our newly created data frame. class(my_data) ## [1] "data.frame" If we don’t want to create a brand new data frame, we can also convert our original matrix into one to accomplish the same thing. Create a new object called my_data2 using the function as.data.frame() on my_matrix. my_data2 <- as.data.frame(my_matrix) Now try to cbind() patients with my_data2. You don’t need to assign it a new object for now. cbind(my_data2, patients) ## V1 V2 V3 V4 V5 patients ## 1 1 5 9 13 17 Bill ## 2 2 6 10 14 18 Gina ## 3 3 7 11 15 19 Kelly ## 4 4 8 12 16 20 Sean It worked! We converted our matrix to a data frame, which let us include a different data type in our object. Note that when we converted our matrix, R also automatically assigned row and column names to our new data frame. It’s also possible to assign names to the individual rows and columns of a data frame, which presents another possible way of determining which row of values in our table belongs to each patient. However, since we’ve already solved that problem, let’s solve a different problem by assigning names to the columns of our data frame so that we know what type of measurement each column represents. Since we have six columns (including patient names), we’ll need to first create a vector containing one element for each column. Create a character vector called cnames that contains the following values (in order) – ‘patient’, ‘age’, ‘weight’, ‘bp’, ‘rating’, ‘test’. cnames <- c("patient", "age", "weight", "bp", "rating", "test") Now, use the colnames() function to set the colnames attribute for our original data frame, my_data. This is similar to the way we used the dim() function earlier in this lesson. colnames(my_data) <- cnames Let’s see if that got the job done. Print the contents of my_data. my_data ## patient age weight bp rating test ## 1 Bill 1 5 9 13 17 ## 2 Gina 2 6 10 14 18 ## 3 Kelly 3 7 11 15 19 ## 4 Sean 4 8 12 16 20 Now that you’ve made your own data frame, we are going to load one in for you so we can see what a real data frame might look like and explore it a bit. Go ahead and look at a new object, UScereal, now. UScereal ## mfr calories protein fat ## 100% Bran N 212.12121 12.1212121 3.0303030 ## All-Bran K 212.12121 12.1212121 3.0303030 ## All-Bran with Extra Fiber K 100.00000 8.0000000 0.0000000 ## Apple Cinnamon Cheerios G 146.66667 2.6666667 2.6666667 ## Apple Jacks K 110.00000 2.0000000 0.0000000 ## Basic 4 G 173.33333 4.0000000 2.6666667 ## Bran Chex R 134.32836 2.9850746 1.4925373 ## Bran Flakes P 134.32836 4.4776119 0.0000000 ## Cap'n'Crunch Q 160.00000 1.3333333 2.6666667 ## Cheerios G 88.00000 4.8000000 1.6000000 ## Cinnamon Toast Crunch G 160.00000 1.3333333 4.0000000 ## Clusters G 220.00000 6.0000000 4.0000000 ## Cocoa Puffs G 110.00000 1.0000000 1.0000000 ## Corn Chex R 110.00000 2.0000000 0.0000000 ## Corn Flakes K 100.00000 2.0000000 0.0000000 ## Corn Pops K 110.00000 1.0000000 0.0000000 ## Count Chocula G 110.00000 1.0000000 1.0000000 ## Cracklin' Oat Bran K 220.00000 6.0000000 6.0000000 ## Crispix K 110.00000 2.0000000 0.0000000 ## Crispy Wheat & Raisins G 133.33333 2.6666667 1.3333333 ## Double Chex R 133.33333 2.6666667 0.0000000 ## Froot Loops K 110.00000 2.0000000 1.0000000 ## Frosted Flakes K 146.66667 1.3333333 0.0000000 ## Frosted Mini-Wheats K 125.00000 3.7500000 0.0000000 ## Fruit & Fibre: Dates Walnuts and Oats P 179.10448 4.4776119 2.9850746 ## Fruitful Bran K 179.10448 4.4776119 0.0000000 ## Fruity Pebbles P 146.66667 1.3333333 1.3333333 ## Golden Crisp P 113.63636 2.2727273 0.0000000 ## Golden Grahams G 146.66667 1.3333333 1.3333333 ## Grape Nuts Flakes P 113.63636 3.4090909 1.1363636 ## Grape-Nuts P 440.00000 12.0000000 0.0000000 ## Great Grains Pecan P 363.63636 9.0909091 9.0909091 ## Honey Graham Ohs Q 120.00000 1.0000000 2.0000000 ## Honey Nut Cheerios G 146.66667 4.0000000 1.3333333 ## Honey-comb P 82.70677 0.7518797 0.0000000 ## Just Right Fruit & Nut K 186.66667 4.0000000 1.3333333 ## Kix G 73.33333 1.3333333 0.6666667 ## Life Q 149.25373 5.9701493 2.9850746 ## Lucky Charms G 110.00000 2.0000000 1.0000000 ## Mueslix Crispy Blend K 238.80597 4.4776119 2.9850746 ## Multi-Grain Cheerios G 100.00000 2.0000000 1.0000000 ## Nut&Honey Crunch K 179.10448 2.9850746 1.4925373 ## Nutri-Grain Almond-Raisin K 208.95522 4.4776119 2.9850746 ## Oatmeal Raisin Crisp G 260.00000 6.0000000 4.0000000 ## Post Nat. Raisin Bran P 179.10448 4.4776119 1.4925373 ## Product 19 K 100.00000 3.0000000 0.0000000 ## Puffed Rice Q 50.00000 1.0000000 0.0000000 ## Quaker Oat Squares Q 200.00000 8.0000000 2.0000000 ## Raisin Bran K 160.00000 4.0000000 1.3333333 ## Raisin Nut Bran G 200.00000 6.0000000 4.0000000 ## Raisin Squares K 180.00000 4.0000000 0.0000000 ## Rice Chex R 97.34513 0.8849558 0.0000000 ## Rice Krispies K 110.00000 2.0000000 0.0000000 ## Shredded Wheat 'n'Bran N 134.32836 4.4776119 0.0000000 ## Shredded Wheat spoon size N 134.32836 4.4776119 0.0000000 ## Smacks K 146.66667 2.6666667 1.3333333 ## Special K K 110.00000 6.0000000 0.0000000 ## Total Corn Flakes G 110.00000 2.0000000 1.0000000 ## Total Raisin Bran G 140.00000 3.0000000 1.0000000 ## Total Whole Grain G 100.00000 3.0000000 1.0000000 ## Triples G 146.66667 2.6666667 1.3333333 ## Trix G 110.00000 1.0000000 1.0000000 ## Wheat Chex R 149.25373 4.4776119 1.4925373 ## Wheaties G 100.00000 3.0000000 1.0000000 ## Wheaties Honey Gold G 146.66667 2.6666667 1.3333333 ## sodium fibre carbo ## 100% Bran 393.93939 30.303030 15.15152 ## All-Bran 787.87879 27.272727 21.21212 ## All-Bran with Extra Fiber 280.00000 28.000000 16.00000 ## Apple Cinnamon Cheerios 240.00000 2.000000 14.00000 ## Apple Jacks 125.00000 1.000000 11.00000 ## Basic 4 280.00000 2.666667 24.00000 ## Bran Chex 298.50746 5.970149 22.38806 ## Bran Flakes 313.43284 7.462687 19.40299 ## Cap'n'Crunch 293.33333 0.000000 16.00000 ## Cheerios 232.00000 1.600000 13.60000 ## Cinnamon Toast Crunch 280.00000 0.000000 17.33333 ## Clusters 280.00000 4.000000 26.00000 ## Cocoa Puffs 180.00000 0.000000 12.00000 ## Corn Chex 280.00000 0.000000 22.00000 ## Corn Flakes 290.00000 1.000000 21.00000 ## Corn Pops 90.00000 1.000000 13.00000 ## Count Chocula 180.00000 0.000000 12.00000 ## Cracklin' Oat Bran 280.00000 8.000000 20.00000 ## Crispix 220.00000 1.000000 21.00000 ## Crispy Wheat & Raisins 186.66667 2.666667 14.66667 ## Double Chex 253.33333 1.333333 24.00000 ## Froot Loops 125.00000 1.000000 11.00000 ## Frosted Flakes 266.66667 1.333333 18.66667 ## Frosted Mini-Wheats 0.00000 3.750000 17.50000 ## Fruit & Fibre: Dates Walnuts and Oats 238.80597 7.462687 17.91045 ## Fruitful Bran 358.20896 7.462687 20.89552 ## Fruity Pebbles 180.00000 0.000000 17.33333 ## Golden Crisp 51.13636 0.000000 12.50000 ## Golden Grahams 373.33333 0.000000 20.00000 ## Grape Nuts Flakes 159.09091 3.409091 17.04545 ## Grape-Nuts 680.00000 12.000000 68.00000 ## Great Grains Pecan 227.27273 9.090909 39.39394 ## Honey Graham Ohs 220.00000 1.000000 12.00000 ## Honey Nut Cheerios 333.33333 2.000000 15.33333 ## Honey-comb 135.33835 0.000000 10.52632 ## Just Right Fruit & Nut 226.66667 2.666667 26.66667 ## Kix 173.33333 0.000000 14.00000 ## Life 223.88060 2.985075 17.91045 ## Lucky Charms 180.00000 0.000000 12.00000 ## Mueslix Crispy Blend 223.88060 4.477612 25.37313 ## Multi-Grain Cheerios 220.00000 2.000000 15.00000 ## Nut&Honey Crunch 283.58209 0.000000 22.38806 ## Nutri-Grain Almond-Raisin 328.35821 4.477612 31.34328 ## Oatmeal Raisin Crisp 340.00000 3.000000 27.00000 ## Post Nat. Raisin Bran 298.50746 8.955224 16.41791 ## Product 19 320.00000 1.000000 20.00000 ## Puffed Rice 0.00000 0.000000 13.00000 ## Quaker Oat Squares 270.00000 4.000000 28.00000 ## Raisin Bran 280.00000 6.666667 18.66667 ## Raisin Nut Bran 280.00000 5.000000 21.00000 ## Raisin Squares 0.00000 4.000000 30.00000 ## Rice Chex 212.38938 0.000000 20.35398 ## Rice Krispies 290.00000 0.000000 22.00000 ## Shredded Wheat 'n'Bran 0.00000 5.970149 28.35821 ## Shredded Wheat spoon size 0.00000 4.477612 29.85075 ## Smacks 93.33333 1.333333 12.00000 ## Special K 230.00000 1.000000 16.00000 ## Total Corn Flakes 200.00000 0.000000 21.00000 ## Total Raisin Bran 190.00000 4.000000 15.00000 ## Total Whole Grain 200.00000 3.000000 16.00000 ## Triples 333.33333 0.000000 28.00000 ## Trix 140.00000 0.000000 13.00000 ## Wheat Chex 343.28358 4.477612 25.37313 ## Wheaties 200.00000 3.000000 17.00000 ## Wheaties Honey Gold 266.66667 1.333333 21.33333 ## sugars shelf potassium vitamins ## 100% Bran 18.181818 3 848.48485 enriched ## All-Bran 15.151515 3 969.69697 enriched ## All-Bran with Extra Fiber 0.000000 3 660.00000 enriched ## Apple Cinnamon Cheerios 13.333333 1 93.33333 enriched ## Apple Jacks 14.000000 2 30.00000 enriched ## Basic 4 10.666667 3 133.33333 enriched ## Bran Chex 8.955224 1 186.56716 enriched ## Bran Flakes 7.462687 3 283.58209 enriched ## Cap'n'Crunch 16.000000 2 46.66667 enriched ## Cheerios 0.800000 1 84.00000 enriched ## Cinnamon Toast Crunch 12.000000 2 60.00000 enriched ## Clusters 14.000000 3 210.00000 enriched ## Cocoa Puffs 13.000000 2 55.00000 enriched ## Corn Chex 3.000000 1 25.00000 enriched ## Corn Flakes 2.000000 1 35.00000 enriched ## Corn Pops 12.000000 2 20.00000 enriched ## Count Chocula 13.000000 2 65.00000 enriched ## Cracklin' Oat Bran 14.000000 3 320.00000 enriched ## Crispix 3.000000 3 30.00000 enriched ## Crispy Wheat & Raisins 13.333333 3 160.00000 enriched ## Double Chex 6.666667 3 106.66667 enriched ## Froot Loops 13.000000 2 30.00000 enriched ## Frosted Flakes 14.666667 1 33.33333 enriched ## Frosted Mini-Wheats 8.750000 2 125.00000 enriched ## Fruit & Fibre: Dates Walnuts and Oats 14.925373 3 298.50746 enriched ## Fruitful Bran 17.910448 3 283.58209 enriched ## Fruity Pebbles 16.000000 2 33.33333 enriched ## Golden Crisp 17.045455 1 45.45455 enriched ## Golden Grahams 12.000000 2 60.00000 enriched ## Grape Nuts Flakes 5.681818 3 96.59091 enriched ## Grape-Nuts 12.000000 3 360.00000 enriched ## Great Grains Pecan 12.121212 3 303.03030 enriched ## Honey Graham Ohs 11.000000 2 45.00000 enriched ## Honey Nut Cheerios 13.333333 1 120.00000 enriched ## Honey-comb 8.270677 1 26.31579 enriched ## Just Right Fruit & Nut 12.000000 3 126.66667 100% ## Kix 2.000000 2 26.66667 enriched ## Life 8.955224 2 141.79104 enriched ## Lucky Charms 12.000000 2 55.00000 enriched ## Mueslix Crispy Blend 19.402985 3 238.80597 enriched ## Multi-Grain Cheerios 6.000000 1 90.00000 enriched ## Nut&Honey Crunch 13.432836 2 59.70149 enriched ## Nutri-Grain Almond-Raisin 10.447761 3 194.02985 enriched ## Oatmeal Raisin Crisp 20.000000 3 240.00000 enriched ## Post Nat. Raisin Bran 20.895522 3 388.05970 enriched ## Product 19 3.000000 3 45.00000 100% ## Puffed Rice 0.000000 3 15.00000 none ## Quaker Oat Squares 12.000000 3 220.00000 enriched ## Raisin Bran 16.000000 2 320.00000 enriched ## Raisin Nut Bran 16.000000 3 280.00000 enriched ## Raisin Squares 12.000000 3 220.00000 enriched ## Rice Chex 1.769912 1 26.54867 enriched ## Rice Krispies 3.000000 1 35.00000 enriched ## Shredded Wheat 'n'Bran 0.000000 1 208.95522 none ## Shredded Wheat spoon size 0.000000 1 179.10448 none ## Smacks 20.000000 2 53.33333 enriched ## Special K 3.000000 1 55.00000 enriched ## Total Corn Flakes 3.000000 3 35.00000 100% ## Total Raisin Bran 14.000000 3 230.00000 100% ## Total Whole Grain 3.000000 3 110.00000 100% ## Triples 4.000000 3 80.00000 enriched ## Trix 12.000000 2 25.00000 enriched ## Wheat Chex 4.477612 1 171.64179 enriched ## Wheaties 3.000000 1 110.00000 enriched ## Wheaties Honey Gold 10.666667 1 80.00000 enriched You can see that its much larger than anything we’ve worked with in R so far. If your screen isn’t wide enough, the columns may have wrapped around and started new lines. No worries - they’re still where they should be at the end of your data frame, its just a quirk of how data is displayed in R when your screen isn’t large enough. Now lets get a feel for the data. One of the quickest ways to check a data frame is with the head() function. This function will only display the first 6 rows of data, giving you a much more manageable chunk of data to digest. Go ahead and check the first 6 lines of UScereal head(UScereal) ## mfr calories protein fat sodium ## 100% Bran N 212.1212 12.121212 3.030303 393.9394 ## All-Bran K 212.1212 12.121212 3.030303 787.8788 ## All-Bran with Extra Fiber K 100.0000 8.000000 0.000000 280.0000 ## Apple Cinnamon Cheerios G 146.6667 2.666667 2.666667 240.0000 ## Apple Jacks K 110.0000 2.000000 0.000000 125.0000 ## Basic 4 G 173.3333 4.000000 2.666667 280.0000 ## fibre carbo sugars shelf potassium ## 100% Bran 30.303030 15.15152 18.18182 3 848.48485 ## All-Bran 27.272727 21.21212 15.15151 3 969.69697 ## All-Bran with Extra Fiber 28.000000 16.00000 0.00000 3 660.00000 ## Apple Cinnamon Cheerios 2.000000 14.00000 13.33333 1 93.33333 ## Apple Jacks 1.000000 11.00000 14.00000 2 30.00000 ## Basic 4 2.666667 24.00000 10.66667 3 133.33333 ## vitamins ## 100% Bran enriched ## All-Bran enriched ## All-Bran with Extra Fiber enriched ## Apple Cinnamon Cheerios enriched ## Apple Jacks enriched ## Basic 4 enriched Note that you can display different numbers of rows with head() using a second argument. You can query the help file to find out more. Much nicer. We can see our row names, each representing a different cereal brand, as well as 11 columns, each representing a variable of the cereal. In this case they’re macronutrients. head() also has a mirror function in tail(). As you might expect, it gives you the last 6 rows of data. Go ahead and find the last 6 rows of UScereal now. tail(UScereal) ## mfr calories protein fat sodium fibre ## Total Whole Grain G 100.0000 3.000000 1.000000 200.0000 3.000000 ## Triples G 146.6667 2.666667 1.333333 333.3333 0.000000 ## Trix G 110.0000 1.000000 1.000000 140.0000 0.000000 ## Wheat Chex R 149.2537 4.477612 1.492537 343.2836 4.477612 ## Wheaties G 100.0000 3.000000 1.000000 200.0000 3.000000 ## Wheaties Honey Gold G 146.6667 2.666667 1.333333 266.6667 1.333333 ## carbo sugars shelf potassium vitamins ## Total Whole Grain 16.00000 3.000000 3 110.0000 100% ## Triples 28.00000 4.000000 3 80.0000 enriched ## Trix 13.00000 12.000000 2 25.0000 enriched ## Wheat Chex 25.37313 4.477612 1 171.6418 enriched ## Wheaties 17.00000 3.000000 1 110.0000 enriched ## Wheaties Honey Gold 21.33333 10.666667 1 80.0000 enriched Note that we have mixed variables - some columns are numeric, while others are characters or factors. Again, this is a feature of data frames. Lets find out exactly what type of data is present in each column. If you remember back a few lessons, we used the structure function, str(), to determine details of a vector. It will tell us even more about our data frame. Go ahead and look at the structure of UScereal now. str(UScereal) ## 'data.frame': 65 obs. of 11 variables: ##$ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $calories : num 212 212 100 147 110 ... ##$ protein  : num  12.12 12.12 8 2.67 2 ...
##  $fat : num 3.03 3.03 0 2.67 0 ... ##$ sodium   : num  394 788 280 240 125 ...
##  $fibre : num 30.3 27.3 28 2 1 ... ##$ carbo    : num  15.2 21.2 16 14 11 ...
##  $sugars : num 18.2 15.2 0 13.3 14 ... ##$ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $potassium: num 848.5 969.7 660 93.3 30 ... ##$ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...

Lets unpack this output. str() is first telling us that our object is a data frame with 65 observations and 11 variables. This is the same as saying our data frame has 65 rows and 11 columns.

Next we have a short summary of each column in the data frame. $col_name refers to the column names while the entry after the : tells us the data type. We can see that most of our columsn are numeric, but we have a few factors as well. It also tells us how many levels there are to our factors (return to the Understanding Factors lesson if you need a refresh). Finally, str() gives us the first handful of entries in that column. We can also use the summary() function on data frames. Go ahead and do so on UScereal. summary(UScereal) ## mfr calories protein fat sodium ## G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0 ## K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0 ## N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0 ## P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8 ## Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0 ## R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9 ## fibre carbo sugars shelf ## Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000 ## 1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000 ## Median : 2.000 Median :18.67 Median :12.00 Median :2.000 ## Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169 ## 3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000 ## Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000 ## potassium vitamins ## Min. : 15.00 100% : 5 ## 1st Qu.: 45.00 enriched:57 ## Median : 96.59 none : 3 ## Mean :159.12 ## 3rd Qu.:220.00 ## Max. :969.70 Summary treats each column as a separate vector, and gives us a breakdown of each. Note the differences between numeric columns and factors. With numeric, we get basic summary statistics; with factors, we see how many times each factor shows up. Tabulation of factors is particularly important, and we’ll need to do this frequently once we start analyzing categorical data. We’ll end this lesson by introducing you to an important function and an important operator. The table() function will count the occurence of factors in a vector. But how do we pull out just one column from our data frame? The easiest way is the $ operator. If we call our data frame and follow its name with $ and the column we want, we’ll just get that column as a vector. Go ahead and enter UScereal$vitamins now.

UScereal$vitamins ## [1] enriched enriched enriched enriched enriched enriched enriched ## [8] enriched enriched enriched enriched enriched enriched enriched ## [15] enriched enriched enriched enriched enriched enriched enriched ## [22] enriched enriched enriched enriched enriched enriched enriched ## [29] enriched enriched enriched enriched enriched enriched enriched ## [36] 100% enriched enriched enriched enriched enriched enriched ## [43] enriched enriched enriched 100% none enriched enriched ## [50] enriched enriched enriched enriched none none enriched ## [57] enriched 100% 100% 100% enriched enriched enriched ## [64] enriched enriched ## Levels: 100% enriched none You’ve isolated just that column of data. Now enter table(UScereal$vitamins).

table(UScereal$vitamins) ## ## 100% enriched none ## 5 57 3 You’ve now tabulated your factors in the vitamins column. Well done! We’ll be using both table() and $ a lot in the near future.

In this lesson, you learned the basics of working with two very important and common data structures – matrices and data frames. There’s much more to learn and we’ll be covering more advanced topics, particularly with respect to data frames, in future lessons.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

1. As you wish

As you wish