Unit 2: Working with Data: Labs

These labs test and build on the material presented in the SWIRL lessons.

Scroll down or click here to to check how to submit them.

Due Dates

Lab 1 and 2: 2018-09-21 23:59

Lab 3 and recap: 2018-09-28 23:59

(no best practice lab)


Lab 1

1. Matrix operations I

a. Create a vector from 1 to 20 using whatever method you like.

a. Convert this vector into a 4 column matrix.

b. Sum the rows of the matrix, with rowSums().

c. Sum the columns of the matrix, with colSums().

d. Give the matrix column names (whatever you want).

2. Create vectors from shoe size, number of siblings, and pineapple yes/no

Table 1. Ten random responses from the class survey.

Height Eye Color Shoe Size # Siblings Home Pop. HP Books HP House Ideal Temp Fav Color Tea or Coffee Pineapple on Pizza? Roll your Tongue? Tree or Pollen Allergy
175 Brown 10 2 4621 7 Hufflepuff 21 Blue Coffee Yes Yes Both
177.8 Blue 10 2 200000 7 Ravenclaw 21 Blue Coffee Yes Yes Neither
167.64 Hazel 9.5 3 4988 0 Hufflepuff 15.56 Blue Coffee Disgusting No Both
183 Hazel 11.5 2 62243 5 Ravenclaw 21 Blue Coffee Yes Yes Both
163 Hazel 9 1 3300 7 Hufflepuff 18.9 Green Coffee Disgusting Yes Both
175.26 Blue 11 2 21845 7 Gryffindor 26 Red Tea Yes Yes Neither
165 Brown 8 0 9000000 7 Hufflepuff 20 Purple Coffee Yes Yes Neither
162.56 Brown 7 2 27865 7 Gryffindor 23 Yellow Coffee Disgusting Yes Neither
167 Brown 7.5 0 80000 7 Ravenclaw 23 Blue Coffee Disgusting Yes Both
167 Brown 8.5 5 3500 0 Hufflepuff 27 Blue Coffee Disgusting Yes Neither

a. Using these vectors, create a matrix. Ensure that all vectors are numeric or integers (you may need to re-code them).

b. Convert this matrix to a data frame (you may need to google or find external help).

b. Display the first 7 rows of the data frame. Check ?head if needed.

c. Display the last 4 rows of the data frame. Check ?tail if needed.

d. Add eye color as another column to the same data frame (do not create a new data frame).

e. Look at the structure and summary of the data frame. Which columns are which data type?

3. Create a data frame from these data

Table 2. Apple production in selected countries in 2016 (Source: FAO).

Country Harvested area (ha) Apple production (tonnes)
China 2383815 44447793
India 314000 2872000
Iran 238638 2799197
Russia 214270 1843544
Poland 177203 3604271
Turkey 173394 2925828
United States 130552 4649323
Uzbekistan 101726 1120209
Pakistan 91928 590039
Ukraine 91600 1099240
Italy 56164 2455616
France 49618 1819762
Chile 36063 1759421
Ukraine 91600 1099240
Brazil 33981 1049251
Germany 31334 1032913
United Kingdom 16512 481100

a. Display the head, tail, and summary of data.

b. What are the dimensions of the data frame?

c. What is the mean harvested area across all countries?

d. What is the minimum apple production across all countries?

e. Plot the values of apple production on harvested area. Add labels to the x- and y-axes, and a title to the plot.


Lab 2

Read in the following data frame (copy and paste into your R console). This data details CO2 emissions from various sources in 2015. (Source)

CO2_2015 <- data.frame(
   Country  = c('World', 'China', 'United States', 'European Union', 'India', 'Russia', 'Japan', 'Germany', 'International Shipping', 'Iran', 'South Korea', 'Canada', 'Saudi Arabia', 'Indonesia', 'International Aviation'), 
    Total_kt = c(36061710, 10641789, 5172336, 3469671, 2454968, 1760895, 1252890, 777905, 642024, 633750, 617285, 555401, 505565, 502961, 502936),
    Percent_World_CO2  = c(100.00, 29.51, 14.34, 9.62, 6.81, 4.88, 3.47, 2.16, 1.78, 1.76, 1.71, 1.54, 1.40, 1.39, 1.39),
    Per_capita_t = c(NA , 7.7, 16.1, 6.9, 1.9, 12.3, 9.9, 9.6, NA , 8, 12.3, 15.5, 16, 2, NA),
    Kg_per_USD1000_GDP_2014 = c(490.8, 1235, 324.2, 184.7, 1051.5, 999.4, 205.2, 197.4, NA, 1344.4, 475.7, 301, 921.9, 492.7, NA)
)

Subsetting

1. Load in data set and subset the following, each from CO2_2015.

a. Subset the Percent of world emissions. Use $, and assign to a new variable.

b. Subset the Per_capita_t column, using [‘name’].

c. Subset the Country column. What is the data structure?

Subsetting with [ ] positionally (one dimension)

2. Use the Country vector you just pulled out:

a. Display just the first element of the vector

b. Display just the last element of the vector

c. Display the 2, 5, and 9th element of the vector

d. Display the first 5 elements of the vector

e. Remove the first 4 elements of this vector

Subsetting with [ ] logically (one dimension)

3. Using the Percent_World_CO2 column:

a. Display all values of the vector less than 10

b. Display all values of the vector more than 2

c. Display all values of the vector more than 2 and less than 10

d. Display all values of the vector less than 2 or more than 10

Subsetting in two dimensions

Read in this data frame (copy and paste into your R console). The data are part of a survey of the state of global happiness, conducted each year by the UN. Each variable measured reveals a populated-weighted average score on a scale running from 0 to 10 that is tracked over time and compared against other countries. (Source)

Happiness <- data.frame(
    OverallRank = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
    Country = c('Finland', 'Norway', 'Denmark', 'Iceland', ' Switzerland', 'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia', 'Israel', 'Austria', 'Costa Rica', 'Ireland', 'Germany', 'Belgium', 'Luxembourg', 'United States', 'United Kingdom', 'United Arab Emirates'), 
    Score = c(7.632, 7.594, 7.555, 7.495, 7.487, 7.441, 7.328, 7.324, 7.314, 7.272, 7.19, 7.139, 7.072, 6.977, 6.965, 6.927, 6.91, 6.886, 6.814, 6.774),
   GDP_per_capita = c(1.305, 1.456, 1.351, 1.343, 1.42, 1.361, 1.33, 1.268, 1.355, 1.34, 1.244, 1.341, 1.01, 1.448, 1.34, 1.324, 1.576, 1.398, 1.301, 2.096),
    Social_support = c(1.592, 1.582, 1.59, 1.644, 1.549, 1.488, 1.532, 1.601, 1.501, 1.573, 1.433, 1.504, 1.459, 1.583, 1.474, 1.483, 1.52, 1.471, 1.559, 0.776), 
    Healthy_life_expectancy = c(0.874, 0.861, 0.868, 0.914, 0.927, 0.878, 0.896, 0.876, 0.913, 0.91, 0.888, 0.891, 0.817, 0.876, 0.861, 0.894, 0.896, 0.819, 0.883, 0.67),
    Freedom_life_choices = c(0.681, 0.686, 0.683, 0.677, 0.66, 0.638, 0.653, 0.669, 0.659, 0.647, 0.464, 0.617, 0.632, 0.614, 0.586, 0.583, 0.632, 0.547, 0.533, 0.284),
    Generosity = c(0.192, 0.286, 0.284, 0.353, 0.256, 0.333, 0.321, 0.365, 0.285, 0.361, 0.262, 0.242, 0.143, 0.307, 0.273, 0.188, 0.196, 0.291, 0.354, 0.186),
    Perceptions_corruption = c(0.393, 0.34, 0.408, 0.138, 0.357, 0.295, 0.291, 0.389, 0.383, 0.302, 0.082, 0.224, 0.101, 0.306, 0.28, 0.24, 0.321, 0.133, 0.272, NA)
)

4. Use this data frame:

a. Display the first element of the first column.

b. Display the entire first row.

c. Display the entire 2nd column without its name.

d. Display the 2-5th rows and the 3-4th columns.

e. Display $Score column values that are more than 7.

f. Display $Score column values for where column $Generosity is greater than 0.2.

g. Make a boxplot of ‘Perception of corruption’ for which GDP is greater than 1.3. Make the boxes green.

h. Remove the 3rd row of data.

i. Plot a histogram of Social support. Change the number of breaks in the histogram to 10.


Lab 3

The data for this lab are available on the Data page.

Please upload any cleaned datafiles with your R code to Canvas.

1. Read in the following clean data sets.

a. Read in CO2_2015.txt (tab-delim). Display the first 10 rows.

b. Read in happiness.csv (comma-delim). What is the mean GDP?

c. Read in apples.txt. How many rows and columns does this data have?

2. Clean and read in the following data sets

a. Michigan tree species.

b. Harry Potter movie budgets.

c. Galapagos mammal incidence.

3. Clean and read in the birdflu.xls spreadsheet.

Remember that R can only have 1 row as the header.

a. Use the names() and str() functions in R to view the data.

b. What is the total number of bird flu cases in 2003 and in 2005?

c. Which country has had the most cases?

d. Which country has had the least bird flu deaths?

e. What is the total number of bird flu cases per country?

f. What is the total number of cases per year?


Lab 4: Unit 2 Recap

Data you will need is available from the Data page.

Please upload any cleaned datafiles with your R code to Canvas.

1. Working with lists

a. Create a matrix of your choice.

b. Create a data frame of your choice.

c. Create a vector of your choice.

d. Put all three objects in a list.

e. Subset the data frame from the list by its name.

f. Subset a column in the data frame in the list.

g. Subset the vector by its position in the list.

2. Use the birdflu.xls spreadsheet again

a. Edit the birdflu spreadsheet from a wide format to a long format.

Check the best practice on Principles of data files (which I’m sure you have read!) to see the difference.

b. Read in the edited birdflu data.

c. How many deaths were there in 2007?

d. How many cases were there in Thailand in all years?

e. How many cases were there in Indonesia in 2008?

3. New Haven road race data

Download the New Haven road race data from 2015.

Check here if you are having trouble.

a. Clean and read in the data.

b. How many males and females ran in the race?

c. How many males under 19 years old ran in the race?

c. What was the fastest race time?

d. What was the mean race time for runners from New Haven?

f. Make a boxplot of pace as a function of gender (male vs female).

g. How fast did your instructor run?


How to submit your labs

How to write up your lab answers

You will need to write R code to answer each of the questions.

Please format your answers as follows:

It should look something like this:

# LAB: Unit 1. Lab 1
# Your Name #  Put your name here

# 1. Add 7 and 3,456.
7 + 3456

# 2. Assign this value to an object
x <- 7 + 3456

How to submit your lab answers on Canvas

You are permitted to submit your answers as many times as you like within each Unit.

Answers will be graded two or three times a week and re-opened if you submit early.

Each lab will close at its respective deadline (see Canvas).

Final grades for each lab will be computed and entered into the Canvas gradebook at the end of each Unit.