Subsetting Dataframes

Now that we’re finally thinking in 2 dimensions, we’re going to learn one of the most important skills in the course, subsetting data frames.

Lets start with a quick refresher on subsets. Make a vector called my_seq and give it the values from 0 to 20 in increments of 2.

my_seq <- seq(0,20, 2)

Now display the first 5 elements of the vector. As a reminder, you’ll need square brackets [] along with a vector of the values 1 through 5.

my_seq[1:5]
## [1] 0 2 4 6 8

Display only those values greater than 10. You’ll need to logically subset here. Remember that we can make a logical vector with the ‘<’ or ‘>’ operator, and use this logical vector as our subset condition in the [].

my_seq[my_seq > 10]
## [1] 12 14 16 18 20

Now that you’re refreshed on subsets, we’re going to apply this logic in 2 dimensions.

We’ve loaded a data frame for you, called forest_area. It has the area of every country, the percentage of that area covered in forest, and their respective ranks in forest cover. Look at the first 6 rows now using head().

head(forest_area)
##   rank       country land.area.km per.forest
## 1    1        Russia      8149300      49.40
## 2    2        Canada      4916438      49.24
## 3    3        Brazil      4776980      56.10
## 4    4 United States      3100950      33.84
## 5    5         China      2083210      21.83
## 6    6     Australia      1470832      19.90

You can see we have 4 columns, most of which are self-explanatory. Per.forest stands for percentage forest. Lets now look at the structure of our data frame. Use the str() function.

str(forest_area)
## 'data.frame':    192 obs. of  4 variables:
##  $rank : num 1 2 3 4 5 6 7 8 9 10 ... ##$ country     : Factor w/ 191 levels "Afghanistan",..: 139 31 24 182 36 9 44 7 75 74 ...
##  $land.area.km: num 8149300 4916438 4776980 3100950 2083210 ... ##$ per.forest  : num  49.4 49.2 56.1 33.8 21.8 ...

Everything looks good - country should be a factor, and is, and everything else is either numeric or an integer. Lets use this data frame to practice subsetting!

With vectors, we only needed a single value or vector in our square brackets [] because vectors are 1-dimensional. With 2-dimensional objects, we’ll need 2 values, the first telling R which row or rows we want, and the second telling R which column or columns. We use a comma to separate them, so it’ll take the form [row,col].

What do you think forest_area[1,2] will give you?

1. The entire data frame
2. The first column of data
3. The value in the first row and second column
4. The value in the first column and second row

The value in the first row and second column

Subset now by entering forest_area[1,2].

forest_area[1,2]
## [1] Russia
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Its basically like telling R which coordinate you want in a x,y grid. As in 1-dimension, we can also use vectors as our subset rather than a single value. Use vectors to subset forest_area for the 1st, 3rd, and 5th row, and the 2nd and 4th column.

forest_area[c(1,3,5),c(2,4)]
##   country per.forest
## 1  Russia      49.40
## 3  Brazil      56.10
## 5   China      21.83

Often, we just want the entire row or the entire column. If you want all of either rows or columns, simply leave that subset blank. Lets get the entire first row by entering forest_area[1,] now. This tells R we want the first row, and all columns.

forest_area[1,]
##   rank country land.area.km per.forest
## 1    1  Russia      8149300       49.4

Lets do something similar, but this time subset all rows of just the 2nd column.

forest_area[,2]
##   [1] Russia                           Canada
##   [3] Brazil                           United States
##   [5] China                            Australia
##   [7] Democratic Republic of the Congo Argentina
##   [9] Indonesia                        India
##  [11] Mexico                           Peru
##  [13] Sudan                            Mozambique
##  [15] Colombia                         Angola
##  [17] Bolivia                          Venezuela
##  [19] Myanmar                          Zambia
##  [21] Tanzania                         Sudan
##  [23] Papua New Guinea                 Sweden
##  [25] Zimbabwe                         Japan
##  [27] France                           Finland
##  [29] Central African Republic         Gabon
##  [31] Republic of the Congo            Turkey
##  [33] Cameroon                         Malaysia
##  [35] Paraguay                         Spain
##  [37] Laos                             Chile
##  [39] Guyana                           South Sudan
##  [41] Suriname                         Thailand
##  [45] Vietnam                          Norway
##  [51] Niger                            Morocco
##  [53] Nigeria                          Iran
##  [55] Guatemala                        Italy
##  [57] Ukraine                          Mongolia
##  [59] Cambodia                         Poland
##  [61] South Africa                     Senegal
##  [63] New Zealand                      Belarus
##  [65] Namibia                          North Korea
##  [67] Philippines                      Somalia
##  [69] Romania                          Burkina Faso
##  [71] Honduras                         Guinea
##  [73] South Korea                      Benin
##  [75] Kazakhstan                       Nicaragua
##  [77] Uganda                           Panama
##  [79] Pakistan                         Algeria
##  [81] Turkmenistan                     Austria
##  [83] Ethiopia                         Greece
##  [85] Nepal                            Bulgaria
##  [87] Malawi                           Portugal
##  [89] Liberia                          Ivory Coast
##  [91] Latvia                           United Kingdom
##  [93] Georgia                          Sierra Leone
##  [95] Serbia                           Cuba
##  [97] Saudi Arabia                     Czech Republic
##  [99] Bosnia and Herzegovina           Croatia
## [101] Bhutan                           Ghana
## [103] Estonia                          Taiwan
## [105] Lithuania                        Hungary
## [107] Guinea-Bissau                    Tunisia
## [109] Costa Rica                       Slovakia
## [111] Uzbekistan                       Sri Lanka
## [113] Belize                           Equatorial Guinea
## [115] Eritrea                          Dominican Republic
## [117] Kenya                            Oman
## [119] Slovenia                          Switzerland
## [121] Azerbaijan                       Macedonia
## [125] Kyrgyzstan                       Iraq
## [127] Togo                             Timor-Leste
## [129] Albania                          Ireland
## [131] Uruguay                          Belgium
## [133] Montenegro                       Denmark
## [135] Bahamas                          Gambia
## [137] Syria                            Yemen
## [139] Vanuatu                          Tajikistan
## [141] Netherlands                      Jamaica
## [143] Moldova                          Rwanda
## [145] United Arab Emirates             El Salvador
## [147] Armenia                          Brunei
## [149] Mauritania                       Trinidad and Tobago
## [151] Libya                            Iceland
## [153] Cyprus                           Afghanistan
## [155] Israel                           Lebanon
## [157] Burundi                          Samoa
## [159] Jordan                           Luxembourg
## [161] Cape Verde                       Egypt
## [163] Micronesia                       Mauritius
## [165] Dominica                         Haiti
## [167] Seychelles                       Palau
## [169] Sao Tome and Principe            Swaziland
## [171] Saint Lucia                      Andorra
## [173] Saint Vincent and the Grenadines Antigua and Barbuda
## [175] Palestine                        Lesotho
## [177] Liechtenstein                    Djibouti
## [179] Kuwait                           Comoros
## [181] Saint Kitts and Nevis            Grenada
## [183] Tonga                            Tuvalu
## [187] Singapore                        Maldives
## [189] Bahrain                          Malta
## [191] San Marino                       Qatar
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Well done. You now know how to subset by position in 2-dimensional objects. Subsetting an entire column is particularly useful, because each column represents a variable of our data. You can use positional subsets to do this, but there’s an easier more flexible way.

If we want to target a single column of our data (for example, percent forest area in our data frame), the easiest way is the $ operator. If we use it following the name of our data frame, we can tell R exactly which column of data we’re interested in working with, and it will return that column as if it were its own vector. What do you think forest_area$per.forest will give you?

1. The entire data frame
2. Country names
3. Percent forest cover
4. The first row of data

Percent forest cover

We could print that column of data, but it would just be a long vector that’s hard to interpret. So lets visualize it instead! Make a histogram of the per.forest column in your forest_area data frame.

hist(forest_area$per.forest) Now find the median land area in our data frame. Just like mean(), the function you want is median(). median(forest_area$land.area.km)
## [1] 27065

The $ operator is a much better way to subset a column of data than positional subsets, because we don’t have to memorize what position our column of interest is in (is it the 2nd column? The 4th?). This might seem trivial with 4 columns, but when you have 10, 20, or maybe 100 variables, the ability to refer to the column by name is a life saver. Now we’re really going to spice things up - logical subsets! Earlier in the lesson we subset a vector based on a logical condition. We can also subset our entire data frame using the values of a single column. What if we were interested in rows of data (observations) for only those countries with more than 80% forest cover?. We’ll need a logical subset of course! To show you how this works, first enter forest_area$per.forest > 80 now.

forest_area$per.forest > 80 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE ## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE ## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE ## [166] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE ## [188] FALSE FALSE FALSE FALSE FALSE This logical vector can be passed as our row subset, such that R will only keep those rows which are TRUE in the vector. Lets subset our entire data frame with this vector by passing the logical statement we just entered as our row subset, and leaving the column subset blank (because we want to see all variables). Try it now. forest_area[forest_area$per.forest > 80,]
##     rank    country land.area.km per.forest
## 30    30      Gabon       227517      85.00
## 41    41   Suriname       147760      90.20
## 163  164 Micronesia          630      89.74
## 167  168 Seychelles          407      89.45
## 168  169      Palau          400      87.15
## 184  185     Tuvalu           23      89.12

Great work. Now subset the data frame for ranks above or equal to 20. To add an equals to to an inequality, just enter it as >= or <=.

forest_area[forest_area$rank >= 20,] ## rank country land.area.km per.forest ## 20 20 Zambia 376309 50.00 ## 21 21 Tanzania 352570 37.22 ## 22 22 Sudan 327909 17.62 ## 23 23 Papua New Guinea 294370 63.60 ## 24 24 Sweden 280730 68.95 ## 25 25 Zimbabwe 259267 66.35 ## 26 26 Japan 253203 67.00 ## 27 27 France 246640 36.76 ## 28 28 Finland 233320 72.00 ## 29 29 Central African Republic 227550 36.53 ## 30 30 Gabon 227517 85.00 ## 31 31 Republic of the Congo 224710 65.70 ## 32 32 Turkey 216781 27.60 ## 33 33 Cameroon 212450 44.68 ## 34 34 Malaysia 195200 59.50 ## 35 35 Paraguay 184750 45.42 ## 36 36 Spain 184180 36.70 ## 37 37 Laos 170000 71.60 ## 38 38 Chile 158781 21.00 ## 39 39 Guyana 151040 70.26 ## 40 40 South Sudan 148196 23.00 ## 41 41 Suriname 147760 90.20 ## 42 42 Thailand 147620 29.00 ## 43 43 Madagascar 128380 21.87 ## 44 44 Mali 125720 10.14 ## 45 45 Vietnam 123000 37.14 ## 46 46 Norway 121120 37.42 ## 47 47 Botswana 119430 20.53 ## 48 48 Chad 119210 9.28 ## 49 49 Ecuador 115510 42.00 ## 50 50 Germany 114190 32.00 ## 51 51 Niger 112000 8.00 ## 52 52 Morocco 111325 15.66 ## 53 53 Nigeria 110890 12.00 ## 54 54 Iran 110750 6.72 ## 55 55 Guatemala 108894 51.00 ## 56 56 Italy 106736 35.00 ## 57 57 Ukraine 105000 17.00 ## 58 58 Mongolia 102520 6.55 ## 59 59 Cambodia 93350 51.56 ## 60 60 Poland 90000 28.80 ## 61 61 South Africa 89170 7.31 ## 62 62 Senegal 86730 44.09 ## 63 63 New Zealand 85424 31.87 ## 64 64 Belarus 80334 38.60 ## 65 65 Namibia 76610 9.29 ## 66 66 North Korea 76240 73.00 ## 67 67 Philippines 71620 23.87 ## 68 68 Somalia 71310 11.18 ## 69 69 Romania 69610 29.02 ## 70 70 Burkina Faso 67940 24.78 ## 71 71 Honduras 67254 60.00 ## 72 72 Guinea 67240 27.35 ## 73 73 South Korea 63346 63.20 ## 74 74 Benin 61860 54.93 ## 75 75 Kazakhstan 54498 2.00 ## 76 76 Nicaragua 51890 39.80 ## 77 77 Uganda 50000 20.74 ## 78 78 Panama 42940 56.93 ## 79 79 Pakistan 42240 5.31 ## 80 80 Algeria 42000 1.70 ## 81 81 Turkmenistan 41270 8.46 ## 82 82 Austria 39600 47.20 ## 83 83 Ethiopia 39313 3.56 ## 84 84 Greece 37520 28.43 ## 85 85 Nepal 36360 24.70 ## 86 86 Bulgaria 36250 32.69 ## 87 87 Malawi 33176 28.00 ## 88 88 Portugal 31820 34.80 ## 89 89 Liberia 31540 28.32 ## 90 90 Ivory Coast 30000 9.30 ## 91 91 Latvia 28807 60.00 ## 92 92 United Kingdom 28650 11.76 ## 93 93 Georgia 28577 41.00 ## 94 94 Sierra Leone 27540 38.39 ## 95 95 Serbia 27200 31.13 ## 96 96 Cuba 27130 24.47 ## 97 97 Saudi Arabia 27000 1.26 ## 98 98 Czech Republic 26000 34.00 ## 99 99 Bosnia and Herzegovina 25599 53.00 ## 100 100 Croatia 24901 44.00 ## 101 101 Bhutan 24764 64.50 ## 102 102 Ghana 24600 10.20 ## 103 103 Estonia 23066 61.00 ## 104 105 Taiwan 21588 60.00 ## 105 106 Lithuania 21223 33.00 ## 106 107 Hungary 20990 23.00 ## 107 108 Guinea-Bissau 20720 57.36 ## 108 109 Tunisia 20615 12.60 ## 109 110 Costa Rica 20440 40.00 ## 110 111 Slovakia 20006 40.80 ## 111 112 Uzbekistan 19690 4.40 ## 112 113 Sri Lanka 19330 29.46 ## 113 114 Belize 16530 71.98 ## 114 115 Equatorial Guinea 16320 58.18 ## 115 116 Eritrea 15540 13.21 ## 116 117 Dominican Republic 13384 27.50 ## 117 118 Kenya 13200 2.27 ## 118 119 Oman 13050 4.22 ## 119 120 Slovenia 12574 60.00 ## 120 121 Switzerland 12540 31.80 ## 121 122 Azerbaijan 11400 13.17 ## 122 123 Macedonia 10285 40.00 ## 123 124 Fiji 10000 54.72 ## 124 125 Bangladesh 8710 6.05 ## 125 126 Kyrgyzstan 8690 4.35 ## 126 127 Iraq 8220 1.88 ## 127 128 Togo 8000 14.00 ## 128 129 Timor-Leste 7980 53.65 ## 129 130 Albania 7716 26.84 ## 130 131 Ireland 7540 11.07 ## 131 132 Uruguay 7000 4.00 ## 132 133 Belgium 6834 22.38 ## 133 134 Montenegro 6252 45.26 ## 134 135 Denmark 6120 14.20 ## 135 136 Bahamas 5150 51.45 ## 136 137 Gambia 4857 43.00 ## 137 138 Syria 4610 2.49 ## 138 139 Yemen 4490 0.85 ## 139 140 Vanuatu 4470 36.67 ## 140 141 Tajikistan 4100 2.87 ## 141 142 Netherlands 3650 8.79 ## 142 143 Jamaica 3308 30.10 ## 143 144 Moldova 3290 9.72 ## 144 145 Rwanda 3161 12.00 ## 145 146 United Arab Emirates 3120 3.73 ## 146 147 El Salvador 2980 14.16 ## 147 148 Armenia 2974 10.00 ## 148 149 Brunei 2780 48.22 ## 149 150 Mauritania 2670 0.26 ## 150 151 Trinidad and Tobago 2260 44.07 ## 151 152 Libya 2170 0.12 ## 152 153 Iceland 1930 2.00 ## 153 154 Cyprus 1740 18.81 ## 154 155 Afghanistan 1631 0.25 ## 155 156 Israel 1600 7.00 ## 156 157 Lebanon 1383 13.30 ## 157 158 Burundi 1270 4.50 ## 158 159 Samoa 1050 37.00 ## 159 160 Jordan 893 1.00 ## 160 161 Luxembourg 870 33.64 ## 161 162 Cape Verde 840 20.83 ## 162 163 Egypt 670 0.07 ## 163 164 Micronesia 630 89.74 ## 164 165 Mauritius 612 30.00 ## 165 166 Dominica 488 65.00 ## 166 167 Haiti 8300 30.00 ## 167 168 Seychelles 407 89.45 ## 168 169 Palau 400 87.15 ## 169 170 Sao Tome and Principe 270 28.01 ## 170 171 Swaziland 174 1.00 ## 171 172 Saint Lucia 170 27.60 ## 172 173 Andorra 160 34.19 ## 173 174 Saint Vincent and the Grenadines 127 32.56 ## 174 175 Antigua and Barbuda 90 21.00 ## 175 176 Palestine 90 1.45 ## 176 177 Lesotho 80 0.26 ## 177 178 Liechtenstein 70 43.75 ## 178 179 Djibouti 60 0.26 ## 179 180 Kuwait 60 0.34 ## 180 181 Comoros 50 2.24 ## 181 182 Saint Kitts and Nevis 50 19.16 ## 182 183 Grenada 40 11.63 ## 183 184 Tonga 40 5.35 ## 184 185 Tuvalu 23 89.12 ## 185 186 Barbados 20 4.65 ## 186 187 Kiribati 20 2.47 ## 187 188 Singapore 20 3.00 ## 188 189 Maldives 10 3.56 ## 189 190 Bahrain 5 0.67 ## 190 191 Malta 3 0.95 ## 191 192 San Marino 1 1.64 ## 192 193 Qatar 0 0.00 We can even use logical subsets to zero in on specific observations using other functions. How might we find the country with the maximum percentage forest cover? Use the max() function to find the maximum value in the per.forest column in our data frame. max(forest_area$per.forest)
## [1] 90.2

We can use this function to create a logical statement to check where along our column of values the percentage forest cover is equal to the maximum. When finding an equivalence (as opposed to an inequality), the operator is a double equals, ==. Enter forest_area$per.forest == max(forest_area$per.forest) now.

forest_area$per.forest == max(forest_area$per.forest)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE

This gives us a logical vector in which the single TRUE corresponds to the maximum percentage forest cover. Now use this logical statement to subset forest_area for just the maximum, remembering to select all columns.

forest_area[forest_area$per.forest == max(forest_area$per.forest),]
##    rank  country land.area.km per.forest
## 41   41 Suriname       147760       90.2

This is a very powerful form of subsetting, letting us select observations in our data frame for desired ranges of values or specific conditions like maximum or minimum. Essentially any logical condition you can imagine.

Lets do one last subset for this lesson. This time, we want only the country name that has the minimum percentage forest cover.

Since we only want the country name, we’re only subsetting the country column of our data frame. You can access this with the $ operator as usual. Because we’re only subsetting a single column, we only need a single subset in our brackets [] (i.e. we dont need a row and column). Try selecting the country name that has the minimum percentage forest cover now. forest_area$country[forest_area$per.forest == min(forest_area$per.forest)]
## [1] Qatar
## 191 Levels: Afghanistan Albania Algeria Andorra ...  Switzerland

Excellent work. You’ve now been introduced to subsets in 2 dimensions. It may seem tricky or confusing now, but with practice (and you will practice plenty), you’ll be subseting your data frames like a pro in no time.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

1. As you wish

As you wish