Summarizing data

Summarizing Data

Now that you can create vectors of data, we will learn how to explore and summarize them. In this lesson you will use functions that provide key information, statistics, or visualizations to help you better understand your data.

Lets start with a vector we can practice on. We have provided you with a vector of average annual precipitation (in inches) in some US cities, called precip. Enter precip to look at it now. Note that each element is named.

precip

##              Mobile              Juneau             Phoenix 
##                67.0                54.7                 7.0 
##         Little Rock         Los Angeles          Sacramento 
##                48.5                14.0                17.2 
##       San Francisco              Denver            Hartford 
##                20.7                13.0                43.4 
##          Wilmington          Washington        Jacksonville 
##                40.2                38.9                54.5 
##               Miami             Atlanta            Honolulu 
##                59.8                48.3                22.9 
##               Boise             Chicago              Peoria 
##                11.5                34.4                35.1 
##        Indianapolis          Des Moines             Wichita 
##                38.7                30.8                30.6 
##          Louisville         New Orleans            Portland 
##                43.1                56.8                40.8 
##           Baltimore              Boston             Detroit 
##                41.8                42.5                31.0 
##    Sault Ste. Marie              Duluth Minneapolis/St Paul 
##                31.7                30.2                25.9 
##             Jackson         Kansas City            St Louis 
##                49.2                37.0                35.9 
##         Great Falls               Omaha                Reno 
##                15.0                30.2                 7.2 
##             Concord       Atlantic City         Albuquerque 
##                36.2                45.5                 7.8 
##              Albany             Buffalo            New York 
##                33.4                36.1                40.2 
##           Charlotte             Raleigh             Bismark 
##                42.7                42.5                16.2 
##          Cincinnati           Cleveland            Columbus 
##                39.0                35.0                37.0 
##       Oklahoma City            Portland        Philadelphia 
##                31.4                37.6                39.9 
##           Pittsburg          Providence            Columbia 
##                36.2                42.8                46.4 
##         Sioux Falls             Memphis           Nashville 
##                24.7                49.1                46.0 
##              Dallas             El Paso             Houston 
##                35.9                 7.8                48.2 
##      Salt Lake City          Burlington             Norfolk 
##                15.2                32.5                44.7 
##            Richmond      Seattle Tacoma             Spokane 
##                42.6                38.8                17.4 
##          Charleston           Milwaukee            Cheyenne 
##                40.8                29.1                14.6 
##            San Juan 
##                59.2

Confirm that this is truly a vector (remember is.object_type).

is.vector(precip)

## [1] TRUE

Rainfall in inches is foolish—the rest of the scientific community uses millimeters, so we will too! Create a new object called ‘precip_mm’ by multiplying your precip vector by 25.4

precip_mm <- precip * 25.4

Nice work! When working with data objects in R, it is important to know their details, so lets look at them. One of the most important functions for data analysis in R is str(), which stands for structure. It will tell us about our data object. Go ahead and look at the structure of precip_mm.

str(precip_mm)

##  Named num [1:70] 1702 1389 178 1232 356 ...
##  - attr(*, "names")= chr [1:70] "Mobile" "Juneau" "Phoenix" "Little Rock" ...

In this case, str() is telling us that our vector is numeric (num), has 70 elements in it ([1:70], meaning 70 observations), and shows us the first handful of elements in the vector. It is also telling us that each element has a name associated with it, and that those names are of the character data type (Named num and attr(*, 'names')= chr).

You will use str() a lot when you start working with more complex data objects such as lists and dataframes and the output of regression models.

But lets get to some actual statistics. A statistic we often want to know about our samples is the average, or mean. R has a built-in function: mean(). Go ahead and use mean() to find the average precipitation across all cities in the vector.

mean(precip_mm)

## [1] 886.0971

Fascinating. In addition to the mean, we typically want to know the standard deviation. In R, this function is shortened to sd(). Find the standard deviation of precipitation.

sd(precip_mm)

## [1] 348.1489

Its great to know exactly what our mean and standard deviation is, but what does this look like? When getting a feel for our data, we almost always want to have a sense for the distribution of our data. The way in which our measurements are distributed is a fundamental property of any sample we might have, and many statistical tests assume data that resembles a normal distribution i.e., a bell curve.

The quickest way to assess the distribution of our data is the histogram. In R, the function is called hist(). Go ahead and enter hist(precip_mm).

hist(precip_mm)

This is one way in which we can visualize the mean and variance in the data. In this case, our data generally resemble a normal distribution (with a slight left skew), which is great!

There are other ways in which we can summarize our data as well. Conveniently, there is a function called summary(), which will give us a numeric breakdown of our vector. Go ahead and summarize our precipitation vector.

summary(precip_mm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   177.8   746.1   929.6   886.1  1086.0  1702.0

summary() returns statistics of central tendency (mean and median), spread (1st and 3rd quartile), and range (min and max). Summary also works with different R objects, such as statistical models, to summarize for us important information like model coefficients and significance - we will return to this in the future.

Sometimes we just want specific values - not the entire summary. The functions max(), min(), median(), and one we already learned (mean()) do this. Use max() to find the maximum precipitation in our vector.

max(precip_mm)

## [1] 1701.8

Now find the mininum.

min(precip_mm)

## [1] 177.8

Now lets visualize this summary information. The function hist() displayed all the data in ‘bins’ (small groups). A boxplot (or box-and-whisker plot) displays summary information based on the quartiles. R has this as the function boxplot(). Enter boxplot(precip_mm) now.

boxplot(precip_mm)

We now see much of the summary information as a graph. The middle line represents the median (not the mean!). The median is actually the second quartile. The extents of the box are the 1st and 3rd quartiles. This is consistent across all boxplots.

However, what the whiskers indicate can vary across software. In R, the whiskers display the highest and lowest value excluding outliers. In R, the whiskers are calculated as 1.5 x the interquartile range. Values beyond this are outliers and indicated by open circles … Now you know more about boxplots that most folks.

We can look at multiple sets of data with a boxplot as well. Lets create two new objects. The first will be the first 10 elements of our precip_mm vector. Subset the vector for elements 1 to 10 and call it precip1.

precip1 <- precip_mm[1:10]

Now create a vector of elements 11-20 of precip_mm, and call it precip2.

precip2 <- precip_mm[11:20]

Now we can make our boxplot. Use the boxplot() function, but give it two arguments this time: precip 1 and precip2, separated by commas.

boxplot(precip1, precip2)

We now have a boxplot for each subset of your vector! These types of visualizations are extremely useful for comparing differences between groups at a glance.

We should note that these functions only work in this way for numeric data. We will cover how to deal with character data when we start working with factors/categorical data.

You now know how to briefly summarize, explore, and visualize data in R! Great job! We will expand on these skills in later units. Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

Sure, no problem

Sure, no problem