Testing Assumptions and Exploring Data

Conducting data analysis is like drinking a fine wine. It is important to swirl and sniff the wine, to unpack the complex bouquet and to appreciate the experience. Gulping the wine doesn’t work. [Wright, D.B. 2003. Making friends with your data: Improving how statistics are conducted and reported. British Journal of Educational Psychology, 73, 123-136.]

Most statistical tests and techniques have underlying assumptions that are often violated (i.e., the data or model do not meet the assumptions). Some of these violations have little impact on the results or conclusions.

Others can increase type I (false positive) or type II (false negative) errors, potentially leading to wrong conclusions and erroneous recommendations. Most statistical violations can be avoided by exploring your data better before analysis.

In this lesson, we will examine graphical and statistical methods for data exploration and testing assumptions. We follow the recommendations of Zuur et al., 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1, 3–14.

All models and statistical tests have assumptions. Different tests can deal with violation of some assumptions better than others. For example, heterogeneity (differences in variation) may be problematic in linear regression, or a single outlier may exert a huge influence on estimates of the mean.

These assumptions come under three broad categories: (1) Distributional assumptions are concerned with the probability distributions of the observations or their associated random errors (e.g., normal versus binomial), (2) Structural assumptions are concerned with the form of the functional relationships between the response variable and the predictor variables (e.g., linear regression assumes that the relationship is linear), (3) Cross-variation assumptions are concerned with the joint probability distribution of the observations and/or the errors (e.g., most tests assume that the observations are independent).

Before we proceed, two other important issue to be clear about.

First, data exploration is a very different, and separate, process from hypothesis testing (or Bayesian, likelihood, or information theoretic approaches). Models and hypotheses should be based on your understanding of your study system, not on ‘data dredging’ looking for patterns.

Second, here we emphasize the visual inspection of the data and model, rather than an over-reliance on statistical tests of the assumptions. There are statistical tests for normality and homogeneity, but the statistical literature warns against them. See the Best Practice and Resources pages for more information.

We will explore several data sets now. First, look at the head of the sparrow dataset loaded with this lesson. The dataset is called ‘sparrows’

head(sparrows)
##   Species    Sex Wing Tarsus Head Culmen Nalospi Weight Observer Age
## 1    SSTS   Male 58.0   21.7 32.7   13.9    10.2   20.3        2   0
## 2    SSTS Female 56.5   21.1 31.4   12.2    10.1   17.4        2   0
## 3    SSTS   Male 59.0   21.0 33.3   13.8    10.0   21.0        2   0
## 4    SSTS   Male 59.0   21.3 32.5   13.2     9.9   21.0        2   0
## 5    SSTS   Male 57.0   21.0 32.5   13.8     9.9   19.8        2   0
## 6    SSTS Female 57.0   20.7 32.5   13.3     9.9   17.5        2   0

You can see 10 columns with morphometric information on 979 individual sparrows from females and males of two species.

Are there outliers in Y and X?

An outlier is an observation or data point that is ‘abnormally’ high or low compared to the other data. It may indicate real variation or experimental, observation, and/or data entry errors. As such, the definition of what constitutes an outlier is subjective.

Different techniques respond to and treat outliers differently. For some analyses, outliers make no difference; for others they may bias the results and conclusions.

The classic graph for looking at outliers is the boxplot, which we have seen before. Make a simple boxplot of sparrow wing length, ‘sparrows$Wing’.

boxplot(sparrows$Wing)

The bold line in the middle of the box indicates the median, the lower end of the box is the 25% quartile, the upper end of the box is the 75% quartile. The ‘hinge’ is the 75% quartile minus the 25% quartile. The lines (or whiskers) indicate 1.5 times the size of the hinge. (Note that the interval defined by these lines is not a confidence interval.) Points beyond these lines are (often wrongly) considered to be outliers.

In this case, the boxplot is suggesting to us that there may be at least 6 outliers.

Boxplots provide a summary of the data. As we will discuss in Unit 5, a better approach would be to present the raw data as well. We can use the basic plot() function to do this. Use plot() to display the wing measurements.

plot(sparrows$Wing)

As usual, running plot() on a numeric vector creates a plot showing each data point as an open circle, displayed in the order they occur in the vector (indicated by the x-axis, labelled Index).

This plot is a version of the ‘Cleveland dotplot’, the row number (i.e., index) of an observation is plotted vs. the observation value.

R has the function dotchart() to do this better than our simple use ofplot(). Rundotchart() on the same sparrow wing data.

dotchart(sparrows$Wing)

Here the function has rotated the plot, compared to before. The data are organised by index on the y-axis and the values are indicated on the x-axis. We also see vertical grey lines that make it easier to compare the horizontal position of each point.

This dotplot suggests that there may in fact be fewer than 6 outliers. A nice feature of this function is that we can condition the continuous variable on other factors in the data, using the argument ‘group =’.

Create a dotchart of the sparrow wing data, grouped by species.

dotchart(sparrows$Wing, group = sparrows$Species)

Now we can see that most of the ‘outliers’ in fact belong to the species ‘SESP’ and appear much more within the expected range of the data. There is only one value from the ‘SSTS’ species that appears abnormally high.

What you then do with any outliers depends. If there are no data entry errors, you could check for other explanatory factors, transform the data, or drop the sample with the outlier, depending on the sensitivity of the analysis. Be careful how you treat outliers - ethically removing points of data because they don’t fit your expectation requires strong justification.

Do we have homogeneity of variance?

Homogeneity of variance is an important assumption in analysis of variance (ANOVA), other regression-related models and some multivariate techniques. Violation of constant variance across groups makes it hard to estimate the standard deviation of the coefficient estimates.

In ANOVA, we can check the variance by making conditional boxplots for each group. Make a boxplot of $Wing conditional on $Species and \(Sex. We can do this by writing a formula: sparrows\)Wing ~ sparrows\(Sex + sparrows\)Species. The tilde (~) means ‘as a function of’. Create a plot of boxplots using this formula.

boxplot(sparrows$Wing ~ sparrows$Sex + sparrows$Species)

To perform this model as an ANOVA, the variation in the observations from the sexes should be similar, as should the variation in the observations from the species. In this case, there seems to be less variation in females of SSTS than females of SESP. Larger differences would be more worrying.

To verify that variances are homogeneous in regression-type models with continuous predictors, we should use the residuals (i.e., the differences between the observed values and the estimated values) of the model. We have loaded a model of sparrow wing length on weight + species. Type ‘m’ to look at the model coefficients.

m
## 
## Call:
## lm(formula = sparrows$Wing ~ sparrows$Weight + sparrows$Species)
## 
## Coefficients:
##          (Intercept)       sparrows$Weight  sparrows$SpeciesSSTS  
##              43.6014                0.7476               -0.9747

Now type ‘str(m)’ to look at the model structure and see how we can extract the residuals.

str(m)
## List of 13
##  $ coefficients : Named num [1:3] 43.601 0.748 -0.975
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "sparrows$Weight" "sparrows$SpeciesSSTS"
##  $ residuals    : Named num [1:979] 0.197 0.865 0.673 0.673 -0.43 ...
##   ..- attr(*, "names")= chr [1:979] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:979] -1810.567 47.154 7.991 0.662 -0.448 ...
##   ..- attr(*, "names")= chr [1:979] "(Intercept)" "sparrows$Weight" "sparrows$SpeciesSSTS" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:979] 57.8 55.6 58.3 58.3 57.4 ...
##   ..- attr(*, "names")= chr [1:979] "1" "2" "3" "4" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:979, 1:3] -31.289 0.032 0.032 0.032 0.032 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:979] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "sparrows$Weight" "sparrows$SpeciesSSTS"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   .. ..- attr(*, "contrasts")=List of 1
##   .. .. ..$ sparrows$Species: chr "contr.treatment"
##   ..$ qraux: num [1:3] 1.03 1.05 1.02
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 976
##  $ contrasts    :List of 1
##   ..$ sparrows$Species: chr "contr.treatment"
##  $ xlevels      :List of 1
##   ..$ sparrows$Species: chr [1:2] "SESP" "SSTS"
##  $ call         : language lm(formula = sparrows$Wing ~ sparrows$Weight + sparrows$Species)
##  $ terms        :Classes 'terms', 'formula'  language sparrows$Wing ~ sparrows$Weight + sparrows$Species
##   .. ..- attr(*, "variables")= language list(sparrows$Wing, sparrows$Weight, sparrows$Species)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "sparrows$Wing" "sparrows$Weight" "sparrows$Species"
##   .. .. .. ..$ : chr [1:2] "sparrows$Weight" "sparrows$Species"
##   .. ..- attr(*, "term.labels")= chr [1:2] "sparrows$Weight" "sparrows$Species"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(sparrows$Wing, sparrows$Weight, sparrows$Species)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
##   .. .. ..- attr(*, "names")= chr [1:3] "sparrows$Wing" "sparrows$Weight" "sparrows$Species"
##  $ model        :'data.frame':   979 obs. of  3 variables:
##   ..$ sparrows$Wing   : num [1:979] 58 56.5 59 59 57 57 57 57 53.5 56.5 ...
##   ..$ sparrows$Weight : num [1:979] 20.3 17.4 21 21 19.8 17.5 19.6 21.2 18.5 20.5 ...
##   ..$ sparrows$Species: Factor w/ 2 levels "SESP","SSTS": 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language sparrows$Wing ~ sparrows$Weight + sparrows$Species
##   .. .. ..- attr(*, "variables")= language list(sparrows$Wing, sparrows$Weight, sparrows$Species)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "sparrows$Wing" "sparrows$Weight" "sparrows$Species"
##   .. .. .. .. ..$ : chr [1:2] "sparrows$Weight" "sparrows$Species"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "sparrows$Weight" "sparrows$Species"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(sparrows$Wing, sparrows$Weight, sparrows$Species)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "sparrows$Wing" "sparrows$Weight" "sparrows$Species"
##  - attr(*, "class")= chr "lm"

There is a lot in there! But, scroll up to the top and you will see $ residuals : Named num [1:979] 0.197 .... Extract (subset) this part of m as a vector. Remember that m, the model output, is a list.

m$residuals
##             1             2             3             4             5 
##  0.1965088231  0.8646123991  0.6731734771  0.6731734771 -0.4296802155 
##             6             7             8             9            10 
##  1.2898502068 -0.2801558310 -1.4763509074 -2.9577717160 -1.4530155615 
##            11            12            13            14            15 
## -4.4249240611 -1.5325339082 -1.1025399460  1.4488869003  0.3226978617 
##            16            17            18            19            20 
## -0.8596741777  1.8693685536  1.4627099373 -0.8268265229 -0.1539667924 
##            21            22            23            24            25 
##  1.5236490926  0.8974600540  1.0936551304 -0.7101497932  2.2479356694 
##            26            27            28            29            30 
##  0.0236490926  2.2712710153  0.5469844385 -0.7753996765  0.9488869003 
##            31            32            33            34            35 
##  1.4955575922  1.1217466308 -0.0558692542 -0.7801558310 -2.7753996765 
##            36            37            38            39            40 
## -1.6025399460 -0.0558692542 -0.3268265229 -1.2006374842 -0.6539667924 
##            41            42            43            44            45 
## -0.7753996765  0.0141367836 -1.9435032525  2.2712710153  2.2712710153 
##            46            47            48            49            50 
##  1.2712710153 -1.1306314464  0.2665148609 -0.3549180233  1.4207953999 
##            51            52            53            54            55 
##  2.2150880145 -0.7849119855  1.4955575922  0.4207953999  0.8974600540 
##            56            57            58            59            60 
## -2.3501618688  2.6731734771  2.2712710153  0.8226978617  0.6731734771 
##            61            62            63            64            65 
##  0.4441307458  1.0469844385  0.5750759389  2.6450819767  0.7712710153 
##            66            67            68            69            70 
## -0.6258752920  2.8974600540  1.3741247080  0.8226978617  2.0236490926 
##            71            72            73            74            75 
##  1.0141367836  0.1684173227 -1.4111010241  0.5984112849 -1.1120522550 
##            76            77            78            79            80 
## -0.5887169091 -1.2006374842  1.1965088231  0.5984112849  0.7993625158 
##            81            82            83            84            85 
##  0.8974600540  0.8974600540  2.2479356694  1.7198441690 -1.8501618688 
##            86            87            88            89            90 
##  0.9955575922 -1.5277777538  2.1498381312  1.1217466308 -0.5511130997 
##            91            92            93            94            95 
##  0.7760271698 -0.8549180233  1.0703197845  1.4488869003  0.5703197845 
##            96            97            98            99           100 
##  0.3741247080  1.8974600540 -1.0372900627  0.0374721296 -4.1025399460 
##           101           102           103           104           105 
## -0.5325339082 -0.9811070619 -1.7849119855  1.4207953999 -0.4111010241 
##           106           107           108           109           110 
## -1.0558692542  1.0936551304 -3.6820582928 -1.1587229469 -2.2287289847 
##           111           112           113           114           115 
##  3.3226978617 -0.1025399460  1.1498381312  3.1965088231  2.7946063613 
##           116           117           118           119           120 
## -2.8034911769 -0.8830095237  1.0469844385 -1.2334851391  2.4207953999 
##           121           122           123           124           125 
##  2.7760271698  2.4207953999  0.2712710153 -0.9577717160  0.7665148609 
##           126           127           128           129           130 
##  0.2946063613  0.4441307458  0.2198441690  0.1217466308  2.3974600540 
##           131           132           133           134           135 
## -0.0511130997  0.2712710153 -0.7053936387  1.5236490926 -0.4296802155 
##           136           137           138           139           140 
##  2.5984112849 -0.6773021383 -0.1306314464 -0.1773021383  1.2246003235 
##           141           142           143           144           145 
## -0.6820582928  0.8974600540 -1.0839607546  2.8460332076  2.4393745914 
##           146           147           148           149           150 
##  0.0655636300 -2.1587229469 -1.1587229469 -0.0511130997  2.2712710153 
##           151           152           153           154           155 
##  2.3460332076  1.9674660918  0.2712710153  2.3693685536 -0.0744484456 
##           156           157           158           159           160 
## -2.9996862533  1.4207953999  1.9627099373  0.3226978617  2.7946063613 
##           161           162           163           164           165 
##  0.0750759389  0.8226978617  0.1498381312  2.4207953999  2.2993625158 
##           166           167           168           169           170 
##  1.4207953999  1.4955575922 -3.7287289847 -1.7849119855  2.6450819767 
##           171           172           173           174           175 
##  1.7198441690  1.1965088231 -0.0839607546  0.9488869003 -1.9063448696 
##           176           177           178           179           180 
## -0.7287289847 -0.6072961005  3.0188929381 -2.2334851391 -2.5325339082 
##           181           182           183           184           185 
## -0.6820582928  2.5422282840 -4.4715947529 -1.8830095237  1.5422282840 
##           186           187           188           189           190 
## -1.4763509074  1.4207953999 -3.3268265229  0.0936551304  0.2150880145 
##           191           192           193           194           195 
##  1.8974600540  1.3693685536  1.5236490926  2.5984112849  3.3460332076 
##           196           197           198           199           200 
## -1.4296802155  2.2993625158  0.0003137467  0.2993625158  3.4722222462 
##           201           202           203           204           205 
##  0.9207953999  2.1684173227  3.1684173227 -0.8082473314  1.9112830909 
##           206           207           208           209           210 
## -0.1120522550 -2.8315826773 -3.3734972147 -1.9811070619 -0.9811070619 
##           211           212           213           214           215 
## -2.5277777538  0.3179417072 -0.5325339082 -1.1492106379  0.7479356694 
##           216           217           218           219           220 
##  0.7993625158 -0.8315826773  2.9441307458  0.8226978617  0.5236490926 
##           221           222           223           224           225 
##  1.3974600540  3.1731734771  0.0003137467 -0.5511130997  1.3460332076 
##           226           227           228           229           230 
## -1.6587229469 -1.0091985623 -0.7287289847  0.2898502068  0.9488869003 
##           231           232           233           234           235 
## -1.3830095237  0.5655636300  0.1965088231  0.1217466308 -2.7287289847 
##           236           237           238           239           240 
##  1.8974600540  0.0188929381  0.7760271698  1.8693685536 -3.5792046001 
##           241           242           243           244           245 
##  0.6122343218  2.9722222462 -2.4763509074  1.6450819767  0.5422282840 
##           246           247           248           249           250 
## -1.1773021383  1.1122343218  0.0188929381 -1.1306314464 -0.9811070619 
##           251           252           253           254           255 
##  1.6965088231  0.1965088231 -0.1306314464 -0.4811070619  0.0469844385 
##           256           257           258           259           260 
## -3.5744484456 -1.7101497932  0.5703197845  1.1684173227 -3.1773021383 
##           261           262           263           264           265 
##  2.9908014377 -1.2101497932 -1.1472147654  3.1565901582 -0.6472147654 
##           266           267           268           269           270 
##  2.0023096192 -1.3248306504  0.5584926200  1.2827791968  0.3294498887 
##           271           272           273           274           275 
##  4.7080170045  1.6237425033 -0.2267331122  0.6704131952  0.5584926200 
##           276           277           278           279           280 
## -2.0957879191  1.4742181187  1.0770718114  0.6799255041  3.1799255041 
##           281           282           283           284           285 
##  1.6799255041 -3.7219769577  0.0537364655 -3.2734038040  2.1284986578 
##           286           287           288           289           290 
## -2.6753062658 -2.4181720341  5.7265961960  3.5770718114  0.9789742732 
##           291           292           293           294           295 
## -1.7686476495 -0.0491172272  0.6518340037 -3.1753062658  1.4508827728 
##           296           297           298           299           300 
## -1.2453123036 -0.3948366882  1.2479356694 -1.2896681400 -0.4249240611 
##           301           302           303           304           305 
## -0.1773021383  1.0003137467  2.1217466308  0.3226978617 -0.8501618688 
##           306           307           308           309           310 
##  0.3741247080  0.9160392454 -1.7006374842  1.1217466308  0.6403258223 
##           311           312           313           314           315 
##  1.6731734771 -1.3782533692  2.4207953999 -1.4296802155 -0.1773021383 
##           316           317           318           319           320 
## -0.9811070619 -2.3830095237 -2.0792046001 -0.0792046001 -0.5839607546 
##           321           322           323           324           325 
##  0.5703197845  0.0469844385  0.5984112849  1.2479356694 -1.0558692542 
##           326           327           328           329           330 
##  1.2479356694 -0.2053936387  0.0188929381  0.3460332076  0.3741247080 
##           331           332           333           334           335 
##  1.9207953999 -0.7334851391  1.1731734771 -1.3268265229 -1.5325339082 
##           336           337           338           339           340 
## -2.1306314464  0.8693685536  0.0188929381  1.8226978617 -0.0277777538 
##           341           342           343           344           345 
## -0.6539667924 -0.2053936387  0.8226978617  1.0469844385 -1.2239728302 
##           346           347           348           349           350 
##  1.0236490926 -0.4811070619  0.7198441690 -1.3082473314  1.6684173227 
##           351           352           353           354           355 
##  1.1731734771  1.1217466308  1.8974600540  1.1965088231 -5.0744484456 
##           356           357           358           359           360 
## -1.2053936387 -1.4015887151  2.2617587064  0.1450819767 -3.1773021383 
##           361           362           363           364           365 
## -1.9763509074  0.2712710153 -0.8782533692  2.5469844385  1.8460332076 
##           366           367           368           369           370 
## -0.9344363700  0.5236490926  2.4955575922  2.5236490926  1.4207953999 
##           371           372           373           374           375 
##  0.9255515544 -0.9530155615 -0.2053936387  0.9722222462 -1.7101497932 
##           376           377           378           379           380 
## -0.7287289847  0.5236490926  1.3460332076 -2.0792046001 -0.2053936387 
##           381           382           383           384           385 
## -3.0277777538  1.1217466308 -1.5230215993 -0.2801558310 -1.9296802155 
##           386           387           388           389           390 
##  0.4955575922  0.4955575922 -0.0277777538 -0.1773021383  0.8131855527 
##           391           392           393           394           395 
##  2.5703197845  2.0188929381  0.6450819767  0.3507893621 -1.4811070619 
##           396           397           398           399           400 
##  0.9955575922 -1.8782533692 -2.5511130997  1.1217466308 -1.3315826773 
##           401           402           403           404           405 
## -3.9249240611  1.2946063613 -0.2006374842  1.1965088231 -1.4577717160 
##           406           407           408           409           410 
##  1.5236490926  1.0984112849  0.0469844385  1.1450819767 -2.1773021383 
##           411           412           413           414           415 
##  0.5141367836 -0.3549180233  1.4955575922  0.6450819767 -3.1025399460 
##           416           417           418           419           420 
## -1.3363388318 -2.2801558310 -2.4858632164 -2.0558692542  0.3460332076 
##           421           422           423           424           425 
##  0.7993625158 -2.6539667924 -2.4577717160 -2.4063448696 -1.4577717160 
##           426           427           428           429           430 
##  1.0188929381  0.1217466308  0.9627099373  0.1450819767 -3.4763509074 
##           431           432           433           434           435 
## -0.1025399460  1.0936551304  1.6965088231 -1.9530155615  1.0469844385 
##           436           437           438           439           440 
##  0.3741247080  1.7946063613 -0.4625278704 -2.4811070619 -1.1025399460 
##           441           442           443           444           445 
##  2.3179417072 -0.8315826773 -0.4625278704 -3.7334851391  0.5469844385 
##           446           447           448           449           450 
##  0.0936551304 -1.5792046001 -0.6353876009 -2.4763509074 -1.7101497932 
##           451           452           453           454           455 
## -0.1820582928  0.5703197845 -2.4296802155  2.6450819767  2.6684173227 
##           456           457           458           459           460 
## -1.1025399460  1.8460332076  0.1965088231  0.1965088231 -2.6820582928 
##           461           462           463           464           465 
##  1.7712710153 -0.7006374842  0.0936551304 -1.3549180233  0.2946063613 
##           466           467           468           469           470 
##  0.5703197845 -0.9625278704  2.9441307458  1.7946063613  0.8226978617 
##           471           472           473           474           475 
##  0.0236490926  1.1450819767  2.6731734771  0.9722222462  3.2384233605 
##           476           477           478           479           480 
##  0.0188929381 -0.6539667924  2.3741247080  2.0888989759 -2.2006374842 
##           481           482           483           484           485 
##  0.3646123991 -0.1725459838  0.3460332076 -0.7287289847  0.9955575922 
##           486           487           488           489           490 
## -0.9530155615  0.0469844385  0.0236490926  0.2150880145 -0.4858632164 
##           491           492           493           494           495 
## -2.6820582928  1.2712710153 -0.2615766395  1.8693685536 -1.2334851391 
##           496           497           498           499           500 
## -2.9715947529 -1.7801558310 -0.5277777538 -0.5792046001 -1.7006374842 
##           501           502           503           504           505 
## -4.9530155615  1.4955575922 -1.7334851391  1.3460332076 -0.2801558310 
##           506           507           508           509           510 
##  0.1217466308  1.4955575922  0.2946063613 -1.9811070619 -2.2334851391 
##           511           512           513           514           515 
##  2.4207953999  0.1731734771 -0.1143671105 -2.5491172272 -1.5257818813 
##           516           517           518           519           520 
## -1.2734038040 -3.0443610727 -3.4557758435  1.5584926200 -2.3762574967 
##           521           522           523           524           525 
## -0.7686476495 -2.2919829955  1.0537364655  0.8013583883 -0.2967391500 
##           526           527           528           529           530 
##  0.7080170045  3.0770718114  0.2966022338  0.1284986578  0.3061145427 
##           531           532           533           534           535 
##  3.0770718114 -1.1800624203 -3.4976903808 -0.5443610727  0.4275474269 
##           536           537           538           539           540 
##  1.5770718114 -0.3995928426 -4.3200744959  1.0469844385  1.3460332076 
##           541           542           543           544           545 
## -0.0277777538 -0.0277777538 -0.9063448696  1.2712710153  0.5422282840 
##           546           547           548           549           550 
## -1.9811070619 -0.7287289847  2.4207953999  1.0469844385 -0.3268265229 
##           551           552           553           554           555 
## -3.7287289847 -0.2053936387 -1.4577717160 -0.7101497932  0.9255515544 
##           556           557           558           559           560 
##  0.1684173227 -2.6820582928  1.5469844385 -1.2053936387 -0.9811070619 
##           561           562           563           564           565 
## -1.8315826773 -0.5792046001  1.5141367836 -1.8549180233 -0.7006374842 
##           566           567           568           569           570 
##  2.2993625158  1.2712710153 -3.6958813298 -1.2334851391 -1.8315826773 
##           571           572           573           574           575 
##  0.2712710153  0.7198441690 -3.1773021383  1.9955575922  0.7946063613 
##           576           577           578           579           580 
## -2.5091985623  1.8693685536  1.3179417072  2.2198441690  1.7946063613 
##           581           582           583           584           585 
## -2.1072961005 -0.4577717160 -0.1587229469 -0.6820582928 -0.9811070619 
##           586           587           588           589           590 
##  2.0374721296  1.0188929381  2.0469844385 -0.1306314464 -0.9530155615 
##           591           592           593           594           595 
##  0.5236490926 -0.2053936387  0.2712710153  1.6684173227  0.7898502068 
##           596           597           598           599           600 
##  0.1731734771 -4.9811070619  3.0422282840  0.0188929381  0.3226978617 
##           601           602           603           604           605 
## -0.1773021383 -0.4296802155  0.2712710153 -0.0091985623  2.2712710153 
##           606           607           608           609           610 
##  0.6731734771  0.4255515544  0.0188929381 -2.9063448696 -0.4134158796 
##           611           612           613           614           615 
## -0.6657939569  2.3808767350  1.0537364655 -0.9462635345 -1.6657939569 
##           616           617           618           619           620 
##  3.0070657736  1.4556389273  1.8061145427 -1.7033977662  2.3246937342 
##           621           622           623           624           625 
## -1.6753062658  0.7946063613 -0.2239728302  1.5703197845 -1.6820582928 
##           626           627           628           629           630 
## -0.6539667924  1.5236490926 -0.8315826773 -1.0839607546 -0.8315826773 
##           631           632           633           634           635 
## -2.0839607546 -0.9530155615 -2.2053936387 -2.4577717160 -0.7287289847 
##           636           637           638           639           640 
## -1.3363388318  1.9722222462 -0.7101497932 -0.6820582928  0.5236490926 
##           641           642           643           644           645 
##  0.0469844385  1.0469844385 -0.5792046001 -0.9530155615  2.0750759389 
##           646           647           648           649           650 
##  1.5703197845 -0.9530155615 -0.0792046001 10.7946063613 -1.4763509074 
##           651           652           653           654           655 
## -2.8315826773 -1.2053936387  1.1684173227 -1.1306314464  1.7946063613 
##           656           657           658           659           660 
## -0.2053936387 -0.3363388318  0.0469844385 -0.4577717160 -1.2101497932 
##           661           662           663           664           665 
##  1.9112830909 -0.9249240611  1.6450819767 -3.7006374842 -1.9811070619 
##           666           667           668           669           670 
##  0.3342060431  2.1731734771  0.7479356694 -4.2753996765  0.9722222462 
##           671           672           673           674           675 
##  1.1355696678 -1.0839607546  0.3412770531 -0.7520643306  0.0469844385 
##           676           677           678           679           680 
##  0.2946063613 -7.1163629830  2.2617587064  1.7946063613  0.0374721296 
##           681           682           683           684           685 
##  1.4674660918 -3.9530155615 -0.1587229469 -0.8877656782  0.4207953999 
##           686           687           688           689           690 
##  1.4674660918  0.7946063613  1.7712710153  0.7246003235 -2.3596741777 
##           691           692           693           694           695 
## -3.2006374842  0.0469844385  1.7665148609  0.1450819767 -0.5792046001 
##           696           697           698           699           700 
##  1.9207953999 -2.4530155615  0.0469844385 -0.7801558310  2.4207953999 
##           701           702           703           704           705 
## -1.2334851391  1.9207953999 -0.5044424078  0.6450819767 -0.7287289847 
##           706           707           708           709           710 
## -1.8501618688 -1.2053936387 -0.0839607546  1.6684173227 -1.7849119855 
##           711           712           713           714           715 
## -2.2006374842  2.0469844385 -1.8315826773 -2.7006374842 -2.2615766395 
##           716           717           718           719           720 
## -1.3268265229  1.0469844385  0.9722222462 -0.9530155615 -2.0277777538 
##           721           722           723           724           725 
##  0.9255515544  1.5936551304  0.7665148609  0.7198441690  0.0469844385 
##           726           727           728           729           730 
## -1.4577717160 -0.9530155615  1.0469844385 -2.0091985623  0.4160392454 
##           731           732           733           734           735 
##  2.0469844385  1.2712710153  0.5517405930  1.3460332076 -1.3596741777 
##           736           737           738           739           740 
## -0.1539667924 -1.4530155615  0.0188929381  0.9955575922 -0.2053936387 
##           741           742           743           744           745 
##  0.2431795149  0.5469844385  0.7993625158 -1.2006374842  2.2712710153 
##           746           747           748           749           750 
## -1.0792046001 -1.5792046001 -0.9530155615  0.1684173227 -1.2053936387 
##           751           752           753           754           755 
## -3.8315826773 -0.7053936387  0.0469844385 -0.2053936387 -1.4858632164 
##           756           757           758           759           760 
## -0.4811070619  0.2946063613 -1.3596741777 -5.5744484456  0.5236490926 
##           761           762           763           764           765 
## -2.0744484456  1.7898502068 -2.3549180233  0.0936551304  1.3927038995 
##           766           767           768           769           770 
## -0.9530155615 -0.0744484456 -2.7287289847 -1.2239728302 -1.9249240611 
##           771           772           773           774           775 
## -1.4482594070  1.5703197845  0.6731734771  1.7993625158 -0.2801558310 
##           776           777           778           779           780 
## -0.3268265229  0.1731734771  2.3179417072  0.9207953999 -1.7101497932 
##           781           782           783           784           785 
##  0.0469844385  0.4207953999  2.0188929381  0.6684173227  0.5703197845 
##           786           787           788           789           790 
##  0.7198441690  3.1731734771 -0.7287289847  0.0469844385 -1.5839607546 
##           791           792           793           794           795 
## -0.4530155615 -0.4229281886 -1.1424586109 -1.6229281886  0.0537364655 
##           796           797           798           799           800 
## -1.0491172272  0.8761205805  2.0723156570 -2.9929342264 -2.3014953044 
##           801           802           803           804           805 
##  1.6099194663  2.5023096192 -1.9276843430 -1.2453123036 -0.1753062658 
##           806           807           808           809           810 
##  1.3294498887  3.0256449651 -0.8014953044 -1.6286355739  0.7032608500 
##           811           812           813           814           815 
## -2.2919829955  2.8294498887  3.6051633118 -4.7919829955  0.8294498887 
##           816           817           818           819           820 
##  0.8342060431  0.0818279659 -1.6191232650  1.8294498887  0.0537364655 
##           821           822           823           824           825 
##  0.6731734771 -0.1306314464 -0.7053936387  2.1498381312  1.2712710153 
##           826           827           828           829           830 
## -0.0744484456  0.7993625158  0.1217466308  0.4207953999 -0.5606254086 
##           831           832           833           834           835 
## -1.2006374842  0.4207953999 -0.1773021383  0.1122343218  2.1965088231 
##           836           837           838           839           840 
## -0.2053936387  1.5236490926 -1.2101497932  0.5703197845  1.6965088231 
##           841           842           843           844           845 
##  1.0188929381 -1.8782533692 -0.0044424078  0.1965088231 -0.7101497932 
##           846           847           848           849           850 
## -1.1072961005  2.2712710153  1.5469844385 -2.4811070619 -0.4625278704 
##           851           852           853           854           855 
##  3.0188929381  1.1217466308  0.9441307458  0.3226978617 -0.8268265229 
##           856           857           858           859           860 
##  3.1684173227 -0.4577717160  1.0469844385  3.5703197845 -2.0277777538 
##           861           862           863           864           865 
## -3.1820582928  0.1498381312  0.4207953999  2.0188929381  0.5703197845 
##           866           867           868           869           870 
##  3.5703197845  4.6051633118  1.5537364655  0.8994559265  1.0023096192 
##           871           872           873           874           875 
## -0.6025399460 -1.3268265229 -2.6353876009  0.6498381312  0.1265027853 
##           876           877           878           879           880 
##  0.2712710153  0.8741247080  1.6965088231  1.7198441690 -3.0511130997 
##           881           882           883           884           885 
##  0.9722222462  0.5236490926 -0.4391925245  1.0888989759 -3.6911251753 
##           886           887           888           889           890 
## -1.8830095237  0.7665148609  1.9722222462 -2.8220703684 -2.8268265229 
##           891           892           893           894           895 
## -3.4530155615 -2.1820582928  0.1450819767 -0.0792046001  0.7712710153 
##           896           897           898           899           900 
## -0.0744484456  0.9160392454  2.0188929381 -0.5372900627  1.0469844385 
##           901           902           903           904           905 
## -0.9996862533 -1.0558692542  1.1731734771 -1.1025399460 -2.1025399460 
##           906           907           908           909           910 
## -2.3130034859  1.6684173227 -1.9530155615 -1.9063448696  1.4955575922 
##           911           912           913           914           915 
##  0.6217466308  0.6965088231  0.9722222462  0.7946063613  1.2479356694 
##           916           917           918           919           920 
##  0.5003137467  0.4488869003 -0.6492106379  1.6450819767  0.7760271698 
##           921           922           923           924           925 
##  2.4722222462 -2.8549180233 -0.0044424078  1.9207953999 -0.8268265229 
##           926           927           928           929           930 
## -0.9811070619 -1.0091985623  0.1217466308  0.4160392454 -1.6306314464 
##           931           932           933           934           935 
##  1.2479356694 -0.6492106379  0.3460332076  0.9207953999 -1.5044424078 
##           936           937           938           939           940 
##  0.1684173227 -0.8315826773 -2.2053936387  1.0236490926  0.2198441690 
##           941           942           943           944           945 
##  0.6731734771  1.0188929381  0.4722222462 -2.4858632164  0.4488869003 
##           946           947           948           949           950 
## -1.7568204851  1.7246003235 -3.7006374842  0.0188929381 -1.3549180233 
##           951           952           953           954           955 
## -0.7568204851  1.4722222462 -0.0372900627  1.4955575922 -2.4344363700 
##           956           957           958           959           960 
## -1.3549180233  0.7198441690  0.2898502068  0.0469844385 -1.5044424078 
##           961           962           963           964           965 
##  0.5236490926  1.6450819767  0.1636611682  1.1684173227 -4.2520643306 
##           966           967           968           969           970 
##  1.0469844385  0.4908014377 -3.2753996765  1.5023096192  1.3808767350 
##           971           972           973           974           975 
##  0.5818279659 -1.3014953044 -2.2172208032 -2.2033977662 -3.2453123036 
##           976           977           978           979 
## -2.5910317646  1.2032608500 -0.8014953044 -2.6424586109

Now that you can get at the residuals, we can use a similar method to extract the fitted (i.e., predicted) values. Use these two vectors to make a plot of residuals as a function of fitted values of the model m, to check for variation in the variance, within a call to plot(). Use the ~ notation as we did in the boxplot for your plot formula.

plot(m$residuals ~ m$fitted.values)

The residuals look to be ok, there is no great widening or narrowing of the distribution as we move along the x-axis. (There is one large residual up top, though). When assessing equal variance, we want to see a relatively homogenous cloud of points. We don’t want to see any structure in this cloud - the most common is when variance is greater at one end of your data’s range than another, giving a conical shape to your residuals.

For any categorical predictors in the model, we would make boxplots of the residuals conditional on these factors, as above.

For those who really want it, you can run a Bartlett’s test for equality of variances with the function `bartlett.test().

To resolve inhomogenous variance, you either have to transform the response variable, or use an approach that does not require homogeneity of variance (e.g., generalised least squares).

Are the data normally distributed?

Many statistical techniques assume that the data are normally distributed, including linear regression and t-tests. Violations of normality create problems for determining whether model coefficients are significantly different from zero and for calculating confidence intervals. Normality is not required for estimating the values of the coefficients themselves (although outliers may do …).

We can examine the normality of data by plotting a histogram. Plot a basic histogram of sparrow weight.

hist(sparrows$Weight)

The distribution is slightly skewed, with more lower values than higher values. For a test such as a t-test, all we can do to assess normality is plot the raw data. For a regression, we should again be checking the residuals. Plot a histogram of the residuals from the model ‘m’.

hist(m$residuals)

These look better! Less skewed.

Another graphical tool for examining normality is the normal probability plot or normal quantile plot of the residuals. We use the function qqnorm() to generate this plot. The residuals should fall more or less along a straight line. Make a qq plot of the residuals from the model m.

qqnorm(m$residuals)

It can be helpful to see a reference diagonal. Add the line by typing qqline(m$residuals).

qqnorm(m$residuals)
qqline(m$residuals)

A normal distribution is indicated by the points lying close to the diagonal reference line. A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in one direction). An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis (i.e., there are either too many or two few large errors in both directions).

If you really need a p-value for your decision, you can use a Shapiro-Wilk test: `shapiro.test(). Run this test on the residuals of our model.

shapiro.test(m$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  m$residuals
## W = 0.98375, p-value = 5.661e-09

The p-value in this case indicates that our distribution is significantly different from a normal distribution. Realistically, this will nearly always be the case, making the test somewhat unhelpful. Examining model fit through residuals will give you a better idea as to whether the distribution of your data is a problem.

For further reading on the issue of statistical tests of normality, see Zuur et al. 2010, Laasa 2009, and this link. As you may realise, real data with perfect Normal errors are extremely rare.

Are there lots of zeros in the data?

When working with count data (e.g., number of bugs on a leaf), it is common to have zeros. In some cases, one can have a lot of zeros. Such ‘zero-inflated’ data is problematical to analyze, and will require either a two-step approach first modelling the zeros and non-zeros using a binomial generalized linear model, and then only the non-zeros likely with a Poisson glm; or using a zero-inflated glm which essentially does these two steps at once. Much of this will make more sense when we start modeling associations in Unit 4.

Is there collinearity among the covariates?

Collinearity is the existence of correlation between covariates (the various predictor variables in your model). For example, weight and height are often tightly correlated, as are levels of soil nutrients. If collinear variables are all in the model, it is hard for the model to determine which variables are significant.

We can check for collinearity quickly with the plot() function again. Running plot() on a dataframe will generate a matrix of small plots, with each variable plotted against each other variable. The pairs() function does a similar thing. Run plot() on the sparrow dataset.

plot(sparrows)

You may need to expand the plotting window to interpret the output. Collinearity will show itself as an obvious relationship between two predictors. In this data set, nearly all of the anatomical variables are collinear, which makes sense. Because of this, you would need to take caution using more than one of these variables as predictors in a model, as they represent essentially the same thing (overall body size).

Are observations of the response variable independent?

A critical assumption of most statistical techniques is that observations are independent of one another. This assumption means that the value of any one data point is not influenced by the values of other data points, after the effects of the predictor variables have been taken into account.

What this means in practice, is that for data that we know are likely to be highly auto-correlated, such as time-series, repeated measures, or data with strong spatial structure such as tree growth or survival in a plot, we may or may not need to account for this structure in the model.

Auto-correlation refers to the tendency for observations close in space or time to be similar (i.e. to be correlated with one another).

It may be that the predictors we have in the model account for the auto-correlation and we do not need a explicit spatial or temporal model.

As with Normality and heterogeneity of variance, we can check the residuals of the model for evidence of autocorrelation in space or time (or phylogeny, or …).

Great, now you have some idea of how to explore data and test the assumptions of various statistical approaches. See the links on the Resources page for more background and more advanced techniques.

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

  1. Super!

Super!