Conducting data analysis is like drinking a fine wine. It is important to swirl and sniff the wine, to unpack the complex bouquet and to appreciate the experience. Gulping the wine doesn’t work. [Wright, D.B. 2003. Making friends with your data: Improving how statistics are conducted and reported. British Journal of Educational Psychology, 73, 123-136.]
Most statistical tests and techniques have underlying assumptions that are often violated (i.e., the data or model do not meet the assumptions). Some of these violations have little impact on the results or conclusions.
Others can increase type I (false positive) or type II (false negative) errors, potentially leading to wrong conclusions and erroneous recommendations. Most statistical violations can be avoided by exploring your data better before analysis.
In this lesson, we will examine graphical and statistical methods for data exploration and testing assumptions. We follow the recommendations of Zuur et al., 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1, 3–14.
All models and statistical tests have assumptions. Different tests can deal with violation of some assumptions better than others. For example, heterogeneity (differences in variation) may be problematic in linear regression, or a single outlier may exert a huge influence on estimates of the mean.
These assumptions come under three broad categories: (1) Distributional assumptions are concerned with the probability distributions of the observations or their associated random errors (e.g., normal versus binomial), (2) Structural assumptions are concerned with the form of the functional relationships between the response variable and the predictor variables (e.g., linear regression assumes that the relationship is linear), (3) Cross-variation assumptions are concerned with the joint probability distribution of the observations and/or the errors (e.g., most tests assume that the observations are independent).
Before we proceed, two other important issue to be clear about.
First, data exploration is a very different, and separate, process from hypothesis testing (or Bayesian, likelihood, or information theoretic approaches). Models and hypotheses should be based on your understanding of your study system, not on ‘data dredging’ looking for patterns.
Second, here we emphasize the visual inspection of the data and model, rather than an over-reliance on statistical tests of the assumptions. There are statistical tests for normality and homogeneity, but the statistical literature warns against them. See the Best Practice and Resources pages for more information.
We will explore several data sets now. First, look at the head of the sparrow dataset loaded with this lesson. The dataset is called ‘sparrows’
## Species Sex Wing Tarsus Head Culmen Nalospi Weight Observer Age ## 1 SSTS Male 58.0 21.7 32.7 13.9 10.2 20.3 2 0 ## 2 SSTS Female 56.5 21.1 31.4 12.2 10.1 17.4 2 0 ## 3 SSTS Male 59.0 21.0 33.3 13.8 10.0 21.0 2 0 ## 4 SSTS Male 59.0 21.3 32.5 13.2 9.9 21.0 2 0 ## 5 SSTS Male 57.0 21.0 32.5 13.8 9.9 19.8 2 0 ## 6 SSTS Female 57.0 20.7 32.5 13.3 9.9 17.5 2 0
You can see 10 columns with morphometric information on 979 individual sparrows from females and males of two species.
An outlier is an observation or data point that is ‘abnormally’ high or low compared to the other data. It may indicate real variation or experimental, observation, and/or data entry errors. As such, the definition of what constitutes an outlier is subjective.
Different techniques respond to and treat outliers differently. For some analyses, outliers make no difference; for others they may bias the results and conclusions.
The classic graph for looking at outliers is the boxplot, which we have seen before. Make a simple boxplot of sparrow wing length, ‘sparrows$Wing’.
The bold line in the middle of the box indicates the median, the lower end of the box is the 25% quartile, the upper end of the box is the 75% quartile. The ‘hinge’ is the 75% quartile minus the 25% quartile. The lines (or whiskers) indicate 1.5 times the size of the hinge. (Note that the interval defined by these lines is not a confidence interval.) Points beyond these lines are (often wrongly) considered to be outliers.
In this case, the boxplot is suggesting to us that there may be at least 6 outliers.
Boxplots provide a summary of the data. As we will discuss in Unit 5, a better approach would be to present the raw data as well. We can use the basic
plot() function to do this. Use
plot() to display the wing measurements.