Is There An Association Between Two (or More) Variables?

So far we have looked at solely categorical data (Testing Ratios) and continuous data taken from members of different groups or categories (Testing Populations).

Here, we will look at methods to test whether two or more samples of continuous data are positively or negatively associated. We will also consider having multiple predictor variables.

Type of data

Response/Dependent: Integer or continuous

Predictor/Independent: Continuous (or Categorical)

Choosing a test

(0. Response and predictor data are categorical: See Testing Ratios, chisq.test().)

  1. Do the two continuous variables come from the same distribution? ks.test().

  2. We assume no causal relationship between the continuous ‘predictor’ and the continuous ‘response’: correlation, cor(), cor.test().

  3. We are testing if the continuous/categorical predictor variables are causing some of the variation in the continuous response data data from more than two groups and/or two predictors: linear regression, lm().


If we have continuous data on both sides, we will most likely make some kind of scatter plot.

x <- rnorm(100)
y <- rnorm(100)
plot(y ~ x)

1. Do the two continuous variables come from the same distribution?

We can use a Kolmogorov-Smirnov Test to test whether two samples come from the same distribution.



Generate 50 values drawn randomly from two different distributions: x is from a normal distribution and y from a uniform.

Thus, we know they are different …

x <- rnorm(n = 50)
y <- runif(n = 50)

We can check this with histograms.

par(mfrow = c(1,2))
hist(x, breaks = 6)
hist(y, breaks = 6)

The ks.test() function takes x = and y = .

ks.test(x, y)
##  Two-sample Kolmogorov-Smirnov test
## data:  x and y
## D = 0.6, p-value = 1.062e-08
## alternative hypothesis: two-sided

The p-value less than 0.001 suggests that x and y are indeed drawn from different distributions (as we shushpected … ).

2. Correlation

Whether we run correlation or a regression depends on how we think the two variables are related. If we think that there is no causal relationship between the two, then we would test for a correlation. If we think that there is a causal relationship between the two, then we would run a regressin.

As they saying goes ‘correlation does not imply causation’.



Assumptions (Pearson)


The correlation coefficient can fall between -1 and 1.

An example

Let’s look at the small sparrow dataset, and the relationship between the various measurements.

BirdData <- data.frame(
            Tarsus  = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
            Head    = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, 31.6),
            Weight  = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7),
            Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
            Species = c('A', 'A', 'A', 'A', 'A',  'B', 'B', 'B')

We can use the pairs() function to plot all the variables against one another.


We have no a priori reason to believe that any of these measurements of sparrows drives variation in the others. However, we can test whether they are correlated. For example, do sparrows with larger heads also have larger wings?


The function cor() returns the correlation coefficient of two variables. It requires an x and a y. Pearson’s product moment correlation coefficient is the parametric version used for normal data; Kendall’s tau or Spearman’s rho for non-parametric data.

cor(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')
## [1] 0.6409484


cor.test(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')
##  Pearson's product-moment correlation
## data:  BirdData$Tarsus and BirdData$Wingcrd
## t = 2.0454, df = 6, p-value = 0.0868
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1162134  0.9269541
## sample estimates:
##       cor 
## 0.6409484

Interpreting the output

The p-value of the test is 0.0868, not less than the usual alpha of 0.05; so we cannot reject the null hypothesis that Tarsus and Wing are not significantly correlated.

More details

3. Linear Regression

If we think that one variable is driving variation in the other, we should use regression rather than correlation.



Models are specified symbolically. A typical model has the form response ~ terms where:

The response variable is fitted as a function of the predictor variable. The tilde symbol ( ~ ) signifies as a function of, and separates the response and predictor variables. Adding further predictor variables to the right hand side is possible (see Formulae).

For now, we will carry out a simple regression of a single predictor and response. This model estimates values for the three elements of the equation:

y ~ beta0 + beta1 * x + sigma

In other words:

y ~ intercept + slope * x + sd of the error

lm() provides estimates of the intercept, slope and sd.

We will illustrate the use of lm() using the sparrow data (even though we know there is no causality here).

First, we plot the data.

plot(BirdData$Tarsus ~ BirdData$Wingcrd)

Fig. Sparrow Wingcrd as a function of Tarsus length

Second, we run the model. We can specify the formula, with or without the data = argument.

m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)

# or
m <- lm(Tarsus ~ Wingcrd, data = BirdData)

The output from lm()

The output of lm() is a list.

We can use str() to examine the elements of the output, as well as pull out various parts directly, or use other functions to do so.

For example, summary() pulls out various elements and formats them nicely to display the whole model table.

## Call:
## lm(formula = Tarsus ~ Wingcrd, data = BirdData)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2229 -0.1594  0.1844  0.4492  0.5770 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   8.1210     6.2706   1.295   0.2429  
## Wingcrd       0.2328     0.1138   2.045   0.0868 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.6705 on 6 degrees of freedom
## Multiple R-squared:  0.4108, Adjusted R-squared:  0.3126 
## F-statistic: 4.184 on 1 and 6 DF,  p-value: 0.0868

More details here

The coefficients

Accessing the model output

We can access the model coefficients with the function coef()

(Intercept)     Wingcrd 
  8.1209721   0.2327633

or directly with m$coefficients

(Intercept)     Wingcrd 
  8.1209721   0.2327633

We can access the residuals with resid()

##            1            2            3            4            5 
##  0.445994599 -1.222952295  0.226192619 -0.622952295  0.458955896 
##            6            7            8 
## -0.004860486  0.142574257  0.577047705

or m$residuals

           1            2            3            4            5            6 
 0.445994599 -1.222952295  0.226192619 -0.622952295  0.458955896 -0.004860486 
           7            8 
 0.142574257  0.577047705 

Adding a fitted line to a plot

We can use abline() to add the fitted line of the model to a plot of the raw data.

plot(BirdData$Tarsus ~ BirdData$Wingcrd)
abline(lm(BirdData$Tarsus ~ BirdData$Wingcrd))

# or
m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)

Fig. Sparrow Wingcrd as a function of Tarsus length, with fitted line

lm() with a categorical predictor

You can also include categorical predictors in linear models.

If we include only one predictor, and that predictor is categorical, we could think of this as an ANOVA (remember that ANOVA and regression are essentially the same kind of thing in many cases).

Let’s model Tarsus as a function of Species.

A good plot of this model would be a boxplot.

plot(Tarsus ~ Species, data = BirdData)

Now we can run the model. Notice that we can use lm() to run this model (whereas in Unit 3 we used aov()).

m <- lm(Tarsus ~ Species, data = BirdData)

And look at the summary.

lm(formula = Tarsus ~ Species, data = BirdData)

   Min     1Q Median     3Q    Max 
 -1.08  -0.51   0.02   0.30   1.52 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.7800     0.3763  55.222 2.37e-09 ***
SpeciesB      0.4200     0.6145   0.683     0.52    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8414 on 6 degrees of freedom
Multiple R-squared:  0.07224,	Adjusted R-squared:  -0.08239 
F-statistic: 0.4672 on 1 and 6 DF,  p-value: 0.5198

Interpreting the output

We have the same parts to the model summary as above, but the interpretation of the coefficient estimates is slightly different.