Is There An Association Between Two (or More) Variables?

So far we have looked at solely categorical data (Testing Ratios) and continuous data taken from members of different groups or categories (Testing Populations).

Here, we will look at methods to test whether two or more samples of continuous data are positively or negatively associated. We will also consider having multiple predictor variables.

Type of data

Response/Dependent: Integer or continuous

Predictor/Independent: Continuous (or Categorical)

Choosing a test

(0. Response and predictor data are categorical: See Testing Ratios, chisq.test().)

Do the two continuous variables come from the same distribution? ks.test().
We assume no causal relationship between the continuous ‘predictor’ and the continuous ‘response’: correlation, cor(), cor.test().
We are testing if the continuous/categorical predictor variables are causing some of the variation in the continuous response data data from more than two groups and/or two predictors: linear regression, lm().

Figures

If we have continuous data on both sides, we will most likely make some kind of scatter plot.

# Generate 2 vectors of 100 random values from a Normal distribution
x <- rnorm(100)
y <- rnorm(100)
plot(y ~ x)

1. Do the two continuous variables come from the same distribution?

We can use a Kolmogorov-Smirnov Test to test whether two samples come from the same distribution.

Functions

ks.test(x = , y = ) performs a two-sample test of the null hypothesis that x and y were drawn from the same continuous distribution.
- x = and y = are vectors of numeric data.

Example

Generate 50 values drawn randomly from two different distributions: x is from a normal distribution and y from a uniform.

Thus, we know they are different …

x <- rnorm(n = 50)

# Here, y comes from a uniform distribution, where each value the same probability of being sampled.
y <- runif(n = 50)

We can check this with histograms.

par(mfrow = c(1,2))
hist(x, breaks = 6)
hist(y, breaks = 6)

The ks.test() function takes x = and y = .

ks.test(x, y)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.6, p-value = 1.062e-08
## alternative hypothesis: two-sided

The p-value less than 0.001 suggests that x and y are indeed drawn from different distributions (as we shushpected … ).

2. Correlation

Whether we run correlation or a regression depends on how we think the two variables are related. If we think that there is no causal relationship between the two, then we would test for a correlation. If we think that there is a causal relationship between the two, then we would run a regressin.

As the saying goes “correlation does not imply causation”.

Functions

cor() returns the correlation coefficient of two variables. Note: You can also run cor() on a matrix or dataframe to test all possible correlations at once.
cor.test() tests for association—correlation—between paired samples. It returns the correlation coefficient and the p-value of the correlation.

Arguments

x = a numeric vector the same length as y,
y = a numeric vector the same length as x,
method = Pearson’s product moment correlation coefficient (method = 'pearson') is the parametric version used for normal data (and the default). Kendall’s tau (method = 'kendall') or Spearman’s rho (method = 'spearman') are used for non-parametric data. The three methods each estimate the association between paired samples and compute a test of the value being zero (indicating no association).

Assumptions (Pearson)

normality (Shapiro-Wilks test, qqplots)

Interpretation

The correlation coefficient can fall between -1 and 1.

-1: (negative 1) means a strong negative correlation (x goes up and y goes down),
0: means no association between x and y,
1: means a strong positive correlation (x goes up and y goes up).

An example

Let’s look at the small sparrow dataset and the relationships between the various measurements.

BirdData <- data.frame(
            Tarsus  = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
            Head    = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, 31.6),
            Weight  = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7),
            Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
            Species = c('A', 'A', 'A', 'A', 'A',  'B', 'B', 'B')
            )

We can use the pairs() function to plot all the variables against one another.

pairs(BirdData)

We have no a priori reason to believe that any of these measurements of sparrows drives variation in the others. However, we can test whether they are correlated. For example, do sparrows with larger heads also have larger wings?

cor()

The function cor() returns the correlation coefficient of two variables. It requires an x = and a y = . Pearson’s product moment correlation coefficient is the parametric version used for normal data; Kendall’s tau or Spearman’s rho for non-parametric data.

cor(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')

## [1] 0.6409484

cor.test()

The function cor.test() tests whether the correlation coefficient is significantly different from 0.

cor.test(x = BirdData$Tarsus, y = BirdData$Wingcrd, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  BirdData$Tarsus and BirdData$Wingcrd
## t = 2.0454, df = 6, p-value = 0.0868
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1162134  0.9269541
## sample estimates:
##       cor 
## 0.6409484

Interpreting the output

t = 2.0454 is the t-test statistic value,
df = 6 is the degrees of freedom,
p-value = 0.0868 is the significance level of the t-test,
95 percent confidence interval: -0.1162134 0.9269541 is the confidence interval of the correlation coefficient at 95%,
sample estimates cor 0.6409484 is the correlation coefficient.

The p-value of the test is 0.0868, not less than the usual alpha of 0.05; so we cannot reject the null hypothesis that Tarsus and Wing are not significantly correlated.

More details

3. Linear Regression

If we think that one variable is driving variation in the other (i.e., we have predictor and response variables), then we should use regression rather than correlation.

Function

lm()

Arguments

formula = a symbolic (pseudocode) formula,
data = (optional) a data object, usually a data.frame.

Models are specified symbolically. A typical model has the form response ~ terms where:

response is the (numeric) response/dependent vector and
terms is a series of terms which specifies a linear predictor (or independent variable) for the response.

The response variable is fitted as a function of the predictor variable. The tilde symbol ( ~ ) signifies as a function of, and separates the response and predictor variables. Adding further predictor variables to the right hand side is possible (see Formulae).

For now, we will carry out a simple regression of a single predictor and response. This model estimates values for the three elements of the equation:

y ~ beta0 + beta1 * x + sigma

In other words:

y ~ intercept + slope * x + sd of the error

lm() provides estimates of the intercept, slope and sd.

We will illustrate the use of lm() using the sparrow data (even though we know there is no causality here).

First, we plot the data.

plot(BirdData$Tarsus ~ BirdData$Wingcrd)

Fig. Sparrow Wingcrd as a function of Tarsus length

Second, we run the model. We can specify the formula, with or without the data = argument.

m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)

# or
m <- lm(Tarsus ~ Wingcrd, data = BirdData)

The output from lm()

The output of lm() is a list.

We can use str() to examine the elements of the output, as well as pull out various parts directly, or use other functions to do so.

For example, summary() pulls out various elements and formats them nicely to display the whole model table.

summary(m)

## 
## Call:
## lm(formula = Tarsus ~ Wingcrd, data = BirdData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2229 -0.1594  0.1844  0.4492  0.5770 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   8.1210     6.2706   1.295   0.2429  
## Wingcrd       0.2328     0.1138   2.045   0.0868 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6705 on 6 degrees of freedom
## Multiple R-squared:  0.4108, Adjusted R-squared:  0.3126 
## F-statistic: 4.184 on 1 and 6 DF,  p-value: 0.0868

Call: The model formula,
Residuals: The quartiles of the residuals from the model,
Coefficients: For continuous predictors The Estimate and Standard Error for the intercept and slope. It also includes the t-values and associated p-values testing if the estimates are significantly different from 0.
R-squared The proportion of the variation in the response variable that is explained by the predictor,
F statistic From the ANOVA table of the model.

More details here

The coefficients

Intercept: Where the line of the linear model crosses the y-axis,
Slope: For continuous predictors, the estimate is the slope, i.e., the change in the response for a 1-unit increase in the predictor,
p-value: The probability of obtaining a result equal to or more extreme than what was actually observed, if the null hypothesis is true. The null hypothesis is rejected if the p-value is less than the alpha (usually 0.05). More.

Accessing the model output

We can access the model coefficients with the function coef()

coef(m)

(Intercept)     Wingcrd 
  8.1209721   0.2327633

or directly with m$coefficients

m$coefficients

(Intercept)     Wingcrd 
  8.1209721   0.2327633

We can access the residuals with resid()

resid(m)

##            1            2            3            4            5 
##  0.445994599 -1.222952295  0.226192619 -0.622952295  0.458955896 
##            6            7            8 
## -0.004860486  0.142574257  0.577047705

or m$residuals

m$residuals

           1            2            3            4            5            6 
 0.445994599 -1.222952295  0.226192619 -0.622952295  0.458955896 -0.004860486 
           7            8 
 0.142574257  0.577047705 

Adding a fitted line to a plot

We can use abline() to add the fitted line of the model to a plot of the raw data.

plot(BirdData$Tarsus ~ BirdData$Wingcrd)
abline(lm(BirdData$Tarsus ~ BirdData$Wingcrd))

# or
m <- lm(BirdData$Tarsus ~ BirdData$Wingcrd)
abline(m)

Fig. Sparrow Wingcrd as a function of Tarsus length, with fitted line

lm() with a categorical predictor

You can also include categorical predictors in linear models.

If we include only one predictor, and that predictor is categorical, we could think of this as an ANOVA (remember that ANOVA and regression are essentially the same kind of thing in many cases).

Let’s model Tarsus as a function of Species.

A good plot of this model would be a boxplot.

plot(Tarsus ~ Species, data = BirdData)

Now we can run the model. Notice that we can use lm() to run this model (whereas in Unit 3 we used aov()).

m <- lm(Tarsus ~ Species, data = BirdData)

And look at the summary.

summary(m)

Call:
lm(formula = Tarsus ~ Species, data = BirdData)

Residuals:
   Min     1Q Median     3Q    Max 
 -1.08  -0.51   0.02   0.30   1.52 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.7800     0.3763  55.222 2.37e-09 ***
SpeciesB      0.4200     0.6145   0.683     0.52    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8414 on 6 degrees of freedom
Multiple R-squared:  0.07224,	Adjusted R-squared:  -0.08239 
F-statistic: 0.4672 on 1 and 6 DF,  p-value: 0.5198

Interpreting the output

We have the same parts to the model summary as above, but the interpretation of the coefficient estimates is slightly different.

Call: The model formula,
Residuals: The quartiles of the residuals from the model,
Coefficients: For categorical predictors the default is to display the mean of the base level (in this case Species A), and then the difference between the mean of each level and the base level. So, here we have the difference between the mean Tarsus size of Species B from Species A. It also includes the t-values and associated p-values testing if the estimates of this difference are significantly different from 0.
R-squared The proportion of the variation in the response variable that is explained by the predictor,
F statistic From the ANOVA table of the model.