Many statistical problems involve binary response variables. For example, we could classify individuals as alive/dead, healthy/unwell, employ/unemployed, left/right, right/wrong, … etc. A regression of binary data is possible if at least one of the predictors is continuous (otherwise you would use some kind of categorical test, such as a Chi-squared test).

The response variable contains only 0s and 1s (e.g., dead = 0, alive = 1) in a single vector. R treats such binary data is if each row came from a binomial trial with sample size 1. If the probability that an individual is dead is p, then the probability of obtaining y (where y is either dead or alive, 0 or 1) is given by an abbreviated form of the binomial distribution with n = 1, known as the Bernoulli distribution.

Binomial data can either be modelled at the individual (binary response) or group (proportion) level. If you have unique values of one or more explanatory variables for each and every individual case, then a model with a binary response variable will work. If not, you should aggregate the data until you do have such explanatory variables (e.g., at the level of the household, town, state, population, …).

A. Individual-level: binary response (0/1) = logistic regression

At the individual level we can model each individual in terms of dead or alive (male/female, tall/short, green/red, …), using logistic regression (or logit model). In the logit model, the log odds (logarithm of the odds) of the outcome is modeled as a linear combination of the predictor variables. If p is a probability, then p/(1 − p) is the corresponding odds; the logit of the probability is the logarithm of the odds.

To model a binary response variable, we first need (or need to create) a single vector of 0 and 1s. We then model this vector with glm() and family = 'binomial'. We can compare models using a test for a change in deviance with chi-squared. Overdispersion is not an issue with a binary response variable.

Let’s look at the incidence (presence/absence) of a breeding bird on various islands. Examine the top 6 rows of the isolation dataset.

head(isolation)
##   incidence  area distance
## 1         1 7.928    3.317
## 2         0 1.925    7.554
## 3         1 2.045    5.883
## 4         0 4.781    5.932
## 5         0 1.536    5.308
## 6         1 7.369    4.934

The data show the $incidence of the bird (present = 1, absent = 0) on islands of different sizes ($area in km2) and distance ($distance in km) from the mainland.

We can easily imagine how we might be more likely to find a specific bird on larger islands that are closer to the mainland, and less likely to find that bird on smaller islands further away from the mainland.

Now let’s plot the data. First, plot incidence as a function of distance.

plot(incidence ~ distance, data = isolation)

Because we only have two options for the y-axis (response variable), 0 or 1, we have two lines of data points at different distances (x-axis). It looks like there are more 1’s than 0’s at lower values of distance and fewer 1’s than 0’s at higher values of distance.

Now make a similar plot where area is our predictor.

plot(incidence ~ area, data = isolation)

You can see that area has the opposite effect: as area increases, we see fewer 0’s and more 1’s. Let’s test these relationships using a generalized linear model.

We can begin examining these patterns in more detail by fitting a model of incidence as a function of area and distance. We will first consider the additive model (i.e., area and distance are independent).

To do this, we’ll need to run a more flexible form of a linear model, called a generalized linear model. Simply, this lets us fit a linear model with error terms from other, non-normal distributions. In this case, we assume our binomial response has errors following a bernoulli distribution.

Use glm(), which stands for generalized linear model, and call it m0. Use a formula, set the data = argument, and the family = binomial argument.

m0 <- glm(incidence ~ area + distance, family = binomial, data = isolation)

What are we actually stating in this additive model? We are saying that area and distance have the same effect on incidence, but that the intercept may be different between them. If we plotted both on the same graph, they would have the same slope, but maybe different intercepts.

Ok, what if we think that there is an interaction between area and distance, i.e., the slope of each variable varies as well as the intercept? Run the model of an interaction between area and distance, called m1.

m1 <- glm(incidence ~ area * distance, family = binomial, data = isolation)

We can compare these two models using ANOVA. The function anova() computes analysis of variance (or deviance) tables for one or more fitted model objects. Run this function on m0 and m1. In the case of binary data, we need to do a Chi-squared test (test = 'Chi').

anova(m0, m1, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: incidence ~ area + distance
## Model 2: incidence ~ area * distance
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        47     28.402                     
## 2        46     28.252  1  0.15043   0.6981

The simpler model (m0) is not significantly worse (p = 0.6981), so we can go with that one.

Fitting and interpreting the model

Let’s look at the summary of model m0.

summary(m0)
## 
## Call:
## glm(formula = incidence ~ area + distance, family = binomial, 
##     data = isolation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   6.6417     2.9218   2.273  0.02302 * 
## area          0.5807     0.2478   2.344  0.01909 * 
## distance     -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6

As normal, we have the formula, the deviance residuals (a measure of model fit), then the coefficient estimates, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and p-values. What exactly do the coefficients represent?

The coefficients as presented in the model summary are the logits. We can then transform these logits to odds ratios if we want.

Logit/log odds

The logistic regression coefficients give the change in the log odds (i.e., the logarithm of the odds, called the logit).

For continuous predictors, the coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. For example, for every 1km^2 increase in area, the log odds of the bird being present (versus absent) increases by 0.58 and the log odds of the bird being present decreases by 1.37 (i.e., -1.37) for every km of distance from the mainland.

For categorical predictors, the coefficient estimate describes the change in the log odds for each level compared to the base level.

Confidence intervals around the coefficient estimates can be calculated with the function confint(). Calculate the CIs for m0.

confint(m0)
## Waiting for profiling to be done...
##                  2.5 %    97.5 %
## (Intercept)  1.8390115 13.735103
## area         0.1636582  1.176611
## distance    -2.5953657 -0.639140

Logits (log odds) are just one way of describing the model. We can easily convert the logits either to odds ratios or predicted probabilities.

Odds ratios

Odds ratios are a relative measure of effect, which allows the comparison of two groups. It is a popular way of presenting results in medicine, especially of controlled trials. Here, the odds of one group (usually the control) is divided by the odds of another (the treatment) to give a ratio. In this case, if the odds of both groups are the same, the ratio will be 1, suggesting there is no difference. If the OR is > 1 the control is better than the other level. If the OR is < 1 the treatment is better than the control.

However, in R, we are presented with differences or the change in log odds between groups (for categorical variables) or the change in log odds between x and x + 1 (for continuous variables). The directionality is the opposite to the example above because R effectively compares the ‘treatment’ to the ‘control’.

Thus, for categorical predictors, an odds ratio of > 1 suggests a positive difference between levels and the base level (the exponential of a positive number is greater than 1), i.e., the base level has a lower odds and the treatment level has a higher odds. An odds ratio of < 1 suggests a negative difference (the exponential of a negative number lies between 0 and 1).

For continuous predictors, the odds ratio is the odds of x + 1 over x (i.e., the odds ratio for a unit increase in the continuous x predictor variable), and the interpretation is the same as for categorical.

To obtain these odds ratios in R, we need to exponentiate the coefficient estimates. Remember that these estimates are the log of the odds, so the exponent is the actual odds. The function exp() can do this for us.

Remember that we can extract the coefficient estimates from any model with the function coef(). Put exp() and coef() together to obtain the odds ratios from m0.

exp(coef(m0))
## (Intercept)        area    distance 
## 766.3669575   1.7873322   0.2536142

Since both our predictors are continuous, they are the odds ratios for x+1 to x. The odds ratio for area is 1.79, indicating that for a 1km^2 increase in area we expect a 79% increase in the odds of finding the bird on the island. For a 1km increase in distance of the island from the mainland, we expect a 75% decrease in the odds of finding the bird.

A more detailed discussion of these different representations can be found here: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

Predicted probabilities

We can generate predicted probabilities for both continuous and categorical variables. To do this, we call the predict() function, which can generate predictions for each row in the original data set.

Setting type = 'link' returns the logits; setting type = 'response' returns the predicted probabilities.

Add an extra column to the dataset for the predicted logits, called ‘logit’ by using the predict() function on m0 and setting the correct type. Remember that to add a column we just need to call the data object, and add the new column name with a dollar sign.

isolation$logit <- predict(m0, type = 'link')

Add another column for the predicted probabilities, called ‘p’.

isolation$p <- predict(m0, type = 'response')

However, to better plot these relationships, we should input a specific range of x’s that we predict y’s for.

Plotting binary response data

To plot the fitted line of a generalized linear model, we generally have to use the predict() function, because the model, although linear, is not a straight line that can be plotted with either an intercept or slope, as we used for lm() and abline().

The function predict() is a general wrapper for more specific functions, in this case predict.glm(). We have to specify the object (i.e., the fitted model), the newdata (i.e., the values of x for which we want predictions, named identically to the predictors in the object, as a dataframe or list), and the type of prediction required (either on the scale of the linear predictors (‘link’) or on the scale of the response variable (‘response’).

There are, as usual, at least two ways we could do this. First, we will make two separate individual logistic regressions and make predictions from those.

Make a logistic model for incidence as a function of area, called ma.

ma <- glm(incidence ~ area, family = binomial, data = isolation)

We then need to make a sequence of x values (e.g., $area) from which to predict the y (predicted probability of $incidence). To do this, first generate a sequence of x values from 0 to 9, in steps of 0.1. Call this vector xv.

xv <- seq(from = 0, to = 9, by = 0.1)

Now generate the vector of predicted y values of the log odds using predict (call it yv). Include object = for your model, newdata = (a dataframe of xv, in which xv is named as your predictor, in this case area), and type = (response).

yv <- predict(object = ma, newdata = data.frame(area = xv), type = "response")

Now, we have two vectors, xv and yv, that we can use to plot the fitted (predicted) line.

Add this fitted line to the plot of incidence on area, using the lines() function.

plot(incidence ~ area, data = isolation)
lines(xv, yv)