Count data

A lot of data is in the form of counts (whole numbers or integers). For example, the number of species in plots, the number of sunny days per year, the number of restuarants in towns, …

Further, with count data, the number 0 often appears as a value of the response variable.

Straightforward linear regression methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons:

  1. The linear model might lead to the prediction of negative counts. (2) The variance of the response variable is likely to increase with the mean. (3) The errors will not be normally distributed. (4) Zeros are difficult to handle in transformations.

Therefore, Poisson regression is used to model count variables.

The Poisson distribution was made famous by the Russian statistician Ladislaus Bortkiewicz (1868-1931), who counted the number of soldiers killed by being kicked by a horse each year in each of 14 cavalry corps in the Prussian army over a 20-year period. The data can be found here, if you are interested.

To fit a Poisson regression, we run a generalized linear model with family = poisson, which sets errors = Poisson and link = log. The log link ensures that all the fitted values are positive. The Poisson errors take account of the fact that the data are integer and have variances that are equal to their means.

We will illustrate a Poisson regression with data on the number of awards earned by students at one high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general, or academic) and the score on their final exam in maths.

Look at the head of the data, awards.

##   student num_awards    program math_score
## 1       1          0 vocational         40
## 2       2          0 vocational         33
## 3       3          0   academic         48
## 4       4          1   academic         41
## 5       5          1   academic         43
## 6       6          0   academic         46

The response variable is the number of awards earned by each high-school student during the year ($num_awards). The predictor variables are each students’ score on their maths final exam ($math_score, continuous) and the type of program in which the students were enrolled ($program, categorical with three levels: general, academic, vocational).

Check that the data read in correctly, with str().

## 'data.frame':    200 obs. of  4 variables:
##  $ student   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ num_awards: int  0 0 0 1 1 0 1 0 0 1 ...
##  $ program   : Factor w/ 3 levels "academic","general",..: 3 3 1 1 1 1 1 1 3 2 ...
##  $ math_score: int  40 33 48 41 43 46 59 52 52 49 ...

Checking the data

The Poisson distribution assumes that the mean and variance of the response variable, conditioned on the predictors, are equal. We can check that this is the case.

First, the unconditional mean and variance (i.e., the mean and variance of the whole vector). Calculate the mean number of awards given, over all students (i.e. the entire column in the data).

## [1] 0.63

Now, calculate the variance of the number of awards.

## [1] 1.108643

Second, check the conditional mean and variance (i.e., the mean and variance of the respose variable conditional on the levels of categorical predictors).

Calculate the mean number of awards for each program. Call this object a_mean.

a_mean <- tapply(awards$num_awards, awards$program, mean)

Calculate the variance in the number of awards for each program. Call this object a_var.

a_var <- tapply(awards$num_awards, awards$program, var)

The conditional mean and variance look reasonably similar here. Let’s make a histogram. The displayed figure is created from the following code:

barplot(t(table(awards$num_awards, awards$program)), beside = TRUE, xlab = 'Number of Awards', ylab = 'Count')

Fitting the model

Let’s fit a glm! We want to model the number of awards as a function of the program and the maths grade. Remember to set the family = argument to poisson. Use a formula and data = argument, and call this m1.

m1 <- glm(num_awards ~ program + math_score, data = awards, family = 'poisson')

Now display the summary of the model result.

## Call:
## glm(formula = num_awards ~ program + math_score, family = "poisson", 
##     data = awards)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.16327    0.66288  -6.281 3.37e-10 ***
## programgeneral    -1.08386    0.35825  -3.025  0.00248 ** 
## programvocational -0.71405    0.32001  -2.231  0.02566 *  
## math_score         0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for poisson family taken to be 1)
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## Number of Fisher Scoring iterations: 6

Let’s work through this output, which is in the usual format.

First we have the model Call, reiterating the model we just ran.

Second, we have the Deviance Residuals, which are approximately normally distributed if the model is specified correctly. Here, it shows a little bit of skeweness since median is not quite zero.

Then, we have the Coefficients: The Poisson regression coefficients, standard errors, z-scores, and p-values for each variable.

For continuous predictor variables, the coefficient estimate is the expected log count of the response variable for a one-unit increase in the continuous predictor.

For example, the coefficient for $math_score is 0.07. This means that the expected increase in log award number for a one-unit increase in math is 0.07. To obtain the actual counts, we need to calculate the exponential of 0.07.

Extract the coefficient estimate for $math_score from this model summary, and calculate the exponent.

## math_score 
##   1.072672

For categorical predictor variables, the coefficient estimate is the expected difference in log count between the baseline level (in this case, academic) and each other level. To return the difference in actual counts, you would take the exponent.

In this example, the expected difference in log count for general compared to academic is -1.08. The expected difference in log count for vocational compared to academic is -0.7: students in general and vocational programs got fewer awards than students in academic programs.

Finally, we have information on the Deviance. The residual deviance can be used to test for the goodness of fit of the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. The smaller the residual deviance, the better the model fit.

We can eyeball this fit by checking if the residual deviance is similar to the residual degrees of freedom (it should be, because the variance and the mean should be the same). If the residual deviance is larger than residual degrees of freedom, this difference suggests that there is extra, unexplained, variation in the response (i.e., we have a case of overdispersion). Overdispersion can be compensated for by refitting the model using family =quasipoisson` rather than Poisson errors.

A better approach is to explicitly test for a difference. We can do this using a chi-squared goodness of fit test, pchisq(). This function returns a p-value for a specific value (q) and degrees of freedom (df) for the chi-squared distribution. By default, pchisq() gives the proportion of the distribution to the left of the value. To get the proportion more extreme than the difference, you can specify lower.tail = FALSE (or subtract the result from 1).

If the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.

Run a chi-squared goodness of fit test on the residuals of m1, using pchisq(). Make sure that you extract the deviance and degrees of freedom from the model (df.residual), rather than entering the values manually.

pchisq(m1$deviance, m1$df.residual)
## [1] 0.3817726

We can also use a goodness of fit test to compare this full model with a simpler model, such as one that only includes $math_score. Run a new model, m2, that models the number of awards as a function of math_score only.

m2 <- glm(num_awards ~ math_score, data = awards, family = 'poisson')

Now we can test the overall effect of $program by comparing the deviance of the full model with the deviance of this new model. Use the anova() function, with test = 'Chisq' to compare models m1 and m2.

anova(m2, m1, test = "Chisq")
## Analysis of Deviance Table
## Model 1: num_awards ~ math_score
## Model 2: num_awards ~ program + math_score
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       198     204.02                          
## 2       196     189.45  2   14.572 0.0006852 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The two degree-of-freedom chi-square test indicates that $program is a statistically significant predictor of num_awards.

Plotting the model

As with the binomial glm, we cannot just use the coefficient estimates to plot the fitted line with abline().

We first need to calculate and store predicted values for each row in the original dataset (we have enough data here that, at least for our purposes, we do not need to generate a new vector of math_scores).

Use the predict() function to add a new column ($pred) to the awards data object for the predicted values from the original model, m1. In this case, we do not need to provide any newdata. Further, to get predicted *counts8 (rather than log counts), we need to set `type = ‘response’.

awards$pred <- predict(m1, type = "response")

Now we can make a plot of the data and fitted lines. Think about how we would plot both a continuous and a categorical predictor variables. A good way would be to have the continuous variable as the x-axis, and different points/lines for each level of the categorical variable.

To plot the results, we need to first order the data by $program and $math_score, so that the lines we plot will be curves and not go all over the plot. We can do that using the order() function within the square brackets, as we did before. Run: awards2 <- awards[order(awards\(program, awards\)math_score), ].

awards2 <- awards[order(awards$program, awards$math_score), ]

Now, we can plot the data points (adding some jitter to the counts, because there will be some overlapping points), and then add the predicted lines. First, plot jittered $num_awards as a function of math_score. Use pch = 20, and the argument col = to your factor, program.

plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)