A lot of data is in the form of counts (whole numbers or integers). For example, the number of species in plots, the number of sunny days per year, the number of restuarants in towns, â€¦

Further, with count data, the number 0 often appears as a value of the response variable.

Straightforward linear regression methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons:

- The linear model might lead to the prediction of negative counts. (2) The variance of the response variable is likely to increase with the mean. (3) The errors will not be normally distributed. (4) Zeros are difficult to handle in transformations.

Therefore, Poisson regression is used to model count variables.

The Poisson distribution was made famous by the Russian statistician Ladislaus Bortkiewicz (1868-1931), who counted the number of soldiers killed by being kicked by a horse each year in each of 14 cavalry corps in the Prussian army over a 20-year period. The data can be found here, if you are interested.

To fit a Poisson regression, we run a generalized linear model with `family = poisson`

, which sets `errors = Poisson`

and `link = log`

. The log link ensures that all the fitted values are positive. The Poisson errors take account of the fact that the data are integer and have variances that are equal to their means.

We will illustrate a Poisson regression with data on the number of awards earned by students at one high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general, or academic) and the score on their final exam in maths.

Look at the head of the data, `awards`

.

`head(awards)`

```
## student num_awards program math_score
## 1 1 0 vocational 40
## 2 2 0 vocational 33
## 3 3 0 academic 48
## 4 4 1 academic 41
## 5 5 1 academic 43
## 6 6 0 academic 46
```

The response variable is the number of awards earned by each high-school student during the year (`$num_awards`

). The predictor variables are each studentsâ€™ score on their maths final exam (`$math_score`

, continuous) and the type of program in which the students were enrolled (`$program`

, categorical with three levels: general, academic, vocational).

Check that the data read in correctly, with `str()`

.

`str(awards)`

```
## 'data.frame': 200 obs. of 4 variables:
## $ student : int 1 2 3 4 5 6 7 8 9 10 ...
## $ num_awards: int 0 0 0 1 1 0 1 0 0 1 ...
## $ program : Factor w/ 3 levels "academic","general",..: 3 3 1 1 1 1 1 1 3 2 ...
## $ math_score: int 40 33 48 41 43 46 59 52 52 49 ...
```

The Poisson distribution assumes that the mean and variance of the response variable, conditioned on the predictors, are equal. We can check that this is the case.

First, the *unconditional* mean and variance (i.e., the mean and variance of the whole vector). Calculate the mean number of awards given, over all students (i.e.Â the entire column in the data).

`mean(awards$num_awards)`

`## [1] 0.63`

Now, calculate the variance of the number of awards.

`var(awards$num_awards)`

`## [1] 1.108643`

Second, check the *conditional* mean and variance (i.e., the mean and variance of the respose variable conditional on the levels of categorical predictors).

Calculate the mean number of awards for each program. Call this object `a_mean`

.

`a_mean <- tapply(awards$num_awards, awards$program, mean)`

Calculate the variance in the number of awards for each program. Call this object `a_var`

.

`a_var <- tapply(awards$num_awards, awards$program, var)`

The conditional mean and variance look reasonably similar here. Letâ€™s make a histogram. The displayed figure is created from the following code:

`barplot(t(table(awards$num_awards, awards$program)), beside = TRUE, xlab = 'Number of Awards', ylab = 'Count')`

Letâ€™s fit a glm! We want to model the number of awards as a function of the program and the maths grade. Remember to set the `family =`

argument to `poisson`

. Use a formula and `data =`

argument, and call this `m1`

.

`m1 <- glm(num_awards ~ program + math_score, data = awards, family = 'poisson')`

Now display the summary of the model result.

`summary(m1)`

```
##
## Call:
## glm(formula = num_awards ~ program + math_score, family = "poisson",
## data = awards)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.8436 -0.5106 0.2558 2.6796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16327 0.66288 -6.281 3.37e-10 ***
## programgeneral -1.08386 0.35825 -3.025 0.00248 **
## programvocational -0.71405 0.32001 -2.231 0.02566 *
## math_score 0.07015 0.01060 6.619 3.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
```

Letâ€™s work through this output, which is in the usual format.

First we have the model **Call**, reiterating the model we just ran.

Second, we have the **Deviance Residuals**, which are approximately normally distributed if the model is specified correctly. Here, it shows a little bit of skeweness since median is not quite zero.

Then, we have the **Coefficients**: The Poisson regression coefficients, standard errors, z-scores, and p-values for each variable.

For continuous predictor variables, the coefficient estimate is the expected *log count* of the response variable for a one-unit increase in the continuous predictor.

For example, the coefficient for `$math_score`

is 0.07. This means that the expected increase in log award number for a one-unit increase in math is 0.07. To obtain the actual counts, we need to calculate the exponential of 0.07.

Extract the coefficient estimate for `$math_score`

from this model summary, and calculate the exponent.

`exp(coef(m1)['math_score'])`

```
## math_score
## 1.072672
```

For categorical predictor variables, the coefficient estimate is the expected difference in log count between the baseline level (in this case, `academic`

) and each other level. To return the difference in actual counts, you would take the exponent.

In this example, the expected difference in log count for `general`

compared to `academic`

is -1.08. The expected difference in log count for `vocational`

compared to `academic`

is -0.7: students in `general`

and `vocational`

programs got fewer awards than students in academic programs.

Finally, we have information on the **Deviance**. The residual deviance can be used to test for the goodness of fit of the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. The smaller the residual deviance, the better the model fit.

We can eyeball this fit by checking if the residual deviance is similar to the residual degrees of freedom (it should be, because the variance and the mean should be the same). If the residual deviance is *larger* than residual degrees of freedom, this difference suggests that there is extra, unexplained, variation in the response (i.e., we have a case of overdispersion). *Overdispersion* can be compensated for by refitting the model using `family =`

quasipoisson` rather than Poisson errors.

A better approach is to explicitly test for a difference. We can do this using a chi-squared goodness of fit test, `pchisq()`

. This function returns a p-value for a specific value (`q`

) and degrees of freedom (`df`

) for the chi-squared distribution. By default, `pchisq()`

gives the proportion of the distribution to the left of the value. To get the proportion more extreme than the difference, you can specify `lower.tail = FALSE`

(or subtract the result from 1).

If the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.

Run a chi-squared goodness of fit test on the residuals of m1, using `pchisq()`

. Make sure that you extract the deviance and degrees of freedom from the model (df.residual), rather than entering the values manually.

`pchisq(m1$deviance, m1$df.residual)`

`## [1] 0.3817726`

We can also use a goodness of fit test to compare this full model with a simpler model, such as one that only includes `$math_score`

. Run a new model, `m2`

, that models the number of awards as a function of math_score only.

`m2 <- glm(num_awards ~ math_score, data = awards, family = 'poisson')`

Now we can test the overall effect of `$program`

by comparing the deviance of the full model with the deviance of this new model. Use the `anova()`

function, with `test = 'Chisq'`

to compare models m1 and m2.

`anova(m2, m1, test = "Chisq")`

```
## Analysis of Deviance Table
##
## Model 1: num_awards ~ math_score
## Model 2: num_awards ~ program + math_score
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 198 204.02
## 2 196 189.45 2 14.572 0.0006852 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The two degree-of-freedom chi-square test indicates that `$program`

is a statistically significant predictor of num_awards.

As with the binomial glm, we cannot just use the coefficient estimates to plot the fitted line with `abline()`

.

We first need to calculate and store predicted values for each row in the original dataset (we have enough data here that, at least for our purposes, we do not need to generate a new vector of math_scores).

Use the `predict()`

function to add a new column (`$pred`

) to the `awards`

data object for the predicted values from the original model, m1. In this case, we do not need to provide any `newdata`

. Further, to get predicted *counts8 (rather than log counts), we need to set `type = â€˜responseâ€™.

`awards$pred <- predict(m1, type = "response")`

Now we can make a plot of the data and fitted lines. Think about how we would plot both a continuous and a categorical predictor variables. A good way would be to have the continuous variable as the x-axis, and different points/lines for each level of the categorical variable.

To plot the results, we need to first order the data by `$program`

and `$math_score`

, so that the lines we plot will be curves and not go all over the plot. We can do that using the `order()`

function within the square brackets, as we did before. Run: awards2 <- awards[order(awards\(program, awards\)math_score), ].

`awards2 <- awards[order(awards$program, awards$math_score), ]`

Now, we can plot the data points (adding some jitter to the counts, because there will be some overlapping points), and then add the predicted lines. First, plot jittered `$num_awards`

as a function of `math_score`

. Use `pch = 20`

, and the argument `col =`

to your factor, program.

`plot(jitter(num_awards) ~ math_score, data = awards2, col = program, pch = 20)`