Poisson
and Negative Binomial Regression
The purpose of this session is to
show you how to use R's procedures for count models including Poisson adn
Negative Binomial Regression. We also show how to do various tests for overdispersion
and for discriminating between models.
# We use
McCullagh and Nelder's data on ship accidents.
The variables are:
# Type = Ship type
# TA, TB, TC, TD, TE = Ship Type
indicators
# T6064, T6569, T7074, T7579 = Year constructed
indicators
# O6064, O7579 = Years operated
indicators
# Months = Measure of service amount
# Acc = Accidents.
library(car)
library(stats)
library(MASS)
# Now let's read in the data. Note
that there are some missing data defined by ".". R expects
# missing data to be coded NA. So we
have changed the missing value code on input here.
rm(list = ls())
ShipAccidents <-
read.table("C:/users/B. Dan Wood/My Documents/My Teaching/Maximum
Likelihood/Data/ShipAccidents.txt",
header=TRUE, na.strings=".")
attach(ShipAccidents)
# Let's go ahead and remove the
missing values from the dataset.
detach(ShipAccidents)
ShipAccidents <-
na.omit(ShipAccidents)
attach(ShipAccidents)
# First let's do some descriptive
statistics on the data just to confirm what is
# obvious from looking at the data
(there are a lot of zeros).
summary(ShipAccidents)
library(e1071)
skewness(Acc)
kurtosis(Acc)
qq.plot(Acc, dist= "norm",
labels=FALSE)
# These are discrete data, so let's
do a histogram of their distribution.
# And, just for fun, let's also
superimpose non-parametric regression lines.
hist(Acc, nclass=n.bins(Acc),
probability=T,
ylab='Density')
lines(density(Acc), lwd=2)
points(Acc, rep(0, length(Acc)),
pch="|")
box()
lines(density(Acc, adjust=.5),
lwd=1)
# Note that the discrete data are
non-normal and cannot be approximated by a normal distribution.
# They are both skewed and kurtotic.
This implies the need for some model other than a normal
# distribution. Which model? is the
next question. We'll attempt to answer this question later.
# Let's begin by doing the
estimation using the Poisson model. Again, we use the Zelig interface
# to R. Replicating McCullaugh and
Nelder, we include months of service logged in the analysis
ShipAccidents$Logmonths <-
log(Months)
detach(ShipAccidents)
attach(ShipAccidents)
library(Zelig)
poisson.out <- zelig(Acc ~ TB +
TC + TD + TE + T6569+ T7074 + T7579 + O7579 + Logmonths,
model = "poisson", data = ShipAccidents)
summary(poisson.out)
# Zelig has kluged up the location
for some of their outputs so that we need to get the variance-covariance matrix
# from the summary, rather than the
output. So we also need to define the summary object.
poissonsummary.out <-
summary(poisson.out)
# McCullagh and Nelder assert that
the coefficient on Logmonths is known to be 1. Let's do a
# Wald test for whether the
coefficient on Logmonths is consistent with their assertion.
b <- poisson.out$coefficients
v <-
poissonsummary.out$cov.unscaled
r <-
rbind(c(0,0,0,0,0,0,0,0,0,1))
q <- 1
Wald <- t(r%*%b-q) %*% solve (r
%*% v %*% t(r)) %*% (r%*%b-q)
Wald
Waldpvalue
<- 1-pchisq(Wald, df=j)
Waldpvalue
# Based on the Wald statistic we are
unable to reject the null that the coefficients on Logmonths
# is one.
# Note, McCullaugh and Nelder assert
that the coefficient on Logmonths is known
# to be 1.
In a sense, they cheat by not taking into account the sampling variability
# of this
coefficient. However, to replicate their results we need to fix the coefficient
# on
Logmonths at 1. This can be done using the offset command. Note that you can
also
# restrict to other values than 1 by
making a scalar transformation of the variable to
# be
restricted. Note that R does not report a coefficient for the fixed parameter.
poissonmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579
+ offset(Logmonths),
model = "poisson", data =
ShipAccidents)
summary(poissonmn.out)
# Again, we
need to define a summary object for testing below.
poissonmnsummary.out <- summary(poissonmn.out)
# Let's do
some interpretation of the results. First, the hard way.
# Get the vector of coefficients and
data matrix
x <- as.matrix(cbind(1,ShipAccidents[3:6],ShipAccidents[
ShipAccidents[
y <- as.vector(Acc)
beta <- c(poissonmn.out$coefficients,1)
beta
# Predicted
accidents for each observation.
lambda <- exp(x %*% beta)
lambda
# Predicted
accidents with all independent variables at their means.
xbar <-
as.matrix(mean(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],
ShipAccidents[
Expectation <- exp(t(xbar)
%*% beta)
Expectation
# We could
use the preceding to do a first differences analysis. However, let's use Zelig
# to do the
work for us. First, set all variables at their means.
means.out <- setx(poisson.out)
means.out
# Simulate
expected value as above. Compare the results to the preceding.
fitted.out <- sim(poisson.out,
x = means.out)
summary(fitted.out)
# Simulated first difference when
shifting Logmonths one standard deviation above the mean
Logmonths.low <- setx(poisson.out, Logmonths = mean(Logmonths))
Logmonths.high <- setx(poisson.out, Logmonths = mean(Logmonths) +
sd(Logmonths))
fd.Logmonth <- sim(poisson.out,
x = Logmonths.low, x1 = Logmonths.high)
summary(fd.Logmonth)
# Now let's
test the hypothesis that the period of manufacture has no effect on accidents.
# Let's use
a Wald statistic.
r <-
rbind(c(0,0,0,0,0,1,0,0,0),c(0,0,0,0,0,0,1,0,0),c(0,0,0,0,0,0,0,1,0))
q <- c(0,0,0)
b <- poissonmn.out$coefficients
v <- poissonmnsummary.out$cov.unscaled
j <- nrow(r)
Wald <- t(r%*%b-q) %*% solve (r
%*% v %*% t(r)) %*% (r%*%b-q)
Wald
Waldpvalue <- 1-pchisq(Wald, df=j)
Waldpvalue
# We can
reject the null hypothesis that the period of manufacture has no effect.
# We can
also do this test using a likelihood ratio approach. Re-estimate the model
leaving out
# the
period effects.
poissonR.out <- zelig(Acc ~ TB + TC + TD + TE + O7579 + Logmonths,
model = "poisson", data =
ShipAccidents)
summary(poissonR.out)
names(poissonR.out)
# Now
construct the likelihood ratio test
j <- poissonR.out$df.residual -
poissonmn.out$df.residual
lu <- -.5*poissonmn.out$deviance
lr <- -.5*poissonR.out$deviance
lrtest <- -2*(lr - lu)
lrtest
lrpvalue <- 1-pchisq(lrtest, df=j)
lrpvalue
# We
generated lambda above as the model predictions. Let's also generate the set of
errors.
err <- Acc-lambda
err
# McCullagh and Nelder also assert
that the Poisson model exhibits overdispersion. Let's test
# for overdispersion using a
regression based approach as suggested by Cameron and Trivedi
z <- ((Acc-lambda)^2-Acc)/(sqrt(2)*lambda)
# First the regression on a constant
test1 <- lm(z
~ 1)
summary(test1)
# Now the
regression without a constant
test2 <- lm(z
~ lambda -1)
summary(test2)
# Since
neither t-statistic is significant, there is no evidence of overdispersion in
the data
# using a regression based approach.
# We can
also test for overdispersion using the more general approach suggested on pp.
743-44
# of Greene.
ei2 <- as.vector(err*err)
vi <- as.vector(ei2 - lambda)
vi2 <- as.vector(vi*vi)
eivi <- as.vector(err*vi)
zi <-
cbind(TB,TC,TD,TE,T6569,T7074,T7579,O7579,Logmonths)
mm <- t(zi) %*% diag(vi2) %*% zi
dd <- t(x) %*% diag(ei2) %*% x
md <- t(zi) %*% diag(eivi) %*% x
q <- mm - md %*% solve(dd)
%*% t(md)
r <- t(zi) %*% vi
cm <- t(r) %*% solve(q)
%*% r
cm
# CM is a chi square statistic with
k degrees of freedom, where k is the number of regressors.
# Here we reject the null that the
variance is unrelated to the the regressors in a way that is not
# completely accounted for by the
expected value of y. In other words, we have evidence
# of
heterogeneity due to the regressors. We also reject the null that the mean
equals the variance.
# Now let's
test for overdispersion using a LaGrange Multiplier test as given by
# Green on p. 744.This is a test for
poisson versus negative binomial.
mean(Acc)
ybar <- mean(Acc)
nybar <- nrow(x) * ybar
lmtest <- (t(err) %*% err-nybar)^2/(2*t(lambda) %*% lambda)
lmtest
# The LaGrange Multiplier statistic
is Chi Squared with 1 degree of freedom.
The test is
# non-significant,
so no evidence of overdispersion for this test.
# We can
also test for overdispersion in the context of the Negative Binomial model. If
# the reported heterogeneity
parameter is significant then this is evidence for overdispersion.
startvalues <- poissonmn.out$coefficients
negbinmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579
+ offset(Logmonths),
model = "negbin", start = startvalues,
maxit=100, data = ShipAccidents)
summary(negbinmn.out)
# In this
case the estimation does not converge and no overdispersion parameter is
reported.
# This does
not bode well for McCullaugh and Nelder's analysis.
# All of the preceding interpretational
tools using Zelig can also be applied to the
# negative
binomial model in straight forward fashion. We leave that as an exercise.
# There are
library packages in R that will also estimate other models of heterogeneityh
# in count processes such as the
zero inflated poisson or negative binomial.