Censoring and Truncation
The
purpose of this session is to show you how to use R's procedures for doing
censored and truncated regression. We also estimate Heckman's two-stage
procedure for samples with selection bias which is a form of incidential truncation.
# This
file demonstrates some of R's procedures for doing censored and truncated
regression.
# In
particular, we estimate a lower limit censored regression (i.e., Tobit), Cragg's model
# that
assumes a heterogenous censoring process, Heckman's incidential truncation model for
#
dealing with sample selection bias, and truncated regression.
# We
will use some of the Mroz data on female labor force
participation and income for these
# examples.
The first 428 observations of the Mroz data contained
women who worked in 1975.
# The
remaining 345 bservations contained women who did not
work. We will use only the first
# 50
observations from each of these subsets of the data.
LFP <- read.table("C:/users/B.
Dan Wood/My Documents/My Teaching/Maximum Likelihood/Data/tobit.txt",header=TRUE)
attach(LFP)
summary(LFP)
# Now
let's estimate a Tobit model and also save the log
likelihood for later testing. The dependent
# variable
(WHRS)is the wife's hours worked in 1975. The independent variables are a
constant,
# number
of children less than 6 years old (KL6), number of children between 6 and 18
(KL618),
# wife's
age (WA), and wife's education (WE).
# Currently Zelig does not support the tobit model. However, it is easily obtained using the
survival
# regression
package. We will discuss why in the next lecture period. Note that the element
# WHRS > 0 gives the
non-censored observations. The "left" option implies left censoring.
library(survival)
tobit.out <- survreg(Surv(WHRS, WHRS>0,
type='left') ~ KL6 + K618 + WA + WE,
data=LFP,
dist='gaussian')
summary(tobit.out)
# You
can obtain predicted values from the Tobit regression
as follows.
WHRShat <- predict(tobit.out,type="response")
WHRShat
# We
will also save the log likelihood for the model for later use.
lltobit <- tobit.out$loglik[2:2]
lltobit
# However,
this prediction would assume that the observations are all censored. Following
Greene,
# p. 764 the expected
value must take into account the probability of censoring. First, get the
# standard
deviation of the latent residuals.
SD <- tobit.out$scale
SD
# Now
calculate the inverse Mills ratio called lambda
lambda <- dnorm(WHRShat/SD)/pnorm(WHRShat/SD)
lambda
# Now
calculate the expected value given censoring and X
EYgivenX <- pnorm(WHRShat/SD)*(WHRShat
+ SD*lambda)
EYgivenX
# McDonald and Moffit suggest a useful decomposition of the marginal
effects associated with the
# censored regression
model. They show that a change in the conditional mean due to right side
# variables
derives from two sources:
# 1) It
affects the conditional mean in the uncensored part of the distribution
# 2) It
affects the conditional mean by also affecting the probability that an observation
will
# lie
in the uncensored part of the distribution.
xbar
<- as.matrix(mean(cbind(1,LFP[3:6])))
beta <-coefficients(tobit.out)
BXoverS <- t(beta) %*% xbar/SD
Mu <- dnorm(BXoverS)/pnorm(BXoverS)
P <- pnorm(BXoverS)
P
P1 <- P*(1-BXoverS*Mu-Mu^2)
P1
P2 <- dnorm(BXoverS)*BXoverS+dnorm(BXoverS)*Mu
P2
# Cragg
has suggested that assuming a censoring limit that depends on the same
distribution as
# the
uncensored observations is often incorrect. He suggests a two equation system
in which the
# first
equation estimates the probability of being above the censoring limit and the
second is a
# truncated regression on
the uncensored observations. Below we estimate Cragg's
model using
# Probit
and a Truncated regression procedure. Also, we do a
likelihood ratio test of whether
# Cragg's
model is significantly different than the Tobit
model.
library(Zelig)
probit.out <- zelig(LFP ~ KL6 + K618 + WA + WE,
model = "probit", data = LFP)
summary(probit.out)
llprobit <- -.5 * probit.out$deviance
llprobit
# We
will write our own procedure for the truncated normal regression on the
uncensored observations
# First,
create a new dataset using the uncensored observations.
LFPsubset <- subset(LFP,
subset=LFP>0)
x <- cbind(1,LFPsubset[3:6])
y <- as.vector(LFPsubset$WHRS)
K <- ncol(x)
K1 <- K+1
n <- nrow(x)
# Assign the lower
truncation limit
L <- 0
# Define the function to
be optimized
llik.ltruncnormal <- function(par,X,Y) {
Y <- as.vector(y)
X <- as.matrix(x)
xbeta
<- X%*%par[1:K]
Sig <- par[K1:K1]
sum( log((1/Sig)*dnorm((Y-xbeta)/Sig))-log(1-pnorm((L-xbeta)/Sig)))
}
llik.ltruncnormal
# Now
let's use the above function to estimate the model.
# Get start values from a
regular regression.
regress.out <- lm(WHRS ~ KL6 + K618 +
WA + WE, data = LFPsubset)
startvalues <- c(coefficients(regress.out),120)
# Estimate the lower
truncated regression.
ltrunc.out <- optim(startvalues,llik.ltruncnormal,
method = "BFGS", control = list(trace,maxit=100,fnscale
= -1),
hessian
= TRUE)
ltrunc.out
# Calculate standard
errors, z statistics, and pvalues and report in a
table.
v <- -solve(ltrunc.out$hessian)
se <- sqrt( diag(v))
b <- ltrunc.out$par
zstat
<-b/se
pzstat <- 2* (1 - pnorm(abs(zstat)))
table <- cbind(b,se,zstat,pzstat)
table
# Get the log likelihood
from the truncated regression.
lltruncate <- ltrunc.out$value
lltruncate
# Calculate
Cragg's test statistic
lrtest <- 2*((llprobit+lltruncate)-lltobit)
lrtest
# The
restricted model is Tobit. The unrestricted model is
the two models estimated
# separately.
The test statistic is 14.34, which is chi-squared with 5 degrees of freedom
# for
the number of additional parameters being estimated. The critical value is
11.07, so
# we
reject the null that the restricted model is true. The two equation approach is
therefore
# more
appropriate than Tobit.
# Now
let's turn to estimating a model with sample selection bias. In these cases the
# truncation
is incidental, due to sample selection on another variable that is correlated
# with
the truncation in the dependent variable. As discussed in class, the standard
model is
# Heckman's two stage
procedure. Once again, there is no "canned" procedure in R to do
# the
Heckman model. The easy thing to do in this case would be to use STATA or
LIMDEP.
# However,
for illustrative purposes we develop our own procedure in R.
# First,
estimate the first stage probit.
heckprob.out <- zelig(LFP ~ CIT + KL6, model="probit",
data=LFP)
summary(heckprob.out)
# Now
retrieve the index function from the probit
gamma <- as.matrix(heckprob.out$coefficients)
w <- cbind(1,CIT,KL6)
z <- w %*% gamma
# Now
calculate lambda and delta
LFP$lambda <- dnorm(z)/pnorm(z)
LFP$delta <- lambda *(lambda+z)
detach(LFP)
attach(LFP)
# Now
get the subset of the data that suffers from selection bias
LFPheckman <- subset(LFP,
subset=LFP>0)
# Now
run the second stage regression that includes lambda (the Mills ratio) from the
probit.
heckman.out <- lm( WHRS ~ KL6 + K618
+ WA + WE + lambda, data=LFPheckman)
summary(heckman.out)
# The
coefficients from this OLS regression are now unbiased due to the inclusion of
# the
source of the bias in the equation. However, the standard errors and
inferential
# statistics
will be incorrect due to not taking into account the sampling variability
# contained in lambda.
Also, the residuals are heteroskedastic by
definition.
# We
need to compute a corrected covariance matrix of estimates for both problems.
# First,
get the standard error of the regression corrected for selection.
blambda <- heckman.out$coefficients[6:6]
deltabar <- mean(LFPheckman$delta)
sigma <- sqrt(((t(heckman.out$residuals) %*% heckman.out$residuals)/nrow(LFPheckman))
+ deltabar*blambda^2)
# Next obtain rho in the correction rho*sigma*lambda*wgamma, Greene p. 784
rho2 <-
blambda^2/sigma^2
rho
<- sqrt(rho2)
# Next obtain an n x n
Identity matrix, the Delta matrix, and I-rho^2 Delta.
# These
are from Greene, p 785
n <- nrow(LFPheckman)
I <- matrix(0,nrow=n,ncol=n)
I[row(I)==col(I)] <- 1
Delta <- as.vector(LFPheckman$delta) * I
IminusRho2Delta <- (I -
as.numeric(rho2)*Delta)
# Now
calculate the correction factor Q
xstar
<- cbind(1,LFPheckman$KL6,LFPheckman$K618,LFPheckman$WA,LFPheckman$WE,LFPheckman$lambda)
w <- cbind(1,LFPheckman$CIT,LFPheckman$KL6)
Q <- (t(xstar) %*% Delta %*% w) %*% vcov(heckprob.out) %*%
(t(w) %*% Delta
%*% xstar)
# Now,
obtain the corrected asymptotic covariance matrix of coefficients.
vcovheckman <- as.numeric(sigma^2)
*(solve(t(xstar) %*% xstar)
%*%
(t(xstar) %*% IminusRho2Delta %*% xstar
+ Q) %*% solve(t(xstar) %*% xstar))
# Now
produce corrected standard errors, z statistics and pvalues
along with the coefficients
se <- sqrt( diag(vcovheckman))
b <- coefficients(heckman.out)
zstat
<-b/se
pzstat <- 2* (1 - pnorm(abs(zstat)))
table <- cbind(b,se,zstat,pzstat)
table
# Again,
this section was illustrative only. You will probably want to use STATA or
LIMDEP
# for
most Heckman estimations. However, the preceding demonstrates how to
# program
your own estimator. If you have an econometrics text with the matrices,
# then
R can be used to implement the estimator. As a final note, the preceding
# would have been incorrect
if taken directly from the textbook. Greene's formulation
# for
delta i at 1 on p. 784 is incorrect. So it pays to
have the original
# reference
and to check the math against multiple references.