Introduction to
Parametric Duration Models
The
purpose of this session is to show you how to use STATA's
procedures for estimating parametric duration models. Note that we do not cover
non-parametric or semi-parametric duration models which are an important part
of this literature.
/*
This file demonstrates the STATA procedures for
evaluating duration data. First, let's read in the data. These are data on the
duration of labor strikes taken from Greene, p. 800. The variable T is the
duration of the strike. The variable PROD is the residual from a linear
regression of the log of industrial production on time, time squared, and a set
of monthly dummy variables. It measures aggregate industrial production less
trend and seasonal components. */
use "C:\documents and settings\b. dan wood\my documents\my teaching\maximum likelihood\data\strikes.dta", clear
/*
In contrast to Limdep, STATA does not expect the
dependent variable to be logged. You will want a censoring variable to indicate
whether or not a process is still ongoing. Below we create the censoring
variable, STATUS, for all durations above 80. Note that the censoring variable
can be omitted from the specifications below if there is no censoring. */
gen status=t<81
/*We
need to stset the data set to alert STATA that we
have duration data and want to estimate survival models*/
stset t, failure(status)
/*
Let's first summarize the data */
summarize
/*
Now let's just estimate some duration models with no covariates. This would be
useful for exploring the duration dependence of a process. First the
Exponential model */
streg , dist(exp) time
/*
Let's now get the survival, hazard, and cumulative
hazard functions for the exponential function.*/
stcurve, surv
more
stcurve , haz
more
stcurve, cumhaz
/*
Let's now calculate AIC for later use in model testing
and comparison. */
scalar aicexp = -2*e(ll)+2*(e(df_m)+1)
display aicexp
/*Note
the distinctive hazard and integrated hazard functions for the Exponential
model. This reflects the memory-less property of this distribution. Also, note
that this distribution results in hazard functions that are strictly monotonic.
*/
/*Now
let's estimate the same relationship with the Weibull
model. Weibull is probably the most popular. The
reason is it's flexibility and the fact that it
encompasses the exponential. You have a built-in test for exponential, in the
significance and size of p. Also, the Weibull's
hazard function is more flexible in that it posits exponential change in the
hazard function, which is an advantage if you think that duration
dependence is not linear as in the exponential
function. Note that both the exponential and weibull disributions posit only monotonic change in the hazard
function. */
streg , dist(weibull) time
/*
Let's now get the survival, hazard, and cumulative
hazard functions for the weibull model.*/
stcurve, surv
more
stcurve , haz
more
stcurve, cumhaz
/*
Now let's calculate AIC for later use in model testing and comparison. */
scalar aicweib = -2*e(ll)+2*(e(df_m)+1)
display aicweib
/*STATA
will also estimate the Gamma model. Note
that the GAMMA function allows non-monotonicity in
the hazard function.*/
streg , dist(gamma) time
/*
Let's now get the survival, hazard, and cumulative
hazard functions for the gamma model.*/
stcurve, surv
more
stcurve , haz
more
stcurve, cumhaz
/*
Now let's calculate AIC for later use in model testing and comparison. */
scalar aicgamma = -2*e(ll)+2*(e(df_m)+1)
display aicgamma
/*Generally, note the distinctively different shape of the
hazard functions for the Gamma model versus the Weibull
or exponential. This demonstrates that it does make a difference which model is
selected. If you think the baseline hazard should be non-monotonic, then you
need to specify one of the functions that allow non-monotonicity.
STATA will also estimate a lognormal and loglogistic
model, both of which also allow a non-monotonic hazard function. */
streg , dist(lognormal) time
stcurve, surv
more
stcurve , haz
more
stcurve, cumhaz
more
/*
Calculate AIC for the lognormal model. */
scalar aiclognorm = -2*e(ll)+2*(e(df_m)+1)
display aiclognorm
/*
Calculate AIC for the loglogistic
model. */
scalar aicloglog = -2*e(ll)+2*(e(df_m)+1)
display aicloglog
/*We can select the appropriate model using AIC. We want the
model with minimum AIC */
display aicexp
display aicweib
display aicgamma
display aiclognorm
display aicloglog
/*Based on this, we find that the gamma model provides the
best fit for modeling duration dependence.
However, we might want to make this determination after including
covariates. Note also that this differs from the finding from LIMDEP using the
F distribution encompassing test that the Weibull is
the correct model. */
/*Another diagnostic for model specification utilizes the
cumulative hazard function. A well specified model will have a cumulative
hazard function that begins at zero and increases in linear fashion. For
example, compare the cumulative hazard functions for the gamma versus the weibull models below.*/
streg , dist(gamma) time
stcurve, cumhaz
more
streg , dist(weibull) time
stcurve, cumhaz
/*This diagnostic suggests that the Weibull
may be the best model. */
/*STATA
allows you to include covariates in the duration model. First, let's estimate
the Cox model, which makes no assumptions about the form of duration
dependence, but does assume proportional hazards. We obtain estimates in both
the coefficient form and the hazard ratio form.*/
stcox prod, nohr
stcox prod
/*Now
let's estimate some of the parametric survival models with a covariate. As
observed in class, for these models, the coefficients on covariates must be
interpreted with some caution. The interpretation is relatively straightforward
with an exponential, or some values of the P parameter for other distribution. Otherwise,
exercise caution. */
/*If
you are using one of the accelerated failure time distributions (i.e.,
exponential or weibull), then you can get the
estimates in either the coefficient, hazard ratio, or
time ratio metric. */
streg prod, dist(weibull) time
streg prod, dist(weibull)
streg prod, dist(weibull) tr
/*If
you are using the gamma, log-logistic, or gamma models, you can have the
estimates reported in either the coefficient metric or the time ratio metric,
but not in a hazard ratio metric. */
streg prod, dist(gamma) time
streg prod, dist(gamma) tr
/*The preceding models are all log-linear. However, an
alternative model, the Gompertz, is not log-linear. Note
that with the Gompertz there is no intercept. Below
we report estimates in both hazard ratio and coefficients form.*/
streg prod, dist(gompertz)
streg prod, dist(gompertz) nohr
/*STATA
will also estimate models with heterogeneity. For example, here is a model that
assumes heterogeneity in the coefficients based on whether the data is
censored. Be patient. This one takes a while to converge.*/
set more off
streg prod, dist(weibull) strata(status)
/*Here
is a model that assumes an unknown source of heterogeneity in the durations,
modeled as a gamma distribution. These are the so-called frailty models. */
streg prod, dist(weibull) frailty(gamma)
/*This lesson has provided a summary of parametric models of
survival. As you can see, there is considerable expertise required to use and
interpret these models. A future course in survival analysis might be required
to learn everything about this class of models.*/