Introduction to Parametric Duration Models

 

The purpose of this session is to show you how to use STATA's procedures for estimating parametric duration models. Note that we do not cover non-parametric or semi-parametric duration models which are an important part of this literature.

 

/* This file demonstrates the STATA procedures for evaluating duration data. First, let's read in the data. These are data on the duration of labor strikes taken from Greene, p. 800. The variable T is the duration of the strike. The variable PROD is the residual from a linear regression of the log of industrial production on time, time squared, and a set of monthly dummy variables. It measures aggregate industrial production less trend and seasonal components. */

 

use "C:\documents and settings\b. dan wood\my documents\my teaching\maximum likelihood\data\strikes.dta", clear

 

/* In contrast to Limdep, STATA does not expect the dependent variable to be logged. You will want a censoring variable to indicate whether or not a process is still ongoing. Below we create the censoring variable, STATUS, for all durations above 80. Note that the censoring variable can be omitted from the specifications below if there is no censoring. */

 

gen status=t<81

 

/*We need to stset the data set to alert STATA that we have duration data and want to estimate survival models*/

 

stset t, failure(status)

 

/* Let's first summarize the data */

 

summarize

 

/* Now let's just estimate some duration models with no covariates. This would be useful for exploring the duration dependence of a process. First the Exponential model */

 

streg , dist(exp) time

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the exponential function.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Let's now calculate AIC for later use in model testing and comparison. */

 

scalar aicexp = -2*e(ll)+2*(e(df_m)+1)

display aicexp

 

/*Note the distinctive hazard and integrated hazard functions for the Exponential model. This reflects the memory-less property of this distribution. Also, note that this distribution results in hazard functions that are strictly monotonic. */

 

/*Now let's estimate the same relationship with the Weibull model. Weibull is probably the most popular. The reason is it's flexibility and the fact that it encompasses the exponential. You have a built-in test for exponential, in the significance and size of p. Also, the Weibull's hazard function is more flexible in that it posits exponential change in the hazard function, which is an advantage if you think that duration

dependence is not linear as in the exponential function. Note that both the exponential and weibull disributions posit only monotonic change in the hazard function. */

 

streg , dist(weibull) time

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the weibull model.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Now let's calculate AIC for later use in model testing and comparison. */

 

scalar aicweib = -2*e(ll)+2*(e(df_m)+1)

display aicweib

 

/*STATA will also estimate the Gamma model.  Note that the GAMMA function allows non-monotonicity in the hazard function.*/

 

streg , dist(gamma) time

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the gamma model.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Now let's calculate AIC for later use in model testing and comparison. */

 

scalar aicgamma = -2*e(ll)+2*(e(df_m)+1)

display aicgamma

 

/*Generally, note the distinctively different shape of the hazard functions for the Gamma model versus the Weibull or exponential. This demonstrates that it does make a difference which model is selected. If you think the baseline hazard should be non-monotonic, then you need to specify one of the functions that allow non-monotonicity. STATA will also estimate a lognormal and loglogistic model, both of which also allow a non-monotonic hazard function. */

 

streg , dist(lognormal) time

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

more

 

/* Calculate AIC for the lognormal model. */

 

scalar aiclognorm = -2*e(ll)+2*(e(df_m)+1)

display aiclognorm

 

/* Calculate AIC for the loglogistic model. */

 

scalar aicloglog = -2*e(ll)+2*(e(df_m)+1)

display aicloglog

 

/*We can select the appropriate model using AIC. We want the model with minimum AIC */

 

display aicexp

display aicweib

display aicgamma

display aiclognorm

display aicloglog

 

/*Based on this, we find that the gamma model provides the best fit for modeling duration dependence.  However, we might want to make this determination after including covariates. Note also that this differs from the finding from LIMDEP using the F distribution encompassing test that the Weibull is the correct model. */

 

/*Another diagnostic for model specification utilizes the cumulative hazard function. A well specified model will have a cumulative hazard function that begins at zero and increases in linear fashion. For example, compare the cumulative hazard functions for the gamma versus the weibull models below.*/

 

streg , dist(gamma) time

stcurve, cumhaz

more

streg , dist(weibull) time

stcurve, cumhaz

 

/*This diagnostic suggests that the Weibull may be the best model. */

 

/*STATA allows you to include covariates in the duration model. First, let's estimate the Cox model, which makes no assumptions about the form of duration dependence, but does assume proportional hazards. We obtain estimates in both the coefficient form and the hazard ratio form.*/

 

stcox prod, nohr

stcox prod

 

/*Now let's estimate some of the parametric survival models with a covariate. As observed in class, for these models, the coefficients on covariates must be interpreted with some caution. The interpretation is relatively straightforward with an exponential, or some values of the P parameter for other distribution. Otherwise, exercise caution. */

 

/*If you are using one of the accelerated failure time distributions (i.e., exponential or weibull), then you can get the estimates in either the coefficient, hazard ratio, or time ratio metric. */

 

streg prod, dist(weibull) time

streg prod, dist(weibull)

streg prod, dist(weibull) tr

 

/*If you are using the gamma, log-logistic, or gamma models, you can have the estimates reported in either the coefficient metric or the time ratio metric, but not in a hazard ratio metric. */

 

streg prod, dist(gamma) time

streg prod, dist(gamma) tr

 

/*The preceding models are all log-linear. However, an alternative model, the Gompertz, is not log-linear. Note that with the Gompertz there is no intercept. Below we report estimates in both hazard ratio and coefficients form.*/

 

streg prod, dist(gompertz)

streg prod, dist(gompertz) nohr

 

/*STATA will also estimate models with heterogeneity. For example, here is a model that assumes heterogeneity in the coefficients based on whether the data is censored. Be patient. This one takes a while to converge.*/

 

set more off

streg prod, dist(weibull) strata(status)

 

/*Here is a model that assumes an unknown source of heterogeneity in the durations, modeled as a gamma distribution. These are the so-called frailty models. */

 

streg prod, dist(weibull) frailty(gamma)

 

/*This lesson has provided a summary of parametric models of survival. As you can see, there is considerable expertise required to use and interpret these models. A future course in survival analysis might be required to learn everything about this class of models.*/