Generalized modeling approaches to risk adjustment of skewed outcomes data

https://doi.org/10.1016/j.jhealeco.2004.09.011Get rights and content

Abstract

There are two broad classes of models used to address the econometric problems caused by skewness in data commonly encountered in health care applications: (1) transformation to deal with skewness (e.g., ordinary least square (OLS) on ln(y)); and (2) alternative weighting approaches based on exponential conditional models (ECM) and generalized linear model (GLM) approaches. In this paper, we encompass these two classes of models using the three parameter generalized Gamma (GGM) distribution, which includes several of the standard alternatives as special cases—OLS with a normal error, OLS for the log-normal, the standard Gamma and exponential with a log link, and the Weibull. Using simulation methods, we find the tests of identifying distributions to be robust. The GGM also provides a potentially more robust alternative estimator to the standard alternatives. An example using inpatient expenditures is also analyzed.

Introduction

Many past studies of health care costs and their responses to health insurance, treatment modalities or patient characteristics indicate that estimates of mean responses may be quite sensitive to how estimators treat the skewness in the outcome (y) and other statistical problems that are common in such data. Some of the solutions that have been used in the literature rely on transformation to deal with skewness (most commonly, ordinary least square (OLS) on ln(y)), alternative weighting approaches based on exponential conditional models (ECM) and generalized linear model (GLM) approaches, the decomposition of the response into a series of estimation models that deal with specific parts of the distribution (e.g., multi-part models), or various combinations of these. The default alternative has been to ignore the data characteristics and to apply OLS without further modification.

In two recent papers, we have explored the performance of some of the alternatives found in the literature. In Manning and Mullahy (2001), we compared models for estimating the exponential conditional mean—how the log of the expected value of y varied with observed covariates x. That analysis compared OLS on log transformed dependent variables and a range of GLM alternatives with log links under a variety of data conditions that researchers often encounter in health care cost data. In Basu et al. (2004), we compared log OLS, the Gamma with a log link, and an alternative from the survival model literature, the Cox proportional hazard (PH) regression (Cox, 1972). In both papers, we proposed a set of tests that can be employed to select among the competing estimators, because we found no single estimator dominates the other alternatives or is a close second best.

Our primary interest is in the marginal effect of a covariate x1 on E(y|x), where x1 could be a treatment or behavioral variable of interest, in the context of modeling a response variable y as a function of a vector x = (x0, x1, x2, …, xp,)T of covariates in a regression model for the mean function μ(x)  E(y|x).1 In this paper, we again compare exponential conditional mean models where μ(x) is assumed to follow a functional form that is the exponentiation of the linear combination of covariates x.2 If E(y|x) is an exponential conditional mean, then the marginal effect, which is nonlinear in x, is:m1(x)=E(y|x)x1=β1exβwhere xβ=k=0pβkxk and x0 may be a vector of ones. But if we log both sides, then we can summarize the marginal effect by:ln(E(y|x))x1=ln(m1(x))x1=β1

In what follows, we focus on this as a summary of the response of y to x.

In Manning and Mullahy (2001), we explored the performance of alternative least squares and generalized linear model estimators for the response of the expected value of y to a set of covariates x under a range of data generating processes. No single estimator was dominant or nearly dominant under all circumstances. But two patterns were clear. First, least squares could provide biased estimates of the mean response of the (untransformed) outcome variable if there was heteroscedasticity in the log scale error. Second, the GLM models would be unbiased but could be quite imprecise if the log-scale error was symmetric but heavy tailed or if the log-scale error variance is large (>1). We proposed a set of tests that would allow analysts to choose among the competing exponential conditional mean (ECM) models.3

This paper takes a different approach. It considers the estimation of a regression model using maximum likelihood for a specific distribution—the generalized Gamma (GGM) distribution. The generalized Gamma is appealing because it includes several of the standard alternatives as special cases—OLS with a normal error, OLS for the log-normal, the standard Gamma and Exponential with a log link, and the Weibull. We see two potential advantages of implementing a regression framework based on this distribution. First, it provides nested comparisons for some alternative estimators, and hence a formal alternative to the somewhat cumbersome and incomplete testing procedure in Manning and Mullahy (2001). Second, if none of the standard approaches is appropriate for the data, then the generalized Gamma regression provides an alternative estimator that will be more efficient because it better approximates the distribution than the more restrictive alternatives.

The plan for the paper is as follows. In the next section, we describe the generalized Gamma distribution in greater detail, showing the connection of the GGM regression framework to more commonly used estimators. Section 3 describes the general modeling approaches that we consider, and our simulation framework. Section 4 summarizes the results of the simulations and examines an application: (1) a study of inpatient expenditures that we have used in previous papers. The final section contains our discussion and conclusions.

Section snippets

Generalized Gamma modelling framework

We confine our discussion here to the case with strictly positive values of y to streamline the analysis. We do not address issues related to truncation, censoring, or the “zeros” aspects of data (or “part one of a two-part model”).4

Methods

To evaluate the performance of the generalized Gamma estimator, we rely on Monte-Carlo simulation of how this estimator behaves over a range of data circumstances and compare it with the behavior of alternative estimators from the literature, including one that is optimal in terms of bias and efficiency for the given data generating mechanism. We consider a broad range of data circumstances that are common in health economics and health services research. They are: (1) skewness in the raw-scale

Simulation results

Table 2 provides some of the sample statistics for the dependent measure y on the raw scale across the various data generating mechanisms. As indicated earlier, the intercepts have been set so that the E(y) is 1. For each case, the dependent variable y is skewed to the right and heavy tailed. Table 3 provides the results on the consistency and precision in the estimate of β1, the slope of ln(E(y|x)) with respect to x, for each of the alternative estimators for different data generating

Conclusions

In this paper, we have considered the estimation of a regression model using maximum likelihood for a specific distribution—the generalized Gamma—that includes some of the ECM estimators, notably the Gamma and the log-normal, as special cases. Using similar simulation comparisons to our two earlier papers, we find that the GGM performs well against the special cases. It handily rejects alternatives that do not apply to a specific data generating mechanism—for example, the log-normal when the

Acknowledgements

We would like to thank Mindy Drum, Alberto Holly, Joe Hilbe, Joseph Newhouse, Dan Polsky, Paul Rathouz, Frank Windmeijer, and an anonymous reviewer for the Journal of Health Economics for their help and comments. The opinions expressed are those of the authors, and not those of the University of Chicago, or the University of Wisconsin. This work was supported in part by the National Institute of Alcohol Abuse and Alcoholism (NIAAA) Grant 1RO1 AA12664-01 A2.

References (21)

There are more references available in the full text version of this article.

Cited by (597)

View all citing articles on Scopus
View full text