Abstract
Objective. To compare the US EQ-5D with the UK EQ-5D and the SF-6D in patients with rheumatoid arthritis (RA). To provide mappings for each of the scales based on clinical variables.
Methods. We studied 12,424 patients with RA with 66,958 longitudinal observations using linear regression. In our mapping models we used the Health Assessment Questionnaire (HAQ) as a continuous predictor variable and as individual items. More complex models included the addition of a visual analog pain scale, the mood scale from the SF-36, and demographic and comorbidity covariates. We compared various models using root mean squared error (RMSE), in-sample and out-of-sample mean absolute error (MAE), and other measures of prediction accuracy and model fit.
Results. At any level of clinical severity, the US EQ-5D always had a higher utility score than the UK EQ-5D; and overall, the US scores were 0.094 units higher. The best models explained 64% to 72% of variance in utility scores, with RMSE values of 0.07 (SF-6D), 0.11 (EQ-5D US), and 0.17 (UK EQ-5D). There was a substantial increase in predictive accuracy by using pain and mood as predictor variables in the mapping.
Conclusion. The US EQ-5D differs from the UK version and from the SF-6D in mean scores and ranges. When determined by mapping, the US EQ-5D has a much lower prediction error than the UK EQ-5D. Simple mapping models that use HAQ and pain have acceptable error rates, although more complex models that include mood scores and individual HAQ items substantially improve predictive accuracy.
- EQ-5D UTILITIES
- SF-6D UTILITIES
- HEALTH ASSESSMENT QUESTIONNAIRE
- CONVERSION
Recently, Bansback, et al proposed a method to use the Health Assessment Questionnaire disability index (HAQ) to estimate preference-based single measures of health status or utilities1. Almost all current assessments of utilities in rheumatology studies rely on measures that include either the EQ-5D, the Short Form-6D (SF-6D), or the Health Utilities Index II or III (HUI-II, HUI-III)2,3. Given the existence of one of these measures, the results of clinical trials can be transformed into quality-adjusted life-years (QALY) gained or lost as a result of the intervention, and this, in turn, can be expressed in cost-utility analyses as cost per QALY. One QALY is the equivalent of one extra year of life lived in perfect health over a specified number of years. The cost per QALY for rheumatoid arthritis (RA) biologic therapy ranges from US $40,000 to US $68,0004,5,6,7. Utilities and QALY allow comparison between treatments in the same disease, for example a comparison of 2 biologics, as well as different treatments across illnesses, thus allowing healthcare economists and regulatory authorities to understand the comparative costs and benefits using a single standard.
Investigators have been most interested in using the EQ-5D because of the relative restricted range of the SF-6D and its apparently reduced responsiveness, although that finding has recently been called into some question8,9. The length and difficulty of administering and scoring the HUI is also somewhat limiting for that questionnaire. The EQ-5D is a 5-item questionnaire that has 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. However, 3 of the dimensions fall within the domain of function. Each of the 5 questions has 3 levels, with 1 denoting no problems and 3 indicating extreme problems10. The number of theoretically possible health states is 243 (35). The EQ-5D is commonly reported as a preference-based single index number that is derived from the answers to the 5 EQ-5D questions. This index number is obtained by applying algorithms that link the responses with average valuations obtained from the general public. The predominance of functional items in the EQ-5D suggested the possibility that EQ-5D scores could be estimated with sufficient accuracy for use in cost-utility analyses. This HAQ to EQ-5D mapping has been used to estimate cost-utility analyses in multiple studies, and its basis has been analyzed and explored in detail by Bansback, et al1.
There are a number of practical problems with the use of utilities. Most importantly, Marra, et al have shown in 313 patients with RA that the agreement among 4 different utility scales was poor, and if applied to cost-utility analyses would yield quite different cost/QALY results11. A second issue that concerns the EQ-5D is that all RA studies that utilize that questionnaire have been based on the scoring algorithm derived from UK weightings, including Canadian and UK studies. But Johnson, et al reported that the average difference in valuations between US and UK EQ-5D was 0.10, with higher scores being found in the US EQ-5D12. They also reported that “the magnitude of the difference in the US and UK valuations was not constant across EQ-5D health states; greater differences in valuations were present in health states characterized by extreme problems.” Their recommendation that “EQ-5D index scores generated using valuations from the US general population should be used for studies aiming to reflect health state preferences of the US general public” would create problems in interpreting multinational studies and in the comparison of results of observational studies that used the different valuations.
While the Bansback study1 developed predictive models for the UK EQ-5D, they did not address the US valuations. In addition, models that predict utility scores only from the HAQ cannot adequately address the contribution of pain and mood. Very low utility scores and states “worse than death” derive from the contribution of pain and mood13.
In this report, we provide data that compare valuations of the US EQ-5D, UK EQ-5D, and SF-6D scales at all levels of the HAQ, as well as at important levels of RA outcomes. In addition we describe the differences between the scales at different levels of RA and HAQ severity. Based on a sample size of 12,098 patients with 63,406 observations, we provide a series of maps via regression algorithms that convert HAQ, and HAQ, pain and mood scores, to US and UK EQ-5D and SF-6D results.
MATERIALS AND METHODS
Patients
We used the National Data Bank for Rheumatic Diseases (NDB) longitudinal study of RA outcomes14,15 to evaluate utility scores, mapping predictors, and the association of clinical outcomes with utility scores. Patients in this study were diagnosed and referred to the NDB by US rheumatologists. They received no compensation for participation. Patients who were referred to the NDB to be participants of drug safety registries were excluded from analysis because they might have been selected because of the severity of their RA. At 6-month intervals, patients completed complex survey questionnaires by mail or by the Internet. Administration of the SF-6D and the EQ-5D began simultaneously in the NDB assessment of July 2002. The ending date for the current report was the questionnaire of January 2009.
Utilities measures
The EQ-5D, described in some detail above, is a 5-item questionnaire that has 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. From the 243 possible health states derived from 5 questions and 3 levels, a single index number for each state is obtained based on valuations obtained from persons in the general population. In the UK these valuations were based on a population survey of 3,995 persons in the UK using 10-year time tradeoffs, and published in 199510. The valuations were used widely in the US, Canada, and the UK, and are the valuations used in clinical trials4,5,6,7. We refer to this version of the EQ-5D as the UK EQ-5D. The worst UK EQ-5D score observed in the current study was −0.59.
In 2005, US valuations of the EQ-5D health states first became available for 42 common health states based on time tradeoff, and then expanded to all 243 states by regression analysis12,16. EQ-5D US scores are known to be lower than UK scores12. However, the US EQ-5D has not been studied previously in RA. The lowest US EQ-5D score observed in the current study was −0.11.
The third utility measure studied in this report was the SF-6D17. First reported in 2002, it utilizes 11 questions from the SF-3618 to create 6 domains and 249 health states. Valuations for the health states were obtained from 836 persons in the UK general population using the standard gamble method. The worst score observed in the current study was 0.34.
Covariates and predictor variables
We used a series of predictor variables to estimate EQ-5D values (Table 1). The HAQ disability index is a widely used measure of functional status across rheumatic diseases19,20. The HAQ consists of 20 items (scored 0–3) in 8 different domains. Each domain contains 2 to 3 questions based on a common theme: dressing, standing, eating, walking, toileting, reach, grip, and instrumental activities. A score is derived for each domain based on the most abnormal item in that domain. In addition, the use of 14 aids and devices to help with function is taken into consideration in the item scoring by increasing all item scores at the level of “with no difficulty [0]” or “with some difficulty [1]” to “with much difficulty or with help [2]” if an aid or device is used with the item. The final HAQ score ranges from 0 to 3 and is the average score of the 8 categories.
The HAQ-II is a reliable and valid 10-item questionnaire that provides scores in the range 0–3, performs at least as well as the HAQ, and is simpler to administer and score21. Psychometrically improved, it has a reduced floor effect, and scores that are very similar to those of the HAQ, thus allowing comparison of group data using the HAQ and HAQ-II22. The HAQ-II can be substituted for the HAQ in clinical care22. Because the HAQ-II has not been used widely in clinical trials, and details of its performance would double the length of this report, we do not give HAQ-II results here, except as a brief summary at the end of the Results section.
We assessed pain using a 21-point 0–10 visual analog scale (VAS) in which higher values indicated more pain. Mood was assessed using the mood (mental health) scales of the SF-3623. Comorbidity was measured by a patient-reported composite comorbidity index (range 0–9) consisting of 11 present or past comorbid conditions including pulmonary disorders, myocardial infarction, other cardiovascular disorders, stroke, hypertension, diabetes, spine/hip/leg fracture, depression, gastrointestinal (GI) ulcer, other GI disorders, and cancer24,25.
The levels of formal education were categorized as 0–8, 9–11, 12, 12–15, and ≥ 16 years. Based on preliminary evaluations, we dichotomized RA duration as < 8 years and ≥ 8 years.
Validation: outcome variables
To characterize the “clinical significance” of differences in utility scores across the 3 utility measures, we compared utility scores at levels of the following clinical measures. Self-rated current health was obtained with a question from the SF-36 questionnaire: “In general, would you say your health is: Excellent, Very Good, Good, Fair, Poor.” Disability status (able to work) was determined by self-report. This measure is a valid, broader measure than an assessment of receipt of work disability pension because it also assesses disability in nonworkers, particularly homemakers and those past the retirement age26. Total joint replacement (Yes or No) measures the influence of chronic RA severity and activity, as joint replacement is the end product of RA activity. Total direct medical costs, adjusted to 2007, were calculated from hospitalization, treatment, and utilization data as described27. Comorbidity: We used self-reported comorbidities to compute a composite comorbidity index (range 0–9) comprising 11 present or past comorbid conditions including pulmonary disorders, myocardial infarction, other cardiovascular disorders, stroke, hypertension, diabetes, spine/hip/leg fracture, depression, GI ulcer, other GI disorders, and cancer24,25,28. Widespread pain index (WPI): In this index patients indicate in which of 19 body areas they had pain during the last week. These areas were those previously described as part of the Regional Pain Scale, now renamed the Widespread Pain Index (WPI)29. The WPI is a measure of the degree of widespread pain, and is strongly correlated with poor health status. Fibromyalgia, as measured by survey fibromyalgia criteria30, is associated with very poor health status.
Predictive models of US EQ-5D, UK EQ-5D, and SF-6D
To predict utilities from surrogate measures, we used linear regression techniques for analysis of each of the utility scales. We also performed analyses using the HAQ-II instead of the HAQ. A central issue for the study analyses was which functional form of the HAQ or HAQ-related variables was best. Although we modeled the HAQ using fractional polynomial regression in preliminary analyses, a fractional polynomial functional form was not an improvement over other forms, and we did not include fractional polynomial regression in the output tables. We used the HAQ as a single continuous variable (HAQ score) and as a categorical variable of 25 categories (0, .125, .25, .375, etc.). However, the categorical form did not perform better than the continuous form, and we elected to use only the continuous form in followup analyses of Table 4. Bansback, et al found that treating 8 HAQ domains as categorical variables provided a useful model1, and we used categorical HAQ domains as one of our functional forms in initial analyses. Finally, we also used the 20 categorical HAQ items, as did Bansback1. In the 20 categorical HAQ items analyses we incorporated the contribution of HAQ aids and devices sections into the item scores and the domain scores, and did not analyze them separately. Domains were not used in HAQ-II analyses, as the HAQ-II does not create domains.
In additional analyses we added covariates. We used the 21-step (0–10) VAS pain scale as a continuous scale, and similarly employed the mood scale as a continuous variable. We considered these variables as primary covariates, as the EQ-5D questionnaire has one item for pain and an additional item for mood. We added the comorbidity index as a categorical variable because we believed that the effect of comorbidity on utility scores might offer information that might not be picked up by the HAQ and other covariates. In addition, we adjusted for age, age-squared, sex, RA duration, and education level. Finally, in many preliminary models we incorporated interaction terms between sex and HAQ and sex and duration. These were not included in final analyses because their effect was mostly nonsignificant, complicated model use, and did not add to overall prediction accuracy.
Model selection
We evaluated each model statistically and graphically. In particular we used quantile-quantile plots to evaluate how well the predicted utility followed the observed utility. In our primary analyses we utilized one randomly selected observation from each of the 12,424 patients. However, while there were 12,424 HAQ and utility scores, there were only 10,895 patients who completed all the 20 HAQ items. Therefore, so that all models would use the same sample, we restricted analyses to the 10,895 patients with complete data.
To test out-of-sample error and to evaluate changes over time, we used 8,669 observations for each patient, obtained 6 months after the primary observation. Only 8,669 observations were used because not all patients had 2 consecutive observations within the 6-month window.
As we suspected that many models might be useful clinically, it was not our goal to select the best model. Instead, we described each model in terms of its predictive accuracy and fit. For predictive accuracy at the group level, we used the root mean squared error (RMSE) and the mean absolute error (MAE), and at the patient’s level we used the Bland-Altman limits of agreement (LOA) statistic31 and the correlation between observed and predicted values. The RMSE, also known as the standard error of the estimate (SEE), is the square root of the average squared prediction error. The RMSE favors prediction models that do not produce particularly large errors1. The MAE represents the average difference between the actual and predicted utility scores. “The RMSE attaches greater weight to larger errors and favors prediction models that do not produce particularly large errors at the expense of models that are off by a modest amount.”32 We used the MAE to determine “in-sample” and “out-of-sample” errors. Lower error scores indicate better prediction models. RMSE and MAE should be used in analyses of individual measures (e.g., US EQ-5D) and not used to compare different measures (e.g., US EQ-5D vs UK EQ-5D vs SF-6D). To evaluate model fit, we determined the adjusted R-square, Akaike information criterion (AIC), and Bayesian information criterion (BIC). Higher values indicate better fit for the R-square, and lower values a better fit for the AIC and BIC. Data were analyzed using Stata version 11.0 (Stata Corp., College Station, TX, USA).
The study was approved by the Via Christi Institutional Review Board, Wichita, Kansas.
RESULTS
The study data were derived from patients with RA with a median duration of RA of 12.8 years. The mean age was 61.2 (SD 13.0) years, and 21.4% of participants were men (Table 1). Four of the 5 EQ-5D item variables were almost binary (Table 1). For example, “Confined to bed” was endorsed by only 0.3% and “Unable to wash/dress myself” by 0.4%. By contrast, HAQ (1.03, SD 0.73), mood (2.7, SD 1.8), and VAS pain scores (3.8, SD 2.7) had wide variability.
Values for the key scales were SF-6D 0.69 (SD 0.13), UK EQ-5D 0.64 (SD 0.28), US EQ-5D 0.73 (SD 0.19), and HAQ 1.03 (SD 0.73). The mean difference between UK and US EQ-5D scores was 0.094 units. The observed range of the SF-6D was 0.34 to 1.00, with only 5.0% of scores < 0.5. The range of UK EQ-5D was −0.59 to 1, with 15.0% of scores < 0.5. The US EQ-5D ranged from −0.11 to 1, and 14.0% of the scores were < 0.50. Thus the UK EQ-5D scores are shifted to the left and the scale has a lower limit compared with the US EQ-5D.
The consequences of these distributional differences can be seen in Figure 1, where mean utility scores are plotted at each level of HAQ score. Although the SF-6D aligns closely with the UK EQ-5D at HAQ values up to 1.0, the curves diverge after that. The observed minimum mean score of the SF-6D is 0.50, compared with 0.23 observed for the US EQ-5D and −0.06 for the UK EQ-5D. Scores were always higher (“better”) for the US EQ-5D compared with the UK EQ-5D, and the difference in scores increased with increasingly more extreme HAQ scores.
To study the relation between clinical scores and utilities, and the relation between change in scores over 6 months, we utilized a correlation matrix of the key study variables (Table 2). HAQ and pain were correlated with the 3 utility scores at values between 0.625 and 0.681; slightly lower correlations were noted with mood. The correlation between the SF-6D and US EQ-5D was 0.689, and between SF-6D and UK EQ-5D was 0.673. We also examined the correlation between changes in the various scores in questionnaires administered 6 months apart. HAQ change correlated with US EQ-5D change −0.300, UK EQ-5D change −0.289, and SF-6D change −0.258. Pain change correlated with US EQ-5D change −0.363, UK EQ-5D change −0.364, and SF-6D change −0.258. The change in SF-6D was correlated with US EQ-5D change at 0.260 and UK EQ-5D change at 0.250.
We also examined mean utility scores for important clinical subgroups, as shown in Table 3. In conditions when patients were severely affected — as in “poor health” and high regional pain scores — UK EQ-5D scores were very low compared with other scale scores, and SF-6D scores were unable to become low enough to adequately represent the adverse health condition. For example, patients reporting “poor health” had SF-6D scores of 0.53, US EQ-5D scores of 0.45, and UK EQ-5D scores of 0.21. For RA patients with the highest (most abnormal) WPI values, the associated utility scores were 0.63, 0.46, and 0.23; and for fibromyalgia occurring in RA, the scores were 0.57, 0.55, and 0.46. In terms of agreement with each other, the SF-6D and US EQ-5D were similar for “disabled,” “total joint replacement,” high levels of comorbidity, and survey fibromyalgia. Overall, the SF-6D and US EQ-5D appear often to identify similar groups, as suggested by Figure 1, particularly in comparison with the UK EQ-5D.
Mapping the US and UK EQ-5D and the SF-6D
To develop a predictive model, and usable predictive results, we took several approaches. First, we attempted to predict each of the 5 EQ-5D item results by ordered and binary logistic regression using the HAQ items to predict the 3 functional questions, the VAS pain score to predict the pain question, and the SF-36 mood score to predict the anxiety and depression question. This method proved unsuccessful because of very high rates of misclassification of each item. Such results might have been expected by the distribution of the EQ-5D components that, except for pain, are almost binary, and incapable of identifying the nuances of function, pain, and mood (Table 1).
We then turned to linear regression to model the relationship between utility scores and predictor variables. The method of approach is illustrated in detail for the US EQ-5D (Table 4), and in slightly less detail for the UK EQ-5D and the SF-6D. Results of the analyses of Table 4, in terms of beta coefficients that can be used for clinical prediction, are presented in Table 5 and Appendix 1. In Table 4, each additional model is generally shown to provide better fit (adjusted R-square) and predictive accuracy (RMSE, MAE, and for the US EQ-5D: AIC, BIC, Pearson correlation, and LOA). As a measure of the reliability of the study models, we examined the MAE in the development (or in-sample) model and in a second data set (out-of-sample model). As the results of the ISMAE and OSMAE were virtually indistinguishable in each of the 10 models, we present all other data from the primary, in-sample models.
We assumed that the modeling results of this study might be used under conditions when only HAQ data are available for prediction, or where HAQ and pain data are available for prediction, or where HAQ, pain, and mood data are available for prediction. So we provided analyses to cover each of these uses. If only HAQ data are available, using all 20 HAQ items is superior to using just the HAQ score, assuming the individual HAQ items are available. This observation is true across the 3 utility measures. The data indicate that it is always much better to use a model that includes pain (HAQ + Pain section). Better fit and accuracy is obtained when mood scores are added, although the incremental benefit of the addition of this variable is relatively small. In all cases, using the 20 HAQ items improves fit and accuracy. Differences between the models and model improvement can be seen clearly by observing the AIC, BIC, correlation, and LOA changes.
Prediction of the UK EQ-5D was less satisfactory than prediction of the US EQ-5D or the SF-6D. The RMSE was more than double for the UK EQ-5D compared with the SF-6D in all models, and somewhat less than double for the US EQ-5D compared with the UK EQ-5D. The SF-6D R-square improves substantially with the addition of the mood question. This might be expected to happen as the mood questions are (partially) included in the SF-6D.
Much of the improvement in EQ-5D models that is noted by using HAQ items and pain and mood occurs at lower EQ-5D levels, as shown in Figure 2. Model fit and accuracy deteriorate at levels below 0.5. However, only 14% of US and 15% of UK EQ-5D scores are lower than 0.5. Little is to be gained by adding covariates such as age, sex, RA duration, education, and comorbidity. All these variables are usually not reported in studies or are not available in the forms used in this study.
HAQ-II
Although not specifically reported here, the RMSE of the HAQ-II was 0.11 and the adjusted R-square was 0.65 in the HAQ-II item plus pain and mood mapping to the US EQ-5D, and was 0.17 and 0.62 mapped to the UK EQ-5D. Thus, the HAQ-II is almost identical in its predictive ability compared with the HAQ. Specific model details are available from the authors.
DISCUSSION
The EQ-5D and the SF-6D measures used in rheumatic diseases have been based on preference valuations or weights developed in the UK. North American studies, performed mostly in Canada, have also used the UK preference weightings. The UK EQ-5D weights were described in 199610, and the SF-6D was first reported in 200217. The US EQ-5D preference weights were published in 200516, but have not been reported previously in patients with rheumatic disease. With the publication of the US EQ-5D weights, it was noted that the mean US EQ-5D score was 0.11 units higher than the mean score of the UK EQ-5D12, and that differences between the 2 measures were most profound in individuals with poor health status.
Marra, et al showed that use of the HUI-2, HUI-3, UK EQ-5D, and SF-6D in patients with RA led to different utility scores and, when applied to cost utility analyses, resulted in different QALY depending on which scale was used11,33. We found that for any RA health status state (Table 2), the US EQ-5D always had a higher (better) utility score than the UK version. Overall, we found that the US scores were 0.094 units higher than the UK EQ-5D scores. When the 3 scales were compared using the HAQ as an anchor (Figure 1), the US EQ-5D scores were higher than the SF-6D scores from HAQ values of 0 to 1.75. Thereafter, SF-6D scores were higher owing to the limited scaling of the measure.
In addition to the absolute differences between the UK EQ-5D and the SF-6D scores, the utility score changes after an intervention were larger when the UK EQ-5D was used, compared with the SF-6D. The absolute change differences and responsiveness for the UK EQ-5D and the SF-6D may also depend on baseline RA severity and whether there is improvement or worsening of the clinical state9 (and Michaud and Wolfe, unpublished data). The US EQ-5D has not yet been studied with respect to changes observed in clinical trials, but it is likely that they will have an intermediate position between the UK EQ-5D and the SF-6D. The above observations present problems in 3 respects: (1) the validity of utility measures, given that they yield different results; (2) the sensitivity of cost utility analyses to the utility measure selected; and (3) the problem of how data should be analyzed when patients in multinational studies are assessed.
The use of mapping of clinical variables to utility variables came about when it was recognized, retrospectively, that economic analyses were valuable, but formal utility scales had not been administered. A wide variety of predictor variables have been mapped in different illnesses34. In rheumatic diseases, the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) has been mapped to the UK EQ-5D35,36 and the HUI-332. In RA, mapping efforts have used the HAQ to predict UK EQ-5D and SF-6D scores. Bansback, et al provided the first careful and detailed analyses of the HAQ as a predictor of the UK EQ-5D and SF-6D1. They found that use of selected “domains” from the HAQ best predicted UK EQ-5D and SF-6D scores, with RMSE values of 0.183 and 0.089 and R-squares of 0.57 and 0.51, respectively. A recent study by Amadi, et al found that a HAQ mapped SF-6D score was valid and responsive in early RA37.
There are essentially 4 approaches to using the HAQ as a predictor: (1) using the calculated final HAQ score in a linear regression; (2) using the final scores in a nonlinear regression or a fractional polynomial regression; (3) using HAQ domains (as used by Bansback, et al); or (4) using individual HAQ items as categorical predictors. In agreement with Bansback, et al, we found domains to be superior to the HAQ score or to a nonlinear application of the HAQ score (Table 3). However, we found that individual HAQ items were the best predictors when the RSME, MAE, R-square, and other model and prediction statistics were considered (Table 4). However, using individual items is burdensome (Appendix 1), although calculable by computer. Another limitation is that often only mean HAQ scores are available from published studies, therefore eliminating the use of individual items as a potential means of predicting utility values. But if study variable data are available, there is considerable advantage to the item method (Tables 4 and 6).
In almost all settings where HAQ is available, a VAS pain scale is also available. There are substantial gains in model accuracy and fit by using the HAQ and pain scores together to predict utilities (Tables 4 and 6). The model improves further by the incorporation of the SF-36 mental health domain score, but this score, although often a part of clinical trial data, is not ordinarily reported in primary trial results. Similarly, some additional improvement in prediction and fit can be obtained by incorporating other covariates such as age, sex, education, comorbidity, and duration of RA (Table 4). However, the added improvement is small and these covariates are not always available (education) or are collected using a common scale (comorbidity). These results lead us to recommend the continuous HAQ and VAS pain scale (and mood scale, if available) when item data are not available, and the more complex HAQ items as a substitute for the HAQ summary score, if available. The mapping of the utility measures described here expand on methods currently available, and should represent an improvement in validity.
The initial report of clinical relevance of the UK EQ-5D came from Hurst, et al in 199738. They found that the EQ-5D demonstrated “moderate to high correlations with measures of impairment and high correlations with disability measures,” and was reliable and valid. Recent studies that included change data have confirmed these findings9. However, transformation of clinical data to EQ-5D is not without problems. As shown in Table 1, 73.8% of patients endorsed the EQ-5D category of “moderate pain or discomfort.” The VAS pain score for those in that category was 3.7 (SD 2.3), indicating large variance and the inability of the EQ-5D to determine important clinical differences. Similarly, the EQ-5D category of “I have no problems walking about” represents a crude clinical measure.
Scott, et al raised the issue of “real clinical concern” over the use of utility indices to measure the outcome of clinical care39. In addition and in particular, they stated that “HAQ and EuroQol are demonstrably not equivalent, [and] economic evaluations of treatment cost effectiveness should not be based on EuroQol data transformed from HAQ.” They noted in their study of 56 patients that 6-month “...changes in HAQ and EuroQol were unrelated (r = 0.08),” while the correlation between changes in the EQ-5D and changes in pain was 0.54. While we found 6-month changes in the HAQ to correlate with changes in the UK EQ-5D at r = 0.300 and changes in the EQ-5D to correlate with changes in pain at r = 0.363 (Table 2), the concerns of Scott, et al are important and reflect the ongoing tensions between clinical measurement and patient and societal valuation40,41.
They also reflect the omnipresent but often unspoken problem of the use of clinical data for administrative decisions at the level of the patient — particularly in the face of measurement error. As shown in Table 3, which measures the difference between actual (observed) and predicted (mapped) values, the best case Bland-Altman LOA was ± 0.204 units and the worst case was ± 0.270 units. Thus, if mapped values are applied at the individual patient level, an unreliable estimate of the actual health state may be obtained, and these differences do not even consider HAQ measurement error. Most cost-effectiveness studies, however, do not use patient-level data, and the RMSE levels found in our study are acceptable for group use. With respect to mapping of EQ-5D data, incorporation of pain, and possibly of mood, provides additional assurance of utility scores that correlate with clinical experience.
A case can be made that the use of any of the mapped models is acceptable. However, all things being equal, the model with the smallest predictive error should be preferred. As shown by Grootendorst, et al, the confidence intervals around the predictive values depend on the sample size of the study that the predictions are applied to32, a finding we also noted (data not shown).
Although mapped utilities can be helpful when actual utility scores are not available, mapped utilities can have additional substantial limitations. Barton, et al showed that in patients with osteoarthritis, “mapping models developed from the WOMAC tended to underestimate the QALY gain associated with each of four interventions, compared to that which was derived from actual [UK] EQ-5D scores”35. One explanation for this observation is that “prediction errors...tend to be increasingly positive for lower EQ-5D scores and increasingly negative for higher EQ-5D scores,” a finding that we observed in the current study in Figure 2 (error direction is reversed by subtraction method) and others have also noted in RA studies of mapped EQ-5D scales. These findings, and the inherent error in mapping42,43, lead us to advise the use of the 5-item EQ-5D questionnaire or SF-6D rather than relying on secondary mapping.
In summary, the US EQ-5D differs from the UK version and from the SF-6D in mean scores and ranges. When determined by mapping, the US EQ-5D has a much lower prediction error than the UK EQ-5D. Simple mapping models that use HAQ and pain have acceptable error rates, although more complicated models that include individual HAQ items and mood scores improve predictive accuracy and model fit.
APPENDIX Coefficients for mapping HAQ items, pain, and mood to US EQ-5D, UK EQ-5D, and SF-6D
Footnotes
-
Supported by a grant from Pfizer, Inc.
- Accepted for publication March 5, 2010.
REFERENCES
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.