Original Article
Mathematical coupling may account for the association between baseline severity and minimally important difference values

https://doi.org/10.1016/j.jclinepi.2009.10.004Get rights and content

Abstract

Objective

To generate anchor-based values for the “minimally important difference” (MID) for a number of commonly used patient-reported outcome (PRO) measures and to examine whether these values could be applied across the continuum of preoperative patient severity.

Study Design and Setting

Six prospective cohort studies of patients undergoing elective surgery at hospitals in England and Wales. Patients completed questionnaires about their health and health-related quality of life before and after surgery. MID values were calculated using the mean change score for a reference group of patients who reported they were “a little better” after surgery minus the mean change score for those who said they were “about the same.” Pearson's correlation was used to examine the association between baseline severity and change scores in the reference group. Baseline severity was expressed in two ways: first in terms of preoperative scores and second in terms of the average of pre- and postoperative scores (Oldham's method).

Results

Of the 10 PRO measures examined, eight demonstrated a moderate or high positive association between preoperative scores and MID values. Only two measures demonstrated such an association when Oldham's measure of baseline severity was used.

Conclusion

In general, there is little association between baseline severity and MID values. However, a moderate association persists for some measures, and it is recommended that researchers continue to test for this relationship when generating anchor-based MID values from change scores.

Introduction

What is new?

  • We have provided anchor-based MID values for the first time for a number of commonly used generic and disease-specific patient-reported outcome measures.

  • Previous research on the relationship between baseline severity and anchor-based MID values may have reached inappropriate conclusions because of flawed statistical analyses.

  • When an unbiased test is used there is, in general, a low positive association between baseline severity and anchor-based MID values and for some measures there is a low negative association.

Interpretability is a key challenge for researchers interested in measures of health status and health-related quality of life. Patient-reported outcome (PRO) measures do not produce intuitively meaningful data, and this makes it difficult to interpret the meaning of differences between and within groups and individuals. The meaning of unit changes in the measures is unclear, because the metrics change from instrument to instrument and also because of unfamiliarity with their use, unlike, for example, measures of blood pressure [1]. It is also important to ensure that statistically significant results have clinical or social significance [2]. The importance of interpretability was emphasized in guidance issued in 2006 by the United States Food and Drug Administration, which recommended the specification of a “minimally important difference” (MID) when developing PRO measures. The MID has been defined as “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side-effects and excessive cost, a change in the patient's management” [3]. MIDs may be derived using either distribution- or anchor-based methods.

Distribution-based approaches use statistical aspects of representative samples to determine an MID. The most commonly used methods describe change in terms of standard deviation (SD) units. For example, Cohen's effect size formula defines an MID as 0.20 for “small” effects, 0.50 for “moderate” effects, and 0.80 for “large” effects [4]. The effect size approach suffers from sample dependence: The greater the variability within a sample, the higher the SD and the higher the MID. The standard error of the measurement method has been suggested to produce a distribution-based MID that is relatively constant when measured in different samples of patients [5]. The main disadvantage of both methods is the arbitrariness of the qualitative thresholds and the lack of a known relationship to patient experience. The remainder of this article, therefore, focuses on an alternative approach, in which MID values are “anchored” to a known categorical change in health status.

Two types of anchor-based MID can be derived: between group and within group. Between-group values are based on a comparison of scores of patients in different clinical groups. For example, Deyo et al. compared sickness impact profile (SIP) scores in rheumatoid arthritis patients classified into one of four severity categories as defined by the American Rheumatism Association. Differences in average scores between adjacent categories were then used as MID values for the SIP [6]. A similar approach was used by Kulkarni to produce an MID for the Hydrocephalus Outcome Questionnaire. In this case, the average scores for children rated by clinicians as “not at all impaired” were compared with those rated as “very mildly impaired” [7]. A disadvantage of this approach is that clinical severity and patient-perceived health status, although correlated to some extent, are different constructs. Clinical severity, therefore, can only shed limited light on the meaning behind a patient's response. Furthermore, it is unclear if cross-sectional MIDs can be applied in longitudinal studies of health status.

Within-group MIDs can be derived by calculating the mean change score in a reference group of patients deemed to have experienced minimally important change. The reference group may be defined according to external clinical criteria or criteria based on the patient's own perspective. In an example of the clinically driven approach, Eton et al. used the Eastern Cooperative Oncology Group Performance Status Rating scale to identify three types of patients: those who improved, those who were stable, and those who deteriorated over time. MID values for the Functional Assessment of Cancer Therapy–Lung Symptom Index-12 (FACT-LSI12) were then produced by calculating mean change scores on the FACT-LSI12 for each of the three groups [8]. The use of MID reference groups defined according to clinical measures has been criticized for prioritizing criteria that have no known relationship to the patient experience [9]. Revicki et al. recommended that “the patient's perspective be given the most weight since these are patient reported outcome, although the clinician's perspective is considered important as well” [10].

The first use of a patient-referenced approach to MID generation for a quality-of-life instrument was reported by Jaeschke et al. in 1989 [3]. Patients were asked to rate their symptom change on a 15-point Transition Rating Index (TRI) ranging from −7 to +7, where zero represents no change. Those who answered from +3 to −3 on this scale were identified as the reference group, and their mean change score on the quality-of-life instrument was used as the MID. Juniper et al. used the same TRI but included only those patients who scored between +1 and −1 in the reference group [1]. In both the Jaeschke et al. and Juniper et al. studies, patients who reported improvement and deterioration were grouped together. It is now commonly accepted that this should be avoided, because the mean absolute magnitude of change scores in improving and deteriorating patients is not the same [11]. Separate MIDs for patients that report improvement and deterioration have been derived in more recent studies [12], [13], [14]. A further modification of the Jaeschke et al. approach involves recalibrating change scores in patients who report minimal change to take into account change scores for patients who report no change in their symptoms. For example, Coyne et al. [15] produced an MID for a quality-of-life measure in patients with an overactive bladder by subtracting the mean change score for patients reporting no benefit from the mean change scores for patients reporting little benefit. This recalibration is performed, because any deviation from zero in the change scores of patients who report no change in symptoms is considered to represent “measurement error.”

The Department of Health in England has signaled an intention to move toward the routine use of PRO measures to assess the benefits of health care [16]. The Department commissioned our group to perform a systematic review to identify the most appropriate measures to use for five high-volume elective procedures: cataract surgery, groin hernia repair, varicose vein surgery, hip replacement, and knee replacement. We found that procedure-specific MID values had not been produced for the preferred measures [17]. A value for the MID is also unavailable for the Sino-Nasal Outcome Test (SNOT, 22-item version), a measure deemed the best available for sinonasal surgery patients in a recent review [18]. In this study, we generated anchor-based MID values for seven different disease-specific and generic PRO measures that are commonly used to assess the benefits of high-volume elective surgery.

A number of studies suggest that MIDs based on change scores lack stability across the continuum of baseline severity. Specifically, previous studies report that MID values are higher in patients with greater baseline severity and lower in patients with lower severity. The studies cover a diverse set of patients, including those with back pain [19], [20], [21], isolated trauma of the extremities [22], obesity [23], conditions requiring elective surgery [24], osteoarthritis [25], localized musculoskeletal pain [26], conditions requiring emergency care [27], and chronic pain [28]. This evidence has been cited in a number of review articles, and it is now conventional wisdom that MID values are associated with baseline severity [11], [29], [30]. The results are usually explained in psychophysical terms: patients with a high baseline severity need a higher magnitude of change to perceive a clinically meaningful change in their condition. To deal with this phenomenon, it has been suggested that PRO measures should have separate MID values for discrete categories of baseline severity [11], [23], [30]. It has also been suggested that absolute values for MIDs should be replaced with relative values (e.g., percentage change in a reference group that reports minimal improvement), as these have been found less prone to association with baseline severity [21], [24], [26], [28].

Because anchor-based MIDs are simply mean change scores in a subsample of patients, studies that investigate the relationship between such MIDs and baseline severity must deal with the same statistical challenges faced by any enquiry into the relationship between change and initial value. Mathematical coupling occurs “when one variable directly or indirectly contains the whole or part of another” [31]. Because change scores are the pretreatment score minus the posttreatment score, they contain the pretreatment score. Mathematical coupling can lead to an artificially inflated association between initial value and change score when correlation or regression is used. This was demonstrated by Oldham in 1962, who showed that, for two series of independent random numbers x and y with the same SD, a strong correlation (≈0.71) is observed between x and xy [32]. Using the same random numbers, it can also be shown that mean values for xy vary across different strata for x. For example, when x and y are bounded by 0 and 100, the mean value of x − y for the lowest quartile of x is approximately −37, and the mean value of xy for the highest quartile of x is approximately +37.

Previous research on the relationship between MIDs derived from change scores and baseline severity has not adequately accounted for mathematical coupling. The studies have used correlations [24], regression methods [26], and the comparison of mean change scores for different baseline strata [19], [20], [21], [22], [23], [25], [27], [28]. As a result, it is unclear to what extent the influence of baseline severity on anchor-based MIDs is a clinical phenomenon or a statistical artifact. To address this question, an unbiased test of the relationship between baseline severity and change is required. Such a test was proposed by Oldham in 1962. Rather than test the correlation between x and xy, he proposed testing the correlation between (x + y)/2 and xy. This correlation equals zero for two series of independent random numbers x and y with the same SD and, thus, avoids the statistical problems described earlier. Why should we correlate change with an average of pre- and posttreatment scores when we are interested in the relationship between pretreatment scores and change? It can be shown that this correlation is a test of the differences in the variances between an initial measurement and a repeated measurement and that Oldham's coefficient will equal zero if there is no difference in the variances [31]. Oldham reasoned that if the effectiveness of a treatment is related to baseline severity, we should observe a “shrinking” in the posttreatment variance compared with the pretreatment variance. This will occur, because the proportional response to treatment will cause posttreatment scores to converge around the mean. This can be demonstrated using the hypothetical example of three patients completing a health status measure that is bounded by 0 and 100 and which gives lower scores for patients with greater impairment. Patient 1 has a score of 20 before surgery, patient 2 scores 40, and patient 3 scores 60. Because improvement is greater for patients with higher baseline severity in this example, patient 1 improves by 40 points to 60, patient 2 by 30 points to 70, and patient 3 by 20 points to 80. As a result, the dispersion of scores as measured by the SD around the mean is halved from 20 before surgery to 10 after surgery. Although Oldham's method cannot specify the incremental change produced by increases in baseline severity, it does give an unbiased test of correlation for differential baseline effects.

Our aim was to generate anchor-based MID values for a number of commonly used PRO measures in patients undergoing elective surgery and then use Oldham's method to examine whether these values were related to preoperative severity.

Section snippets

Methods

Patients were from six cohorts undergoing elective operations: sinonasal surgery, cataract surgery, groin hernia repair, varicose vein surgery, hip replacement, and knee replacement. All patients were treated at hospitals in England and Wales, and the research protocols were approved by all relevant institutional ethics committees. Consecutive patients aged 16 years or more were invited to complete questionnaires about their health and quality of life before and after surgery. Patients judged

Results

The number of patients in each cohort who consented to participate and completed a baseline questionnaire were as follows: 2,561 sinonasal surgery, 866 cataract surgery, 570 groin hernia repair, 363 varicose vein surgery, 512 hip replacement; and 526 knee replacement. The baseline characteristics of patients are shown in Table 1.

The number of patients in each cohort who returned a postoperative questionnaire was as follows: 2,141 sinonasal surgery patients (response rate, 84%); 750 cataract

Minimally important difference values

We have provided anchor-based MID values for the first time for a number of commonly used generic and disease-specific PRO measures. The TRI method for obtaining anchor-based MID values failed with one measure (VF-14) because of a discrepancy between the VF-14 change scores observed for patients who reported they were “about the same” after surgery and those who reported they were “a little better.” The TRI method had only moderate construct validity in terms of correlation with prospectively

Conclusions

This study generated anchor-based MID values for seven PRO measures that are commonly used with patients undergoing elective surgery. In four instances (VF-14, AVVQ, EQ-5D in hernia, and EQ-5D in hip replacement), there was poor evidence for the validity of the method used, and the MID values produced for these measures should be treated with appropriate caution. For all measures other than the SNOT-22, it is recommended that larger studies be performed to validate the MID values produced.

MID

Acknowledgments

The authors would like to acknowledge the contributions of Liz Jamieson and Lynn Copley in the data collection phase of the project. The authors also thank the Department of Health Policy Research Programme and Commercial Directorate for funding the project on which these analyses are based.

References (48)

  • P.M. ten Klooster et al.

    Patient-perceived satisfactory improvement (PPSI): interpreting meaningful change in pain from the patient's perspective

    Pain

    (2006)
  • J.T. Farrar et al.

    Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale

    Pain

    (2001)
  • D. Revicki et al.

    Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes

    J Clin Epidemiol

    (2008)
  • P.D. Oldham

    A note on the analysis of repeated measurements of the same subjects

    J Chronic Dis

    (1962)
  • G.H. Guyatt et al.

    A critical look at transition ratings

    J Clin Epidemiol

    (2002)
  • G.R. Norman et al.

    Methodological problems in the retrospective computation of responsiveness to change: the lesson of Cronbach

    J Clin Epidemiol

    (1997)
  • Y.K. Tu et al.

    Mathematical coupling can undermine the statistical assessment of clinical research: illustration from the treatment of guided tissue regeneration

    J Dent

    (2004)
  • D. Osoba et al.

    Interpreting the significance of changes in health-related quality-of-life scores

    J Clin Oncol

    (1998)
  • J. Cohen

    Statistical power analysis for the behavioral sciences

    (1988)
  • K. Wyrich et al.

    Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life

    J Clin Epidemiol

    (1999)
  • R.A. Deyo et al.

    Physical and psychosocial function in rheumatoid arthritis: clinical use of a self-administered health status instrument

    Arch Intern Med

    (1982)
  • C. Bradley

    Feedback on the FDA's February 2006 draft guidance on patient reported outcome (PRO) measures from a developer of PRO measures

    Health Qual Life Outcomes

    (2006)
  • D.A. Revicki et al.

    Responsiveness and minimal important differences for patient reported outcomes

    Health Qual Life Outcomes

    (2006)
  • J.W.H. Kocks et al.

    Health status measurement in COPD: the minimal clinically important difference of the clinical COPD questionnaire

    Respir Res

    (2006)
  • Cited by (0)

    View full text