Abstract
Objective. We aimed to evaluate how minimal (clinically) important differences (MCID/MID) were calculated in rheumatology in the past 2 decades and demonstrate how the calculation is compromised by the lack of interval scaling.
Methods. We conducted a systematic literature review on articles reporting MCID calculation in osteoarthritis (OA) and rheumatoid arthritis (RA) from January 1, 1989, to May 9, 2014. We evaluated the methods of MCID calculation and recorded the ranges of MCID for common patient-reported outcome measures (PROM). Taking data from the Health Assessment Questionnaire (HAQ), we showed the effects of performing mathematical calculations on ordinal data.
Results. A total of 330 abstracts were reviewed and 123 articles chosen for full text review. Thirty-six (19 OA, 16 RA and 1 OA-RA) articles were included in the final evaluation. The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) was the most frequently reported PROM with relevant calculations in OA, and the HAQ in RA. Sixteen articles used anchor-based methods alone for calculation of MCID, and 1 article used distribution-based methods alone. Nineteen articles used both anchor and distribution-based methods. Only 1 article calculated MCID using an interval scale. Wide ranges in MCID for the WOMAC in OA and HAQ in RA were noted. Ordinal-based derivations of MCID are shown to understate true change at the margins, and overstate change in the mid-range of a scale.
Conclusion. The anchor-based method is commonly used in the calculation of MCID. However, the lack of interval scaling is shown to compromise validity of MCID calculation.
- MINIMAL CLINICALLY IMPORTANT DIFFERENCE
- ANCHOR-BASED APPROACHES
- DISTRIBUTION-BASED APPROACHES
- PATIENT-REPORTED OUTCOME MEASURES
- RHEUMATOID ARTHRITIS
- OSTEOARTHRITIS
Over the last 25 years the concept of a minimal clinically important difference (MCID) has emerged in the outcomes literature. A clinically important difference is defined as a change or difference in the outcome measure that would be perceived as important and beneficial by the clinician or the patient, assuming the absence of serious adverse effects and excessive costs1,2,3. A MCID is therefore a threshold value for such change. A number of terms have emerged in the literature that differ slightly in definition and may be confusing. The most common are the minimally important difference (MID), MCID, minimal clinically significant difference (MCSD), and minimal clinically important improvement (MCII). A review by King details the definitions and methods for determining the MCID4.
In determining MCID, both distribution-based and anchor-based approaches have been described. Distribution-based or data driven approaches depend on the statistical characteristics of the data. From a statistical perspective, a significant difference means the difference that is unlikely to occur by chance, and is a decision based on probabilistic calculations. Such probabilistic calculations are often affected by sample size; thus a small difference may be regarded as significant owing to large sample size, but at the same time may mean little to the patients or their clinician. Thus clinicians needed to ascertain the importance of statistically significant results for their own patients. A range of indicators such as the standard error of measurement (SEM), minimum detectable change (MDC), effect size (ES), standardized response mean (SRM), or Guyatt’s responsiveness index (GRI) may be used to define variability in the data. In contrast, anchor-based approaches link the change in the outcome measure to a meaningful external anchor that accounts for the patient’s perspective. For example, patients rate themselves according to their last condition state from “much worse,” “slightly worse,” “same,” “slightly better” to “much better.” The MCID for improvement can be defined as the mean difference of the outcome measures of patients who rated themselves “same” and “slightly better.” The range of categorical rating scale can be varied from study to study and the selection of the score group(s) to calculate MCID is arbitrary. In Outcome Measures in Rheumatology (OMERACT) meetings 5 to 7, the anchor-based method was recommended as the method of choice1,5,6; while in OMERACT 8, reporting the proportion of patients achieving anchor-based acceptable status was recommended as important and complementary information in clinical trials7. The US Federal Drug Administration (FDA) guidelines for patient-reported outcome measures (PROM) affirmed that anchor-based methodology was required in reporting the proportion responding to treatment in the evaluation of all medical devices and drugs8. While there may be some debate over appropriateness of the external anchor, in terms of wording, category levels, and time frame, there is a weakness that is more fundamental. This is connected to the nonlinearity of estimates obtained from PROM. PROM are ordinal scales, in which the difference between 2 levels of a response cannot be assumed to be the same as the difference between 2 other levels. However, the calculations of MCID rely on data that meet criteria for interval scaling9. To inform the debate about the validity of MCID/MID on PROM, we conducted a systematic review on how MCID has been calculated in 2 common rheumatologic conditions, osteoarthritis (OA) and rheumatoid arthritis (RA), in the past 2 decades, in particular on calculation methods and scaling. In addition, we show how the approach lacks validity when used on ordinal data.
MATERIALS AND METHODS
Search strategies
A systematic review of the SCOPUS and MEDLINE databases from January 1, 1989, to May 9, 2014 was performed to identify English-language original research reports related to all the MCID-related terms: (“minimally clinically important difference” OR “minimal clinically important difference” OR “minimum clinically important difference” OR “minimally clinical important difference” OR “minimal clinical important difference” OR “minimum clinical important difference” OR “minimally important difference” OR “minimal important difference” OR “minimum important difference” OR “minimal clinically important improvement” OR “subjectively significant difference” OR “clinically important difference” OR “clinically significant change” OR “minimally important change” OR “minimal clinically important improvement”) AND (“Osteoarthritis” OR “Rheumatoid Arthritis”) in the title or abstract. The references of all retrieved articles were also screened for potentially relevant publications.
Selection of articles
Two reviewers (BDE, YYL, or CP; 2 working on each article) independently assessed inclusion or exclusion of articles, with disputes resolved by another reviewer (AT). We included publications that calculated MCID in the evaluation of any PROM in patients with OA or RA. We excluded articles that used published results of MCID without calculation of MCID and articles that did not report MCID calculation on OA or RA separately.
Data collection
One reviewer (BDE, YYL, or CP) extracted MCID data using a standardized checklist, and results were double- checked by another reviewer. Information on whether the MCID calculated was based on distribution or anchor-based methods, anchor characteristics, and whether the calculation was based on ordinal or interval scales was gathered. MCID values of commonly used PROM were pooled. From the selected articles, we extracted anchor-based MCID improvement and deterioration along with distribution-based indicators including ES, SRM, SEM, MDC 90%, MDC 95%, and GRI.
RESULTS
A total of 330 abstracts were identified from the literature search and 123 articles were chosen for full text review (Figure 1). We excluded 188 articles in the abstract review that were commentaries, reviews, metaanalyses or systematic reviews, articles that included patients other than those with OA or RA, articles on outcome measures other than PROM, articles on single-item outcome measures, and articles that used only published MCID for PROM without calculation of MCID. Thirty-six (19 OA, 16 RA, 1 RA-OA) articles were included in the final evaluation10⇓⇓⇓⇓–15,16⇓⇓⇓–20,21⇓⇓⇓–25,26⇓⇓⇓–30,31⇓⇓⇓⇓–36,37⇓⇓–40,41⇓⇓⇓⇓–46.
The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) and the generic Medical Outcomes Study Short Form-36 (SF-36) were the most frequently reported PROM with relevant calculations in OA. The Health Assessment Questionnaire (HAQ), SF-36, and Functional Assessment of Chronic Illness Therapy–Fatigue (FACIT-F) scales were the most frequent reported PROM with relevant calculations in RA (Table 1). In determining the MCID values, anchor-based approaches (44.4%) were used frequently; 52.8% of articles used more than 1 approach (Table 1). There was only 1 article that used only distributional methods in calculation of MCID10.
There was heterogeneity in the anchor used in different studies. The majority of articles (n = 27, 75%) used a question asking patients’ health status compared with an earlier timepoint. Two articles used a patient conversation method, in which patients were required to rate their health status in relation to another patient. There was diversity in the category level and time frame (period of time to compare current health status with the prior health status), and 5 articles had multiple time frames (Table 1). There were 8 articles (22.2%) that used anchors that were part of the clinical assessment and not patient-derived, such as Simplified Disease Activity Index (SDAI), physician’s global assessment, pain visual analog scale (VAS), swollen/tender joint counts, and the 28-joint Disease Activity Score.
The range of MCID for WOMAC and SF-36 for knee or hip OA are shown in Appendix 1, and the range of MCID for HAQ and FACIT-F for RA are shown in Appendix 2. Data are presented separately for anchor or distribution-based methods. Regardless of calculation methods, there was a wide range of MCID values.
The majority of articles calculated MCID for PROM using the raw scores, which were all ordinal scales. There was 1 exception that used the Rasch model to convert the raw scores of ABILHAND (a Rasch-built measure of manual ability of the hand) to interval measures10.
Calculating MCID on ordinal data
Table 2 illustrates the actual problem when looking at the HAQ. This is typically scored as 8 items with a 0–3 range, giving a raw score of 0–24, usually divided by 8 to give a 0–3 range. Each increment of raw score gives a 0.125 increase in score on the 0–3 range (columns 1–3). Columns 4–6 provide the information of the metric equivalent of HAQ, which is derived from fit of HAQ data to the Rasch model (mean item residual 0.0758; SD 1.1822; person residual −0.287; SD 1.0693; chi-square 32.316; p = 0.120; alpha 0.90). Location (column 4) is the estimation of the probability of a person to achieve a change in the corresponding raw scores. This “location” is then linear transformed into Metric-8 to mimic the 0–3 scale from HAQ-8. Metric-8 is the equivalent of the HAQ represented on a metric or interval scale. The first point to note is that the increment of Metric-8 in column 6 does not match the HAQ-8 (ordinal) increment shown in column 3. In practice, each distance from one raw point to the next has a different magnitude of increment for the Metric-8. The magnitude of difference is most at the margins, and least in the center of the scale.
Consider the implications of this for an example of a 0.5-point change on the classical HAQ-8, where a shift of 4 raw score points is required. Columns 7 and 8 show the results for the change in raw score of HAQ-8 needed to achieve the 0.5-point change in HAQ, for improvement (column 7) and deterioration (column 8), respectively. On the classical HAQ-8 such an improvement will be obtained by a 4-point raw score (0.5) shift in either direction. On the Metric-8 (when units are of equal size), the 0.5-point change can be obtained by just a 2- to 3-point raw score improvement at the margins, whereas it requires 6 raw score points in the center of the scale. The reason there are differences for the 0.5-point improvement and deterioration is explained in the item threshold distribution histogram as shown in Figure 2. The items of HAQ-8 are now calibrated on a metric scale. An item’s location in logits is the relative difficulty respondents describe regarding that item on a scale. The score points can be considered as a ruler of physical function, from most disabled at the right end to least disabled at the left end. When moving from right to left on the metric ruler, raw score points are lost in a nonlinear fashion (i.e., the patient is improving). For example, a patient starting at 4 logits would quickly lose 3 score points as he improves (1 logit down to 3 logits); but when he continues to improve, he needs to move 2 more logits before he loses another 3 raw score points (with the caveat that this is set within a probabilistic framework). This is typical of an ordinal scale where the distances between each raw score point are not equal. Further, when a patient is becoming more disabled (moving from left to right on the ruler), score points are picked up in a different fashion, thus explaining the differences in improvement versus deterioration on the ordinal scale. This also makes a difference for MCID calculated by distribution-based approaches, where the same unit and associated measurement error is assumed for all parts of the instrument.
DISCUSSION
At OMERACT 11, a special interest group (SIG) on Rasch model analysis was formed and reported on the acceptance and increasing use of Rasch model analysis in evaluation and development of PROM47. However, it also highlighted the lack of availability and reporting of transformation tables, which limited the application of transformed interval measurement in clinical practice. Meaningful measurement is based on the arithmetical property of interval scales48, and this applies particularly to the responsiveness of PROM and “discrimination” in the OMERACT filter49. During OMERACT 12 (May 2014), we evaluated the application of interval scaling in the domain of responsiveness and MCID calculation. We noted a significant deficiency in the use of interval scaling, which resulted in potential misinference of reported MCID. During the SIG in OMERACT 12, there was unanimous agreement from 39 participants that it was crucial to promote the use of the Rasch interval scale in measurement. Thus an international collaboration has been established to provide data for the establishment of Rasch transformed scales for commonly used PROM in rheumatology, which should be made available in the public domain.
The use of MCID or similar calculations in musculoskeletal disorders is growing rapidly. As shown in our systematic review for PROM in OA and RA, the majority of cases have failed to recognize that ordinal scales do not support the mathematical calculations required for MCID calculation. Only in 1 case, with the ABILHAND, was the MCID calculated using an appropriate metric49. This situation is not unique to rheumatology, and only rarely can a metric-based MCID be found elsewhere50. In practice, it generally means that where an original MCID calculation has been undertaken on a raw ordinal score, then subsequently patients will be less likely to achieve an MCID when they are moving across the outer 50% of the scale range, and more likely to achieve an MCID across the middle range of the scale. Thus, as we have demonstrated using the HAQ, at the margins they will improve a great deal for a 4-point change, and much less so as they pass across the middle of the score range.
The problems associated with MCID/MID calculations have been noted for some time. There is the debate over the appropriate methodology to ascertain patient perceived change7,51. It has been recognized that MCID/MID appear to differ according to the initial health status of study subjects52,53, and are different for improvement and deterioration. Both issues can be solved when MCID/MID are calculated on the metric or interval scale. Rasch analysis contributes to a valid calculation of MCID/MID by providing an interval scale metric for this purpose, and revealing how the ordinal-based calculations are invalid and biased. On the other hand, there are issues, particularly concerning the external anchor used in the calculation of MCID on metric scales, that cannot be resolved, including the difference in MCID for improvement/deterioration as determined by the anchor method; this is because patients’ perception of “important” changes for improvement or deterioration may be different. However, the question as to whether the patient weights his/her judgment differentially is a separate matter. Because the Rasch metric shows a difference in the unit change (in relation to the raw score) depending on whether the patient is improving or deteriorating, does it matter if the subjective anchor for that change is also different? We must accept the subjective anchor, and link the Rasch metric accordingly. From a methodological standpoint, the issue is whether the subjective judgment of change (i.e., the anchor) is consistent within and between patients and across samples. Given this, the rheumatology community should now state that MCID and their equivalent should always be calculated on a metric transformation. This may also help to facilitate development of a gold standard for calculating MCID/MID, which has yet to be established and has resulted in problems of interpretation due to varied results.
Our study has some limitations. The systematic review was constrained to English language publications, and information presented in other languages may have been missed. We found large ranges of MCID values calculated for either OA or RA for commonly used instruments, regardless of calculation method. However, we cannot differentiate how much is attributed by difference in anchor or anchor category levels, and how much is scaling. Because the terms used to indicate MCID have been numerous, evolving, and confusing, we may have missed some terms. But we think we have included most terms that were commonly used. We used only 1 scale (HAQ) to demonstrate the lack of interval scaling and its effect upon calculating the MCID. However, the nonlinearity of ordinal raw scores in PROM is now well established in the literature, including other widely used scales such as the WOMAC54.
The existing calculations of MCID/MID based on an ordinal scale are invalid. Making MCID/MID dependent on the initial health status of study subjects and direction of change leads to potential mis-inference of it. This has most likely contributed to the wide range of reported MCID/MID in commonly used PROM. The availability of interval scaling based on Rasch model-transformed ordinal scales in the calculation of MCID/MID is urgently required.
APPENDIX 1.
APPENDIX 2.
Footnotes
PGC is funded in part through a grant from Arthritis Research UK.