Abstract
The OMERACT patient reported outcomes (PRO) working group evaluated the methodologies for measuring responsiveness to change at the Outcome Measures in Rheumatology (OMERACT) 10 meeting. The outcome measures used in PRO studies are often expressed as continuous data at the group level (e.g., mean change in pain on a 0–100 visual analog scale). This is difficult to interpret and cannot easily be translated to the individual level of response. When interpreting scores at the individual level, it is important to take into account the following 4 main concepts: (1) improvement; (2) status of well-being; (3) onset of action; and (4) sustainability. Information from clinical trials on how many patients showed a response, what the level of response was, and how many patients are doing well, would be extremely useful for physicians. The objective of this article is to outline how continuous data may be reported in a clinically relevant manner. We will describe 5 techniques of reporting continuous variables in clinical studies and discuss the relevance of each.
Physicians manage patient treatment at the individual level within their clinics. Investigation, diagnosis, and treatment are often performed based on the experience of the physician. However, in the new millennium, such decisions should be based on the available evidence. Evidence is usually acquired by evaluating the clinical relevance of data obtained in research studies. In these studies, we need to use instruments that are clinically meaningful and responsive to changes in a patient’s health. The Outcome Measures in Rheumatology (OMERACT) Patient Reported Outcomes (PRO) working group evaluated this issue at the OMERACT 10 meeting. This work is part of a larger PRO initiative that includes selecting domains and choosing instruments that are responsive to change. The outcome measures used in PRO studies are often expressed as continuous data at the group level (e.g., mean change in pain on a 0–100 visual analog scale, VAS). This is difficult to interpret and cannot easily be translated to the individual level of response. To better understand the results of clinical trials, continuous variables can be translated into dichotomous variables, such as “therapeutic success (yes/no).” However, in order to turn a continuous variable into a dichotomous variable, the cut-off used for the dichotomization must be clinically relevant. Information from clinical trials on how many patients showed a response, what the level of response was, and how many patients are doing well, would be extremely useful for physicians. This article aims to address these questions.
When interpreting scores at the individual level, 4 main concepts need to be taken into account: Improvement, status of well-being, onset of action, and sustainability. Improvement (to feel better) can be measured using the minimally clinically important difference (MCID) [or minimally important difference (MID)] or minimally clinically important improvement (MCII). MCID is defined as “the smallest difference in a score that is considered to be worthwhile or important.” This necessitates a clinical change in a patient’s health status1. MCII is similar; however, it is only concerned with positive improvements. With the development of more effective therapies in RA over the past decade, mean improvements within treatment groups have generally exceeded MCID, so that twice or 3 times the MCID in Health Assessment Questionnaire (HAQ) scores have frequently been reported. This has prompted definitions of “really important improvements”: for example, “moderate improvement,” which is defined as an improvement of greater than 30%, or “substantial improvement,” defined as an improvement greater than 50%.
Achievement of a status of well-being (to feel good) can be broken down into 3 separate states: the patient acceptable symptoms state (PASS); the low or minimal disease activity state (LDAS); and attainment of a normative state, such as attainment of the goal of the therapy or remission (Figure 1).
The PASS is defined as a state in which patients consider their condition to be satisfactory or acceptable, often interpreted as “feeling well.” LDAS is an intermediate state between activity of the disease and complete remission. Minimal or low disease activity (MDA) has been defined as the state of disease activity deemed a useful target of treatment by both the patient and the physician, given current treatment possibilities and limitations2,3. The attainment of “normative” values refers to whether a patient is able to attain the goals of therapy or alternatively enter into remission.
The third concept refers to the onset of action (to feel good/better as soon as possible), or time taken to achieve the therapeutic success. The fourth concept refers to sustainability (to feel good/better for as long as possible), or the duration of the therapeutic success.
The objective of this article is to outline how continuous data may be reported in a clinically relevant manner. We describe 5 techniques of reporting continuous variables in clinical studies and discuss the relevance of each of these.
TECHNIQUES OF REPORTING SCORES AT THE INDIVIDUAL LEVEL
Continuous data can be reported in a number of ways. The 5 main techniques are defined below:
-
Technique A, the conventional technique, presents data at the group level (i.e., mean and standard deviation).
-
Technique B presents data at an individual level and requires that a level of change, or response, is defined in a continuous dataset above which the patient can be categorized as a therapeutic success or a responder. Thus we need to know the minimum clinically important difference or minimum clinically important improvement.
-
Technique C also presents data at the individual level but instead defines a state of being. This represents the level of the continuous variable below which a patient can be categorized as a therapeutic success, e.g., PASS or LDAS or remission.
-
Technique D takes into account the concept of “onset of action.” Data can be expressed as either the percentage of patients achieving therapeutic success or the median time to achieve therapeutic success. This technique could use life-table analysis. In this analysis, the “event” is defined as the first visit during which the therapeutic success is observed, regardless of the values observed during consecutive visits.
-
Technique E takes into account the concept of sustainability. This technique could also use life-table analysis. In this analysis, the event is defined by the first visit during which the therapeutic success is observed. This differs from technique D as the event is measured during consecutive visits. Alternatively, one can describe the characteristics of the time spent in the state of success (e.g., mean, mimimum, maximum, etc.) or assess the number of transitions in and out of the state. The extent to which a period is uninterrupted can be weighted positively. An intuitive method to do this is the ConRew (continuity rewarded) score4.
Technique A
The conventional techniques of reporting data using the mean and standard deviation have been discussed elsewhere in detail.
Technique B1. Minimum Clinically Important Differences
MCID refers to the degree of improvement in PRO that would be perceptible to patients, on an individual basis, and would be considered clinically meaningful to them1,5,6,7. MCID is defined as “the smallest difference in score that patients perceive as beneficial and which would mandate a change in the patient’s management”8. Although these definitions are relevant only on an individual patient basis, when mean changes within a treatment group exceed such a value, it can be estimated that the majority of the group will have attained clinically important improvements. Alternatively, the percentage of patients who report improvements meeting or exceeding MCID is another way to indicate the clinical meaningfulness of treatment associated changes, and to translate group data to the level of individual impact. Next, a few scales commonly used in rheumatology trials will be presented with the defined MCID values.
Pain and global assessment of disease activity by VAS
It is generally accepted that a 10-point change on a 0–100 VAS corresponds to MCID for pain and global assessments.
Western Ontario McMaster Universities Osteoarthritis Index Questionnaire (WOMAC)
Definitions of MCID are generally obtained by linking changes in PRO to improvements (or worsening) in anchored scores such as patient-reported global disease activity (VAS or Likert scales) or Guyatt feeling thermometer9,10. Following a variety of treatments in OA, including nonselective nonsteroidal antiinflammatory drugs (nsNSAID), cyclooxygenase 2 (COX-2) inhibitors, and physical therapy, it was demonstrated that a change of about 10 points on a 0–100 point VAS corresponds to MCID for total WOMAC and WOMAC pain and physical function subscores11,12,13,14,15. Functionally useful pain relief is frequently associated with improvements in physical function (so that function may be increased until limited by pain), and composite measures of response in chronic pain conditions such as osteoarthritis (OA), fibromyalgia (FM), and low back pain have included both pain and physical function, measured by WOMAC, Fibromyalgia Impact Questionnaire, Roland Morris Disability Questionnaire (RMDQ), or Medical Outcome Study Short Form-36 (SF-36) physical component score (PCS) as components.
Health Assessment Questionnaire Disability Index (HAQ DI)
The HAQ DI queries patients’ ability to perform activities of daily living as well as the need for help, or use of functional aids. Although a generic measure, it has predominantly been utilized in randomized controlled trials (RCT) in rheumatoid arthritis (RA), and has been shown to perform better in RA and gout than in psoriatic arthritis (PsA) or systemic lupus erythematosus (SLE)16,17,18. It is generally believed that the HAQ DI has its greatest item information content in more disabled populations and SF-36 physical function domain in more normalized ones, and that floor effects are more common with HAQ DI than SF-36 physical function domain. An improvement of −0.22 is considered to represent MCID in HAQ-DI5,6,19. As with all PRO, changes in HAQ have differentiated active from placebo treatment across recent RCT in RA20,21,22. It has also been recognized that improvements in physical function, measured by HAQ, are based on baseline status; higher scores prior to treatment reflect disease duration/severity and an “irreversible” component due to deformities or other irreversible reasons for loss of physical function, including muscle weakness, etc. Thus, HAQ has been recognized as a measure of “state” as well as “change”23. It should, however, be noted that the HAQ has scaling problems and is not in reality a linear or interval measure. Shorter versions of HAQ include the modified HAQ (MHAQ), Multidimensional HAQ (MDHAQ) among others24. MCID is generally recognized to be −0.25 for these instruments18.
Roland Morris Disability Questionnaire
In low back pain a change of −5 points on the RMDQ-24 scale has been defined as MCID25,26.
Functional Assessment of Chronic Illness Therapy Fatigue Scale (FACIT) and Fatigue VAS
FACIT-F scores range from 0 to 52, with higher scores representing less fatigue. The instrument has been validated in the general population and in patients with RA. MCID for FACIT-F in RA was determined to be ≥ 4-point change from baseline27. Fatigue has also been assessed using a standard VAS scale, again with MCID considered to be ≥ 10 point improvement.
Health related quality of life (HRQOL): SF-36
Although a generic instrument, SF-36 has been validated in multiple languages and is sensitive to change across a wide variety of clinical conditions, including rheumatic diseases: RA, OA, SLE, systemic sclerosis (SSc), PsA, ankylosing spondylitis (AS), FM, and gout28,29,30,31,32,33. Further, comparisons to age- and gender-matched normative values in US, UK, Scandinavian countries, the Netherlands, and Turkey are available. Values for MCID in domain and summary scores (PCS and mental component score, MCS) of SF-36 have been derived, based on correlations with patient reported improvements in global disease activity or condition, on an individual patient basis. For example, Kosinski and Ware compared changes in HAQ DI and SF-36 domains and summary scores with patient global assessments and pain in 2 RCT comparing COX-2 selective agents to traditional NSAID in active RA and OA12,34,35,36. Mean changes in SF-36 domain scores corresponding to one point of improvement in patient global assessment of disease activity (by Likert) or 10 mm improvement (i.e., by VAS) ranged from 4.2 to 21.0 and 1.9 to 10.8, respectively.
Thumboo, Strand, Khanna, Singh, Choy, and others have shown that changes of 5–10 points in domain scores and 2.5–5 points in PCS and MCS summary scores can be considered to represent MCID in OA, RA, SLE, FM, and gout, based on correlations with Guyatt feeling thermometer and/or patient global assessments of disease activity11,12,13,14,15,18,23,37,38,39,40,41,42,43,44.
Values for MCID for deterioration are: −2.5 to −5.0 points in domain and −0.8 in PCS and MCS scores, indicating that patients perceive worsening with smaller changes than improvement12.
Another interpretation of clinical meaningful improvements is to focus on key questions in SF-36 to perform content-based analysis of HRQOL changes. Examples include assessing the percentage of subjects reporting improvements ≥ MCID in performance of specific tasks or in answer to certain questions, such as climbing flights of stairs or walking a mile44.
B2. Minimally Important Differences
Alternatively, statistical definitions, such as changes ≥ 0.5 standard deviations of the mean baseline, can be used to reflect minimally important differences (MID), which are not specifically anchored to PRO45. As data have accumulated, it has become apparent that SF-36 MCID and MID values closely correspond and are remarkably consistent across disease states.
Health Utilities Index Mark 3 (HUI3)
HUI3 is a validated generic HRQOL measure46,47,48. It includes a health-status classification system and preference-based scoring formula based on 8 attributes (vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain). There are 5 rating levels for speech, emotion, and pain [ranging from 1 (able to be understood completely, being pain free, etc.)] to 5 (unable to be understood, severe pain, etc.), and 6 rating levels for vision, hearing, ambulation, dexterity, and cognition (where 1 is positive and 6 is negative). Global HRQOL is calculated by translating these categorical data into a single attribute and overall utility scores (ranging from 0 = death to 1.00 = perfect health). The construction of the scale is one of preference or desirability. Score changes of 0.03 are considered to represent MID49.
EuroQol (EQ-5D)
The EuroQol Group designed EQ-5D to be a simple, self-administered questionnaire measuring HRQOL and health utilities. It is valid, reproducible, and sensitive to change in RA, and has been translated into most major languages and cross culturally validated across a variety of clinical indications50,51. It consists of: (a) descriptive profile consisting of 5 dimensions, namely, mobility, self-care, usual activities, pain/discomfort and anxiety/depression, with each dimension rated from 1–3: 1 = no problems; 2 = moderate problems; and 3 = extreme health problems; and (b) a current general health status index, measured by a 20 cm VAS (EQ-VAS) with endpoints labeled “best imaginable health state” and “worst imaginable health state,” anchored at 100 and 0, respectively. EQ-5D scores have been shown to differ according to disease duration, history of disease modifying antirheumatic drug use, and presence of probable depression or anxiety, in RCT as well as longitudinal observational studies (LOS) in RA47,52,53.
SF-6D
Brazier, et al developed the SF-6D, a preference-based measure of health utilities that used individual item responses from 11 questions of the SF-36 to derive 6 domain classifications of health states, generating 18,000 health states in total54,55. Scores range from 0.296 to 1 in which 0.296 = maximum impairment in all 6 domains and 1 = full health. Recently, a newer method for deriving SF-6D utilizing all 8 domain scores of SF-36 has been developed from published group mean data and validated against the traditional calculation method as well as EQ-5D in a cohort of 6350 patients with various diagnoses including OA56,57.
MID values for EQ-5D and HUI3 were 0.05 and 0.06, respectively, in a cohort of 222 RA patients with mean age 62 years, disease duration 14 years and baseline HAQ scores of 1.1, assessed 6 months apart45. Similarly, MCID or MID for SF-6D calculated using 6 domains is 0.04850,53,54,58 and with the recent revision of SF-6D, utilizing all 8 domains of SF-36 is 0.04155,56.
B3. Minimum Clinically Important Improvements
Others think that even MCID and certainly clinically meaningful improvements should be defined not only by an absolute amount of change but also relative to baseline. This implies change in a positive direction, as well as more than minimally perceptible. Thus in patients with knee OA, both Tubach’s definition of MCII (reduction in the WOMAC physical function subscale of ≥ 26%59) and achievement of a PASS (WOMAC physical function subscale score < 3159) would be required. This definition more closely corresponds to the OMERACT/OARSI definition of responders in OA, as determined by WOMAC60. OMERACT-OARSI criteria considered a patient to be a strict responder with improvements from baseline in WOMAC pain or physical function subscores ≥ 50% with absolute changes ≥ 20 mm. A responder reported improvements from baseline ≥ 20% with absolute changes ≥ 10 mm in 2 of 3 measures: WOMAC pain or physical function sub-scores; and/or patient global assessment of disease activity.
In low back pain, OMERACT response is defined as ≥ 30% improvement in pain and ≥ 30% improvement in patient global assessment, and no worsening of physical function (change ≤ 2)61.
B4. Moderate, Substantial and Really Important Improvements
With the development of more effective therapies in RA over the past decade, mean improvements within treatment groups have generally exceeded MCID — so that twice or 3 times MCID in HAQ scores have frequently been reported. This has prompted definitions of really important improvements: e.g., changes in HAQ DI of −0.50 and −0.7562.
This is consistent with the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) recommendations that improvements in pain be assessed as not just by MCID but as moderate (≥ 30%) or substantial (≥ 50%) improvements63,64. These have been applied also to WOMAC pain and physical function sub-scores and are consistent with the OMERACT/OARSI Strict Responder definition.
Technique C. Attainment of Acceptable States
During the past decade, a lot of work has been done in order to propose clinically relevant definitions of MCII, PASS, LDAS, and remission. The proposal of a value in a continuous variable permitting to define a clinically relevant improvement or an acceptable status has been done through different methodologies as described below:
C1. Patient Acceptable States
PASS are attainment of a state rather than a measure of change — meaning that patients consider that level of improvement to be acceptable. They are based on changes observed in RCT in OA and longitudinal observational studies in RA and have been defined as58,65:
-
WOMAC Physical Function score: < 31
-
HAQ (RA): 1.04
-
Patient global VAS (RA): 36
-
Pain VAS (RA): 34
-
Fatigue VAS (RA): 50
-
SF-36 domains - physical function: 50; role physical: 41; general health: 45; vitality: 40; social functioning: 75; mental health: 68.
C2. Low Disease Activity
This concept is similar to that of PASS but generally refers to a composite index evaluating the activity of a disease and comprising several domains. In the field of rheumatology, we have the examples of the DAS28-ESR (Disease Activity Score 28-erythrocyte sedimentation rate) for RA, and AS-DAS-CRP (C-reactive protein) for AS. These composite indices are usually presented as continuous variables, but several methodologies have been used to propose a threshold below which the condition of the patient is considered as acceptable. For example, in RA, a DAS28-ESR < 3.2 and in AS, AS-DAS-CRP < 2.1.
C3. Attainment of Normative Values
As HAQ scores have decreased over time in longitudinal observational studies of newly diagnosed patients, and generally are higher with larger potential to show a reduction in subjects with active RA accrued into RCT, another high bar for comparison has been to assess the proportion of subjects who attain normative values of HAQ-DI scores ≤ 0.566.
SF-36 PCS and MCS scores are derived from z transformed and normative based data, such that 50 is considered the norm with SD of 10. More informative, however, is to compare individual domain scores at baseline and endpoint with age/gender matched normative data in subjects without arthritis, as a goal for treatment and a means to assess treatment associated improvements42.
MCII and PASS
After the step of the elaboration/proposal of such cutoffs, different studies have been conducted to answer different questions such as:
-
Should MCII and PASS be treatment-specific or the same whatever the treatment evaluates?
-
What is the relation between MCII and PASS?
-
What is the impact of various parameters (e.g., country location, gender, age) on the MCII and PASS estimates?
Determining treatment-specific MCII and PASS values [i.e., PASS for evaluating NSAID therapy vs PASS for evaluating anti-tumor necrosis factor-α (anti-TNF-α)] allows for taking into account the different levels of patients’ expectations for the treatment. Actually, whether patients consider a state (or a change) satisfactory independently of the treatment they receive (i.e., whether the PASS values are related to patients’ expectations of the treatment) is not known. One may hypothesize, for instance, that patients expect stronger effects from a TNF-α antagonist than from NSAID therapy and thus would consider a lower level of symptoms as satisfactory with TNF-α antagonist therapy. This point should be investigated in further studies. The drawback of using treatment-specific PASS or MCII values is that these values should be regularly updated as treatment options and knowledge and expectations about them evolve.
Concerning the potential relation between MCII and PASS, the relative meaning of the MCII and PASS is unknown. Whether the concept of improvement or remission or both should be recommended was discussed, and this point was addressed in a survey following the special interest group session, reported below. The results on how the MCII and PASS interrelate in a study of hip and knee OA and acute rotator cuff syndrome were presented59. The MCII appears to be the needed change to achieve the PASS. It seems that patients consider that they experienced an important improvement only if this improvement allows them to achieve a satisfactory state. Consequently, it seems that what is important to patients is to feel acceptable or satisfactory (concept of PASS) rather than to feel better (concept of MCII).
At the disease level (e.g., disease activity of RA), several studies suggest also that the status after therapy is more important not only for the patient but also for reducing the risk of subsequent structural damage.
One could argue that the MCII and PASS estimates can vary according to different parameters. In the hip or knee OA study, the MCII was shown to vary greatly across tertiles of baseline scores and age. This impact of the baseline level of symptoms was only partly reduced when using relative change instead of absolute change. Gender and disease duration did not appear to affect the MCII value. The impact of the baseline value had been previously demonstrated in low back pain, in which the MCID varied between 3 and 13, depending on the baseline range of scores (from the Roland Morris Back Pain Questionnaire). Patients dealing with the most severe symptoms have to experience a greater change to consider themselves improved. Where the data on responsiveness are available, it may be possible to adjust or stratify MCII and PASS values for age and gender.
In the hip or knee OA study, the PASS was more constant across tertiles of the baseline score than the MCII; and age, gender and disease duration did not affect the PASS value. An important aspect of any desirable state is the time spent in that state. In the AS study31, the PASS was shown to be stable over 10 weeks. This key finding supports the use of the PASS values to describe patients achieving and maintaining such a state for a specified period of time. This point should be confirmed in a study with a longer followup.
Improvements and States
In summary, one can develop a series of degrees of improvement on a spectrum from MCID through moderate and substantial changes — all best defined for pain — to attainment of state that may be PASS or reaching age/gender matched normative levels (≤ 0.5 in HAQ, ≤ 31 in WOMAC, or age/gender matched normative values in SF-36 domain scores, as examples) (Figure 1).
In fact, PRO all conform to a spectrum ranging from minimal or just detectable to moderate, substantial, and really important differences. Reported improvements may also be looked at in parallel as attainments of state or normative values, even as goals of therapy.
The interpretation of improvements as clinically meaningful adds another dimension to statistically significant changes. In large RCT, treatment changes in HRQOL scores may be statistically different but not meaningfully different for most patients, and vice versa. These changes may also be used to calculate values for number needed to treat (NNT): the number of subjects required to receive a therapy to achieve an additional desired outcome. For example, NNT values may be based on the percentage of subjects who report improvements in PRO scores ≥ MCID, moderate or substantial, ≥ 50% changes67. If 40% of subjects receiving active therapy versus 10% in placebo report clinically meaningful changes, then NNT would be calculated as 1/(relative response in active − relative response in control) or 1/(0.40–0.10) = 3; meaning that 3 patients must receive the treatment intervention to result in one more patient achieving “success,” than would have been found with the control intervention. Similarly NNT estimates may be derived based on improvements ≥ MCID in SF-36 domain or summary scores or health utility scores based on SF-6D or EQ-5D.
Technique D: Onset of Action and E: Sustainability
Both the concepts of onset of action and sustainability have been recently recognized as important enough to be included in the EULAR/ACR collaborative recommendations for reporting results from clinical trials in RA68. These 2 concepts are also considered important from a patient’s perspective69. However, there is still no consensus concerning how to report the analysis of such concepts. It was recently proposed that the life-table analysis approach be used to analyze the concept of onset of action (e.g., time to reach success) and of sustainability (e.g., time to reach a sustained success). When addressing the concept of sustainability, it is also possible to use other techniques such as the ConRew scoring system. Using prespecified observation periods (e.g., every month) the continuity rewarded (ConRew) score sums up periods in remission, and rewards extended periods by placing more value on uninterrupted periods than on interrupted periods.
CONCLUSION
The advantages of the technique of reporting data at the individual level (e.g., percentage responder, percentage of patients in good condition) versus reporting at the group level (e.g., mean ± standard deviation) are well understood. During the last decades, a huge amount of work has dramatically facilitated the awareness of these concepts by the entire medical community. Such recognition has been recently translated in the EULAR/ACR initiative aimed at providing recommendations for reporting disease activity of RA in clinical trials. Such recommendations considered the 4 concepts we have discussed, e.g., the presentation of the data at the individual level (importance of reporting both the responders and the good condition), but also the concepts of onset of action and sustainability/durability.
There remains, however, some discussion about the best way to define the cutoff values, and even the nomenclature that should be employed. The USA Food and Drug Administration, for some years, has required inclusion of graphs of cumulative distribution functions of all levels of responders in product labels for analgesics. The advantage of presenting data in this way is that each reader can decide for themselves which cutoff values are relevant. For many, however, this would add confusion compared to a simple statement such as > 90% of the people taking this medication get more than 50% better. Recently it has been argued in treating chronic pain that: The choice of cut-points depends on the purpose of the study70. Further, clinically important improvement may depend upon the balance between adverse and beneficial effects, a point made at a previous OMERACT meeting71 and reinforced by Dworkin at this one72.
At the final plenary session after these issues were discussed, participants voted on 2 questions: 1. Do participants agree that evaluation of relevant changes has to consider the specificities of the PRO (e.g., considering additional factors such as coping, adaptation over time, etc.)? 68% voted yes (13% no, 19% don't know). 2. Do participants agree that evaluation of PRO changes such as PASS, time to response, sustainability, etc., requires further standardization? 82% voted yes (5% no, 13% don't know). Thus, a research agenda centered around 2 areas of development has been delineated: the context of disease evaluation, including social and personal adaptation; and the development of agreed and standardized definitions, or possibly approaches to deriving definitions, of the nature and significance of different changes.