Abstract
Objective. To assess the validity, responsiveness, and reliability of single-joint outcome measures for determining target joint (TJ) response in patients with inflammatory arthritis.
Methods. Patient-reported outcomes (PRO), consisting of responses to single questions about TJ global status on a 100-mm visual analog scale (VAS; TJ global score), function on a 100-mm VAS (TJ function score), and pain on a 5-point Likert scale (TJ pain score) were piloted in 66 inflammatory arthritis subjects in a phase 1/2 clinical study of an intraarticular gene transfer agent and compared to physical examination measures (TJ swelling, TJ tenderness) and validated function questionnaires (Disabilities of the Arm, Shoulder and Hand scale, Rheumatoid Arthritis Outcome Score, and the Health Assessment Questionnaire). Construct validity was assessed by evaluating the correlation between the single-joint outcome measures and validated function questionnaires using Spearman’s rank correlation. Responsiveness or sensitivity to change was assessed through calculating effect size and standardized response means (SRM). Reliability of physical examination measures was assessed by determining interobserver agreement.
Results. The single-joint PRO were highly correlated with each other and correlated well with validated functional measures. The TJ global score exhibited modest effect size and modest SRM that correlated well with the patient’s assessment of response on a 100-mm VAS. Physical examination measures exhibited high interrater reliability, but correlated less well with validated functional measures and the patient’s assessment of response.
Conclusion. Single-joint PRO, particularly the TJ global score, are simple to administer and demonstrate construct validity and responsiveness in patients with inflammatory arthritis. (ClinicalTrials.gov identifier NCT00126724)
Outcome measures, such as the American College of Rheumatology responder criteria1 and the European League Against Rheumatism response as measured by the Disease Activity Score2,3, have been developed and validated to assess response to systemic agents in inflammatory arthritis. In contrast, no validated measures for assessing the response of single joints exist4,5, despite a history of local treatment of individual refractory joints with intraarticular steroid injections6, radioactive synovectomy7, and surgery8. Now that new local therapies for refractory joints are under development, it is important that single-joint outcome measures be developed and validated as well4,5. These single-joint outcome measures need to be developed in parallel with local therapies for refractory joints, so that the novel therapies can be properly evaluated. Systemic outcome measures cannot be used to assess single joints, because they are not sensitive enough to evaluate changes in a single joint after local therapies.
The phase 2 portion of a phase 1/2 study of an intraarticular gene transfer agent9 provided an opportunity to compare the utility of patient-reported outcomes (PRO) and physical examination in assessing response of a single joint to local treatment. The Outcome Measures in Rheumatology Clinical Trials (OMERACT) filter10,11 was applied to assess truth (face validity, construct validity), discrimination (reliability, responsiveness), and feasibility of these measures.
MATERIALS AND METHODS
Study setting and subjects
Single-joint outcome measures were piloted and evaluated in the phase 2 portion of a multicenter, randomized, double-blind, placebo-controlled trial of intraarticular administration of a gene transfer agent in subjects with rheumatoid arthritis (RA), psoriatic arthritis (PsA), or ankylosing spondylitis (AS), and persistent moderate or severe inflammation in the knee, ankle, elbow, wrist, or metacarpophalangeal (MCP) joint9. In the parent study (ClinicalTrials.gov identifier NCT00126724), subjects were randomized to 1 of 3 doses of the gene transfer agent or placebo. The assessments outlined below were completed by study subjects and study staff without knowledge of treatment assignment. The research was conducted in compliance with the Helsinki Declaration. The trial was approved by the Institutional Review Board at each participating site.
Patient-reported outcomes
Subjects completed validated questionnaires to assess target joint (TJ) function, consisting of the Disabilities of the Arm, Shoulder and Hand (DASH) scale12,13 for subjects whose TJ was in the upper extremity, and the Rheumatoid Arthritis Outcome Score (RAOS)14,15 for subjects whose TJ was in the lower extremity. Subjects also completed the Health Assessment Questionnaire (HAQ)16,17 as an assessment of overall function.
After completing the questionnaires, subjects responded to 3 more questions about the TJ, each on a 100-mm visual analog scale (VAS): (1) “Considering all the ways your target joint affects you, place a vertical line on the scale to show how you have been doing in the past week,” (0 = no symptoms and 10 = severe symptoms; TJ global score); (2) “To what extent has your target joint impaired (prevented) you from doing your usual activities during the past week?” (0 = no impairment and 10 = severe impairment; TJ function score); and (3) “Comparing your target joint today with how it felt just prior to your last injection, how satisfied are you with the results of the study drug injection?” (0 = not satisfied at all and 10 = very satisfied; TJ response score).
The DASH and HAQ scores were calculated according to published criteria12,17. For the RAOS, 3 subscales (pain, stiffness, and function) were calculated using the questions that were originally derived from the Western Ontario and McMaster Osteoarthritis Index (WOMAC), and scored in a similar manner18. The response to each question was on a 5-point scale ranging from 0 = none to 4 = extreme. The pain subscale was calculated by adding the responses to 5 questions about pain, dividing by 20, and multiplying by 100, to obtain a score that ranged from 0 to 100. The stiffness subscale was calculated by adding the responses to 2 questions about stiffness, to obtain a score that ranged from 0 to 8. The function sub-scale was calculated by adding the responses to the 17 questions about physical function, dividing by 68, and multiplying by 100 to obtain a score that ranged from 0 to 100.
Physical examination
The TJ was evaluated by 2 independent examiners for tenderness on a 4-point scale ranging from 0 = none to 3 = severe (TJ tenderness) and swelling on a scale ranging from 0 = none to 3 = severe (TJ swelling), based on guidelines published in the Dictionary of Rheumatic Diseases19. The 2 examiners performed the joint assessments independently, each without knowledge of the other’s assessment.
Outcome measure selection
Prior to analysis of the data, PRO measures considered to be most clinically relevant were prospectively selected by 2 rheumatologists. Both are clinician researchers, and one cochairs the OMERACT single-joint assessment working group, which is charged with conducting research in this subject area. The outcome measures selected included the TJ global score, the TJ function score, the TJ pain score, and a question about the average severity of pain in the past week included in both the DASH and RAOS, whose response ranged from 0 = none to 4 = extreme. Outcome measures considered to be of interest as surrogate markers of disease included TJ swelling and TJ tenderness.
Analysis
The OMERACT filter was applied to assess truth (face validity, construct validity), discrimination (reliability, responsiveness), and feasibility of the selected outcome measures10,11. Face validity and feasibility were assessed through item selection by the expert panel of rheumatologists as outlined above. Construct validity was assessed by comparing the results of the candidate single-joint outcome measures (TJ global score, TJ function score, TJ pain score, TJ swelling, TJ tenderness) with validated functional measures (HAQ scores, RAOS pain, stiffness and function sub-scales, and DASH scores) 12 weeks after injection of a study drug using Spearman’s rank correlation coefficient. Twelve weeks after injection was chosen as the time for evaluation, because the dataset at that time had few missing datapoints, and sufficient time had passed since study entry to allow a greater range of responses, because TJ swelling was required to be moderate (grade 2) or severe (grade 3) at study entry.
The sensitivities to change for each measure were compared using both effect sizes and standardized response means (SRM). Effect size was defined as the absolute mean change from baseline to Week 12 for each measure divided by the baseline SD of that measure20,21. SRM was defined as the absolute mean change from baseline to Week 12 divided by the SD of the change from baseline to Week 1222. Effect sizes and SRM were considered large (> 0.8), moderate (0.5 to 0.8), or small (0.2 to 0.5)22,23. In addition, the correlation between the patient-reported response (TJ response score) and the change from baseline for each measure was assessed using Spearman’s rank correlation coefficient.
Reliability of the physical examination measures of TJ swelling and TJ tenderness was determined by assessing interobserver agreement. The proportion of observations with complete agreement, defined as when both observers graded the swelling or tenderness to be identical, was determined at each time after first injection of the study drug. In addition, weighted κ coefficients were calculated.
RESULTS
Study population
Single-joint outcome measures were piloted in 66 subjects in the study, including 55 (83%) with RA, 9 (14%) with PsA, and 2 (3%) with AS. The mean age of the 51 women and 15 men was 53.3 years, with a range of 22 to 76 years. The TJ included 20 knees (30%), 14 ankles (21%), 18 wrists (27%), 9 MCP joints (14%), and 5 elbows (8%). The construct validity and responsiveness of the proposed single-joint outcome measures were assessed in 61 subjects with complete data at baseline and Week 12. The HAQ was administered to all 61 subjects. The RAOS was administered to 31 subjects whose TJ included 18 knees and 13 ankles. The DASH scale was administered to 30 subjects whose TJ included 16 wrists, 9 MCP joints, and 5 elbows. The reliability of the physical examination measures was assessed in 63 subjects whose TJ was assessed by 2 independent examiners on the same day at 1 or more times.
Construct validity
The distributions of the proposed single-joint measures at baseline and Week 12 are shown in Figures 1, 2, and 3. The TJ global score and TJ function score were fairly evenly distributed at both baseline and Week 12, without ceiling or floor effects (Figure 1). TJ swelling was either moderate or severe at baseline as per study entry criteria, with the exception of 1 subject whose swelling decreased from moderate to mild between screening and baseline (Figure 3A). TJ swelling became more normally distributed by Week 12. Similarly, TJ tenderness (Figure 3B) and the TJ pain score (Figure 2) were more severe at baseline than Week 12.
The correlation between single-joint outcome measures and validated functional measures is shown in Table 1. The single-joint PRO (TJ global score, TJ function score, TJ pain score) were highly correlated with each other, with Spearman’s rank correlation coefficients ranging from 0.76 to 0.90 at Week 12, all with high levels of statistical significance. The single-joint PRO also correlated well with the RAOS and moderately well with the DASH and HAQ. Physical examination measures (TJ swelling, TJ tenderness) correlated well with each other (r = 0.58, p < 0.0001), but TJ tenderness was better correlated with single-joint PRO and validated functional measures than TJ swelling. TJ swelling did not correlate with single-joint PRO or validated functional measures. Of note, the RAOS pain, stiffness, and function subscales were highly correlated with each other, with Spearman’s rank correlation coefficients ranging from 0.78 to 0.87 at Week 12, all with p < 0.0001 (data not shown).
Responsiveness
Measurements of responsiveness or sensitivity to change are presented in Table 2. TJ swelling had a large effect size and moderate SRM. The RAOS stiffness subscale had a moderate effect size and a large SRM. TJ global score, TJ pain score, and RAOS pain subscale had both a moderate effect size and a moderate SRM. TJ tenderness and the RAOS function subscale had either a moderate effect size or moderate SRM. The HAQ and DASH scale had both a small effect size and a small SRM. Correlation with the TJ response score was strongest for the TJ global score among all joints (r = −0.37, p = 0.01), and for the DASH among upper extremity joints (r = −0.42, p = 0.04).
Reliability
The reliability of physical examination measures, as assessed by the proportion of observations with complete agreement, is shown in Figure 4. Interobserver reliability was very good, with rates of complete agreement between assessors ranging from 67% to 78% for TJ swelling and 75% to 84% for TJ tenderness at 7 timepoints. Major disagreements (a difference of 2 or more points) were noted for TJ swelling at rates of 0% (3 timepoints) to 5% (1 time-point) and for TJ tenderness at rates of 0% (1 timepoint) to 3% (3 timepoints). Weighted kappa coefficients ranged from 0.45 (Week 0) to 0.87 (Weeks 18 and 24) for TJ swelling and from 0.61 (Week 0) to 0.75 (Week 24) for TJ tenderness.
DISCUSSION
The need to develop single-joint outcome measures has become more pressing, as more local therapies for refractory joints are under development4,5. Ideally, a single-joint outcome measure would be simple to administer and applicable to all joints. The DASH and RAOS have been developed to assess upper and lower extremities, respectively, but are cumbersome, and neither can be applied to all types of joints.
The single-joint outcome measures (TJ global score, TJ function score, and TJ pain score) evaluated here are simple to administer. Their face validity and feasibility were judged to be good, based on their selection by the expert panel of rheumatologists. The 3 single-joint outcome measures are highly correlated, suggesting that one may be used in lieu of the others. In comparing the 3 proposed single-joint outcome measures, the TJ global score may prove to be most useful. It demonstrated good construct validity, as seen by the high correlation with the HAQ, RAOS subscales, and DASH score, and good responsiveness, as demonstrated by the moderate effect size and moderate SRM.
Single-joint physical examination measures (TJ swelling, TJ tenderness) were very reliable, with high rates of inter-observer agreement, but they did not correlate as well with validated functional measures as the single-joint PRO. The effect sizes for TJ swelling and TJ tenderness were large and moderate, respectively, with SRM that were moderate and small, respectively, but changes in these physical examination measures did not correlate well with the patient’s assessment of response (TJ response score). Physical examination and PRO measure different features of affected joints, and so are unlikely to be highly correlated. Both have their advantages and limitations, and should be used in parallel.
The lack of a “gold standard” for single-joint response hampers the evaluation of prospective single-joint measures in this study. Ideally, the prospective single-joint measures would be compared with a measure of biologic activity. Magnetic resonance imaging (MRI) scans were planned as part of this clinical study protocol, but too few MRI scans were conducted to allow for a meaningful comparison. The rheumatologists considered TJ swelling to be a surrogate for disease activity, but TJ swelling did not correlate well with the patient’s symptoms. The candidate single-joint measures were compared with the patient’s assessment of response, as recorded on a 100-mm VAS as the TJ response score. However, the TJ response score did not meet criteria of face validity, as the expert panel of rheumatologists felt that patients would have difficulty remembering what their TJ felt like 12 weeks previously.
Based on this preliminary assessment, single-joint PRO, specifically the TJ global response, show promise as valid measures of response in the single joint. However, additional work is required to establish the best single-joint outcome measure. Input from patients with inflammatory arthritis should be obtained to increase the face validity of prospective single-joint outcome measures. Similar to the RAOS pain, stiffness, and function subscores, the TJ global score, TJ function score, and TJ pain score were highly correlated. Input from patients with inflammatory arthritis may provide valuable insight into distinguishing these measures, or concluding that the concepts are indistinguishable for a single joint from the patient perspective. In addition, there may be room for improvement in the measures themselves. For example, the TJ pain score was based on the response on a 5-point Likert scale to a question embedded within DASH and RAOS, since that was what was available in this study. A pain score based on a 100-mm VAS, similar to the TJ global score and TJ function score, may prove to have better metrics.
Additional work is also required to increase the reliability and responsiveness of the prospective outcomes measures. The test-retest reliability should be assessed by administering the measures at 2 different times in close proximity. If no suitable “gold standard” of response can be identified, it would be helpful to administer these measures after administration of a local treatment that is known to be effective, rather than evaluating it in the context of a clinical study, where it is not known if the intervention is effective. Although a simple patient-reported outcome is desirable for feasibility, a composite measure incorporating both a patient-reported outcome and physician assessment of inflammation may prove to be more useful in determining single-joint response.
Acknowledgments
The 13G01 Study Team: Philip J. Mease, Nathan Wei, Edward J. Fudman, Alan Kivitz, Joy Schechtman, Robert G. Trapp, Kathryn Hobbs, Maria Greenwald, Antony Hou, Stephen Bookbinder, Galen Graham, Craig Wiesenhutter, Larry Willis, Eric Ruderman, Joseph Z. Forstot, Michael Maricic, Charles Pritchard, Kathryn Dao, Francis Burch, Darrell Fiske, and Malin Prupas.
Footnotes
-
Supported by Targeted Genetics Corporation.
- Accepted for publication December 18, 2009.