Abstract
Objective. To determine the extent to which instruments that measure core outcome domains in acute gout fulfill the Outcome Measures in Rheumatology (OMERACT) filter requirements of truth, discrimination, and feasibility.
Methods. Patient-level data from 4 randomized controlled trials of agents designed to treat acute gout and 1 observational study of acute gout were analyzed. For each available measure, construct validity, test-retest reliability, within-group change using effect size, between-group change using the Kruskall-Wallis statistic, and repeated measures generalized estimating equations were assessed. Floor and ceiling effects were also assessed and minimal clinically important difference was estimated. These analyses were presented to participants at OMERACT 11 to help inform voting for possible endorsement.
Results. There was evidence for construct validity and discriminative ability for 3 measures of pain [0 to 4 Likert, 0 to 10 numeric rating scale (NRS), 0 to 100 mm visual analog scale (VAS)]. Likewise, there appears to be sufficient evidence for a 4-point Likert scale to possess construct validity and discriminative ability for physician assessment of joint swelling and joint tenderness. There was some evidence for construct validity and within-group discriminative ability for the Health Assessment Questionnaire as a measure of activity limitations, but not for discrimination between groups allocated to different treatment.
Conclusion. There is sufficient evidence to support measures of pain (using Likert, NRS, or VAS), joint tenderness, and swelling (using Likert scale) as fulfilling the requirements of the OMERACT filter. Further research on a measure of activity limitations in acute gout clinical trials is required.
At the 11th Outcome Measures in Rheumatology (OMERACT) meeting, held in May 2012, the focus of the Gout Module was to obtain endorsement of specific instruments that measure each of the 5 core domains identified at OMERACT 9 as key outcomes in acute gout trials1. To assist participants in determining whether specific instruments met the OMERACT filter of truth, discrimination, and feasibility necessary for adequate technical performance of outcome instruments, we aimed to calculate the key psychometric properties from recent trials or observational studies of acute gout.
MATERIALS AND METHODS
Patient-level data were generously provided by Merck Sharp & Dohme Corp. (MSD), Novartis, Pfizer, and Regeneron concerning 4 trials of treatment with etoricoxib, canakinumab, celecoxib, and rilonacept, respectively. Treatment allocation was not made available for the canakinumab study (Novartis) because trial results were in publication at the time of this analysis2; nor for the etoricoxib (MSD) dataset. In addition, data from a small observational cohort study of acute gout was provided by Professor Keith Rome (Auckland University of Technology)3. The key characteristics of each study are shown in Table 1 and 2. Note that all studies were active-controlled, although the celecoxib study included an arm with a lower than recommended dose of celecoxib. These studies were pragmatically selected on the basis of availability of patient-level data with which to perform secondary analysis, studies with drugs of different biological mechanisms, and studies of both randomized controlled trials (RCT) and longitudinal observational studies. A systematic review of published trials of acute gout was performed separately and is reported in a companion article4.
Each of the included studies had previously received ethical approval from appropriate ethical review board, and provision of patient-level data to the authors was within the permission given by patients at informed consent.
Construct validity, or the extent to which the instrument was closely associated with similar concepts and not closely associated with dissimilar concepts, was assessed using Spearman correlation coefficients between each instrument measured at the baseline timepoint. Floor and ceiling effects were calculated as the percentage of participants scoring the minimum and maximum possible at baseline and final visit. Within-group discrimination was assessed within each study by pooling the change scores of each instrument and calculating the effect size (ES). Between-group discrimination was assessed by calculating the Kruskal-Wallis statistic for the difference between the final reported value of each measure across treatment arms. Within- and between-group change was also assessed using repeated measures generalized estimating equations with ordinal regression to maximize information available from multiple timepoints (for example, pain was measured at several timepoints).
Test-retest reliability was calculated using patient global assessment (PGA) of response to identify a subset of participants who perceived no change. To identify a stable group in the etoricoxib clinical trial we selected cases with the same patient perception of response at days 2 and 5 and at days 5 and 8, in 2 separate estimations of reliability. In the celecoxib clinical trial we selected the low-dose celecoxib cases for the analysis over the first 12 h and cases with poor or fair response at Day 9 for the analysis over 9 days. The intraclass correlation (ICC) used a mixed-effects model for single measure absolute agreement in stable cases. The standard error of measurement (SEM) was calculated as the square root of the error variance from the analysis of variance table from whence the ICC was calculated. Smallest detectable difference (SDD) was calculated as SEM × √2 × 1.965. The minimal clinically important difference (MCID) was calculated as the median value of change in each measure for the “fair” category of patient global response to treatment, where this was available6.
RESULTS
Feasibility (time to completion, cost, respondent burden) were not formally assessed in any study, but all instruments appear to be easy to complete with no or minimal need for training and no or little cost.
Pain Measures
Three pain measures were used in different studies: 0–4 point Likert-like scale, 0–100 mm visual analog scale (VAS), and 0–10 numeric rating scale (NRS). Data for the NRS were derived from a single unpublished study, and therefore most discussion focused on the Likert scale and VAS scales, for which there were data from more than 1 RCT and more than 1 class of drugs (Table 2).
Likert-like scale
A 0–4 point Likert scale was used in most studies with categories of “none” (0), “mild,” “moderate,” “severe,” and “extreme” (4) pain. The Likert scale had good construct validity (Table 3): strong correlation with patient global (Spearman’s correlation coefficient, 0.72) and NRS pain score (0.55 and 0.73), moderate-strong correlation with disability (0.58 and 0.31) and moderate correlation with joint tenderness (0.34, 0.36, 0.13), but weaker correlation with joint swelling (0.18, 0.18, 0.19).
ES ranged from 1.20 to 2.84, demonstrating a large ES over time (Table 4). The Likert scale discriminated well between treatment groups, with minimal clinically important difference (MCID) ranging from a change of 1 to 2. Floor effects were appreciable at final visit and ceiling effects were appreciable at baseline (Table 5).
Pain visual analog scale (VAS)
A VAS pain scale 0 to 100 mm was used in 2 studies. The VAS pain scale had good construct validity: strong correlation with patient global (0.72 and 0.73 in 2 studies), and with disability (0.58 and 0.66), but weak correlation with joint swelling (0.19) or joint tenderness (0.13).
ES ranged from 1.58 to 4.46, demonstrating a large ES over time. VAS pain scale discriminated well between treatment groups as recently reported7, with MCID of 19 on 0–100 mm scale. Minimal floor effects were appreciable at final visit (14%) and minimal ceiling effects were appreciable at baseline (13%).
Numeric rating scale
One study of rilonacept used both Likert scale and NRS. Based on this single study, NRS pain seemed to have face, content, and construct validity, and was sensitive to change (within and between group).
Joint Swelling
A 0–3 point Likert scale used in most studies was examined in this analysis, typical categories being “no swelling” (0), “palpable,” “visible,” and “bulging beyond the joint margins” (3) in the index joint, as assessed by a physician. The Likert scale had evidence for construct validity with moderate correlation with patient global (0.47) and activity limitation as measured by Health Assessment Questionnaire (HAQ; 0.25) and with joint tenderness (0.25, 0.37) and weak correlation with pain (0.14, 0.18). In treatment trials of canakinumab, Likert scale showed between-group, as reported2, and within-group differences (Table 6). ES ranged from 2.3 to 2.9. In this analysis, the MCID for joint swelling corresponded to a change of 1 on the Likert scale. Significant floor effects were appreciable at final visit (47 to 64%) and ceiling effects (27 to 56%) were appreciable at baseline.
Joint tenderness
Joint tenderness was also measured using a 0–3 point Likert scale in most studies. An example of a 0–3 point Likert scale used in the Novartis studies: no pain (0), patient states that “there is pain” (1), patient states “there is pain and winces” (2), and patient states “there is pain, winces and withdraws” on palpation or passive movement of the affected study joint, as assessed by a physician (3). Joint tenderness Likert scale had strong correlation with patient global (0.56), moderate correlation with joint swelling (0.25, 0.37, 0.46) and with pain (0.19, 0.34, 0.36; Table 3). The ES for the Likert scale ranged 2.3 to 3.2, and the measure discriminated between treatment groups in 1 study that we analyzed, as well as a recently published analysis of duplicate RCT for canakinumab2. The MCID for joint tenderness ranged from 1 to 2. We observed significant floor effects at final visit (44 to 55%) and ceiling effects (39 to 58%) at baseline.
Patient Global Assessment
The patient global measure used in most studies was a 0–4 point Likert scale of global assessment of response to therapy. For example, in the etoricoxib clinical trial, the global response to treatment was assessed with the question: “How would you rate the study medication you received for gout?” with these response options: Excellent = 0, Very good = 1, Good = 2, Fair = 3, Poor = 4. The only study that used a global assessment of current status was the Auckland University of Technology observational study that used a 100 mm VAS, asking participants to rate how well they were doing overall.
PGA is usually the external benchmark for all other outcome measures, including several described above. Therefore, it has face, content, and construct validity almost by definition. Typically PGA relate to assessment of current disease status; however, all but 1 study provided data for PGA of response to treatment. Application of the OMERACT filter to a transition scale such as this is problematic. Reliability could not be determined, because we used the responses on this measure to define a stable subgroup. Within-group change was not meaningful for a measure that had no meaning at baseline. For the single study that used a conventional PGA, an ES of 1.46 suggested adequate within-group change sensitivity for that format.
In the only RCT that provided both treatment allocation and measured a global response to treatment (celecoxib study), we did not observe a between-group difference (Table 5).
Activity Limitation
Activity limitation data were available from 3 studies. Two studies used the HAQ-disability index or HAQ-II, and one study used a 0–10 NRS item from the Worker Productivity and Activity Index: Specific Health Problem (WPAI:SHP) scale as a measure of activity limitations.
Health Assessment Questionnaire
HAQ scores showed strong correlation with patient global (0.50, 0.73), moderate correlation with joint swelling (0.31), moderate to strong correlation with pain (0.26, 0.33, 0.37, 0.66), and moderate correlation with joint tenderness (0.46). The ES was moderate to large, ranging from 1.04 to 1.72, suggesting adequate within-group discrimination. Unfortunately, in the only RCT that used the HAQ, treatment allocation data were not made available to us, so between-group discrimination could not be ascertained, and the data on change in HAQ were not reported in the recent publication from that study2. MCID for HAQ-DI was estimated at 0.5 in the 2 replicate clinical trials of canakinumab. There was floor effect at followup visits (33 to 46%), but ceiling effect was minimal (0 to 17%).
0–10 NRS from WPAI:SHP
This single item used only in the Regeneron study was expressed at the baseline visit as “During the past 7 days prior to your gout attack, how much did your gout attack affect your ability to do your regular daily activities, other than work at a job?” and the response is given on a 0 (“Gout attack had no effect on my daily activities”) to 10 (“Gout attack prevented me from doing my daily activities”). This was administered as one of several items from the WPAI:SHP. At the followup visit at Day 7, the question was reworded slightly as “During the past 7 days, how much did your gout attack affect your ability to do your regular daily activities other than work at a job?” This item showed moderate correlation with pain measures (0.31, 0.39) and floor effects at the Day 7 visit (33.2%). We observed a trend toward between-group discrimination for this single item measured at Day 7 (Table 5).
DISCUSSION
The measurement properties for instruments in the core domains for acute gout studies were examined in 4 RCT and 1 cohort study. Overall, there appears to be sufficient evidence for construct validity and discriminative ability for 3 measures of pain (Likert, NRS, VAS). Floor and ceiling effects for pain measures suggested that either the scale for measuring pain needs to be somewhat broader or that the patients with severe pain of acute gout respond very well to treatment and that entry criteria for a particular level of pain limited the range of possible values at baseline. There is some variation in the floor and ceiling effects for the different pain measures across all studies, which is not unexpected given the differences in instrument and study setting.
The correlation of pain with disability was high when disability was measured by HAQ but modest when measured by a single item in the Regeneron study. It is possible that the single-item instrument used to measure disability was inadequate. The correlation between pain and joint swelling was consistently weak. This is not especially surprising because the 2 concepts are quite different and the measurement of joint swelling by a 4-point scale may have insufficient variability to give strong correlation coefficients.
There appears to be sufficient evidence for a 0–3 point Likert scale to possess construct validity and discriminative ability for measuring joint swelling and joint tenderness. There was some evidence for construct validity and within-group discriminative ability for HAQ as a measure of activity limitations, but it has yet to be shown that any measure of activity limitations can discriminate between groups allocated to different treatment.
Demonstration of the psychometric properties of the PGA of response to treatment is difficult. Construct validity tends to be assumed and was not measured by any other global patient-reported outcome in the data examined to enable a sensible comparison. Test-retest reliability could not be assessed. We did not demonstrate between-group discriminative ability in the only dataset available to us in which this could be examined, but the canakinumab study has been reported recently as showing a between-group difference in global response to treatment with a proportional odds regression OR of 2.19 (95% CI 1.6 to 3.1) at 72 h and 1.97 (95% CI 1.4 to 2.8) at 7 days2. We did not have treatment allocation data for that dataset, so were unable to reproduce this analysis.
The assessment of reliability and the associated estimates of SDD should be considered cautiously because acute gout is a highly dynamic condition with rapid changes in clinical status. It is possible that even in patients who self-identified as showing no response to treatment, their condition had improved. Therefore, the calculated ICC values especially during the first few days of acute gout are likely to be underestimates.
At OMERACT 11, these analyses were presented to participants and were useful as a basis for discussion and final conclusions regarding measurement properties of instruments for acute gout studies. This is outlined in a companion paper.
Acknowledgment
We gratefully acknowledge Pfizer Inc. (through an Investigator Initiated Grant), Merck Sharp & Dohme Corp., Novartis, Regeneron and Auckland University of Technology for making available the datasets for this study. We also acknowledge Mike Frecklington from Auckland University of Technology for the data collection in that study.
Footnotes
-
Supported with resources and use of facilities at the Birmingham VA Medical Center, Alabama, USA (J.A.S. and D. Redden). N. Dalbeth has received consulting fees from Ardea Biosciences, Metabolex, Novartis, and Takeda. Her institution has received funding from Fonterra, and she is a named inventor on a patent related to milk products and gout. H.R. Schumacher has been a consultant for Regeneron, Novartis, Pfizer, Savient, Ardea, Metabolex, and BioCryst, and he has received a grant from Takeda. N.L. Edwards has received consultant fees from Novartis, Takeda Pharmaceutical, Savient Pharmaceutical, Ardea Biosciences, Regeneron Pharmaceuticals, Metabolex Pharmaceuticals, and BioCryst Pharmaceuticals. L.L. Simon has served on the board of directors for Savient Pharmaceuticals, and has consulted for Takeda. M.R. John is employed by Novartis and sometimes owns shares in the company. M.N. Essex is employed by Pfizer and owns shares in the company. D.J. Watson is an employee of and owns stock in Merck & Co. Inc.; the marketing authorization holder for etoricoxib and sponsor of the etoricoxib clinical trials that contributed data for this work. R. Evans is employed by Regeneron and owns shares of stock. J.A. Singh has received research grants from Takeda and Savient and consultant fees from Savient, Takeda, Ardea, Regeneron, Allergan, URL Pharmaceuticals and Novartis. He is a member of the executive of OMERACT, an organization that develops outcome measures in rheumatology and receives arms-length funding from 36 companies; a member of the American College of Rheumatology’s Guidelines Subcommittee of the Quality of Care Committee; and a member of the Veterans Affairs Rheumatology Field Advisory Committee.