Abstract
Objective. The aims of this study were to assess the construct validity and the test-retest reliability of Patient Reported Outcomes Measurement Information System (PROMIS) computerized adaptive tests (CAT) in patients with systemic lupus erythematosus (SLE).
Methods. Adults with SLE completed the Medical Outcomes Study Short Form-36, LupusQoL-US version (“legacy instruments”), and 14 selected PROMIS CAT. Using Spearman correlations, PROMIS CAT were compared with similar domains measured with legacy instruments. CAT were also correlated with the Safety of Estrogens in Lupus Erythematosus National Assessment–Systemic Lupus Erythematosus Disease Activity Index (SELENA-SLEDAI) disease activity and the Systemic Lupus International Collaborating Clinics/American College of Rheumatology Damage Index (SDI) scores. Test-retest reliability was evaluated using ICC.
Results. There were 204 outpatients with SLE enrolled in the study and 162 completed a retest. PROMIS CAT showed good performance characteristics and moderate to strong correlations with similar domains in the 2 legacy instruments (r = −0.49 to 0.86, p < 0.001). However, correlations between PROMIS CAT and the SELENA-SLEDAI disease activity and SDI were generally weak and statistically insignificant. PROMIS CAT test-retest ICC were good to excellent, ranging from 0.72 to 0.88.
Conclusion. To our knowledge, these data are the first to show that PROMIS CAT are valid and reliable for many SLE-relevant domains. Importantly, PROMIS scores did not correlate well with physician-derived measures. This disconnect between objective signs and symptoms and the subjective patient disease experience underscores the crucial need to integrate patient-reported outcomes into clinical care to ensure optimal disease management.
The accurate measurement of health-related quality of life (HRQOL), an important patient-reported outcome (PRO), is critical to providing patient-centered care. This is especially important in diseases such as systemic lupus erythematosus (SLE), in which dramatically lower mortality rates have refocused care on minimizing morbidity1. Physicians and patients have different perceptions of the effect of SLE. For example, patients focus on functional status whereas physicians focus on laboratory values2. Further, it is well known that SLE significantly decreases HRQOL3, but exactly how HRQOL should best be defined and measured is unclear.
The US Food and Drug Administration, the European Medical League, and the Outcome Measures in Rheumatology Clinical Trials (OMERACT) group have identified HRQOL as a crucial outcome measure for clinical trials and observational studies in SLE4,5,6. They recommend the use of both generic and disease-specific measures that would allow comparisons with healthy individuals while also ensuring the inclusion of domains that are meaningful to patients.
A number of generic and disease-specific instruments have been validated for the measurement of PRO in SLE, but all have significant limitations7,8. The Medical Outcomes Study Short Form-36 (SF-36)9 is a widely used generic measure in SLE, but has variable longitudinal responsiveness10,11,12 and lacks multiple domains of relevance to patients with SLE, such as fatigue, sleep, and cognition13,14,15. The LupusQoL, the most extensively validated SLE-specific instrument, includes several of these SLE-specific domains, such as fatigue, body image, and planning, but has significant floor and ceiling effects16. In addition, both measures can be challenging to administer and score at the point of care.
The Patient Reported Outcomes Measurement Information System (PROMIS) is a novel publicly available psychometrically validated system developed by the US National Institutes of Health to efficiently measure PRO in populations with a wide range of chronic diseases17. PROMIS instruments increase measurement precision and reduce responder burden relative to traditional instruments because they use item response theory and include computerized adaptive tests (CAT). CAT select the most informative questions from an item bank based on subjects’ previous responses, permitting the use of fewer questions per domain with more precision. PROMIS item banks are generic, scored as T scores normalized to the general population in the United States, and include numerous domains of relevance to patients with SLE that are not found in the SF-36, such as fatigue, sleep, and cognition.
The performance characteristics of PROMIS CAT have not yet been evaluated in SLE. Our study describes the validity and reliability of 14 PROMIS CAT compared with both the SF-36 and the LupusQoL in adult outpatients with SLE. Second, we evaluate the correlation of PROMIS CAT with physician assessments of disease activity and damage.
MATERIALS AND METHODS
English-speaking adults ≥ 18 years receiving care at the Hospital for Special Surgery (HSS) Lupus Center of Excellence and meeting ≥ 4 of the American College of Rheumatology (ACR) 1997 SLE criteria were eligible to participate in our prospective cohort study18. Patients on dialysis and those with active malignancy, other than nonmelanomatous skin cancer, were excluded.
Patients with SLE were identified by their treating rheumatologists and medical records were reviewed to confirm eligibility. Patients consented to participate in our study at the time of an outpatient visit. Patients could complete the Web-based surveys on-site during their visit by computer or tablet with the technical assistance of a study investigator. Alternatively, patients could complete the surveys remotely on a computer, tablet, or smart-phone by an e-mailed study-specific URL. Consenting subjects were registered in the Assessment Center (www.assessmentcenter.net), a free, secure online research management tool maintained at the Northwestern University Research Data Center.
Fourteen PROMIS CAT were selected for testing based on prior focus group studies in which patients with SLE identified quality of life domains of critical importance to them14,15,19. Administered CAT included physical function (version 1.2), mobility (v1.2), pain behavior (v1.0), pain interference (v1.1), ability to participate in social roles (v2.0), satisfaction with social roles and activities (v2.0), fatigue (v1.0), sleep disturbance (v1.0), sleep-related impairment (v1.0), applied cognition-abilities (v1.0), applied cognition-general concerns (v1.0), anger (v1.1), anxiety (v1.0), and depression (v1.0)20. PROMIS items ask about the 7 preceding days, with the exception of items in the physical and social health domains, which do not specify a recall time frame. CAT were programmed to administer enough items to achieve a standard error (precision estimate) of ≤ 0.3, with a minimum of 4 to a maximum of 12 items per CAT.
Patients completed 2 legacy PRO measures: the SF-36 standard, US version 1.0, a frequently used generic PRO instrument validated for use in SLE clinical trials, and the LupusQoL-US, an extensively validated SLE-specific PRO questionnaire adapted for use in the United States9,21. Both legacy instruments refer to a 4-week recall period.
All self-report questionnaires were administered through the Assessment Center and all participants completed both PROMIS CAT and legacy instruments. Half the participants were randomly assigned to complete PROMIS CAT first, and the other half completed legacy PRO instruments first.
To assess PROMIS CAT test-retest reliability, all participants were contacted by telephone or e-mail within 1 week of enrollment to complete PROMIS CAT a second time. A 7-point Likert scale anchor question was used to identify any changes in patients’ disease activity. Only patients reporting that the effect of SLE on their general health was “about the same” were included in the test-retest analysis because their PRO should not have changed.
PROMIS CAT were scored through the Assessment Center using a T score metric, in which the mean T score in the US general population is 50 with an SD of 10. Higher T scores reflect more of the trait being measured, so that higher scores for physical and social function are desirable, whereas higher symptom scores (e.g., fatigue, depression, anxiety) indicate a greater burden of symptoms. The SF-36 is divided into 8 scales, each with a score ranging from 0 to 100, with higher scores reflecting better HRQOL. Scores can also be reported as the physical component summary (PCS) and mental component summary (MCS), in which related scales are grouped and reported as a single score, normalized to the general US population with a score of 50 representing the population mean. The LupusQoL contains 34 questions in 8 domains, with scores ranging from 0 to 100, with higher scores indicating better HRQOL.
Sociodemographic information including age, sex, race, and ethnicity were obtained by patient self-report. Disease activity and damage at the time of the study visit were assessed by the subject’s treating rheumatologist using a physician’s global assessment (PGA), the Safety of Estrogens in Lupus Erythematosus National Assessment-Systemic Lupus Erythematosus Disease Activity Index (SELENA-SLEDAI), and the Systemic Lupus International Collaborating Clinics/American College of Rheumatology Damage Index (SDI)22,23. The PGA ranges from 0 to 3, SELENA-SLEDAI scores range from 0 to 105, and SDI scores range from 0 to 46. Higher scores reflect greater disease activity and more end-organ damage.
Statistical analysis
Means and SD were calculated for continuous variables, and frequencies and percentages for categorical variables. Floor and ceiling effects for each instrument were analyzed by calculating the percentage of respondents achieving the minimum and maximum possible scores, respectively. Construct validity of PROMIS CAT was assessed through Spearman correlation coefficients (r) with legacy PRO instruments, with coefficients of at least 0.7 indicating good convergent validity24. Correlations between PROMIS CAT and disease activity and damage measures were also evaluated with Spearman r. Test-retest reliability was evaluated in participants completing the questionnaires twice within the 7-day time frame. Agreement between scores for each questionnaire was assessed with an intraclass correlation coefficient (ICC)25. ICC of at least 0.7 indicate acceptable test-retest reliability26. All statistical analyses were performed with SAS version 9.3.
The study was reviewed and approved by the HSS Institutional Review Board (IRB# 14125).
RESULTS
The study questionnaires were completed by 204 patients with SLE (Table 1), with 164 (80%) completing them remotely. One hundred sixty-two subjects (79%) completed the retest within 1 week. Subjects were predominantly women (93%) with a mean (SD) age of 40.0 (13.2) years. They were racially and ethnically diverse: 38% identified as white, 30% black, 13% Asian, and 28% Hispanic or Latino. The average (SD) SELENA-SLEDAI score was 4.2 (3.5), indicating mild disease activity, though 19.6% were flaring as per SELENA-SLEDAI at the time of assessment. The mean (SD) SDI was 1.2 (1.7), consistent with minimal end-organ damage.
PROMIS CAT and legacy instrument score distributions are shown in Table 2. The mean CAT scores across all PROMIS domains were worse than the general population by an average of 0.6 SD. Mean SF-36 PCS and MCS scores were 1.3 and 0.7 SD worse than the general population. PROMIS CAT were generally normally distributed, except for pain behavior and fatigue, which had slight positive skews. Similarly, SF-36 scale scores were relatively normally distributed except for the physical function, role physical, and role emotional scales, which were positively skewed. All domains in the LupusQoL were positively skewed. The SF-36 had large floor and ceiling effects in the role physical and role emotional scales (23%–49%), while the LupusQoL had notable ceiling effects across all domains (6%–32%). PROMIS CAT had less significant floor or ceiling effects, with fewer than 5% of patients scoring the lowest or highest possible score in most domains.
The number of items and time per instrument are shown in Table 3. On average, PROMIS CAT administered 4 items per domain and the median time per CAT was 32 s.
Correlations of PROMIS CAT with legacy instruments
Correlations between PROMIS CAT and legacy instruments are shown in Table 4. PROMIS physical function and mobility CAT correlated strongly with the physical function domains in the SF-36 and LupusQoL (r = 0.81–0.86), and with the SF-36 PCS (r = 0.75–0.81). Correlations between PROMIS pain interference and legacy instrument pain domains were also strong (r = −0.79). PROMIS fatigue correlated more strongly with the corresponding domain in the LupusQoL (r = −0.75) than with the SF-36 vitality scale (r = −0.67). Similarly, in the domain of mental health, PROMIS anger, anxiety, and depression CATS showed strong correlations with the LupusQoL emotional health domain (r = −0.69 to −0.75), and more moderate correlations with all of the SF-36 mental health–related scales (r = −0.49 to −0.76). PROMIS social function CAT correlated moderately to strongly with the corresponding domains in the SF-36 and LupusQoL (r = 0.55–0.75). All correlations were statistically significant with p < 0.001.
There were no analogous legacy instrument domains with which to compare the 4 PROMIS CAT evaluating cognition and sleep. However, these CAT showed strong correlations with fatigue. Correlations between fatigue and sleep-related impairment and applied cognition-concerns were both 0.68, while correlations between sleep-related impairment and disturbance was 0.62, and applied cognition-abilities and concerns was −0.74 (p < 0.001 for all).
Correlations of PROMIS CAT with physician-derived measures
Correlations between PROMIS CAT and physician-derived measures of SLE disease activity and disease-related damage are shown in Table 5. Correlations were generally weak and nonsignificant, with the highest correlations observed between CAT in the domains of physical function and pain and the PGA and SDI (r = 0.27–0.37, p < 0.001).
Test-retest reliability
Of the 162 participants who completed PROMIS CAT a second time within 7 days (average 6.9 days), 90 reported no change in the effect of SLE on their health. Among these 90 subjects, ICC were > 0.7 across all domains (Table 6).
DISCUSSION
To our knowledge, our study is the first to demonstrate the validity and reliability of PROMIS CAT in outpatients with SLE. PROMIS CAT showed strong correlations with the SF-36 and LupusQoL across analogous domains, supporting the construct validity of the PROMIS measures. PROMIS CAT also showed high test-retest reliability in participants self-reporting no change in the effect of SLE on their health.
Although to our knowledge no prior studies have evaluated PROMIS CAT in adults with SLE, there has been some work evaluating PROMIS short forms (i.e., PROMIS questions administered as part of a standard questionnaire without using computerized adaptive testing) in patients with SLE. The PROMIS-29, a 29-question short form composed of items from 7 PROMIS item banks (physical function, fatigue, pain interference, anxiety, depression, sleep disturbance, and satisfaction with social roles), was administered to 333 patients with self-reported SLE recruited from patient advocacy organizations27. PROMIS-29 domain scores were associated with self-reported disease severity, but the study did not validate cases of SLE or compare the PROMIS-29 with established legacy instruments. Katz, et al evaluated the PROMIS-29 in 240 patients with rheumatologist-diagnosed SLE, demonstrating convergent validity with domains of the SF-36, but also noted significantly larger ceiling effects in 5 of the 7 PROMIS-29 domains compared with the SF-3628. In contrast, our study found similar convergent validity between PROMIS CAT and the SF-36, but significantly decreased floor and ceiling effects in PROMIS CAT, suggesting increased precision over both SF-36 and PROMIS-29 short forms.
Mahieu, et al evaluated the internal consistency of 7 PROMIS short forms (physical function, fatigue, pain interference, anxiety, depression, sleep disturbance, and sleep-related impairment) in 123 adults with SLE, finding strong internal consistency among the measures (Cronbach’s alpha 0.91–0.98)29. They also showed strong correlations between the PROMIS fatigue short form and the self-report Fatigue Severity Scale scores (Spearman r = 0.84). The authors found that physical activity, measured with an accelerometer, was positively associated with PROMIS physical function (r = 0.33) and negatively associated with pain interference (r = −0.29). While these results legitimize the use of PROMIS short forms in SLE, these pre-set groups of questions are much longer than CAT, which average only 4 items per domain. The ability of PROMIS CAT to decrease responder burden without compromising precision or reliability is a significant advantage over both legacy instruments and PROMIS short forms.
PROMIS instruments have also been evaluated in children with rheumatic disease, with pediatric item banks demonstrating construct validity in 228 children with juvenile idiopathic arthritis and pediatric short forms demonstrating construct validity and responsiveness in 100 children with SLE30,31. Of note, PROMIS instruments were developed with the goal of creating single metrics to measure domains across the lifespan32. This unique advantage of PROMIS over legacy instruments is particularly important in SLE, which can begin in childhood and continue through adulthood with waxing and waning course.
In our study, evaluating PROMIS CAT in SLE, participants scored one-half SD or worse than the general population across most PROMIS CAT, with the largest differences in the domains of physical function, mobility, pain interference, fatigue, sleep-related impairment, and applied cognition-concerns. These findings are concordant with those of Mahieu, et al, who reported that subjects with SLE scored one-half SD worse than the general population in physical function, pain interference, fatigue, sleep disturbance, and sleep-related impairment short forms, which are scored using the same T scale metric as CAT29. These findings also provide face validity because CAT scores should trend lower than average given the known lower HRQOL in patients with SLE33.
While prior studies have suggested that PROMIS short forms appear to have good reliability and precision in patients with SLE, ours is the first study to compare the performance characteristics of PROMIS CAT with the SF-36 and LupusQoL, 2 legacy PRO instruments commonly used in clinical research. In our study, in contrast to legacy instruments, PROMIS CAT demonstrated a normal distribution across domains and had smaller floor and ceiling effects, with less than 5% of subjects scoring the lowest or highest possible score. Certain domains, notably pain interference, pain behavior, and depression, did exhibit clustering at the lowest observed score, suggesting that perhaps this score represents the “de facto” floor of the instrument. Similar minimum scores for these domains were observed in validations of the PROMIS in other rheumatic disease populations34,35. The significant floor and ceiling effects observed in the SF-36 and LupusQoL are consistent with score distributions reported in other studies and may contribute to the variable responsiveness of the measures in longitudinal studies10,12,36. PROMIS CAT are better able to discriminate among individuals at the extremes of the spectrum, and importantly, may be more sensitive to identifying change over time within individuals because their score distribution is less skewed relative to legacy instruments and the PROMIS-29.
Importantly, PROMIS CAT correlated poorly with physician-derived measures of SLE disease activity and damage, supporting the principle that PRO measures identify unique information37,38. In SLE, where defining appropriate outcome measures for clinical trials remains challenging39, the patient perspective is particularly important and needs to be reliably measured. Currently, the SF-36 MCS and PCS are often used to benchmark HRQOL in study populations. Our findings suggest that the MCS and PCS correlate moderately to strongly with domains relevant to patients with SLE; correlations with PROMIS CAT related to physical function and mental health ranged from 0.62 to 0.81. PROMIS CAT offer an improved method of accurately and efficiently measuring patient-centered outcomes in SLE while at the same time allowing comparisons with the general population.
In addition to their well-established use in research, PRO measures have great potential to improve the clinical care of SLE. There is increasing recognition that measuring PRO may improve patient engagement and shared decision making40,41. PROMIS CAT are well suited for use at the point of care in SLE because of their favorable performance characteristics and decreased responder burden. Further studies are necessary to evaluate barriers to and facilitators of implementing PROMIS CAT in the clinical care of SLE, as well as the effect of regular PRO collection on patient engagement and outcomes.
Our study has many strengths, including its large and diverse cohort of subjects with SLE, all validated according to the ACR criteria. Patients had a range of disease activity, with almost one-fifth flaring. However, the majority of patients had mild disease, reflecting the reality of most outpatients with SLE. There was also a high rate of participation in the retest questionnaire, supporting the generalizability of our findings. The choice of which PROMIS CAT to administer was informed by literature review, ensuring inclusion of domains that had been identified as important by patients with SLE themselves, and a large number of CAT were administered.
Our study has certain limitations. To decrease responder burden, not all PROMIS domains were validated against gold standard instruments. For example, the sleep- and cognition-related CAT had no corresponding domains in the SF-36 and LupusQoL. Further studies should validate these CAT against relevant legacy instruments, including potentially the LupusPRO15, which includes both of these domains. Conversely, PROMIS lacks some domains that are present in the LupusQoL, including body image, planning, and intimate relationships, areas valued by patients with SLE. This points to a knowledge gap; these item banks need to be developed. In our study, PROMIS CAT were evaluated in outpatients; the validity among inpatients, who may have worse HRQOL and worse disease activity, may differ and will need to be analyzed. Importantly, our study is cross-sectional and future studies need to evaluate the longitudinal responsiveness of PROMIS CAT. Finally, our study was limited to English speakers. PROMIS item banks have been translated into many other languages (Spanish, German, Dutch, Chinese, etc.)42; additional studies are needed to validate PROMIS CAT in non-English–speaking patients with SLE.
Our study is the first, to our knowledge, to demonstrate the validity and reliability of PROMIS CAT in outpatients with SLE. PROMIS CAT are an efficient method of evaluating HRQOL in patients with SLE. They provide an accurate metric for measuring relevant patient domains, and future work should evaluate their performance in both clinical research and routine clinical care.
Acknowledgment
The authors thank Rima Abhyankar and Kelly McHugh for their assistance with patient recruitment.
Footnotes
Supported by the Rheumatology Research Foundation Scientist Development Award.
- Accepted for publication March 14, 2017.