Abstract
Objective. To systematically identify the outcome measures and instruments used in clinical studies of polymyalgia rheumatica (PMR) and to evaluate evidence about their measurement properties.
Methods. Searches based on the MeSH term “polymyalgia rheumatica” were carried out in 5 databases. Two researchers were involved in screening, data extraction, and risk of bias assessment. Once outcomes and instruments used were identified and categorized, key instruments were selected for further review through a consensus process. Studies on measurement properties of these instruments were appraised against the COSMIN-OMERACT (COnsensus-based Standards for the selection of health Measurement Instruments–Outcome Measures in Rheumatology) checklist to determine the extent of evidence supporting their use in PMR.
Results. Forty-six studies were included. In decreasing order of frequency, the most common outcomes (and instruments) used were markers of systemic inflammation [erythrocyte sedimentation rate (ESR), C-reactive protein (CRP)], pain [visual analog scale (VAS)], stiffness (duration in minutes), and physical function (elevation of upper limbs). Instruments selected for further evaluation were ESR, CRP, pain VAS, morning stiffness duration, and the Health Assessment Questionnaire. Five studies evaluated measurement properties of these instruments, but none met all of the COSMIN-OMERACT checklist criteria.
Conclusion. Measurement of outcomes in studies of PMR lacks consistency. The critical patient-centered domain of physical function is poorly assessed. None of the candidate instruments considered for inclusion in the core outcome set had high-quality evidence, derived from populations with PMR, on their full range of measurement properties. Further studies are needed to determine whether these instruments are suitable for inclusion in a core outcome measurement set for PMR.
Polymyalgia rheumatica (PMR) is the most common inflammatory rheumatic condition of older people1 and is characterized by proximal pain and stiffness, raised inflammatory markers, and a therapeutic response to glucocorticoids2. A recent UK study using the Clinical Practice Research Datalink found an annual incidence of 96 per 100,000 people aged over 40 years, with incidence rising markedly with increasing age3.
Although it is common, PMR remains underresearched, and there are many unanswered questions about its management4. A core outcome measurement set of standardized instruments for use in clinical studies of PMR would make it easier to synthesize future research evidence.
In 2016, a core domain set (“what” to measure) was endorsed by the Outcome Measures in Rheumatology (OMERACT) group. This comprises pain, stiffness, physical function, and systemic inflammation5. We now need to establish “how” to best measure these domains. A previous systematic review6 found a wide range of instruments had been used but was limited in its search strategy and inclusion criteria, and did not assess the quality of the evidence found. Further, no review of the evidence for measurement properties of instruments in PMR has been carried out.
We therefore set out to systematically (1) identify all of the outcome measures and instruments previously used in clinical studies of PMR, and (2) evaluate the literature on the measurement properties of selected instruments to determine whether they sufficiently met the OMERACT Filter 2.1 requirements for discriminative ability7.
METHODS
Protocol and registration. The review protocol was registered in Prospero (www.crd.york.ac.uk/prospero; registration number CRD42017080058).
Eligibility criteria. Studies were eligible if they included patients with PMR and reported original quantitative data on outcomes of PMR. A range of study types, including randomized controlled trials (RCT), other interventional trials, prospective cohort studies, case control studies, and cross-sectional studies, were eligible for inclusion. Editorials, commentaries, review articles, case reports, and letters were excluded.
Studies evaluating measurement properties of an instrument in patients with PMR were included and tagged to identify them for the second part of the review process.
Studies that considered patients with PMR and giant cell arteritis (GCA) as a single group (i.e., PMR-specific data not available), diagnostic studies, and studies that solely reported outcomes not pertaining directly to PMR (e.g., cardiovascular events in patients with PMR) were excluded.
Information sources. Five databases [MEDLINE (OVID), CINAHL (EBSCO), Embase (HDAS), Web of Science, and the Cochrane Library] were searched from inception until September 30, 2017.
Clinical trial registries (ClinicalTrials.gov, ISCTRN, and the EU Clinical Trials Register) were reviewed to track any unpublished studies. Experts in the field were contacted to see if they were aware of any ongoing studies of relevance.
Searches. The search strategy (Table 1) was developed by the lead author (HT) with advice from a specialist health librarian. It was based on the MeSH term “polymyalgia rheumatica” and adapted for each database.
Study selection. Identified studies were imported into Endnote X8 (endnote.com) and duplicates removed. HT screened these titles and uploaded eligible studies to Covidence (www.covidence.org). HT screened all abstracts and full texts against the inclusion and exclusion criteria, and each was independently screened by 1 other review author (CO, SM, CDM, CM, or CH). Disagreements were resolved by discussion and, if needed, by consensus with a third reviewer (SH).
Data collection. Data from all included studies were extracted by HT. A second review author (CO, SM, CDM, CM, CH, or SH) checked the extracted data for each. Extracted data comprised lead author, journal, and year of publication; study design; setting; criteria used to define PMR; sample size; participant age and sex distribution; type of intervention; duration of follow-up; outcomes measured; instruments used; and key findings.
Data extraction for the review of measurement properties was carried out independently by HT and CO. The additional information extracted for studies of measurement properties comprised measurement properties evaluated, methods used, and findings in relation to the measurement properties.
Risk of bias. To inform judgment of overall study quality, risk of bias was assessed using criteria from 3 domains of the Quality In Prognosis Studies (QUIPS) tool8: domains 1 (study participation), 2 (study attrition), and 4 (outcome measurement). The other 3 domains of the QUIPS tool were not applied, as they were not relevant to all study types in the review. Additional relevant criteria from the Cochrane Risk of Bias tool9 were applied to included RCT (adequacy of the randomization and blinding process, and whether the groups were treated equally throughout).
Risk of bias assessment was carried out at the same time as data extraction. Studies were categorized as high, moderate, or low risk for each domain. HT carried out this process with review by a second team member (CO, SM, CDM, CM, CH, or SH). Any disagreements were discussed, and consensus was reached.
The assessment of risk of bias for each study was used in critical judgment of the weight given to the study when deciding which outcome measures to take forward for evaluation of their measurement properties.
Strengths and limitations of studies of measurement properties were evaluated independently by HT and CO. Studies were assessed against the COSMIN-OMERACT Good Methods checklist (Table 2) and given a rating to signify whether they should be used as evidence for each measurement property evaluated (red = no, do not use this as evidence; amber = some cautions but this will be used as evidence; green = yes, likely low risk of bias). Results of this assessment were discussed with the wider review team and used to inform overall judgment on whether there was sufficient evidence to support the use of the instrument in PMR.
Planned methods of analysis. Outcomes and instruments were categorized according to the core domain set agreed upon in 2016 by the OMERACT PMR Working Group5. Instruments measuring domains that were not in the core set were also collated to establish other constructs assessed in studies of PMR to inform the future research agenda. A narrative review of the results was carried out.
The findings and quality assessment of all studies on individual measurement properties of each selected instrument were tabulated. This information was synthesized into an overall rating of the body of evidence for each measurement property of each instrument in PMR.
RESULTS
Study selection. Forty-six studies were selected for inclusion in the review (Figure 1). No additional studies meeting the eligibility criteria were identified from reference lists or through contacting experts in PMR. Eight ongoing or unpublished studies were identified from clinical trial registries.
Study characteristics. The 46 included studies were carried out between 1995 and 2017. Forty were carried out in Europe, 5 in North America, and 1 in Japan. Only 1 study recruited exclusively from primary care10.
Study types. The most frequent study type was prospective cohort study (n = 23), followed by RCT (n = 10). There were 5 pilot efficacy/safety studies, 3 nonrandomized, noncontrolled intervention studies, 3 case series, and 2 case-control studies.
Numbers of participants and follow-up. The sample size of individual studies ranged from 411 to 65210. Aside from the study by Cawley, et al10, all studies had < 150 participants. In longitudinal studies, follow-up duration ranged from 4 weeks to 4 years.
Age and sex of participants. Mean age ranged from 62 to 78 years, and most studies (n = 42) had more female than male participants.
Criteria used for diagnosis. A range of classification criteria were used to identify participants with PMR. The most commonly used were the Healey12 and Chuang13 criteria (9 and 8 studies, respectively). Five studies used the 2012 American College of Rheumatology/European League Against Rheumatism criteria14, 6 used Bird criteria15, and 6 used Jones and Hazleman criteria16. Twelve studies used clinician diagnosis or a specified combination of clinical features.
Risk of bias within studies. Thirteen of 46 studies were judged to have low risk of bias using the study participation domain as a marker of overall risk of bias. Twenty-five were judged to have a moderate risk of bias, and 8 were judged to have a high risk of bias. The most common reasons for high risk of bias rating were inadequate information about the recruitment process/response rate and small sample size for the study design.
Those judged to be at a low risk of bias did not measure noticeably different outcomes to studies where risk of bias was higher, and therefore the rating did not significantly influence the decision on which outcome measures to evaluate further.
Outcomes measured. A summary of outcomes measured by domain is given in Table 3. Eighteen of 46 studies measured an outcome representing each of the core OMERACT domains, of which only 2 were RCT17,18.
Laboratory markers of inflammation. Laboratory markers of inflammation were reported in 43 of 46 studies. Most studies measured both erythrocyte sedimentation rate (ESR) and C-reactive protein (CRP; n = 32). The 5 measuring only ESR were from before the year 2000, whereas the 5 measuring only CRP were published after.
Pain. Thirty-two of 46 studies assessed pain. The most common instrument used (n = 29) was a pain severity visual analog scale (VAS), but the anchor question was rarely stated.
Stiffness. Twenty-eight of 46 studies included an assessment of stiffness. In 26 studies, duration of morning stiffness in minutes was recorded. Four studies additionally assessed stiffness severity using either a VAS or numeric rating scale (NRS).
Physical function. Twenty-two of 46 studies assessed physical function, with 8 of these using > 1 measure of function. In 13 studies, the functional assessment was “elevation of the upper limbs” on a 0–3 scale, measured as part of the composite Polymyalgia Rheumatica Activity Score (PMR-AS)19, which is defined as follows:
CRP + MST × 0.1 + VASpain + VASphysician + EUL0−3
where CRP is measured in mg/dL, MST is morning stiffness duration in minutes, VAS has a possible range: 0-10), and EUL is elevation of the upper limbs (possible range 0–3).
Twelve studies used the Health Assessment Questionnaire (HAQ)20 in some form, either the Health Assessment Questionnaire–Disability Index (HAQ-DI; n = 9) or the modified HAQ (mHAQ; n = 3).
Disease activity/global assessment. Thirteen of 46 studies recorded PMR-AS19. Six studies that did not use the PMR-AS included a physician global assessment VAS. Nine studies included some form of patient global assessment. The wording of the questions and the scales for the global VAS varied between studies.
Imaging. Nine of 46 studies included a form of imaging in their outcome set. In 5 of these, assessment of the utility of the imaging technique in PMR was part of the study’s aims.
Ongoing or unpublished studies. Five of the ongoing or unpublished studies specified their outcomes. While there were no new outcomes used among these, 3/5 measured fatigue, and 2/5 measured stiffness severity as well as duration of morning stiffness, possibly suggesting a trend toward these factors being attributed greater importance.
Evaluation of measurement properties. The OMERACT PMR Special Interest Group, comprising clinicians, researchers, and patient partners, met in 2018 to determine whether instruments mapping to the core domains had satisfied tests for domain match and feasibility, and if they should continue through the remaining steps of the OMERACT 2.1 Filter. This process has been described in detail in a previous publication21. Results from the first part of the review informed this discussion, and the following instruments were selected for further evaluation: laboratory markers of inflammation (CRP and ESR), pain (VAS and NRS), stiffness (VAS, NRS, and duration of morning stiffness), and function (mHAQ and HAQ-DI).
Through the search strategy described, 5 studies were identified that evaluated measurement properties of these instruments. Results of the appraisal of these studies are summarized in Table 4. Table 5 presents an overview of the quality of evidence that exists for each instrument.
The standardized OMERACT Summary of Measurement Properties tables were also completed for each instrument, and the example for pain VAS is available as Supplementary Material (available from the authors on request).
Pain VAS. No studies explicitly aimed to assess construct validity, but the reporting of the change in pain VAS in response to treatment, and the correlation between pain VAS and other instruments demonstrated in Leeb22 and Matteson23, can be taken as some evidence supporting the validity of this measure in assessing PMR-related pain. However, neither study set out hypotheses about the expected relationship with other outcomes, and the comparator measures used were either not themselves validated in PMR or they measured a different construct altogether. Both were rated red against the Good Methods checklist.
Responsiveness of the pain VAS was evaluated in 2 studies24,25. Neither study stated hypotheses about the anticipated change in response to treatment or the magnitude of the anticipated effect size a priori, and again, both were rated red for this measurement property.
Test-retest reliability of a pain VAS was evaluated by Matteson, et al23. The methods were appropriate and the result suggests good reliability, but the small sample size (n = 14) meant that this study was rated amber.
The % minimal detectable change (MDC) for pain VAS was calculated in the same small subgroup in this study (n = 14)23. This was the only study looking at any thresholds of meaning for a pain VAS in PMR. The authors did not evaluate what a minimally important change might be for patients, and the study was rated red for this measurement property as well.
Duration of morning stiffness. The 4 studies that evaluated measurement properties of pain VAS all also evaluated duration of morning stiffness22,23,24,25. The limitations to the methods discussed above also applied for this outcome measure, and test-retest reliability was poorer. All were rated red for all measurement properties.
HAQ-DI. Kalke, et al24 evaluated the construct validity and responsiveness of the HAQ as an assessment of function in PMR, but significant limitations meant it was rated red for both measurement properties.
Construct validity was evaluated by studying correlation of the HAQ with duration of morning stiffness, pain VAS, and CRP, none of which are measures of function. The correlation was good (> 0.6 in each case), but no hypotheses about the magnitude of change or strength of correlation were stated. Responsiveness was evaluated using the standardized response mean (SRM). The SRM was higher for the HAQ than for the other measures in this study, suggesting greater responsiveness to change, but no a priori hypotheses were stated.
mHAQ. Two studies evaluated the mHAQ, covering the full range of measurement properties between them23,25, but they were rated red for all measurement properties except test-retest reliability.
Both studies provide some evidence toward the construct validity of the mHAQ through demonstrating its improvement in response to treatment23,25. McCarthy, et al also demonstrated correlation of the mHAQ with other outcome measures25, but the comparator measures were not measures of function.
Responsiveness of the mHAQ was evaluated by McCarthy, et al using appropriate statistical methods, but no hypothesis about the magnitude of change was given25.
Test-retest reliability of the mHAQ was evaluated by Matteson, et al23. The ICC was 0.72, but the small sample size prevented the study being rated green27. The %MDC was calculated in the same study, but there was limited information on the methods and no attempt to determine a minimally important difference to patients.
ESR/CRP. Construct validity was supported by 3 studies22,23,25, which all confirmed that ESR and CRP improved with treatment of PMR. McCarthy, et al found moderate correlation between ESR/CRP and the mHAQ25, but these instruments do not measure the same construct. None of the studies set out hypotheses about expected relationships, and all 3 studies were rated red.
Responsiveness was evaluated in 2 studies24,25, but neither set out hypotheses about magnitude of change a priori. One study28 addressed thresholds of meaning for ESR and CRP, and was rated amber. This study found that CRP was superior to ESR in detecting active disease and disease remission.
DISCUSSION
We identified all the outcome measures and instruments used to date in studies of PMR and categorized them using the PMR Core Domain Set endorsed by OMERACT in 2016. Results from the first part of the review informed the decision on which instruments to evaluate as candidates for inclusion in a core instrument set. Only 5 studies evaluating measurement properties of candidate instruments in populations with PMR were identified. Crucially, none of the studies were rated green for any of the measurement properties when assessed against the COSMIN-OMERACT Good Methods criteria. For pain VAS and the mHAQ, there was 1 study of test-retest reliability, which achieved amber, and there was 1 study considering thresholds of meaning for ESR/CRP, which was also rated amber.
The majority of PMR studies included in this review were cohort studies, with only 10 RCT. Almost all had sample sizes of fewer than 150 participants. We found that outcome measures used in studies of PMR varied widely and were often poorly defined. This makes comparing results across studies very difficult and prevents synthesis of current data to improve the evidence base.
Systemic inflammation was most frequently assessed of the 4 PMR core domains, followed by pain and stiffness. Physical function was measured least often. This contrasts with findings from qualitative studies where patients with PMR have highlighted disability and stiffness as having significant effect on their quality of life29,30.
Pain was the most commonly assessed patient-reported outcome, with VAS being the most frequently used measurement instrument. However, as noted in previous reviews6,31, there is little consistency in the question and scales used or on the time frame being considered. Each measurement property of pain VAS has been evaluated in PMR, but there is only sufficient evidence on its test-retest reliability.
Stiffness was measured in 28/46 studies in this review. Given that it is a cardinal symptom of PMR, this is notably low. No studies evaluated a stiffness severity VAS despite the widely acknowledged limitations of “duration of morning stiffness” as an outcome measure30,32,33. We did not find sufficient evidence for any measurement property of duration of morning stiffness to support its use in PMR.
Physical function was assessed in the least consistent way of the core domains. Most frequently, it was measured as part of the PMR-AS, an overall assessment of disease activity that includes evaluation of “elevation of the upper limbs” on a 0–3 scale. This is a very limited assessment of overall function and is insufficient to represent this domain29,30. Therefore, the measurement properties of mHAQ and HAQ-DI were reviewed. We found that neither mHAQ nor HAQ-DI had high-quality evidence to support their use as an outcome measure in PMR. Since physical function is of prime importance to people’s daily lives, the failure to measure it in a meaningful, reliable way that allows comparison across studies of PMR needs addressing.
Where inflammatory markers are used in studies of PMR, ESR and CRP are usually both measured. In studies that chose one over the other, more recent studies tended to use CRP. ESR and CRP are used to evaluate many rheumatological conditions and are frequently incorporated into disease activity scores. Certain properties of biomarkers, such as face validity and feasibility, are likely to be transferrable across conditions. However, properties such as responsiveness and test-retest reliability may vary between conditions, and the limited evaluation in patients with PMR is therefore of note. Indeed, up to 20% of people with PMR may have normal ESR or CRP before treatment; the relationship between these biomarkers and PMR disease activity is not straightforward34.
A small number of studies measured domains that were outside of the core set but included in the “important” or “research agenda” list by the OMERACT 2016 group35. These include fatigue, psychological effect, and overall health status. Although these constructs are heavily intertwined, with each other and with pain, stiffness, and function, this may signify a gap in the core domain set. An overall measure of PMR-related quality of life could be of value in addressing this gap.
The exclusion of papers considering PMR and GCA as a single group is a potential source of bias. However, the risk of bias from including participants with GCA is high and outweighs the small risk of having missed any outcome measure of relevance. One exception to this rule was made in 2 papers (arising from 1 study) by McCarthy, et al, in which 1 participant out of 60 had biopsy-proven GCA as well as PMR25,28. This decision was made by the team because there were so few studies on measurement properties of instruments in PMR that these 2 papers contributed substantially to the available data, and it was felt that there was minimal risk of bias from 1 participant having a dual diagnosis.
Risk-of-bias assessment of included studies added value in this review, as it had not been done previously, to our knowledge. This is a subjective process but was carried out using an established tool and verified by a second assessor. That only 13 of the 46 studies demonstrated low risk of bias shows the limitations of the evidence base in PMR and has implications for the ability to draw firm conclusions from this review. This highlights the need to identify high-quality, well-documented datasets from modern clinical studies of PMR for further evaluation of instrument properties, as well as the need for a core outcome measurement set incorporating the best-performing instruments in order to standardize secondary outcomes across future trials.
Measurement of outcomes in studies of PMR lacks consistency. The critical patient-centered domain of physical function is the least frequently measured of the OMERACT core domains and, when it is measured, is often assessed only by ability to elevate the upper limbs. Overall, none of the candidate instruments considered for inclusion in the core outcome set had high-quality evidence, from studies in populations with PMR, on their full range of measurement properties. This is in part because there are very few published instrument validation studies. We are planning further studies reexamining individual patient data to determine whether the selected instruments are suitable for a core outcome measurement set for PMR.
ACKNOWLEDGMENT
We would like to thank the wider OMERACT PMR Working Group for their contribution to this study.
Footnotes
This work was supported by a Wellcome Trust PhD Programme for Primary Care Clinicians [203921/Z/16/Z], which supports Helen Twohig. CDM is funded by the National Institute for Health Research (NIHR) Applied Research Collaboration West Midlands, the NIHR School for Primary Care Research and a NIHR Research Professorship in General Practice (NIHR-RP-2014-04-026). The views expressed in this paper are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care.
S.L. Mackie declares consultancy to Roche, Chugai, and Sanofi on behalf of University of Leeds (no money paid to her directly in last 3 years); Patron of PMRGCAuk; current or recent site investigator on clinical trials for GSK and Sanofi; and EULAR2019 attendance supported by Roche. S. Muller is a trustee of PMRGCAuk.
Full Release Article. For details see Reprints and Permissions at jrheum.org
- Accepted for publication July 13, 2020.
- Copyright © 2021 by the Journal of Rheumatology
Free online via JRheum Full Release option