Abstract
Objective To systematically review the measurement properties of outcome instruments used in large-vessel vasculitis (LVV).
Methods MEDLINE, Embase, Cochrane, and Scopus databases were searched for studies published from inception to July 14, 2020, that addressed measurement properties of instruments used in giant cell arteritis (GCA) and Takayasu arteritis (TA). The measurement properties of the instruments identified were collected following the Outcome Measures in Rheumatology (OMERACT) and Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) frameworks. Instruments were grouped according to the following domains measured: disease activity/damage, organ function, and health-related quality of life (HRQOL)/health status.
Results From 3534 articles identified, 13 met the predefined criteria. These studies addressed 12 instruments: 4 specific to TA, 2 designed for all types of systemic vasculitis, and 6 non–disease-specific instruments. No instruments specific to GCA were identified. Regarding TA, the Indian Takayasu Clinical Activity Score (ITAS) showed very good consistency, adequate reliability, but doubtful validity for disease activity. The Disease Extent Index-Takayasu (DEI-Tak) showed adequate construct validity but doubtful discriminating validity for disease activity/damage. Instruments, including the Vasculitis Damage Index and the Birmingham Vasculitis Activity Score, were poorly assessed for disease activity/damage. In total, 6 non–vasculitis-specific patient-reported outcome (PRO) instruments showed inadequate validity in GCA/TA.
Conclusion The measurement properties of 12 outcome instruments for LVV covering the OMERACT domains of disease activity/damage, organ function, and HRQOL were assessed. The ITAS and the DEI-Tak were the instruments with the most adequate measurement properties for disease activity/damage in TA. Disease activity/damage instruments specific to GCA, as well as validated PROs for both GCA and TA, are lacking.
- giant cell arteritis
- large-vessel vasculitis
- measurement properties
- OMERACT
- outcome measures
- Takayasu arteritis
Giant cell arteritis (GCA) and Takayasu arteritis (TA) are 2 major subtypes of chronic, progressive large-vessel vasculitis (LVV) of unknown etiology. In LVV, disease flares are common and disease burden is high.1-5 Standardized measurement of LVV outcomes is of the utmost importance for understanding the course of the disease and to measure efficacy of treatment in clinical trials. Not all outcome measurement instruments currently used in LVV research are fully validated, and the study of their measurement performance, as well as the development of new instruments as needed, should be a priority.6 In response to this need, in 2016, the Outcome Measures in Rheumatology (OMERACT) Vasculitis Working Group, through an international Delphi exercise, developed a preliminary core set of domains for LVV (ie, GCA and TA).7 The group highlighted the importance of having a common set of domains and outcome measurement instruments for GCA and TA, supplemented with disease-specific elements. A draft core set of domains for LVV was proposed, which included pain, fatigue, mortality, organ involvement, arterial function, and biomarkers. Future steps have been identified, including the formulation of core contextual factors and the formulation of core adverse events. The ultimate goal of the OMERACT Vasculitis Working Group LVV Task Force is to develop an OMERACT-endorsed core set of outcome measures for LVV for use in clinical trials. Over the past several years, more data about the use of outcome measure instruments in LVV have accumulated, and a heterogeneous set of instruments has been used in trials of LVV (ie, instruments assessing different domains and with different performances). A first definition of disease activity in TA was proposed by Kerr el al8 based on the presence of constitutional symptoms, new bruits, acute-phase reactants, and angiographic features. More recently, the European Alliance of Associations for Rheumatology (EULAR) proposed new consensus definitions for disease activity in LVV. These included the presence of typical signs or symptoms of active LVV, activity on imaging or biopsy, ischemic complications attributed to LVV, and elevated inflammatory markers.4 On the other hand, there is no consensus to define disease damage in LVV. Indeed, according to the report from OMERACT 2018, the disease states in LVV are not well-defined, and the complexity of the disease makes it difficult to differentiate “activity” from “damage.”9 Nevertheless, it is accepted that damage consists mainly of the presence of irreversible lesions—stenotic or aneurismatic—that have occurred since the onset of the disease.10
OMERACT uses a staged process to establish core sets by first establishing the key domains of illness and then identifying validated instruments to assess the domains. Systematic reviews of clinical trials help catalog the outcome measures used and the domains targeted. Groups then seek agreement on the core domains, review the measurement properties of instruments measuring each domain, and hold a final vote on the core set of domains and matched instruments.11 This process aligns with the principles of the Core Outcome Measures in Effectiveness Trials (COMET) initiative12; in addition, the process uses the OMERACT filter to critically appraise the instruments identified13 and uses a reduced version of the COSMIN (Consensus-Based Standards for the Selection of Health Measurement Instruments) checklist.14
The EULAR Outcome Measures Library (OML) is an international collaborative initiative that acts as an open access repository of outcome measures in rheumatology.15 One approach to populate the EULAR-OML is to conduct systematic reviews of existing instruments for any given disease or domain and to appraise their measurement properties. The OML uses the COSMIN checklist to appraise the instruments.
Based on the interest of the vasculitis community to better understand measurement properties of outcome instruments used to measure core domains of vasculitis, this systematic review was designed—in collaboration with the OMERACT Vasculitis Working Group and the EULAR-OML—to evaluate the measurement properties of all available outcome instruments used in LVV.
METHODS
A protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) prior to the initiation of this systematic review (PROSPERO No. CRD42020181949). This review is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.16
Search strategy, eligibility criteria, and selection process. The research clinical question was formulated according to the Population, Instrument of interest, Measurement properties (PIM) framework of OMERACT.17 The population included patients with GCA or TA without any age restrictions; the instruments of interest were those measuring disease progression, disease exacerbation, and disease severity indices; those measuring treatment outcomes, physician global assessment, and patient global assessment; and any instrument measuring any of the following domains: disease activity, disease damage, organ function, and health-related quality of life (HRQOL)/health status. The measurement properties of interest were validity, interobserver reliability, intraobserver reliability, sensitivity to change, and feasibility. Studies where miscellaneous types of vasculitis were analyzed and did not provide separate information on GCA or TA were excluded. A comprehensive systematic literature search was undertaken based on the PIM framework from inception of each of the following databases to July 14, 2020: Ovid MEDLINE, Ovid Embase, Ovid Cochrane Central Register of Controlled Trials, Ovid Cochrane Database of Systematic Reviews, and Scopus.
With the supervision of an expert librarian (LCH) and with input from the study’s principal investigators, the search strategies for the different databases were generated (see Supplementary Tables S1 and S2 for full details of the search strategy, available from the authors upon request).
Titles and abstracts of the retrieved studies were screened by 2 independent reviewers (GB and AB). The full-text articles were retrieved where abstracts were felt to be relevant. Any duplicate articles were excluded. Reference lists of relevant articles were screened to ensure that no relevant publications were overlooked.
Data extraction. Two reviewers (GB and AB) collected the data independently in predesigned and tested extraction forms. Once the data had been collected, data extraction sheets were compared; these were checked for discrepancies with the original article, if needed, and verified with 2 reviewers experienced in outcome measures (SR and LC) when in doubt.
The data were collected within 2 categories: (1) studies, including design and sample description, validation-related objectives, and risk of bias, and (2) instruments, where information was compiled from several studies. Data elements collected included the following:
1. Study design and population: type (ie, list options), country, diseases (ie, GCA or TA), sample size, ages, and sex distribution.
2. Instrument: type (ie, questionnaires, index, or scales). Biomarkers and imaging instruments were excluded.
3. Practical applications: method of administration; score interpretation; cutpoints; smallest detectable change, if described; completion time by assessor; strengths; and limitations.
4. Instrument measurement properties: validity, reliability, responsiveness, and feasibility (ie, OMERACT Filter of Truth, Feasibility, and Discrimination).13 Qualitative evaluations of each measurement property were based on the method by Streiner and Kottner.18
After reviewing all studies and the properties per instrument, we had meetings to decide, by consensus, the final rating of each property based on the studies and their results. The decisions were systematically based on the results of the studies with lower risk of bias.
We considered Pearson and Spearman correlation values < 0.5 indicative of inadequate validity, values between 0.5 and 0.75 indicative of doubtful validity, values between 0.75 and 0.9 indicative of adequate validity, and values > 0.90 indicative of very good validity. Regarding reliability, we considered intraclass correlation coefficient (ICC) and Cronbach α < 0.5 indicative of inadequate reliability, values between 0.5 and 0.75 indicative of doubtful reliability, values between 0.75 and 0.9 indicative of adequate reliability, and values > 0.90 indicative of very good reliability.19
Risk of bias (quality) assessment. Risk of bias (ie, quality) for each study was assessed according to the COSMIN checklist.20 The checklist contains 10 boxes. These boxes contain standards for the included measurement properties: patient-reported outcome (PRO) measure development, content validity, structural validity, internal consistency, cross-cultural validity, reliability, measurement error, criterion validity, hypotheses testing, and responsiveness. The studies were evaluated by rating each property, when present, with the following levels of quality: “not available,” “inadequate,” “doubtful,” “adequate,” and “very good.” Risk of bias (ie, quality) for each study was assessed separately and independently by 2 reviewers (GB and AB). When the same instrument was evaluated in different studies, an average rating was given for each measurement property after taking the quality of each study into account. After reviewing all studies and properties per instrument, we had meetings to decide, by consensus, the final rating of each property based on the studies and their results. The decisions were systematically based on the results of the studies with lower risk of bias.
RESULTS
In total, 3534 references were identified in the initial search strategy; the full text of 129 of them were reviewed, of which 13 were included21-33 (Supplementary Figure S1 and Table S3, available from the authors upon request). Of these, one was a development study,21 11 were validation studies,22-33 and 1 was both a development and validation study.22 All studies included only adults, except for 1 study32 that assessed patients with TA who had a median age of 12 years.
The characteristics of the included studies and the risk of bias (ie, study quality) of each measurement property are shown in Table 1. Baseline characteristics of the populations of the 13 studies included are shown in Supplementary Table S4 (available from the authors upon request). These studies provided information on 12 instruments:
Characteristics of the 13 studies included and quality of the measurement property assessment (risk of bias).
· Four instruments were specific to TA: the Indian Takayasu Clinical Activity Score 2010 (ITAS2010),22 the Indian Takayasu Clinical Activity Score A (ITAS.A),22 the Disease Extent Index-Takayasu (DEI-Tak),24 and the Takayasu Arteritis Damage Score (TADS).31,32
· Two studies were designed to study all forms of systemic vasculitis: the Vasculitis Damage Index (VDI)21 and the Birmingham Vasculitis Activity Score (BVAS).26
· Six studies were general, non–disease-specific instruments: the Brief Illness Perception Questionnaire (BIPQ),32 the Activities of Daily Vision Scale (ADVS),27 the 36-item Short Form Health Survey (SF-36),29,30 the Vision Core Measurement 1 (VCM1),28 the patient global assessment (PtGA) of disease activity,30 and the Multidimensional Fatigue Inventory (MFI).30 No instruments specific to GCA were identified. A detailed description of each included instrument is shown in Supplementary Table S5 (available from the authors upon request).
Regarding the domains of LVV assessed, 4 instruments evaluated disease activity exclusively (ie, ITAS2010, ITAS.A, BVAS, and PtGA), 2 instruments evaluated disease damage exclusively (ie, VDI and TADS), 1 instrument evaluated both disease activity and disease damage (ie, DEI-Tak), 3 instruments evaluated HRQOL (ie, BIPQ, SF-36, and MFI), and 2 instruments evaluated visual function (ie, ADVS and VCM1).
Measurement properties of the instruments. There was a high degree of heterogeneity in the measurement properties assessed for each instrument. Validity was the property most frequently assessed among the instruments (11/12, 92%), followed by responsiveness (4/12, 33%). In terms of study quality (ie, risk of bias), among the measurement properties assessed, 34% (11/32) of the instruments had very good quality, 38% (12/32) had adequate quality, 3% (1/32) had doubtful quality, and 25% (8/32) had inadequate quality (Table 1).
Table 2 gives a summary of the adequacy of measurement properties of the instruments identified, and Table 3 gives a detailed overview of the results retrieved.
Summary of the adequacy of evidence for measurement properties of the LVV instruments identified.
Summary of the results on the psychometric properties of the 12 LVV instruments identified.
Disease activity. Regarding TA, the ITAS2010 showed very good internal consistency (α = 0.97), doubtful intraobserver reliability (ICC = 0.60), and adequate interobserver reliability (ICC = 0.92). This score showed doubtful construct and discriminating validity, with moderate correlations with the PtGA (r = 0.73) and the BVAS (r = 0.75) and poor correlation with the National Institutes of Health (NIH) score (
= 0.35). The ITAS.A showed doubtful intraobserver reliability (ICC = 0.59) and very good interobserver reliability (ICC = 0.92). The ITAS.A showed very good correlation with the ITAS2010 (r = 0.98) and poor correlations with the PtGA (
= 0.29) and the NIH score (
= 0.35). Responsiveness was, however, poor for both instruments. Validity was poorly tested (ie, few measurement property studies) for the PtGA.
Concerning instruments designed for use in all types of vasculitis, the BVAS showed doubtful construct validity in the assessment of disease activity in GCA (physician global assessment, r = 0.50). One PRO instrument—the PtGA—showed doubtful construct validity, with moderate correlations with erythrocyte sedimentation rate and C-reactive protein (ρ = 0.71) and poor correlation with the positron emission tomography vascular activity score (PETVAS; ρ = −0.21 to −0.32) in GCA/TA.
Among all instruments assessed, the ITAS2010 and ITAS.A performed best, but with a limited assessment of validity and with demonstrated poor responsiveness. However, it is important to realize that no assessment of responsiveness of instruments measuring further disease activity was found in the literature.
Disease damage. In TA, the DEI-Tak showed adequate construct validity (
= 0.85 with NIH score). The TADS showed inadequate discriminative validity with poor correlation with disease duration (r = 0.19) and poor correlation with cumulative corticosteroid dose (r = 0.19).
Concerning instruments designed for use in all types of vasculitis, the VDI showed inadequate discriminating validity (correlation with disease duration, r = 0.25; cumulative glucocorticoid dose, r = 0.19). Neither reliability nor responsiveness was assessed.
Overall, the DEI-Tak was the instrument with the best measurement properties for disease damage.
HRQOL/health status. In total, 3 nonspecific instruments assessing HRQOL/health status were analyzed. The SF-36 physical component score (PCS) and the SF-36 mental component score (MCS) were poorly correlated with the VDI (r = −0.34 and r = −0.23, respectively). However, higher VDI values were detected in patients with PCS values of less than 50. Both the PCS and MCS were not significantly correlated to the PETVAS (ρ = −0.05 and ρ = −0.12, respectively). The BIPQ was significantly correlated to the MFI, the PtGA, the SF-36 PCS, and the SF-36 MCS (ρ = 0.50-0.70, P < 0.001), but it did not correlate with the physician global assessment (ρ = 0.13, P = 0.13). The MFI was significantly negatively correlated with the PETVAS (ρ = −0.23).
Organ function. Visual function was exclusively assessed in GCA through the ADVS and the VMC1. The former showed poor correlation with cumulative steroid dose (r = 0.45). The latter showed poor correlations with all of the SF-36 subscales, except bodily pain (r = −0.22 to −0.40), and inadequate discriminating validity, with similar median VCM1 scores between GCA (4.0, IQR 1-14.5) and non-GCA groups (2.0, IQR 0.25-8.5).
DISCUSSION
To our knowledge, this is the first systematic review summarizing the measurement properties of instruments developed or validated for LVV. In this study, the measurement properties of 12 outcome measurement instruments for GCA and TA covering the domains of disease activity, damage, organ function, and HRQOL/health status were assessed. The domains identified in this systematic review and endorsed by OMERACT as the core set of outcomes for randomized controlled trials of antineutrophil cytoplasmic antibody–associated vasculitis (AAV)34 have been suggested as potential domains for future clinical investigation in LVV.7,35 Our study identified the ITAS2010, ITAS.A, and DEI-Tak as the instruments with the most adequate measurement properties for disease activity and/or damage in TA and, therefore, could be recommended for research and/or clinical practice.
In TA, despite the identification of specific outcome measurement instruments for disease activity (ie, ITAS2010 and ITAS.A) and disease damage (ie, DEI-Tak) with adequate measurement properties, a combination of clinical symptoms, acute-phase reactants, imaging and glucocorticoid-sparing effects, as well as other composite scores (eg, NIH score) were still largely used in recent clinical trials.36-39 However, since some of the identified tools (ie, ITAS2010, ITAS.A, and DEI-Tak) have been developed and validated by the same group, one should be cautious when interpreting the results.
Contrary to TA, no instruments measuring disease activity or damage specific for GCA were identified. The OMERACT 2016 workshop cited the NIH score, the BVAS, the DEI-Tak, and the ITAS2010 as the main outcome instruments used in clinical research for TA and highlighted that similar disease-specific tools do not exist for GCA.7 The BVAS or a combination of clinical symptoms, acute-phase reactants, glucocorticoid dose, and duration and imaging have been used in previous GCA clinical trials.40-43 This systematic review identified 1 study that evaluated the performance of the BVAS in the assessment of GCA disease activity.26 This study revealed a substantial limited utility for the use of the BVAS in GCA, with a considerable number of patients (11%) with active disease having a BVAS of 0. Moreover, ischemic symptoms secondary to vasculitis are not included in the BVAS. This contrasts with the good performance of the BVAS in AAV reported in the literature, which might be related to the fact that this tool was better designed and validated in small-vessel vasculitis than in LVV.44
Regarding PROs reflecting disease activity, only the widely used generic PtGA was identified in this review. This instrument was used in both GCA and TA trials; however, measurement properties were insufficiently assessed for this instrument. Although patients’ evaluation of disease activity/damage is usually a difficult goal to reach, developing a composite measure combining the perspectives of patients and physicians, as has been done for other systemic inflammatory diseases,45 could improve the evaluation of LVV.
There has been growing interest in the importance of integrating patient perspectives regarding the effects of their disease, and this has been proposed by OMERACT as a mandatory area to be assessed in LVV clinical trials. The measurement properties of instruments measuring HRQOL were assessed using generic instruments not specifically validated for LVV. We acknowledge that HRQOL is not a domain that is part of the LVV preliminary/draft core domains, but as we were aiming at being inclusive in this systematic literature review, we collected all data on all existing instruments and matched them to the OMERACT LVV draft core domains, where possible. The instruments measuring HRQOL are, therefore, an example of instruments that do not cover any of the OMERACT domains, so HRQOL is a core theme of those instruments but not an OMERACT domain. Indeed, HRQOL has not been investigated extensively in patients with LVV, and whether generic PROs are sensitive to change in GCA and TA has not been demonstrated. Although the SF-36 does not strictly measure HRQOL, but is rather an indicator of overall health status, and since we believe that HRQOL is intimately related to health status, we decided to include this instrument in our analysis. Indeed, the SF-36 is a widely used generic PRO that covers 8 different domains, including physical and social functioning and mental health,46 and has been widely used in trials of rheumatic and musculoskeletal diseases47,48 and AAV.34 However, this review demonstrated that there is currently inadequate evidence to suggest using the SF-36 as a generic instrument in LVV clinical trials. Further, the ADVS and the VCM1 have been used for subjectively evaluating visual function in GCA but with inadequate construct and discriminating validity according to our analysis.
Surprisingly, the number of studies identified as assessing measurement properties of instruments was limited. This might be a consequence of the selection criteria chosen, which excluded some studies where miscellaneous types of vasculitis were analyzed and did not provide separate information on GCA or TA. This review opted for the most conservative and cleanest approach to specifically collect data on the measurement properties of GCA and TA measurement instruments. Another limitation of this systematic review is the heterogeneity of the included studies, with no hierarchy or a settled format of properties required to properly validate a specific instrument, as well as the heterogeneity in which similar properties were assessed, limiting a direct comparability of the performance of different instruments. When assessing construct validity, as opposed to recommendations, most studies did not specifically report, or did not formulate, a priori hypotheses. One should also emphasize that most of the included studies reported on 1 or 2 measurement properties only, making it difficult to have an overall assessment of the instrument. It is important to notice that reliability or validity are not fixed properties of a scale, and they depend on the testing situation. Indeed, these properties are limited to the results obtained with an evaluation instrument and not to the instrument itself.18
In addition, it is challenging to directly compare the measurement properties of different instruments, since the comparators are not always the same. Therefore, we collected and reported all the data in the papers in a systematic fashion to provide an overview of what has been published, avoiding direct or head-to-head comparisons.
Finally, even though OMERACT and COSMIN frameworks might function as the backbone of these assessments, the heterogeneity, together with the scarcity of the studies, led to limited available evidence on the measurement properties of the instruments analyzed.
In conclusion, this systematic review demonstrated that specific tools for the assessment of outcome domains in LVV are lacking, particularly for GCA. GCA and TA are both very rare conditions; distinguishing and separately measuring damage vs activity, which are often related, is not straightforward. With recent advances in imaging, incorporating composite scores might help to better assess these conditions. Moreover, most of the instruments included in the analysis were only partially validated. Our systematic review also highlighted the need for specific PROs to evaluate both disease activity/damage and HRQOL for both GCA and TA.
Footnotes
The authors declare no conflicts of interest relevant to this article.
- Accepted for publication October 12, 2022.
- Copyright © 2023 by the Journal of Rheumatology






