Abstract
Objective. To evaluate metric properties of the SpondyloArthritis Research Consortium of Canada (SPARCC) score of the sacroiliac (SI) joints.
Methods. Patients with back pain (≥ 3 months, ≤ 2 years, onset < 45 years) were included in the SPACE cohort (SpondyloArthritis Caught Early). Patients with (possible) axial spondyloarthritis had followup visits after 3 and 12 months and were treated according to clinical practice. Magnetic resonance imaging (MRI) of the SI joints (MRI-SI) was scored in 2 independent campaigns (campaign 1: at baseline and 3 months; campaign 2: at baseline, 3 months, and 12 months) by 2 different blinded reader pairs, applying the Assessment of Spondyloarthritis International Society (ASAS) definition (MRI-SI+ vs MRI-SI−; discordant cases were adjudicated by a third reader) and SPARCC score (mean of 2 agreeing readers). Calculations were made for agreement between SPARCC score cutoff values and a consensus judgment of MRI-SI+ (ASAS definition) as external standard, change in SPARCC score, and smallest detectable changes (SDC) over 3 and 12 months.
Results. SPARCC score ≥ 2 showed best agreement with MRI-SI+ in both campaigns. Regarding observed changes in relation to SDC, SPARCC score changed in 70/151 patients; 26/70 patients changed > SDC (3.4), of whom 20 patients received stable treatment over 3 months in campaign 1. Over 3 months, 20/68 patients showed changes in SPARCC score; 11/20 > SDC (2.1), of whom 8 patients received stable treatment. Over 1 year, 23/74 patients changed their SPARCC score; 14/23 changed > SDC (2.4), of whom 7 received stable treatment in campaign 2.
Conclusion. SPARCC score ≥ 2 can be used as surrogate for a consensus judgment of MRI-SI+ (ASAS definition) in clinical trials. The SDC ranged from 2.1–3.4 dependent on reader pair and were close to the proposed minimum important change of 2.5.
A positive magnetic resonance imaging (MRI) of the sacroiliac joints according to the Assessment of Spondyloarthritis International Society (ASAS) definition (positive MRI)1 is part of the ASAS axial spondyloarthritis (axSpA) criteria2 and is increasingly used to test eligibility of patients with axSpA for clinical trials3,4,5. Within clinical trials, MRI-SI is often repeated over short periods (e.g., 12 weeks) to test the efficacy of treatment (especially biological) regarding changes in inflammation. For this efficacy read, the SpondyloArthritis Research Consortium of Canada (SPARCC) score is frequently used because it measures inflammation on a continuous scale with good sensitivity to change6,7. It is unknown what SPARCC score cutoff value is the equivalent of a positive MRI, a value that is needed to link the read for eligibility and the efficacy reading. For example, this information would be useful to define groups with MRI scored according to SPARCC scores as having or not having a “positive MRI,” to study differences in treatment response over time4.
Treatment with biologicals may dramatically influence inflammatory signs on MRI8,9,10,11, but inflammation may also spontaneously change over time in patients without treatment and in patients taking stable nonbiological treatment12,13,14. However, it is not clear how many SPARCC score units these spontaneous changes represent12,13,14. Moreover, these spontaneous changes are likely to be different with variable lengths of followup. A minimally important change (MIC) of 2.5 SPARCC units is proposed based on the patient global assessment as external anchor15. It is known that interreader reliability of SPARCC scores for both a fixed timepoint and changes over time ranges from moderate to high (ICC 0.69–0.9716,17,18 and ICC 0.51–0.897,8,18, respectively)6. It would be of additional value to know about interreader reliability in terms of smallest detectable change (SDC), to be able to judge whether the SDC is sufficiently small to detect the proposed MIC.
The aim of our study is 3-fold: first, to define which SPARCC score best approximates a “positive MRI” judgment; second, to establish an SDC for a 3-month period and for a 1-year period; third, to describe which variation in SPARCC score over a 3-month and 1-year period can be expected in patients without (change in) treatment.
MATERIALS AND METHODS
Study population
Data from the SPondyloArthritis Caught Early (SPACE) cohort are used for this analysis. An extensive description of the SPACE cohort is given elsewhere19. In short, the SPACE cohort is an ongoing cohort started in January 2009, including patients aged 16 years and older with back pain (≥ 3 months, ≤ 2 years, onset age < 45 years) visiting the rheumatology outpatient clinics of 5 participating centers. Patients were not included if they had other painful conditions (not related to SpA) that could interfere with the evaluation of the disease. After signing informed consent forms, all patients underwent a diagnostic examination at baseline, including MRI and plain radiographs of the SI joints, HLA-B27 testing, and examining for other SpA features. Patients fulfilling the ASAS axSpA criteria or patients with possible axSpA were included for followup visits after 3 and 12 months. Possible axSpA was defined as the presence of at least 1 specific SpA feature with a high positive likelihood ratio (LR+ above 6) or at least 2 less specific SpA features (LR+ below 6), but not fulfilling the ASAS axSpA criteria20.
MRI of the SI joints
MR imaging was performed on a 1.5T scanner, acquiring T1-weighted turbo spin echo (T1TSE; TR 550/TE 10) and short-tau inversion recovery (STIR; TR 2500/TE 60) sequences. Slices were of 4 mm thickness in the coronal oblique view of the SI joints.
All readers in our study (n = 4) were extensively trained in reading MRI according to the ASAS definition and the SPARCC score during a calibration session, supervised by a senior radiologist (MR) and a senior rheumatologist (DvdH), discussing definitions of lesions, examples, and pitfalls. Next, all readers independently read 30 blinded MRI to calculate agreement on the ASAS definition (κ = 0.75 to κ = 0.87 for the different pairs of readers), and to calculate agreement on SPARCC scores (ICC 0.81 to ICC 0.95 for the different reader pairs on status scores and ICC 0.78 to ICC 0.97 on change scores). The mean baseline SPARCC score was 10.3 (SD 11.7); the mean followup score was 7.4 (10.0); the mean change score was −2.9 (9.9). A consensus meeting followed, in which the same supervising rheumatologist and radiologist participated. In addition, all readers participated in a reading exercise in which the original developers of the SPARCC score participated as well. The mean SPARCC score of the 32 evaluated cases was 5.3 (7.1). Agreement on the status SPARCC scores was acceptable (ICC 0.77) for all readers including the original developers. A consensus meeting with the original developers was organized, and agreement was considered sufficiently high to start scoring the SPACE cohort.
Two separate reading campaigns were performed by different pairs of readers (RvdB and MdH in exercise 1; PB and MdH in exercise 2), with partly overlapping patients and images. Patients in the first reading exercise were included between January 2009 and November 2012 in 5 different centers and patients in the second reading exercise were included between January 2009 and October 2013 in 1 center. In exercise 1, baseline and 3-month MRI-SI were evaluated; in exercise 2, baseline, 3-month, and 1-year MRI-SI were evaluated. In both campaigns, MRI-SI were independently read by the 2 trained readers for the fulfillment of the ASAS definition1 and according to the SPARCC score6, blinded for the time sequence of the MRI-SI as well as for clinical and laboratory data.
Every inflammatory lesion typical for SpA was marked according to the SPARCC score. In the next step, the readers took a look at the MRI again, and marked whether the MRI was positive or negative according to the ASAS definition based on the global evaluation of the entire MRI-SI. An MRI-SI can be marked positive according to the ASAS definition if ≥ 1 bone marrow edema (BME) lesion highly suggestive of SpA is present on ≥ 2 consecutive slices, or if several BME lesions highly suggestive of SpA are visible on a single slice. The presence of synovitis, enthesitis, or capsulitis only, without BME, is not sufficient for a positive MRI-SI1. In case the 2 readers disagreed on the presence of a positive MRI, a third trained reader served as adjudicator (VNC in campaign 1; RvdB in campaign 2).
According to the SPARCC score, the presence of increased signal corresponding to BME lesions highly suggestive of SpA is marked on 6 consecutive slices of an MRI-SI, starting on the slice on which at least 1 cm of vertical height of the cartilage compartment is visible, from anterior to posterior, assessing the cartilaginous compartment of the SI joints and the antero-inferior portion of the SI joint. At the posterior aspect of the SI joints there is a natural division into upper and lower quadrants by intervening fat and fibrous tissue. When less than 1 cm of a quadrant is visible, it is no longer scored18. Each SI joint is divided into 4 quadrants (upper iliac, lower iliac, upper sacrum, and lower sacrum)6,18. The maximum score for 2 SI joints on each slice is 8. In addition to these 8 points per slice, a score for “intensity” (adding 1 point) may be assigned to each SI joint if an “intense signal” is seen in any quadrant on each slice. The signal from presacral blood vessels defined a lesion that is scored as intense. Further, a score for depth (adding 1 point) may be assigned to each SI joint if a homogeneous and unequivocal increase in signal is extending over a depth of at least 1 cm from the articular surface on each slice, resulting in a maximum score of 12 points per slice. The total maximum SPARCC score is 726,18. The mean SPARCC scores of the 2 readers were used; in case there was a third reader involved because of disagreement among the 2 initial readers regarding a positive MRI according to the ASAS definition, the mean of the SPARCC scores of the 2 readers in agreement of a “positive MRI” for that particular case was used.
For both the SPARCC score and the ASAS definition assessment on the STIR sequence, the readers took into account the findings on the T1TSE sequences as well.
Treatment
Patients in the SPACE cohort are not treated according to a fixed protocol, but according to usual clinical practice by their rheumatologist. Treatment with nonsteroidal antiinflammatory drugs (NSAID) was recorded according to the ASAS recommendations, resulting in a 0–100 score whereby 0 means no NSAID intake at all and 100 means a daily intake at a full dose over the whole period of interest21. Treatment with disease-modifying antirheumatic drug (DMARD) and anti-tumor necrosis factor (TNF) therapy was recorded as present or absent.
To investigate variation in SPARCC scores over time, patients were categorized according to their treatment over the period of interest: no treatment, stable NSAID and/or DMARD intake, and change in NSAID and/or DMARD intake. Patients receiving anti-TNF therapy during the period of interest were excluded from the analysis on variation in SPARCC scores.
Statistical analysis
Baseline characteristics of patients in both groups were investigated using descriptive statistics. Agreement (Cohen κ) between MRI positivity based on several SPARCC score cutoff values (≥ 1, ≥ 2, ≥ 3, and ≥ 4) and the consensus judgment of a “positive MRI” as external standard was calculated using cross-tabulation. Agreement on positive cases (positive agreement) and on negative cases (negative agreement) was also calculated22.
Changes in SPARCC score over the period of interest [baseline to 3 months (both campaigns); baseline to 1 year (campaign 2)] were visualized in cumulative probability plots in which patients were grouped based on treatment. Next, SDC were calculated based on a 95% level of agreement (95% LoA) between the 2 readers on the change scores for both baseline to 3-month and baseline to 1-year intervals, using the following formula: whereby k represents the number of readers (2 in our study)23. The SDC are also displayed in Bland-Altman plots that plot the mean SPARCC score changes of the 2 readers (X axis) and the interreader differences in SPARCC score changes (Y axis). In addition, the mean of the interreader differences in SPARCC score changes (which is a reflection of the systematic error between the 2 readers) and the 95% LoA are presented in these plots. SPSS software version 20.0 was used for statistical analysis.
RESULTS
Patients with available baseline MRI-SI were included in the analysis of the agreement between the SPARCC score cutoff value and positive MRI [n = 294 (campaign 1) and n = 249 (campaign 2)]. There is a partial overlap (49.1%) between patients included in campaign 2 and those included in campaign 1. In both campaigns the population was young, with short symptom duration. Around one-third of the patients were male and around one-third fulfilled the ASAS axSpA criteria (Table 1).
A 3-month followup MRI-SI was available for 154 patients in campaign 1. However, 3/154 patients received anti-TNF therapy during this period and were therefore excluded from the followup part of the analysis of the SPARCC score changes over time and SDC. In campaign 2, a 3-month followup MRI-SI was available in 70 patients, and in 76 patients a 1-year followup MRI-SI was available. Two patients received anti-TNF therapy, leaving MRI-SI of 68 (3-month period) and 74 patients (1-year period) for followup analyses.
SPARCC score cutoff
In both campaigns, there was a high level of agreement between MRI positivity based on all tested SPARCC score cutoff values and the consensus judgment of a “positive MRI” as external standard (Table 2). A cutoff value of ≥ 2 showed the highest κ values (0.94 in campaign 1 and 0.98 in campaign 2) and provided the best balance in terms of misclassifications in comparison to the external standard; 5 false-positive and 1 false-negative classification in campaign 1; 0 false-positive and 1 false-negative classification in campaign 2.
Smallest detectable change of SPARCC score
Of the patients with available followup MRI, the mean SPARCC score at baseline was 4.0 (8.3) and 2.3 (5.7; campaign 1 and 2, respectively). At 3 months, the mean SPARCC score was 3.4 (6.7) and 1.6 (3.8) (campaigns 1 and 2, respectively), and at 1 year the mean SPARCC score was 1.4 (SD 4.0; campaign 2).
Bland-Altman plots show the mean of the 2 readers in SPARCC score changes over the 3-month (campaign 1; Figure 1) and over the 3-month and 1-year period (campaign 2; Supplementary Figure 1, available online at jrheum.org) against the difference between the 2 readers in SPARCC score changes over those periods. The plots show that a large number of observations is clustered around the mean difference of 0, and that differences between readers occur with similar amplitude across the entire range of the SPARCC score (a homoscedastic pattern). To visualize the high number of overlapping observations, series of ranges were defined. All observations were grouped into their corresponding range, increasing exponentially on the positive side of 0 and decreasing exponentially on the negative side, displayed on the X axis. The SDC based on the 95% LoA in campaign 1 over the 3-month period is 3.4 SPARCC units, depicted in Figure 1 as the dark grey area reflecting the SDC of both increased and decreased SPARCC scores over time. The SDC (95% LoA) in campaign 2 over the 3-month period is 2.1 SPARCC units (Supplementary Figure 1, top panel, available online at jrheum.org) and over the 1-year period, 2.4 SPARCC units (Supplementary Figure 1, bottom panel).
In comparison, the SDC based on the 80% LoA are 2.2 (80% LoA −4.3 to 4.5), 1.4 (80% LoA −2.5 to 2.9), and 1.6 (80% LoA −3.2 to 3.0) SPARCC units, respectively.
Change in SPARCC scores over 3 months and 1 year
Eighty-one out of 151 patients in campaign 1 (53.6%) showed no change in SPARCC score over the 3-month period of which 75/81 (92.6%) had a SPARCC score of 0 at both timepoints. In the 70 out of 151 patients (46.4%) showing a change in SPARCC score, 27 increased and 43 decreased [mean change −1.1 (6.3); median change −0.5 (range −16.5 to 16.0); Figure 2 and Table 3]. In 26 out of 70 patients (37.1%) with SPARCC score changes, the change was more than the SDC (3.4); in 16 patients the SPARCC score decreased (2 patients without treatment, 11 with stable NSAID intake, 2 with stable NSAID and DMARD intake, 1 started NSAID intake) and in 10 patients it increased (2 without treatment, 7 with stable NSAID intake, 1 started NSAID intake). In the remaining 44 patients (62.9%) the SPARCC score changes were within the area still compatible with measurement error.
In campaign 2, two followup intervals for the same patients were available. Over the 3-month period, SPARCC score did not change in 48 out of 68 patients (70.6%); 46/48 patients (95.8%) had a SPARCC score of 0 at both time-points. In the remaining 20 patients (29.4%) the SPARCC score changed; 14 patients showed a decrease and 6 patients an increase [mean change −3.1 (4.6); median change −1.5 (range −12.5 to 5); Supplementary Figure 2a, available online at jrheum.org, and Table 3]. Eleven out of 20 patients (55.0%) showed a SPARCC score change > SDC (2.1); 10 patients showed a decrease (1 without treatment, 6 with stable NSAID intake, 2 with stable NSAID and DMARD intake, 1 started NSAID intake) and 1 patient increased (started NSAID intake). The remaining 9 patients (45.0%) had SPARCC score changes still compatible with measurement error.
The results over the 1-year period in campaign 2 are similar to the results over the 3-month period in campaign 2, although more variation between patients is seen; 51/74 patients (68.9%) did not show a change in SPARCC score; of them, 50 patients (98.0%) had a SPARCC score of 0 at both timepoints. The remaining 23 patients (31.1%) showed a change in SPARCC score; 16 patients showed a decrease and 7 an increase [mean change −2.9 (7.5); median change −1.0 (range −18.0 to 12.0); Supplementary Figure 2b, available online at jrheum.org, and Table 3]. Fourteen out of the 23 patients (60.9%) showed a SPARCC score change of more than the SDC (2.4); 10 patients showed a decrease (2 without treatment, 4 with stable NSAID intake, 2 with stable DMARD intake, 1 stopped NSAID intake, 1 started but stopped again NSAID intake), and 4 patients showed an increase (1 with stable NSAID intake, 1 stopped NSAID intake, 1 started NSAID intake, 1 stopped DMARD intake but continued NSAID intake). In the remaining 9 patients (39.1%), SPARCC score changes were not beyond measurement error.
The majority of the patients showing changes in SPARCC score of more than the SDC in both campaigns [20/26 (76.9%; campaign 1), 8/11 (72.7%; 3-month period campaign 2) and 7/14 (50.0%; 1-year period campaign 2)] had stable NSAID and/or DMARD intake.
DISCUSSION
Our study, performed in the SPACE cohort, has shown in 2 campaigns that a cutoff value of 2 SPARCC units is best compatible with a consensus judgment of a positive versus negative MRI according to the ASAS definition. These results were expected because the ASAS definition of a positive MRI-SI includes (apart from a qualitative part, i.e., BME lesions highly suggestive of spondyloarthritis) a quantitative part that requires at least 1 BME lesion visible on at least 2 consecutive slices or several lesions on a single slice1. However, in theory, a SPARCC score can be high because of the presence of several small lesions (highly suggestive of SpA) scattered over several slices (e.g., 1 lesion on slice 1, another lesion on slice 4 and another lesion on slice 6), but still not fulfilling the ASAS definition. A SPARCC score can also be high if 1 lesion is assigned as “intense” or “deep” while it is visible on only 1 slice. Moreover, the SPARCC score prescribes that lesions are scored in the 6 slices representing the largest proportion of the cartilaginous component of the SI joints, while the ASAS definition takes all slices into account1,6,18.
Occasionally, part of a lesion may be visible on only 1 of the 6 selected slices, while the remaining part of the lesion is visible outside those 6 selected slices, or a slice outside those selected 6 shows several lesions. However, these considerations are mainly theoretical and do not appear very frequently (1 case in our study). Therefore, a SPARCC cutoff level of 2 units may serve as a surrogate for the ASAS definition of a positive MRI and could be used in clinical trials with central efficacy reading to derive a dichotomy (positive vs negative).
The SDC in campaign 2 (2.1 SPARCC units over 3 mos and 2.4 over 1 yr) are close to the proposed MIC of 2.5 SPARCC units, which was calculated using pooled changes over 12 and 52 weeks15, but the SDC of campaign 1 (3.4) is slightly higher. This suggests that the previously proposed MIC is close to measurement error in our study based on 2 different reader pairs and different periods of followup.
A large proportion of the SPARCC score changes seen in the patients in both reading campaigns could be considered noise because these changes were smaller than the SDC [62.9% and 45% (3-months, campaign 1 and 2) and 39.1% (1-year in campaign 2)]. To investigate the influence of nonbiological treatment on inflammation on MRI-SI, only patients with SPARCC score changes greater than the SDC were taken into account. Somewhat surprisingly, the majority of patients with a change in SPARCC score were on stable NSAID and/or DMARD treatment. Some patients taking stable doses of NSAID increased in SPARCC score while others who also had stable NSAID intake showed a decrease in SPARCC score. These results are in line with the results found in trials where patients using NSAID — either in an open-label trial or in a placebo group — showed also both increased and decreased inflammation scores on MRI-SI over 6 and 16 weeks, respectively12,24. Moreover, patients with stable background treatment in the placebo group of the ABILITY-1 trial had slightly decreased SPARCC scores at group level, as we found in our study4.
Although too few patients in the SPACE cohort used DMARD to draw conclusions on the effect of DMARD, comparable effects can be expected. The comparator group in the ESTHER trial using sulfasalazine showed a mean decrease of 1.7 and 1.9 SPARCC units over 24 and 48 weeks, respectively14. In the comparator group of another trial where patients used methotrexate, a mean of 1.4 (95% CI −0.8, 3.5) inflammatory lesions resolved over 30 weeks25. Although an overall decrease in inflammation score was seen in these trials, some patients increased in inflammation score on MRI-SI when looking at the individual level14,25. These results indicate that in patients with stable treatment, changes may occur in BME on MRI-SI that are beyond measurement error, which may point to true fluctuation in inflammatory activity over time.
The direct comparisons of our results with the results of drug efficacy trials is difficult because the SPACE cohort is an observational cohort including unselected patients with back pain of short duration resulting in a heterogeneous patient population, with low numbers of a positive MRI and low baseline mean SPARCC scores, while drug efficacy trials select patients with high levels of disease activity. In patients selected because of a high level of disease activity, a decrease in scores is more likely (regression to the mean) in comparison to an unselected group of patients. Thus, the patients in the SPACE cohort will likely not be representative of patients in trials. Nevertheless, we have also observed an overall decrease in the SPACE cohort, just as in the trials. This might occur because patients preferably seek help in case of maximum complaints, which is by default the timepoint of inclusion in the SPACE cohort. It is possible that the results would have been different if our study had been performed in a longstanding or severely diseased group of patients. Further, the SPACE cohort is not designed to investigate the effects of treatment for inflammation on MRI. For example, and in contrast to drug efficacy trials, there is not a good relation between the start date of therapy and the date of the MRI.
Another possible limitation is that the readers have given their judgment based on the ASAS definition immediately after the evaluation according to the SPARCC score. Because the quantitative part of the ASAS definition resembles a SPARCC score of 2, the choice of the value of 2 as the best SPARCC score to serve as cutoff level for negative and positive MRI may not be entirely independent. Yet, it should be stressed that the readers were not trained in scoring the MRI positive according to the ASAS definition if the SPARCC score was 2 or higher, but based their judgment of a positive MRI only on the complete MRI view. Nevertheless, it would have been better if different scores were acquired independently or even by different readers, as is frequently the case in clinical trials. Besides, we repeat that our study was not primarily designed to develop lesion-based cutoffs of a positive MRI for the purpose of disease classification.
A SPARCC score of 2 as cutoff value best reflects the caesura between a positive and negative MRI according to the ASAS definition. This cutoff can be used (in clinical trials) to create a dichotomous MRI variable of potential prognostic interest. The SDC we have obtained in our 2 experiments are close enough to the proposed MIC of 2.5 SPARCC units, which adds credibility to a cutoff level of 2.5 units in that it represents a true difference rather than only measurement error. Surprisingly, while patients experience stable treatment, true (> SDC) changes in SPARCC score over time (both increases and decreases) were frequently observed. This observation strongly suggests that MRI activity fluctuates over time.
ONLINE SUPPLEMENT
Supplementary data for this article are available online at jrheum.org.
Acknowledgment
The authors thank Maarten Boers, MD, PhD, from the VU University Medical Center, Amsterdam, the Netherlands, for his help and advice in creating the Bland-Altman plots.
Footnotes
Supported by the Dutch Rheumatism Association (Reumafonds).
- Accepted for publication March 12, 2015.