Abstract
Objective. To determine the reproducibility of evaluation of sacroiliac joint (SIJ) radiographs among readers with varying levels of experience, and to identify potential drivers of disagreement in classification among 5 predefined radiographic lesion types.
Methods. The study sample consisted of 104 consecutive patients aged 18–40 with low back pain ≥ 3 months of duration who met the Assessment of SpondyloArthritis international Society (ASAS) definition for a positive SIJ magnetic resonance image, or were HLA-B27–positive and had ≥ 1 spondyloarthritis (SpA)-related clinical/laboratory feature according to the ASAS classification criteria for axial SpA. Seven blinded readers (2 musculoskeletal radiologists, 5 rheumatologists) classified pelvic radiographs according to the modified New York criteria (mNY) and recorded presence/absence of 5 lesion types in both SIJ: erosion, sclerosis, ankylosis, joint space widening, and joint space narrowing. Reproducibility of mNY classification among 21 reader pairs was assessed and potential drivers of disagreement were identified among 5 lesion types. A generalized linear mixed logistic regression model served to analyze to what extent discordance in lesion type was associated with discrepant mNY classification.
Results. Mean κ values (percent concordance) were 0.39 (84.1%) for mNY classification over 21 reader pairs, 0.46 (79.8%) between 2 musculoskeletal radiologists, and 0.55 (86.5%) and 0.36 (77.9%) between the most experienced rheumatologist and the 2 radiologists. Erosion showed the lowest agreement (25%) among patients with discordant classification and gave the highest OR of 13.5 for disagreement.
Conclusion. Reproducibility of radiographic SIJ classification in an SpA inception cohort was only fair to at best moderate among 7 readers with varying levels of experience, questioning the applicability of mNY in early SpA.
Radiographic evaluation of sacroiliac joints (SIJ) according to the modified New York criteria (mNY)1 is the gold standard in the classification of axial spondyloarthritis (axSpA) and may affect treatment decisions in this chronic inflammatory condition. However, several studies have consistently shown limited agreement among trained readers in radiographic classification of SIJ, with κ values around 0.52,3,4. The limited reproducibility of SIJ evaluation on pelvic radiographs of patients suspected of having SpA was also featured at a public hearing of the US Food and Drug Administration5. Two interventional trials in patients with nonradiographic axSpA used the mNY, assessed by local rheumatology and radiology readers from different sites as inclusion criterion. A posthoc analysis by trained central readers resulted in the reclassification of 36% and 37% of the patients regarding fulfillment of the radiographic mNY6,7.
These concerns about low reliability of radiographic mNY were confirmed by a report highlighting at best moderate reproducibility of SIJ evaluation on pelvic radiographs by rheumatologist and radiologist readers, which even put the role of radiographic sacroiliitis for classification of axSpA into question3. However, possible data-driven explanations for the marked variability in interpretation of SIJ radiographs are scarce. We therefore hypothesized that certain radiographic lesion types contained in the radiographic mNY such as erosion, sclerosis, or joint space variation may contribute more to interreader disagreement than others.
The objectives of our study in an SpA inception cohort recruited from primary care were (1) to determine the reproducibility of radiographic SIJ classification according to the mNY among 7 rheumatology and radiology readers with varying levels of experience in imaging in SpA, and (2) to identify potential drivers of disagreement in classification among 5 predefined radiographic lesion types according to the mNY.
MATERIALS AND METHODS
Patients
Our study sample was recruited from the cohort Spines of Southern Denmark, which has been described in detail elsewhere8,9,10. Briefly, the cohort consisted of 1037 patients aged 18–40 years referred to the Spine Centre of Southern Denmark, Middelfart, for evaluation of low back pain of 2–12 months’ duration that was refractory to treatment in primary care.
All referred patients were screened according to a standardized protocol, which included a clinical visit, back pain questionnaires, laboratory testing [HLA-B27, high-sensitivity C-reactive protein (CRP)], and magnetic resonance imaging (MRI) of the SIJ and the entire spine. Patients with back pain of ≥ 3 months’ duration, who either fulfilled the Assessment of SpondyloArthritis international Society (ASAS) criteria for a positive SIJ MRI11 or were HLA-B27–positive with at least 1 concomitant clinical or laboratory feature suggestive of SpA according to ASAS classification criteria for axSpA12 were referred for clinical evaluation by 1 of 3 specialists in rheumatology (AGL, LHH, or OH). ASAS concomitant clinical or laboratory features suggestive of SpA were inflammatory back pain according to ASAS criteria13, arthritis, heel enthesitis, uveitis, dactylitis, psoriasis, inflammatory bowel disease, good response to nonsteroidal antiinflammatory drugs, family history of SpA, and elevated CRP.
Our study sample consisted of 104 patients in whom a diagnosis of axSpA was considered possible by the clinical rheumatologic assessment, and in whom pelvic radiographs of sufficient technical quality were available. Among the 104 patients, 92 met the ASAS criteria for a positive SIJ MRI and 12 were HLA-B27–positive showing ≥ 1 clinical or laboratory SpA feature. Eighty-one patients (77.9%) met the ASAS criteria for axSpA: 56 (53.8%) through the imaging arm only (MRI-only), 8 (7.7%) through the clinical arm only, and 17 (16.3%) through both arms. Twenty-three patients (22.1%) did not meet the ASAS criteria for axSpA: 19 (18.3%) with a positive SIJ MRI only and 4 (3.8%) being HLA-B27–positive with only 1 SpA feature.
The study was approved by the Danish Data Protection Agency and by the Ethics Committee of the Region of Southern Denmark (project ID S-20110029). All participating patients gave written informed consent.
Evaluation of SIJ radiographs
SIJ radiographs were obtained according to local protocols used in daily routine in 6 radiology departments in Denmark. Among the 104 SIJ radiographs, 88 (84.6%) were standard anteroposterior pelvic radiographs, 14 (13.5%) were radiographs of the lumbar spine including the SIJ, and 2 examinations (2.0%) consisted of oblique SIJ projections. All 104 digital SIJ radiographs were centrally anonymized and randomized. Seven readers (2 musculoskeletal radiologists, 5 rheumatologists) blinded to clinical, biochemical, and MRI data independently assessed the SIJ radiographs in random order on electronic workstations. First, the readers classified the SIJ radiographs according to the mNY that was considered met if there was at least bilateral grade 2 or unilateral grade 3 sacroiliitis1. Second, the readers recorded the presence/absence of 5 radiographic lesion types in both SIJ as described in the mNY: erosion, sclerosis, ankylosis, joint space widening (JSW), and joint space narrowing (JSN). Erosion and sclerosis were recorded per 4 joint surfaces, i.e., on the sacral and the iliac side of the right and left SIJ, respectively, whereas ankylosis, JSW, and JSN were reported separately per right and left SIJ, respectively. We followed the definitions of SIJ grades and radiographic lesion types as stated in the mNY1: grade 0 = normal, grade 1 = suspicious changes, grade 2 = minimum abnormality (small localized areas with erosion or sclerosis, without alteration in the joint width), grade 3 = unequivocal abnormality (moderate or advanced sacroiliitis with erosion, evidence of sclerosis, widening, narrowing, or partial ankylosis), and grade 4 = severe abnormality or total ankylosis. SIJ scores and radiographic lesions were entered into a standardized electronic data sheet identical to the one used during reader calibration.
Reader calibration
The 7 readers consisted of 2 senior musculoskeletal radiologists having more than 20 years each of experience in interpretation of pelvic radiographs (AGJ, SN), and of 3 senior and 2 junior staff rheumatologists from 1 institution (King Christian 10th Hospital for Rheumatic Diseases, Gråsten, Denmark). The 2 radiologists came from different institutions and were not involved previously in shared imaging research. One of the rheumatologist readers (UW), who had more than 10 years of research experience in conventional and tomographic imaging in SpA, was responsible for calibration of the reader team.
All 7 readers were calibrated by reference images of pelvic radiographs covering all mNY grades. The reference images were derived from clinical practice in patients with various stages of SpA to best match the original grading description, which lacks standardized and validated lesion definitions. The definitions of the 5 grades were adopted from the original description of the mNY1, which was based on the Atlas of Standard Radiographs in Arthritis14. Because of their longstanding experience in scoring SIJ on pelvic radiographs, the 2 musculoskeletal radiologists did not participate in the additional calibration for the rheumatologists. The 5 rheumatologists had three 2-h calibration sessions and independently performed a training readout. The first session consisted of an introduction to the scoring method, a review of the relevant literature, and a group discussion of 10 pelvic radiographs. This was followed by an independent evaluation of 15 pelvic radiographs by each rheumatologist according to the same scientific protocol that was later used in the main study. SIJ scores and radiographic lesion types reported in this training readout were evaluated in a second calibration session. A third calibration session with group discussion of another 10 pelvic radiographs served to refine the reference images set. All pelvic radiographs used in the training sessions were unrelated to the main study.
Descriptive analysis
Categorical demographic, clinical, and laboratory variables were described as proportion of subjects showing these features, and continuous variables as median [interquartile range (IQR)]. We expressed the presence of single radiographic features and fulfillment of the mNY as mean proportion of study subjects over 7 readers, and as mean proportions stratified according to level of reader experience. To determine the frequency of advanced sacroiliitis in our sample, we calculated the proportion of study subjects showing SIJ scores > 2 in the right and left SIJ separately. Presence of erosion and sclerosis was defined as ≥ 1 lesion on ≥ 1 of the 4 joint surfaces on both sides, while ankylosis, JSW, and JSN were defined as ≥ 1 lesion in ≥ 1 of the 2 joints, respectively. The frequency of the 5 radiographic lesions was calculated as mean proportion of patients having each lesion type over 7 readers, and as proportion of each lesion type among mNY-positive and mNY-negative study subjects for all 7 readers individually. Finally, we calculated the frequency of ≥ 2 concomitant lesion types per patient.
Interreader agreement
Interreader agreement for classification according to the mNY and for the 5 radiographic features was assessed by means of 2 × 2 tables and calculating percent agreement (total; positive/negative) and by Cohen’s κ15. Interreader agreement for the ordinal SIJ grades for both sides separately was evaluated by weighted Cohen’s κ. Agreement was interpreted according to Landis and Koch16 as slight (κ < 0.2), fair (0.2 ≤ κ < 0.4), moderate (0.4 ≤ κ < 0.6), substantial (0.6 ≤ κ < 0.8), and almost perfect (0.8 ≤ κ < 1.0). The computations were made for each reader pair and for all readers jointly as mean value over all 21 reader pairs. For the pairwise κ values, a bootstrap CI based on 1000 bootstrap replications and computed at a CI of 95% was provided. We additionally compared 5 selected reader pairs regarding agreement: the 2 musculoskeletal radiologists, the most experienced rheumatologist versus each of the 2 musculoskeletal radiologists, and the 2 senior and the 2 junior rheumatologists. The proportion of concordant single grades according to mNY among ≥ 2 readers (any reader pair) and ≥ 4 readers (majority of readers) was described for the right and left SIJ separately.
Candidate lesion types driving discrepancies in mNY classification
To assess the relative contribution of each of the 5 lesion types to disagreement in mNY classification, we first identified patients with discrepant mNY classification for each reader pair. Among these, we computed the proportion of patients with concordance for each radiographic lesion type for all reader pairs.
Finally, a generalized linear mixed logistic regression model was computed to estimate the relative effect size of each individual radiographic lesion type. Results were expressed as OR for disagreement in mNY classification with 95% CI. P values ≤ 0.05 were considered significant.
All computations were done with R (R Core Team, version 3.1.1.).
RESULTS
Descriptive analysis
Of the 104 patients, 38.5% were men and 33.7% were HLA-B27–positive (Table 1). Median age was 33.0 years. Over all 7 readers, a mean proportion of 15.7% of the patients met the mNY, and 8.1% showed mNY grades 3 or 4 (Table 1). Sclerosis and erosion were the 2 most frequent lesions reported in 50.1% and in 25.7% of the patients, respectively. The 3 more experienced readers scored more lesions of all types than the 4 less experienced readers, and they also considered more patients to be mNY-positive (21.5% vs 11.3%). Patients with erosion concomitantly showed sclerosis in 93.5%, JSN in 48.5%, JSW in 27.8%, and ankylosis in 19.5%. The distribution of the 5 lesion types among mNY-positive and -negative patients for all 7 readers individually is shown in Figure 1. Among the 5 radiographic lesion types, erosion and sclerosis showed the largest variation between individual readers. The most frequent constellation when reporting erosion in mNY-negative patients was unilateral grade 2 sacroiliitis (data not shown). Both more and less experienced readers reported joint space alterations in a small minority of subjects classified as mNY-negative.
Interreader agreement
Kappa (percent) agreement for mNY classification was 0.39 (84.1%) over 7 readers, 0.46 (79.8%) between 2 musculoskeletal radiologists, and 0.55 (86.5%) and 0.36 (77.9%) among the most experienced rheumatologist and each of the 2 musculoskeletal radiologists, respectively (Table 2). Among the rheumatologists less experienced in radiographic SIJ assessment, agreement between the 2 senior and the 2 junior rheumatologists was 0.34 (84.6%) and 0.27 (87.5%), respectively. Among the 5 radiographic lesion types, ankylosis showed the highest (κ 0.34) and JSW the lowest agreement (κ 0.12) over 21 reader pairs. Reliability among all 21 reader pairs for standard pelvic versus lumbar spine radiographs was in the agreement category “fair” as defined above with mean κ values of 0.39 and 0.33, respectively.
Candidate lesion types driving discrepancies in mNY classification
Among the 21 reader pairs, 15.9% of the patients had discrepant mNY classification. Among patients with discordant mNY classification, erosion was the lesion with the lowest interreader agreement: the proportion of mNY-discrepant patients with concordance for erosion was only 0.25 (IQR 0.09; Figure 2). The assessment of the effect size of each of the 5 lesion types showed that erosion was the strongest driver of discordance in mNY classification. Erosion was associated with statistically significant 13.5× higher odds (95% CI 9.1–20.1) for discrepant mNY classification (Table 3).
Figure 3 shows a pelvic radiograph in which 4 of 7 readers considered the mNY as being met and 5 of 7 readers scored erosion.
DISCUSSION
Our study on the reliability of radiographic SIJ classification according to the mNY in an SpA inception cohort suggests that SIJ erosion may be the primary driver of interreader disagreement. The only fair to at best moderate level of concordance for mNY (κ 0.39) among 7 radiology and rheumatology readers with varying experience in imaging in SpA was even slightly lower than a reported only moderate agreement (κ 0.54) between 2 central readers in another axSpA inception cohort3.
The limited reproducibility of radiographic SIJ classification according to the mNY is well documented. However, the characteristics of a given study sample may affect the level of interreader agreement. Previous reports suggest that the higher the proportion of patients with ankylosing spondylitis (AS) in a given study sample, the better the concordance in radiographic mNY. A study from Turkey applying the radiographic mNY in patients with Behçet disease recorded pre/post-training κ agreement of 0.32/0.19, 0.32/0.36, and 0.44/0.41 for 3 reader pairs (1 radiologist and 2 rheumatologists, respectively)2. Our study with 15.7% mNY-positive patients showed a κ concordance of 0.39 among 7 radiology and rheumatology readers. In a report on patients with inflammatory back pain suggestive of axSpA, 21.1%/26.6% (central/local reading) had obvious sacroiliitis; concordance for radiographic mNY by κ values was 0.54 among 2 trained central rheumatologist readers and 0.55 for central versus local radiologist and rheumatologist reading3. A study assessing the rate of radiographic sacroiliitis progression over 2 years consisted of a high proportion of 54.8% patients with AS4. Interreader agreement between 2 trained rheumatology readers blinded to time sequence was moderate at baseline with a κ value of 0.57, but increased to a substantial κ of 0.67 at followup together with a progression rate from nonradiographic SpA to AS of 11.6% over 2 years. The highest interobserver concordance was reported in a study of 217 patients with AS who all met the mNY17. Kappa values between 2 trained rheumatologist readers were 0.68, 0.69, and 0.66 at baseline, 1-year followup, and 2-year followup.
Our agreement for classification according to the radiographic mNY (κ 0.39) was higher than for each single of the 5 radiographic lesion types (κ values 0.12–0.34). This is in line with the above-mentioned report on inflammatory back pain patients with κ values for the single lesion types between 0.12–0.44 as opposed to agreement of 0.54 for mNY3.
A potential source of disagreement is the lack of standardized and validated definitions for each of the radiographic lesion types contained in the original description1,14, which were used in our study. However, it remains to be shown whether an attempt to standardize and validate lesion definitions might facilitate agreement in view of the broad morphologic spectrum of the radiographic lesion types.
All lesions except sclerosis contributed to discordant mNY classification, but erosion was the main driver of disagreement. Technical issues such as bowel overlapping the SIJ or various radiographic SIJ projections can only partially explain this finding because they also affect recognition of other radiographic lesions such as sclerosis or joint space variation. Our results need to be confirmed in other cohorts of patients with clinically suspected axSpA because erosion is widely regarded as a key lesion indicating radiographic sacroiliitis.
Our SpA inception cohort recruited from primary care with low back pain of ≥ 3 months’ duration showed a low frequency of HLA-B27 and male sex. Multiple studies in other early axSpA cohorts have shown lower proportions of male sex and HLA-B27 positivity when compared with AS18,19,20,21,22. However, these cohorts were not or not entirely based on recruitment from primary care, and usually excluded patients with just suspected SpA, which may explain the higher prevalence of male sex and HLA-B27 positivity, when compared to ours. A Dutch cohort of patients with suspected axSpA similar to ours and also recruited from primary care23 showed an even lower proportion of HLA-B27 positivity of 20% among patients meeting the ASAS criteria for axSpA. Our cohort reflects daily routine in which young patients with treatment-refractory back pain referred from primary care with suspected early SpA often need to be followed over time before a final diagnosis can be made. However, pelvic radiographs are often performed as 1 element of the rheumatologic evaluation in such a clinical setting of suspected early SpA, despite the limited evidence of whether they may enhance confidence in a diagnosis of early SpA.
The mNY derived from a cohort of 183 HLA-B27–positive patients with AS, their HLA-B27–positive or –negative first-degree relatives, and population controls1 may not be directly applicable to chronic back pain patients clinically suspected of having axSpA. Further, there are no normative data regarding frequency and morphology of the 5 radiographic mNY lesion types in healthy controls, mechanical back pain patients, subjects with increased physical activity, or multiparous women. A back pain cohort from chiropractic practices in Canada with a recruitment mode similar to ours but with older patients showed degenerative SIJ changes in 35.2% of 142 women ages 18–60 years, which might be a factor leading to reader disagreement in low grade sacroiliitis in women24. In our study, sclerosis was the most frequently reported lesion type by all readers among patients classified as not meeting the radiographic mNY.
A Dutch report on radiographic assessment of sacroiliitis by 100 rheumatologists and 23 radiologists showed only modest sensitivity and specificity for sacroiliitis and sizable intraobserver variation25. Evaluation of the same image set after 3–6 months upon individual training and workshops did not improve performance. However, no pairwise analysis among all possible reader pairs was performed as in our study, but the scores of an expert panel (2 rheumatologists, 1 epidemiologist, and 1 radiologist) served as gold standard. Future studies with pairwise analysis of all possible reader pairs involving both radiologists and rheumatologists are needed to determine whether training and calibration in recognition of various radiographic SIJ lesion types, especially erosion, might improve agreement in classification according to the radiographic mNY.
Our SIJ radiographs were acquired according to local protocols in 6 radiology centers resulting in 84.6% standard anteroposterior pelvic radiographs, the remaining being lumbar spine radiographs including the SIJ and 2 oblique SIJ projections. The different visualizations of the SIJ might have had an effect on reproducibility. The lack of a full calibration of the 2 musculoskeletal radiologists may have affected interreader agreement as well. However, both limitations regarding imaging protocols and reader calibration reflect the conditions in daily routine. Another potential limitation is that κ statistics inherently perform less well in cases of skewed distribution of the variables under observation26,27,28,29, as with our relatively low prevalence of mNY grades 3–4 of only 8.0%.
Reproducibility of SIJ classification according to the mNY in a SpA inception cohort was only fair to at best moderate among 7 radiology and rheumatology readers with varying experience in imaging in SpA. Erosion was the main driver of discordant classification. These findings question the applicability of the radiographic mNY in back pain patients clinically suspected of having early axSpA, particularly in healthcare settings where access to SIJ MRI is readily available.
Acknowledgment
The authors thank Laila Dungart and Henning Jakobsen from the Radiology Department at King Christian 10th Hospital for Rheumatic Diseases, Gråsten, Denmark, for anonymization and randomization of the pelvic radiographs; Lone Holm Hansen (LHH) for clinical evaluation of patients at Hospital Lillebaelt, Vejle, Denmark; Charlotte Drachmann and Lis Schubert at King Christian 10th Hospital for Rheumatic Diseases, Gråsten, Denmark, for high-sensitivity C-reactive protein and HLA-B27 analysis; Tue Secher Jensen at the Spine Centre of Southern Denmark, Denmark, for his role in the conception and design of the Spines of Southern Denmark Cohort; and the radiologic departments at these Danish hospitals for kindly providing the radiographs used in this study: Hospital Lillebaelt, Vejle; Odense University Hospital; Odense University Hospital at Svendborg Hospital; Hospital South West Jutland; Hospital of Nykøbing Falster; and King Christian 10th Hospital for Rheumatic Diseases, Gråsten.
Footnotes
Dr. Rufibach is founder and owner of Rufibach rePROstat and is an employee of F. Hoffmann-La Roche, Basel, Switzerland. The Hospital of Southern Jutland, University of Southern Denmark, Hospital Lillebaelt, Vejle, and Knud og Edith Eriksens Mindefond funded Dr. Christiansen’s salary during the course of a PhD program, including this study.
- Accepted for publication August 31, 2016.