Abstract
OBJECTIVE: Magnetic resonance imaging (MRI) of the spine is increasingly important in the assessment of inflammatory activity in clinical trials with patients with ankylosing spondylitis (AS). We investigated feasibility, inter-reader reliability, sensitivity to change, and discriminatory ability of 3 different scoring methods for MRI activity and change in activity of the spine in patients with AS. METHODS: Thirty sets of spinal MRI at baseline and after 24 weeks of followup, derived from a randomized clinical trial comparing a tumor necrosis factor (TNF)-blocking drug (n = 20) with placebo (n = 10) and selected to cover a wide range of activity at baseline and change in activity, were presented electronically in a partial latin-square design to 9 experienced readers from different countries (Europe, Canada). Readers scored each set of MRI 3 times, using 3 different methods including the Ankylosing Spondylitis spine Magnetic Resonance Imaging-activity [ASspiMRI-a, grading activity (0-6) per vertebral unit in 23 units]; the Berlin modification of the ASspiMRI-a; and the Spondyloarthritis Research Consortium of Canada (SPARCC) scoring system, which scores the 6 vertebral units considered by the reader as the most abnormal, with additional scores for "depth" and "intensity." Both the order of the methods used by each reader and the timepoints (before/after treatment) were randomized. Feasibility of each scoring system was evaluated by measuring the mean time needed to score each set of MRI, and inter-reader reliability was evaluated by smallest detectable change (SDC) and by intraclass correlation coefficients (ICC) for all readers together and for all possible reader pairs separately. Sensitivity to change was investigated by calculating Guyatt's effect size on change scores. Discriminatory ability was assessed using Z-scores (Mann-Whitney test) comparing change in score between patients treated with TNF-blocking drug and placebo. RESULTS: The mean time to score one set of MRI was shortest for the Berlin method. SDC was lowest for the Berlin method and highest for SPARCC. Overall inter-reader ICC per method were between 0.49 and 0.77 for scoring activity status, and between 0.46 and 0.72 for scoring activity change. ICC for all possible reader pairs showed much more fluctuation per method, with lowest observed values of about 0.05 (very low agreement) and highest observed values over 0.90 (excellent agreement). In general, ICC for SPARCC were consistently higher than for other systems. Sensitivity to change differed per reader, and was more consistent with SPARCC than with the other methods, but was in general excellent for all 3 methods. Discrimination between groups (TNF-blocker vs placebo) assessed by Z-scores was good and comparable among methods. CONCLUSION: This experiment demonstrates the feasibility of multiple-reader MRI scoring exercises for method comparison, provides evidence for the feasibility, reliability, sensitivity to change, and discriminatory capacity of all 3 tested scoring systems to be used in assessing spinal activity on MRI in patients with AS in clinical trials. On the basis of these results it is not possible to prioritize one of the 3 methods.