Abstract
Magnetic resonance imaging (MRI) of the sacroiliac (SI) joints and the spine is increasingly important in the assessment of inflammatory activity and structural damage in clinical trials with patients with ankylosing spondylitis (AS). We investigated inter-reader reliability and sensitivity to change of several scoring systems to assess disease activity and change in disease activity in patients with AS. Twenty sets of consecutive MRI, derived from a randomized clinical trial comparing an active drug with placebo and selected on the basis of the presence of activity at baseline, were presented electronically to 7 experienced readers from different countries (Europe, Canada). Readers scored the MRI by 3 different methods including: a global score (grading activity per SI joint); a more comprehensive global score (grading activity per SI joint per quadrant); and a detailed scoring system [Spondyloarthritis Research Consortium of Canada (SPARCC) scoring system], which scores 6 images, divided into quadrants, with additional scores for "depth" and "intensity." A fourth and a fifth scoring system were constructed afterwards. The fourth method included the SPARCC score minus the additional scores for "depth" and "intensity," and the fifth method included the SPARCC slice with the maximum score. Inter-reader reliability was investigated by calculating intraclass correlation coefficients (ICC) for all readers together and for all possible reader pairs. Sensitivity to change was investigated by calculating standardized response means (SRM) on change scores that were made positive. Overall inter-reader ICC per method were between 0.47 and 0.58 for scoring status, and between 0.40 and 0.53 for scoring change. ICC per possible reader pairs showed much more fluctuation per method, with lowest observed values close to zero (no agreement) and highest observed values over 0.80 (excellent agreement). In general, agreement of status scores was somewhat better than agreement of change scores, and agreement of the comprehensive SPARCC scoring system was somewhat better than agreement of the more condensed systems. Sensitivity to change differed per reader, but in general was somewhat better for the comprehensive SPARCC system. This experiment under "real life," far from optimal conditions demonstrates the feasibility of scoring exercises for method comparison, provides evidence for the reliability and sensitivity to change of scoring systems to be used in assessing activity of SI joints in clinical trials, and sets the conditions for further validation research in this field.