Abstract
Objective. Reliable assessment of spinal and sacroiliac joint (SIJ) inflammation on magnetic resonance imaging (MRI) is difficult. We developed 2 Web-based training modules for scoring inflammation by MRI in the spine and SIJ using the SPARCC method. These provide explicit details on methodology and define the parameters of abnormalities scored in the spine and SIJ. Our objective was to assess the influence of rigorous standardization of methodology offered by Web-based training modules on the reliability of SPARCC scores for SIJ and spinal inflammation.
Methods. We studied 32 patients randomized 1:1 to either anti-tumor necrosis factor–α (anti–TNF–α) therapy or placebo for 12 weeks, with MRI examination of the SIJ and spine being conducted at baseline and 12 weeks. MRI scans (as described at www.arthritisdoctor.ca) were assessed blinded to timepoint and treatment allocation by 3 readers who had no prior experience scoring inflammation by MRI and 2 experienced SPARCC readers. The first readings by the inexperienced readers were conducted after verbal instructions on the scoring method. The second readings were conducted after formal training using the Web-based training modules. Interreader reliability was compared before and after training using the 2 SPARCC readers as “gold standard” comparators.
Results. After training, a consistent improvement in reproducibility was observed, which was particularly evident for SIJ inflammation and for change scores. After completion of the training modules the inexperienced readers scored to a similar level of reproducibility as the 2 SPARCC readers.
Conclusion. Systematic evaluation of SIJ and spinal inflammation by MRI can be significantly improved using Web-based training modules.
- SPONDYLOARTHRITIS
- MAGNETIC RESONANCE IMAGING
- INFLAMMATION
- WEB-BASED TRAINING MODULE
- SACROILIAC JOINTS
- SPINE
Magnetic resonance imaging (MRI) is now established as the preferred imaging modality for the detection and evaluation of active inflammatory lesions in the spine and sacroiliac joints (SIJ) of patients with spondyloarthritis (SpA). MRI of active inflammatory lesions is of value not only for diagnostic purposes but also for the assessment of therapeutic agents that alleviate inflammation in SpA. In particular, the administration of anti-tumor necrosis factor–α (anti-TNF–α) therapies has been shown to lead to rapid amelioration of active inflammatory lesions on MRI in patients with SpA1,2. This has led to the development of scoring methodologies for active inflammatory lesions in the spine and SIJ that are principally directed at the clinical trial evaluation of new therapies but may also be used in observational studies.
Two scoring methodologies have been described for evaluation of active inflammatory lesions in the spine. Both are based on assessment of a discovertebral unit (DVU), which represents the region between 2 imaginary lines drawn through the middle of 2 adjacent vertebrae. The first method, the ASspiMRI-a (AS spinal MRI activity) index, scores the severity of bone edema and erosions at each DVU according to a scoring scheme of 0 to 6, with higher values assigned to the presence of erosions3. The score for edema is based on the total area involved in the DVU according to a < 25%, 25%–50%, and > 50% grading scheme. The range of scores is 0–138. An adaptation of this method, the Berlin method4, does not score erosions but otherwise is the same as ASspiMRI-a, and its scoring range is 0–69. For both methods all 23 DVU in the spine are scored in a single sagittal plane of view. The ASspiMRI-a method has been shown to be reproducible and to discriminate between treatment groups in trials of anti–TNF therapies3. However, reliability of the ASspiMRI-a method was rather poor in a multireader exercise, although the Berlin method was better4. The second method, the Spondyloarthritis Research Consortium of Canada (SPARCC) MRI spinal inflammation index, evaluates the extent of the acute lesion in 3 dimensions, which is appropriate because inflammatory lesions are often asymmetrical5. This method has been shown to have excellent reproducibility and to be highly discriminatory between anti–TNF–α treatments and placebo so that only small numbers of patients (20–30) are required for proof-of-concept studies1. Moreover, a multireader exercise confirmed that this method demonstrated the most consistent reproducibility between different reader pairs, which included readers with no prior experience with application of this method4. Although additional methods for scoring acute lesions have been cited in the literature, none has been described in adequate detail or sufficiently validated to enable widespread use.
As for the spine, the SPARCC SIJ scoring method evaluates acute lesions in 3 dimensions and has undergone validation for both reproducibility and discrimination1,6. In addition to being highly discriminatory in placebo-controlled trials of anti–TNF–α therapy, it has been shown to be consistently reproducible in a multireader exercise7.
An essential prerequisite for more widespread implementation of any outcome measure is development of a validated mechanism for knowledge translation. Where expertise for a particular outcome measure is limited, direct and widespread knowledge transfer using traditional formats such as seminars is not feasible. Web-based learning has been implemented in various fields and has the potential to reach a wide audience but has rarely been the subject of formal validation. We describe the development and validation of Web-based training modules for the quantitative evaluation of acute lesions in the spine and SIJ according to the SPARCC method.
METHODS
Development of a Web-based training module for the quantitative assessment of acute lesions in the SIJ
The SIJ training module was developed by 2 radiologists and a rheumatologist, who are the original developers of the SPARCC SIJ scoring method for acute lesions. The module focuses on evaluation of SIJ in the semicoronal orientation as described in the SPARCC method along the following steps:
-
A description of the normal anatomy of the SIJ as visualized on both short-tau inversion recovery (STIR) and T1-weighted sequences. A point of emphasis is that MRI scans are difficult to reproduce with identical orientation of the joints, and landmarks are variable.
-
Basic imaging requirements. In particular, the simultaneous availability of T1-weighted sequences greatly enhances the interpretation of SIJ anatomy.
-
Defining the rules for dividing each SIJ into quadrants, which constitutes the basic unit for the quantification of active inflammatory lesions (Figure 1). It is recommended that SIJ semicoronal images are read in a consistent manner from anterior to posterior, with the first anterior slice being the slice where there is at least 1 cm of vertical height in at least one SIJ (Figure 2). A visible joint of less than 3 cm vertical height is defined as having only upper sacral and upper iliac quadrants. A visible joint of 3 cm or more of vertical height is defined as having 4 quadrants divided at the midpoint into equal upper and lower sacral and iliac quadrants (Figure 3). At the posterior aspect of the SIJ there is a natural division of the joint into upper and lower quadrants by intervening fat and fibrous tissue (Figure 4). When less than 1 cm of a quadrant is visible, it is no longer scored (Figure 5).
-
Defining the reference for scoring of the abnormal increase in bone marrow signal on the STIR sequence that indicates an acute lesion. The primary reference for normal bone marrow signal intensity is the center of the sacrum, at the same craniocaudal level as the bone marrow being assessed. This site is most likely to be normal and is less subject to fatty change or inflammation.
-
The approach to scoring bone marrow edema. Each quadrant is scored dichotomously for the presence of bone edema (present/absent). For each semicoronal slice the scoring range is therefore 0–8. Depth and intensity of bone edema is scored for each joint as a whole so that the scoring range per semicoronal slice is 0–2 for each feature. The maximum score per semicoronal slice is therefore 0–12. Scores are entered into a customized Web-based software scoring program that displays a diagram of the joint quadrants in a semicoronal slice (Figure 6). Brightness of “intense bone marrow edema” is defined as being comparable to, or greater than, the appearance of blood in presacral veins. “Depth” of bone marrow edema is defined as positive when 1 cm or more of continuous increase in STIR signal extends in a horizontal direction away from the joint space. The SPARCC SIJ score was originally developed to score acute lesions in 6 consecutive semicoronal slices since, in general, this number of slices permits detailed evaluation of the synovial portion of the SIJ. The maximum score is then 72. Using the approach outlined in this module, additional slices may be evaluated should the reader wish to score all slices incorporating the joint.
-
Technical considerations aimed at optimizing display, enhancing viewing conditions, detection, and recording of lesions.
Development of Web-based training module for quantitative assessment of acute lesions in the spine
The spine training module was developed by a rheumatologist and a radiologist, who are the original developers of the SPARCC spine scoring method for acute lesions. The module focuses on evaluation of the spine in the sagittal orientation as described in the SPARCC method in the following steps:
-
A description of the normal anatomy of the spine as visualized on both short STIR and T1-weighted sagittal sequences. A point of emphasis is that the anatomy of the posterior and lateral spine is more complex due to overlapping structures.
-
Basic imaging requirements. In particular, the simultaneous availability of T1-weighted sequences greatly enhances the interpretation of spinal anatomy.
-
Defining the reference for scoring abnormal increase in bone marrow signal on the STIR sequence that indicates an acute lesion. The primary reference for normal bone marrow signal intensity is the center of the vertebral body (Figure 7).
-
Defining the basic unit for scoring acute lesions at each spinal level. This is the discovertebral unit, which is defined by the region between 2 imaginary lines through the midpoint of adjacent vertebrae and includes the disc, vertebral endplates, and adjacent bone marrow (Figure 8).
-
The approach to scoring bone marrow edema on sagittal slices of the spine. DVU is divided into quadrants and each quadrant is scored dichotomously for the presence of bone edema (present/absent). This is repeated in 3 consecutive sagittal slices, giving a maximum score of 12. Depth and intensity of bone edema is scored for each DVU as a whole so that the maximum additional score per DVU for 3 consecutive sagittal slices is 3 for each feature, and the total maximum score per DVU is then 18 (Figure 9). Scores are entered into a diagrammatic representation of the DVU in 3 consecutive sagittal slices (Figure 10). Brightness of “intense bone marrow edema” is defined as comparable to or greater than the appearance of cerebrospinal fluid. “Depth” of bone marrow edema is defined as positive when 1 cm or more of continuous increase in STIR signal extends in a vertical direction away from the vertebral endplate.
-
Selection of the 6 most severely affected DVU if the scoring system is recommended for evaluation of therapeutic interventions in clinical trials. This follows a global evaluation of the entire spine for inflammatory lesions on both pre- and post-treatment images because the lesion may no longer be evident at one timepoint if the treatment is very effective. All 23 DVU may also be scored for acute lesions, and this may be preferable for longitudinal studies, but this may force the scoring of regions of artefact and is clearly more time-consuming.
Both training modules can be accessed at www.arthritisdoctor.ca.
Validation of training modules
Validation was conducted using MRI scans from 32 patients with AS fulfilling the modified New York criteria8 who were recruited to a randomized (1:1) placebo-controlled trial of anti–TNF–α therapy. MRI scanning was conducted at baseline and 12 weeks. All scans were reviewed on work stations with large screens and image manipulation software ideally suited for this trial (Merge Healthcare eFilm, Milwaukee, WI, USA). This system permitted simultaneous display of all 8 MRI sequences (T1 and STIR for upper and lower spine for both timepoints) at original (life-size) dimensions, and this could be repeated for the semicoronal images of the SIJ. Readers had full-windowing capability and could choose to score the timepoints in any order. Scores were recorded electronically using a custom-designed Web-based computer-assisted masked reading program; the reader was able to see all scores and all scans simultaneously before committing to the final score. A unique MRI study number was allocated to each of the 32 patients, allocation being done by a technologist unconnected with the study using computer-generated random numbers. Each scan was rated by 3 independent readers, who were rheumatology fellows from Denmark, Thailand, and Mexico, and who were blinded to patients’ identities, treatments, and imaging timepoints. One reader (SJP) had some prior experience in reading and interpreting MRI, but the other 2 readers had minimal prior training in MRI. Readers conducted their first readings of these scans after receiving basic training in the principles of MRI and then reading the original manuscripts describing the SPARCC MRI spine and SIJ scoring systems. One month later, readers reviewed the online training modules and then repeated their readings of the MRI scans.
Magnetic resonance imaging
All MRI scans of the spine were performed with 1.5 Tesla systems (Siemens, Erlangen, Germany) using appropriate surface coils. Sagittal sequences were obtained with 3–4 mm slice thickness and 11–15 slices were acquired. Sequence parameters were: 1. T1-weighted spin echo [repetition time (TR) 517–618 ms, echo time (TE) 13 ms)]; 2. STIR [(TR 3000–3170 ms, inversion time (TI) 140 ms, TE 38–61 ms)]. Field of view was 380 to 400 mm and matrix was 512 × 256 pixels. The spine was imaged in 2 parts: 1. Upper half comprising the entire cervical and most of the thoracic spine; 2. Lower half comprising the lower portion of the thoracic spine and entire lumbar spine. The specific MRI parameters for acquiring spinal images are provided on our website (www.arthritisdoctor.ca).
Statistics
The same MRI scans had been scored a year earlier by 2 experienced SPARCC readers, and these scores constituted the benchmark for the assessment of reliability. Interobserver reliability of the SPARCC MRI spine and SIJ scores between the rheumatology fellow and SPARCC readers was assessed with the intraclass correlation coefficient (ICC), which was calculated using an analysis of variance model, with total SPARCC score as the dependent variable, and patient and reader (fixed factor) as independent variables. An ICC value > 0.6, > 0.8, and > 0.9 indicates good, very good, and excellent reproducibility, respectively. We calculated ICC values for baseline and 12-week scores, as well as the change in score from baseline to 12 weeks.
RESULTS
Training required about 1 hour per reader for each module even though English was not the first language of any of the 3 readers. An improvement in ICC values for interobserver reliability of SPARCC SIJ inflammation scores was observed with all combinations of inexperienced reader with SPARCC reader save one (Table 1). Less improvement was evident in status scores, but ICC values were already high prior to training.
Improvement was also evident after training with the spine module, although both status and change score ICC values for spinal inflammation scores were already high prior to training (Table 2). The greatest improvement was observed for both status and change in the 23 DVU score.
Post-training ICC values approximated those of the gold standard SPARCC readers, especially for spinal inflammation scores.
DISCUSSION
These MRI reading exercises conducted by rheumatology fellows with minimal prior training in MRI demonstrate the following observations. First, we show that a simple Web-based training module requiring minimal time is an effective knowledge transfer technique. Second, we show that training in the use of the SPARCC scoring methods can be so effective that inexperienced readers score as reliably as the original developers of the SPARCC methods. Third, we demonstrate the importance of standardization of imaging protocol, methodological details, and definitions relevant to features of inflammation. Fourth, the high pretraining reliability scores for spinal inflammation attest to the methodological simplicity, feasibility, comprehension, and therefore external validity of the SPARCC MRI spine inflammation score.
Reader A had the least experience with MRI prior to this reading exercise, and the improvement in both status and change score ICC values after training was therefore not surprising. However, improvement in change score ICC values after training was noted for all readers. It is more difficult to attain high reliability for change scores as measured by the ICC, but it is also more desirable since the primary purpose of such an instrument is to reliably detect change and to discriminate effectively between treatment interventions. The ICC estimates the proportion of variance in the data that is due to differences between the subjects rather than differences between the readers, and as such reflects the concept of the signal-to-noise ratio. Consequently, since the differences between patients in status scores are much greater than differences in change scores, ICC values will tend to be lower on methodological grounds. Assessment of the reliability of change scores is therefore a more rigorous test of training. The greater improvement in the reliability of the 23 DVU score compared to the 6 DVU score is not surprising because the 6 DVU score focuses on the most severely affected lesions. Training is required to reliably detect the more subtle lesions. This also accounts for the high pretraining ICC values for the 6 DVU score. In other words, excellent reliability was evident with minimal training.
The principal limitation of the study is that the Web-based training modules were not compared with “usual” training in MRI. A problem with this approach, however, is that most rheumatology fellows do not receive any formal training in musculoskeletal MRI during the course of most fellowship training programs. The external validity of such a study would therefore be questionable. It is possible that a component of the improvement may reflect recall from the pretraining readings even though MRI scans were identified only by study number and scans were read 1 month apart. The retention of knowledge was also not studied, so it is unknown how long readers will maintain their expertise if training is not reinforced.
The post-training reliability data compare favorably with prior studies where these methods were used. A multireader exercise using a similar study design conducted among rheumatologist and radiologist experts in MRI reported reliability data on several methods proposed for the quantitative assessment of spinal inflammation by MRI4. The highest and most consistent ICC values were reported for the SPARCC method, but the median ICC value for change scores (0.78, range 0.42–0.93) was less than that noted in our present study after training. This multireader exercise4 was not preceded by formal standardization of scoring methodology or training exercises and suggests that substantial improvement in reliability would be possible even among experts in the field. The importance of standardization of methodology, availability of reference images, and calibration of readers has been emphasized in the development of a scoring system to evaluate inflammation and structural damage by MRI in patients with rheumatoid arthritis by the OMERACT MRI group9. Although this group has developed a set of reference images and has standardized the methodological approach, a formal mechanism to facilitate calibration and knowledge transfer has not yet been proposed.
In conclusion, we developed Web-based training modules aimed at standardization of the SPARCC MRI scoring methodology for assessing the extent of inflammation in the SIJ and spine in patients with SpA. These modules were shown to be effective in improving reliability to a level where inexperienced rheumatology fellows were able to evaluate active inflammatory lesions to a degree of reliability comparable to that of experienced SPARCC readers. This is therefore an effective approach to formalization of the process of calibration between readers. We recommend the use of these modules prior to application of the SPARCC MRI scoring methods and the adoption of a similar approach by developers of other instruments based on imaging modalities, to facilitate the process of knowledge transfer.
Acknowledgments
We are grateful for the support provided through the Nathional Research Initiative from The Arthritis Society.