Abstract
Objective. Serial magnetic resonance imaging (MRI) examinations are often needed in chronic nonbacterial osteomyelitis (CNO) to determine the objective response to treatment. Our objectives in this study were (1) to develop a consensus-based MRI scoring tool for clinical and research use in CNO; and (2) to evaluate interrater reliability and agreement using whole-body (WB)-MRI from children with CNO.
Methods. Eleven pediatric radiologists discussed definitions and grading of signal intensity, size of signal abnormality within bone marrow, and associated features on MRI through monthly conference calls and a consensus meeting, using a nominal group technique in July 2017. WB-MRI scans from children with CNO were deidentified for training reading and an interrater reliability study. The reading by each radiologist was conducted in a randomized order. Interrater reliability for abnormal signal and severity were assessed using free-marginal κ statistics.
Results. Radiologists reached a consensus on grading CNO-specific MRI findings and on describing bone units based on anatomy. A total of 45 sets of WB-MRI scans, including 4 sets of non-CNO MRI examinations, were selected for the final reading. The mean κ of each category of bones was > 0.7 with majority > 0.9 demonstrating substantial/almost perfect interrater reliability of readings among radiologists. The agreement on signal intensity and the size of signal abnormality within the most commonly affected bones (femur and tibia) were lower than those of other bones.
Conclusion. The chronic nonbacterial osteomyelitis magnetic resonance imaging scoring (CROMRIS) tool is a comprehensive standardized scoring tool for MRI in children with CNO. Our interrater study demonstrated good interrater reliability and agreement of readings.
Chronic nonbacterial osteomyelitis (CNO) is a pediatric autoinflammatory bone disease challenging to physicians because of its occult nature and the difficulty of assessing disease activity. It is also known as chronic recurrent multifocal osteomyelitis and synovitis, acne, pustulosis, hyperostosis, and osteitis (SAPHO) syndrome. Physical examination and traditional inflammatory markers are not sensitive metrics to monitor disease progression because of occasionally minimal or absent findings on physical examination, normal laboratory values, and lack of correlation between them1. Radiographs are only 13–16% sensitive in detecting skeletal lesions in CNO2 and bone scintigraphy was shown to be only 70% sensitive compared to magnetic resonance imaging (MRI)3. The current gold standard imaging modality is whole-body (WB)-MRI2,4,5, especially at the initial evaluation. However, the imaging findings of CNO can be nonspecific and bone biopsy may be necessary.
CNO can affect virtually any bone, and there is no uniform approach to assess all bones identically. Previously, CNO lesions on MRI were reported by the number of active lesions5,6,7,8,9,10 and their anatomical locations. Detailed scoring systems have been reported11,12. WB-MRI has the dual advantages of greater sensitivity and lack of ionizing radiation when compared to skeletal scintigraphy3, and is more commonly used in pediatric rheumatology across the world2,4,13,14. Standardized reporting of each imaging characteristic across all bones of patients with CNO is critical in establishing imaging outcome measurements in CNO for future studies. Our objective is to develop a practical and consensus-based MRI scoring tool for clinical and research use in CNO. Further, interrater agreement and reliability will be evaluated using WB-MRI from children with CNO.
MATERIALS AND METHODS
The development of the chronic nonbacterial osteomyelitis MRI scoring (CROMRIS) tool consisted of 3 steps: (1) a literature review of previously reported MRI scoring tools of CNO, (2) initial development of a standardized MRI scoring tool for CNO, and (3) a consensus meeting. Subsequently, the interrater agreement and reliability were assessed.
We did a literature review on previously reported MRI scoring tools of CNO as preparation for the meetings. The results of the review were presented at the conference call meetings and consensus conference. Members of an international CNO musculoskeletal radiologist working group initiated the process to develop a standardized MRI scoring tool for CNO at the Society of Pediatric Radiology annual conference in Vancouver, British Columbia, Canada, in 2015. Since the first meeting, 11 pediatric radiologists, each with at least 5 years of experience reading musculoskeletal and CNO MRI from 7 different pediatric hospitals in North America and Europe, were identified by soliciting pediatric radiologists within the CNO work group. Group members discussed definitions and grading of signal intensity, size of signal abnormality within bone marrow and surrounding tissue, physis damage, and vertebral compression on MRI through monthly conference calls. Representative MRI images [short-tau inversion recovery (STIR) sequence except that skull used T2 sequence from 1.5T or 3T scanner] of active bone inflammation were assembled by members using a separate set of images to establish an atlas to illustrate the proposed scoring system.
Consensus meeting
There were 7 radiologists and 2 pediatric rheumatologists (YZ, PJF) at the face-to-face conference (Seattle, July 2017). The facilitators (YZ and PJF) participated in the discussion but were not eligible to vote. Nominal group technique was used to achieve consensus (defined as ≥ 70% agreement within the group) on all questions considered during the meeting.
Interrater agreement and reliability
The interrater agreement and reliability study was approved by the institutional review board from Iowa Children’s Hospital (# 201609778). Written informed consent was waived owing to the retrospective nature and use of anonymized images. A total of 82 sets of preexisting WB-MRI scans (STIR sequence with 3–4 mm thickness from 1.5T or 3T scanner) between January 2013 and August 2016 from children with CNO or other diseases at the University of Iowa Children’s Hospital were used for training reading and for assessing interrater agreement and reliability. A video tutorial was produced for training and interrater calibration exercise. Nine sets of MRI examinations were used for the training reading to improve familiarity with the tool before a reliability study. Of the 82 sets of MRI, these were excluded: 4 from subjects older than 18 years, 9 sets for training, and 1 set from a patient with leukemia. To assess interrater agreement and reliability, each radiologist read in a randomized order, among the remaining 68 sets of MRI from 45 patients (19 patients had MRI at more than 1 timepoint), 45 sets of MRI examinations from 45 patients (limit 1 set per patient and the set at the beginning of the disease course if more than 1 set is available), including 4 sets of MRI studies from non-CNO patients. Controls were included in the analyses to ensure variability in the sample. Data were recorded with a detailed scoring form (Supplement 1, available with the online version of this article). There was no gold standard defined for comparisons.
Statistical analysis
For the interrater agreement and reliability study, descriptive analysis was performed to assess the prevalence of abnormalities at each site defined as agreement among > 70% of the radiologists. Data were presented combining similar types of bones per patient. Absolute agreement for each site was defined as the proportion of patients for whom the ratings were the same for all 11 radiologists.
We assessed interrater reliability (i.e., how well the persons can be distinguished from each other despite measurement errors) using the free-marginal κ statistic described by Brennan and Prediger15. The free-marginal κ statistic is recommended when raters are not instructed about the number of observations that should be assigned to each category15 and when the distribution of ratings is highly skewed16. The κ coefficients were interpreted according to Landis and Koch17. Mean κ (and range) was calculated by categories of bones: the spine, complex bone, flat bones, hand/foot, and long bones. The long bones were further divided into proximal epiphysis, proximal metaphysis, diaphysis, distal metaphysis, and distal epiphysis. All analyses were conducted using R version 3.5.118.
RESULTS
Literature review
A search was conducted in PubMed using the following MeSH terms: (SAPHO[All Fields] OR “chronic recurrent multifocal osteomyelitis”[All Fields] OR “chronic nonbacterial osteomyelitis”[All Fields] OR “non-bacterial osteitis”[All Fields]) AND (“magnetic resonance imaging”[MeSH Terms] OR (“magnetic”[All Fields] AND “resonance”[All Fields] AND “imaging”[All Fields]) OR “magnetic resonance imaging”[All Fields] OR “mri”[All Fields]) AND (Score[All Fields] OR scoring[All Fields]). Five peer-reviewed publications were identified and one4 was excluded because it did not mention a scoring system. A total of 3 separate tools were reported in the remaining 4 eligible articles. Two reported an MRI score system for the osteitis lesions ranged from zero to 2 points and the highest score among lesions was used to indicate disease severity in SAPHO19,20. Bone marrow edema, bone erosions, or synovitis (with or without joint effusion) were ascertained. The presence of only 1 finding was scored 1 point and 2 or more findings, 2 points. A second tool used a semiquantitative approach to evaluate the characteristics of CNO lesions from MRI in children11. A comprehensive grading system for the evaluation of the extent of bone edema and soft tissue inflammation was reported, as well as the presence or absence of periosteal reaction, hyperostosis, physeal damage, and vertebral compression11. A third tool, a radiologic index for WB-MRI in patients with nonbacterial osteitis (RINBO), defined the size of active lesions by the absolute measurements and clustered the number of active lesions into 3 categories as unifocal, paucifocal (2, 3, or 4 lesions), and multifocal (5 or more lesions)12. Soft tissue inflammation, periosteal reaction, and hyperostosis were classified as extramedullary findings and spinal involvement was distinguished between active with abnormal STIR signal and chronic with deformation. Surrounding soft tissue inflammation was not included. Points were assessed for each of 4 areas of interest [number of radiologic active lesions (RAL), maximum size of RAL, extramedullary affection, and spine involvement], with a maximum score of 10.
Typical WB-MRI protocols include coronal images of the entire body and sagittal images of the entire spine, acquired with fluid-sensitive sequence (STIR, turbo-inversion-recovery–magnitude, or fat saturation) without contrast. Axial sequences of the pelvis and knees, and sagittal images of the ankles and feet, were also included at one of the centers (Iowa) that enhanced lesion identification in commonly affected sites. This protocol was therefore adopted by the group with consensus. T1-weighted images have been used to confirm findings from fluid-sensitive sequence in CNO. It was considered optional because it adds scanning time. Diffusion-weighted imaging (DWI) was reported21; however, it was not routinely performed in participating institutions. One study did not show difference in sensitivity of differentiating CNO lesions between STIR sequence alone and combining T1-weighted, DWI, and STIR sequences22. Thus, T1-weighted and DWI sequences were not included in the scoring system but use of DWI should be reconsidered when more data are available on its use in CNO. Detailed discussion based on the reported scoring tools led to the newly developed tool.
The consensus process of the final CROMRIS tool
At the 2017 conference, consensus defined as ≥ 70% agreement within the group23 was reached on all questions considered during the meeting. The complete atlas developed following the consensus meeting includes evaluation of 20 sites using 4 different variables (Supplement 2, available with the online version of this article).
Inclusion and definition of various characteristics of MRI findings in CNO
As presented in Table 1, hyperintensity of bone marrow was defined as increased STIR signal within bone marrow compared to the nearby normal marrow, as per the interpreting radiologist’s assessment. Terminology of bone edema was discussed and replaced by bone marrow hyperintensity with consensus for scientific clarity and the uncertainty of pathology. Linear metaphyseal lines caused by bisphosphonate were included in the atlas to avoid misinterpretation as bone marrow hyperintensity. Periosteal reaction was deemed difficult to confirm by MRI whereas soft tissue inflammation was readily detectable. Thus, “hyperintensity of surrounding tissue” was included with consensus to report the presumed inflammation within soft tissue and periosteum. Hyperostosis was a common term used in radiography though identifiable on MRI as bony expansion. Thus the latter term was adopted by the group. Vertebral compression and joint effusion were included. Growth plate irregularity was discussed and voted not suitable for assessment in MRI with consensus. Kyphosis and limb hypertrophy were assessable in WB-MRI and thus included in this tool. Leg length discrepancy cannot be assessed reliably in MRI and thus was voted not to be included as part this tool. None of the above measures was assigned as acute or chronic at this stage because a prospective longitudinal study is required to distinguish among them.
Grading scale of variables and definition of bone units
In general, signal intensity of bone marrow was graded with 3 levels: absent, less than fluid signal, and similar to fluid signal. Confidence level of identifying abnormal signal was also recorded as low, medium, or high. The size of signal intensity within each unit/segment was graded using relative measurement because of various body sizes and bone sizes in affected patients. Small was defined as < 25% of estimated volume, medium as 25–50% of estimated volume, and large as > 50% of estimated volume. When imaging was inadequate for a confident estimate of the size, “unable to estimate the size” was recorded. The following variables were graded as present or absent: signal hyperintensity of surrounding tissue (soft tissue/periosteum), bony expansion, continuity of signal abnormality between diaphysis and adjacent segment in long bones, hypertrophy of limbs, signal intensity of posterior and/or lateral elements in spine, and kyphosis of entire spine. Vertebral compression was graded as normal, presence of some height loss, or plana (defined as complete flattening of a vertebral body).
The division of bone units and segments was discussed, and the consensus was to follow anatomical divisions in complex bones and group bones into 1 unit in less commonly affected sites (hands and fore-/midfoot) and less well visualized sites. Long bones were divided into the following 5 segments anatomically: proximal epiphysis, proximal metaphysis, diaphysis, distal metaphysis, and distal epiphysis. The spine was graded as individual vertebrae from cervical to lumbar region. However, in addition to the grading of anterior vertebral body, there were reports of abnormal signals within “lateral and posterior elements” including pedicles, lamina, and posterior processes. Based on existing literature, the prevalence of signal hyperintensity within metatarsal bones is less common than in the talus and calcaneus24. Therefore, the consensus was to grade any signal hyperintensity within metatarsal bones as abnormal, and only signal hyperintensity with confluence in talus or calcaneus as abnormal.
Total scores as reported by RINBO12 were not recommended because our first step was to describe and grade lesions from each individual bone unit reliably. Future studies will be needed to determine the exact weight of each characteristic using a much larger representative cohort.
Interrater agreement and reliability
The 45 subjects were mainly females with a median age of 11 years [interquartile range (IQR) 9–15] and a median disease duration of about 3.3 years (Table 2). About 80% of WB-MRI were collected with additional axial images of pelvis and knees, and sagittal images of ankles/feet, in addition to the coronal plane images of the entire body and sagittal sequences of entire spine, as done in 20% of subjects. The 11 raters were mainly from the United States, with a median 7 years of experience (IQR 6–10; Supplement 3, available with the online version of this article).
Lower extremities were the bones most commonly affected by CNO, with abnormal bone marrow signal (Figure 1). Upper extremities, including humerus, radius, and hand, were reported at 2–9% presence among these patients. Along the spine, the thoracic spine was the most commonly affected site. Pelvic bones, clavicle, and mandible were well represented. Lesions were absent within this cohort in the cervical spine, manubrium/sternum, rib, scapula, skull, and ulna. Hyperintensity within surrounding tissue was detected adjacent to tibia, femur, fibula, foot, humerus, periacetabulum, clavicle, and mandible. Bony expansion was present only in the femur, humerus, clavicle, and mandible. Vertebral compression was mostly present in the thoracic spine. Detailed data from individual bone units (i.e., left femur, right mandible) are available in Supplement 4 (available with the online version of this article).
The signal intensity of bone marrow hyperintensity had low absolute agreements (< 60%) in more commonly affected bones such as femur, tibia, fore-/midfoot, hindfoot, and clavicle (Figure 2A). The majority of less commonly affected bones, including the spine, pelvis, hands, scapula, patella, and radius, had near or greater than 80% of absolute agreements. The presence of hyperintensity within surrounding tissue and bony expansion agreed very well (> 80%) in all bones (Figure 2B, 2C). Detailed data from individual bone units are in Supplement 5 (available with the online version of this article). Most segments of femur and tibia had lower agreement for the size of bone marrow hyperintensity compared to other long bones (Figure 3A). Among other bones, all had good absolute agreement (> 80%) except for the clavicle, mandible, fore-/midfoot, and hindfoot (Figure 3B). The severity of vertebral compression assessed by radiologists has shown excellent absolute agreement in all patients (Figure 3C).
The mean κ of each category was > 0.7, with a majority > 0.9 demonstrating substantial/almost perfect reliability (Table 3). The lowest κ coefficient was observed in bone marrow hyperintensity for the tibia (right, 0.60, 95% CI 0.49–0.71) and the corresponding absolute agreement was only 29% (Supplement 6, available with the online version of this article). Spine, complex bones (pelvis), and flat bones had higher agreements in bone marrow hyperintensity than did hands/feet and long bones. The signal size of bone marrow hyperintensity within each category agreed perfectly, although hands/feet and proximal/distal metaphysis of long bones had the lowest κ scores. The reliability of presence of hyperintensity within surrounding tissue, presence of bony expansion, and vertebral compression were all almost perfect. Detailed data from individual bone units are available in Supplement 6. Joint effusion data showed excellent agreement (Supplement 6). Most low- and medium-confidence readings were from more commonly affected sites such as the femur, tibia, and foot (Supplement 7, available with the online version of this article).
DISCUSSION
This is the first consensus-based MRI scoring tool for children with CNO and the first comprehensive assessment of interrater reliability of such a tool. Our tool includes the most commonly described characteristics seen in children with CNO from MRI and the grading system can be used as a potential research tool after further development and validation. An atlas and training video were developed that may guide radiologists who are less familiar or less experienced in reporting MRI from these affected children.
We have further defined these variables and developed a semiquantitative scoring system as an assessment tool for longitudinal studies to measure the response to treatments. Comparing to the RINBO system12, our scoring tool included bone marrow hyperintensity (bone edema), size of bone lesion, vertebral compression, and bony expansion (hyperostosis). Several key differences between these 2 tools are (1) periosteal reaction was deemed not reliable by our group in consensus and so was not included in the current tool; (2) the size of lesion was reported in the current tool as relative to the bone, which is more appropriate for a pediatric population; and (3) a total score was not proposed because further studies are needed to determine the weight of each variable.
Defining the minimum abnormal signal is challenging because of individual scoring variations, as suggested by the low absolute agreement of signal hyperintensity in the commonly affected bones (tibia and femur). Therefore, we used a predefined 70% agreement as a threshold to determine whether a “true” abnormal signal existed in bone marrow. Based on this principle, we found a distribution of lesions among the entire skeleton similar to previous reports9,10,25,26. Abnormal signal within surrounding tissue and bony expansion were present at most long bones, but were uncommonly seen in the clavicle and mandible.
In addition, the absolute agreement of the intensity of signal abnormality was poor in commonly affected sites, suggesting that individual radiologists differ in their assessing of various levels either because of inadequate calibration/training or inherent challenge from defined classification. Most low- and medium-confidence readings were from commonly affected sites. These results suggest that adding mandated calibration exercise with a special focus on less conspicuous lesions might improve the interrater agreement. In contrast, the absolute agreements of abnormal signal in surrounding tissue and bony expansion were > 80% except for the tibia. Although the prevalence of these findings was less common than that of bone marrow hyperintensity, it was likely that these features were more distinguishable by radiologists and thus there was more agreement among radiologists.
The κ analysis showed moderate to substantial agreement on the MRI size readings of most commonly affected bones (tibia and femur). When grouped into large categories such as long bones or spine, the agreement significantly increased, which was likely due to the relatively fewer abnormal signals. Hands and feet were scored as regions by grouping multiple bones and the size of bone marrow hyperintensity may not be estimated well enough. It explained why the agreement of this variable is the lowest among all categories. Similarly, the signal size of bone marrow hyperintensity of proximal and distal metaphyses of long bones also had the least agreement because of the difficulty of clearly identifying the border/definition of this segment within long bones. These are very helpful observations that will allow further improvements of our scoring system. Future studies will aim to answer the following questions: (1) is this scoring tool sensitive to change of clinical disease activities in CNO from a longitudinal study, and what is the intrarater reliability?; (2) what is the interrater reliability of this tool in a validation cohort?; and (3) how should the researcher integrate scores from each body site as a total score for disease activity on a whole-body level, and can this score differentiate patients with CNO from those without CNO?
There were limitations of our study. First, even with our large sample size of subjects, some bone sites were not well represented for interrater study. Future studies using a different subset of MRI with enriched prevalence of signal abnormalities in less commonly affected sites and inadequately scanned area (i.e., upper extremities) are needed to validate our findings. Second, joint effusion was not adequately scored but owing to its complexity and less weight in managing these patients, we decided that this should be a separate effort. Thirdly, there was no gold standard of the abnormal signals identified by radiologists for our study. A more objective approach of identifying signal threshold is needed and may be accomplished through machine learning by creating a consensus reading result. Fourth, lower agreement and reliability may have been obtained as a result of unequal familiarity of the tool despite the training. Fifth, even with radiologists from 7 centers, this consensus may not be completely representative. Finally, the correlation of abnormal signals on MRI and the actual pathology from CNO was not confirmed. Therefore, a longitudinal study with detailed clinical characterization in children with CNO and healthy children may shed light on the clinical significance of these variables. Nevertheless, we developed a comprehensive MRI scoring tool for CNO with a consensus from experienced radiologists across 7 centers and 2 continents and showed excellent reliability and agreements in each category of bones and moderate to substantial reliability and agreements in readings from individual bones.
The CROMRIS tool was developed as a comprehensive standardized scoring tool for MRI in children with CNO. Our interrater study demonstrated good interrater reliability and agreement of readings from a group of radiologists. Because CNO is a rare disease and collaborative research is needed in this field, a consensus-based system, such as the CROMRIS tool, representing experienced radiologists from different centers and countries, will likely be adopted by future studies. This tool can be validated in a prospective study and may become a key element of disease activity assessment in CNO.
Acknowledgment
The authors thank other physician participants who contributed in initial conference calls: Nancy Chauvin, Kirsten Ecklund, and Andrea Doria. We are grateful for the excellent advice from Dr. Maarten Boers to help improve the presentation of figures. Drs. Matthew Basiaga and Eric Allenspach critically reviewed the manuscript.
Footnotes
This study was funded by a Childhood Arthritis and Rheumatology Research Alliance–Arthritis Foundation small grant. P.J. Ferguson is supported by R01AR059703 from the US National Institutes of Health/National Institute of Arthritis and Musculoskeletal and Skin Diseases. Y. Zhao is supported by the Clinical Research Scholar Program from Seattle Children’s Research Institute and Bristol-Myers Squibb. The Parker Institute, Bispebjerg and Frederiksberg Hospital (S.M. Nielsen), is supported by a core grant from the Oak Foundation (OCAY-13-309).
- Accepted for publication September 18, 2019.
REFERENCES
ONLINE SUPPLEMENT
Supplementary material accompanies the online version of this article.