Abstract
Objective. To develop and test the interreader reliability of the OMERACT Hand Osteoarthritis Magnetic Resonance Scoring System (HOAMRIS) for assessment of structural and inflammatory hand OA features in the interphalangeal joints.
Methods. The HOAMRIS was developed through an iterative process. Selection of features and their scaling was agreed upon through consensus by members of the OMERACT Magnetic Resonance Imaging (MRI) Task Force, using the Oslo Hand Osteoarthritis (OA) MRI Score system as a template. Two reliability exercises were performed, in which 6 and 4 readers participated, respectively. After the first exercise, an atlas was developed and used in the second exercise to facilitate reading. In each exercise, readers independently scored 8 MRI scans from the Oslo Hand OA cohort (coronal/axial short-tau inversion recovery and coronal/axial/sagittal T1-weighted fat-suppressed pre-/post-Gadolinium images). Interreader reliability was assessed by intraclass correlation coefficients (ICC), percentage exact and close agreement (PEA/PCA).
Results. The preliminary OMERACT HOAMRIS included assessment of synovitis, erosive damage, cysts, osteophytes, cartilage space loss, malalignment, and bone marrow lesions (BML), of which all were scored on a 0–3 scale for normal, mild, moderate, and severe (increments of 0.5 for synovitis, erosive damage, and BML). In the first exercise, most features showed good to very good ICC values (0.64–0.94), except synovitis (0.34). In the second exercise using the atlas, the ICC values were > 0.74 for all MRI features, and the PEA/PCA values were higher than in the first exercise.
Conclusion. A preliminary HOAMRIS with good to very good interreader reliability was developed. Longitudinal studies are needed to assess its sensitivity to change.
Hand osteoarthritis (OA) is a frequent disease in the general population1 that can lead to considerable pain and functional limitation2. Currently, there is limited knowledge about the disease course in hand OA. Magnetic resonance imaging (MRI) has the advantage of whole-joint assessment of all the affected joint tissues including cartilage, bone, and soft tissues. Further, MRI is able to demonstrate bone marrow lesions (BML), which may be a potential target for therapeutic interventions. MRI is therefore a valuable tool to increase the understanding of OA processes, and in future clinical trials may serve as an important outcome measure.
A group from Oslo recently proposed the first MRI scoring system for assessment of hand OA features in the distal interphalangeal (DIP) and proximal interphalangeal (PIP) joints3. This scoring system included evaluation of osteophytes, joint space narrowing, erosions, cysts, BML, malalignment, collateral ligament pathology, synovitis, and flexor tenosynovitis. Despite good to very good reliability, associations between certain MRI features and pain, as well as high sensitivity in detection of OA structural features3,4,5, the authors also noted limitations of the proposed system. First, it was time-consuming because of the inclusion of many features and the separate scoring of the proximal and distal parts of the joint. Further, features such as collateral ligament pathology and flexor tenosynovitis were uncommon, had lower reliability, and were not associated with pain.
MRI outcome measures for both rheumatoid arthritis and psoriatic arthritis were developed by the OMERACT MRI Task Force6,7 and validated using the OMERACT filter8. The aim of our study was to develop a preliminary OMERACT MRI Scoring System for Hand OA (OMERACT HOAMRIS) using OMERACT methodology.
METHODS
Development of the preliminary OMERACT HOAMRIS
Members of the OMERACT MRI Task Force met for a full day in May, 2011. Using the Oslo MRI scoring system as a template, meeting participants made modifications, by consensus, to the selection, definitions, and grading of its pathological features. We also proposed preferred acquisition plane(s) for assessment of the various features and when MRI gadolinium contrast use was preferable.
Iterative reliability exercises
The first interreader reliability exercise was conducted using 6 readers (IKH, IE, FM, PB, FG, VF). Both rheumatologists (n = 4), rheumatology research fellow (n = 1) and radiologist (n = 1) were represented, and all had previous experience in reading MRI scans in different rheumatic diseases affecting the hands. No training was performed prior to the exercise. The readings were performed on different working stations (11–27 in) using different image analysis systems. Each reader scored MRI scans acquired with a 1.0 T extremity MRI scanner (ONI, GE Healthcare) of the DIP and PIP joints in the dominant hand of 8 patients from the Oslo hand OA cohort. The patient images had been selected for a wide range of radiographic hand OA structural severity based on Kellgren-Lawrence scores. The sequences included short-tau inversion recovery (STIR) images in coronal and axial planes (TE 16.3 and 21 ms, TR 2850 and 3150 ms, slice thickness 2–3 mm, gap between slices 0.2 and 1 mm) and T1-weighted gradient-echo fat-suppressed pre- and post-gadolinium images in coronal, axial, and sagittal planes (TE 5 ms, TR 20 ms, slice thickness 1 mm, gap between slices 0 mm).
After the first exercise, a Web-based meeting was arranged to discuss the initial results. To facilitate the reading and to improve interreader reliability, an atlas of the scoring system was developed. The atlas was distributed and approved by all readers prior to a second exercise. For standardizing purposes, the axial plane (pre- and post-gadolinium) was chosen as the preferred plane for assessment of synovitis in the second exercise.
In the second reliability exercise, 4 of the 6 readers from the first exercise (IKH, IE, FG, VF) participated, and 8 new patient images from the Oslo hand OA cohort were selected.
The data collection in the Oslo hand OA cohort was approved by the regional ethics committee and the data inspectorate. All patients signed informed consent.
Statistical analysis
Interreader reliability was assessed by calculation of percentage exact agreement (PEA), percentage close agreement (PCA), and average measure intraclass correlation coefficients (AvmICC) using mixed effect models (absolute agreement). All features were scored on 0–3 scales with increments of 0.5 for synovitis, erosions, and BML. Hence, PEA was defined as a difference of 0 or 0.5 between minimum and maximum scores in a single joint among all readers, whereas PCA was defined as a difference < 1 between minimum and maximum scores in a single joint among the readers. For calculation of ICC we used the total scores for all 8 joints. ICC values < 0.20 were considered as poor reliability, 0.20 < ICC < 0.40 as fair, 0.40 < ICC < 0.60 as moderate, 0.60 < ICC < 0.80 as good, and 0.80 < ICC < 1.00 as very good reliability (i.e., same cutoffs as recommended for kappa)9. We also calculated the median and interquartile range (IQR) for each of the MRI features based on the reader mean values.
RESULTS
Preliminary OMERACT HOAMRIS
Detailed definitions of MRI features in the HOAMRIS, and their grading, are presented in Table 1. Features such as flexor tenosynovitis and collateral ligament pathology were excluded prior to the reliability exercises. All MRI features were scored on a 0–3 scale for normal, mild, moderate, and severe. The distal and proximal part of the DIP and PIP joints were combined, instead of grading separately as in the Oslo scoring system. Increments of 0.5 were provided for synovitis, erosive damage, and BML. The coronal plane was recommended for evaluation of all MRI features, except synovitis, for which the axial plane was recommended. Both coronal and sagittal planes were used for assessment of osteophytes. For evaluation of erosions, using 1 other plane in addition to the coronal plane was considered ideal but not mandatory. For the majority of features, the T1-weighted images were the preferred sequence to be used for evaluation of pathology, except BML, which was evaluated using STIR images. Postcontrast images were recommended for assessment of synovitis.
Reliability exercises
The demographic and clinical variables for the patients whose images were included in the reliability exercises are presented in Table 2.
In the first reliability exercise (6 readers), we demonstrated good to very good interreader reliability for most of the MRI features, except for a fair AvmICC for synovitis. Close agreement was found in > 62.5% of the joints for all MRI features, whereas the exact agreement was generally low (Table 3).
In the second reliability exercise (4 readers), the readers used an atlas to facilitate scoring and preferably the axial plane for assessment of synovitis. Interreader reliability was generally higher, especially for synovitis, compared to the first reliability exercise (except for BML). All features showed good to very good ICC values, and higher PCA and PEA values compared with the first exercise (Table 4).
We also calculated the ICC values in the first round excluding the 2 readers who did not attend the second exercise. For most features, the ICC values in the first round were lower when examining 4 readers only, making the difference from the first and second exercise even larger (data not shown).
DISCUSSION
In our study, the OMERACT MRI Task Force developed a preliminary OMERACT HOAMRIS and tested its interreader reliability in 2 iterative scoring exercises. The scoring system constituted 6 MRI features, all scored on 0–3 scales for normal to severe. When the readers used an atlas with examples of images, good to very good reliability was demonstrated for all the MRI features, suggesting that MRI can reliably assess structural and inflammatory features in the DIP and PIP joints in patients with hand OA.
The Oslo hand OA MRI scoring system for hand OA was used as a starting point3, and the experience and results from the validation studies performed in Oslo were taken into account during the development of the HOAMRIS4,5. Haugen, et al found that flexor tenosynovitis had only moderate reliability and was not related to OA severity, or associated with pain in the same joint3,4. Concerns were also raised regarding a potential magic angle phenomenon (i.e., increase in signal intensity occurring when collagen fibers are oriented 55 degrees relative to the static magnetic field) for the assessment of collateral ligament discontinuation, which may result in false positive reading of collateral ligament pathology10. Consequently, we chose to exclude flexor tenosynovitis and collateral ligament pathology from the scoring system.
The definitions of the features were reassessed and modified. In the Oslo system, the definition of erosion was based mainly on loss of bone volume, and subchondral bone attrition (flattening or depression of the joint plate, leading to small loss of bone volume) was scored as grade 1 only3. We felt the severity of bone attrition was not captured by the Oslo system. In the HOAMRIS system, the grading of erosions was based on both the volume of the erosions and the extent to which the joint surface was affected by bone damage. With this definition, the severity of OA central erosions with bone attrition is expected to be better captured.
When scoring erosions, cysts, and bone marrow lesions, one should assess the total volume from both the distal and proximal side. Hence, 2 erosions estimated at 15% bone loss on both the proximal and distal side (in total 30%) and an erosion estimated at 30% on 1 side should both be scored as grade 3.
Terminology was also changed, and the term for cartilage loss assessment was changed from “joint space narrowing” to “cartilage space loss” to have a more precise wording of the feature. The thickness of the cartilage in the small DIP and PIP joints is difficult to assess directly owing to current image resolution, and the definition was therefore based on the interbone distance.
All features were scored on 0–3 scales with increments of 0.5 for synovitis, erosive damage, and BML. These features were considered to be most important in future clinical trials; thus, an increment of 0.5 was chosen to improve sensitivity to detect more subtle alterations, cross-sectionally and longitudinally. An increment of 0.5 should be used when a reader is uncertain about 2 adjacent categories, e.g., grade 1.5 if uncertain about whether to assign grade 1 or 2. In longitudinal studies or clinical trials, a 0.5 increment can be used when there is increase/decrease without change of category. The use of 0.5 increments may potentially lead to reduced reliability, and there is therefore a need for further evaluation of this in future reliability studies. Training sessions between the readers and more example images in the atlas may improve the reliability between readers.
Our results clearly show the importance of calibration between readers. Higher reliability was found in the second exercise, in which the readers used an atlas with example images. The atlas that was used in the second exercise was based on images from the first exercise, and did not include example images of all categories for all features. More work is therefore needed to improve and complete the atlas. Only a few validation studies of MRI have been performed in hand OA4,5,11, and they are limited by the lack of a gold standard as reference. MRI can detect more erosions and osteophytes than conventional radiography4, but these results need to be confirmed in studies using computed tomography (CT) or histology to show that these MRI features represent true findings. Suboptimal resolution of the MRI also makes the distinction between marginal erosions and cysts as well as cysts and bone marrow lesions difficult, emphasizing the need for validation studies using CT and/or histology. Longitudinal studies will also show the temporal relationship between, for example, bone marrow lesions and cysts, which have been shown to be temporally related in knee OA12.
In the current reliability exercise we used MRI from patients in the Oslo hand OA cohort3. In this cohort the patients had MRI of their dominant hand. However, we have not added any recommendations for which hand to scan, as future studies may want to use other inclusion criteria, such as the hand with most symptoms or radiographic criteria.
One limitation is that the current scoring system does not include the base of the thumb. Thumb base involvement is important for both pain and function in patients with hand OA13. In the Oslo hand OA cohort, only the DIP and PIP joints were scanned, since inclusion of the thumb base joint would have required a separate acquisition. In future studies, we need to examine whether we can use the same definitions for the thumb base joint as for the DIP and PIP joints. Further, the interpretation of the ICC values is dependent on the range of the measuring scale14; the wider the range, the better the results. The patient variation was larger in the second exercise (i.e., broader IQR) for some of the features. Larger patient variation may lead to larger ICC as an artefact, i.e., patient variance being a larger component of the total variance (the denominator). However, also PEA and PCA agreement was higher in the second exercise, supporting a true improvement of reliability. We also found good reliability for features that were less frequently present, such as cysts, malalignment, and BML. A potential limitation is the use of contrast-enhanced images for evaluation of synovitis in a typically elderly population with patients with OA. Future studies should look into whether MRI without contrast have similar sensitivity and specificity in detection of synovitis in hand OA as contrast-enhanced MRI. In both exercises, we used MRI from 8 patients. However, a greater number of patients would have allowed us to look into a subset of patients to explore how different patients affect the reliability.
The next step in the validation of the preliminary OMERACT hand OA MRI score is to examine the feasibility of the scoring system and whether it is sensitive to change. At this time, the OMERACT MRI Task Force does not have available data from any observational longitudinal cohorts or clinical trials using MRI as an outcome measure. However, we expect that within a few years we will have such data available to explore responsiveness.
The OMERACT MRI Task Force has developed a preliminary hand scoring system, the OMERACT HOAMRIS, and tested its interreader reliability. Our results suggest that MRI can reliably assess OA features when readers use an atlas with example images. However, further validation of MRI features and assessment of sensitivity to change must be tested before MRI can be recommended as an outcome measure in clinical hand OA trials.
Acknowledgment
We thank Barbara Slatkowsky-Christensen for her contribution to the data collection in the Oslo hand OA cohort, and Tore K. Kvien, Sølve Sesseng, and Désirée van der Heijde for their important contributions to the development of the Oslo hand OA MRI scoring system.
Footnotes
-
The Oslo hand OA cohort is supported by grants from the South-Eastern Norway Regional Health Authority.