Abstract
Objective. To evaluate the interreader reliability of change scores and the responsiveness of the OMERACT Hand Osteoarthritis (OA) Magnetic Resonance Image (MRI) Scoring System (HOAMRIS).
Methods. Paired MRI (baseline and 5-yr followup) from 20 patients with hand OA were scored with known time sequence by 3 readers according to the HOAMRIS: Synovitis, erosive damage, cysts, osteophytes, cartilage space loss, malalignment, and bone marrow lesions (BML; 0–3 scales with 0.5 increments for synovitis, erosive damage, and BML). Interreader reliability for status and change scores were assessed by intraclass correlation coefficients (ICC), percentage exact agreement and percentage close agreement (PEA/PCA), and smallest detectable change (SDC). Responsiveness was assessed by standardized response means (SRM).
Results. Cross-sectional interreader ICC were good to very good (≥ 0.74) for all features except synovitis, cysts, and malalignment (ICC 0.50–0.58). The range of change values was small, leading to low ICC for change scores. The SDC values for sum scores (total range 0–24) varied between 1.97–3.05 (except 1.08 for malalignment). For status scores, PEA/PCA on scores in individual joints across the readers were 8.1–50.0 and 43.8–78.1, respectively. Similarly, PEA/PCA for change scores were 20.6–63.8 and 66.3–93.1, respectively. All features except cysts and BML demonstrated good responsiveness with higher SRM for sum scores (range 0.46–1.62) than for scores in individual joints (range 0.24–0.73).
Conclusion. Good to very good interreader ICC values were found for cross-sectional readings, whereas the longitudinal reliability was lower because of a smaller range of change scores. All features, except cysts and BML, showed good responsiveness.
Osteoarthritis (OA) is a “whole-joint” disease1, and magnetic resonance imaging (MRI) has the ability to visualize all affected joint structures2. Previous hand OA studies have shown that MRI is more sensitive than radiographs in detection of structural features3,4. Further, MRI findings such as synovitis and bone marrow lesions (BML) are associated with joint tenderness5. Hence, MRI is a valuable tool to increase the understanding of the pathogenesis of OA, and in future clinical trials may serve as an important outcome measure.
The Oslo hand OA MRI scoring system included assessment of several structural (osteophytes, joint space narrowing, erosions, cysts, BML, malalignment, collateral ligament pathology) and inflammatory (synovitis and flexor tenosynovitis) features in the distal interphalangeal (DIP) and proximal interphalangeal (PIP) joints6. Despite good to very good reliability4,5,6, there were several limitations. Because of inclusion of many features, and because scoring of the proximal and distal parts of the joint was done separately, the scoring system was time-consuming. Further, features such as collateral ligament pathology and flexor tenosynovitis were uncommon, had lower reliability compared with other features, and no associations were found with joint tenderness4,5,6.
Based on this tool, the OMERACT MRI Working Group iteratively developed a preliminary OMERACT hand OA MRI scoring system (OMERACT HOAMRIS) using OMERACT methodology7. In a reliability exercise with cross-sectional readings, the interreader reliability was good to very good for all features7.
To develop a satisfactory tool for clinical trials, assessment of the reliability of change scores and the responsiveness of HOAMRIS is warranted. However, longitudinal hand OA MRI studies are uncommon. We assessed the reliability and the responsiveness of status and change scores of the HOAMRIS in 20 patients from the Oslo hand OA cohort with hand MRI scans at 2 timepoints 5 years apart.
MATERIALS AND METHODS
Three readers including 1 radiologist (IE) and 2 rheumatologists (FG, VF) participated in the reliability exercise. All readers participated in the previous hand OA MRI exercise7.
Patients in the Oslo hand OA cohort had MRI scans of the DIP and PIP joints of the dominant hand acquired with a 1.0T extremity scanner (ONI, GE Healthcare) at 2 examinations (2008–2009 and 2013). MRI sequences included short-tau inversion recovery images in coronal and axial planes (TE 16.3 and 21 ms, TR 2850 and 3150 ms, slice thickness 2–3 mm, gap between slices 0.2 and 1 mm) and T1-weighted gradient-echo fat-suppressed pre- and post-gadolinium images in coronal, axial, and sagittal planes (TE 5 ms, TR 20 ms, slice thickness 1 mm, gap between slices 0 mm).
Data collection in the Oslo hand OA cohort was approved by the regional ethics committee and the data inspectorate. All patients signed informed consent.
Calibration Exercise
A calibration exercise was performed in March 2014. Three readers (IE, FG, VF) each scored 3 patients from the Oslo hand OA cohort (1 timepoint) according to the proposed HOAMRIS. The HOAMRIS includes assessment of synovitis, erosive damage, cysts, osteophytes, cartilage space loss, malalignment, and BML (all features on 0–3 scales with 0.5 increments for synovitis, erosions and BML; see Appendix 1, which was presented in the previous publication)7. After reading, scoring discrepancies were evaluated by a Web-based meeting.
An updated and extended version of the atlas was distributed and approved by all readers prior to the reliability exercise. The coronal plane was recommended for evaluation of all MRI features, except synovitis, for which we agreed to use the axial plane for standardization purposes (as previously proposed). Both coronal and sagittal planes were used for assessment of osteophytes. The scoring system (including definitions, grading, and recommended planes) was not changed as compared to the original publication7.
Reliability and Responsiveness Exercise
An interreader reliability exercise was performed between April and May 2014. Each reader scored 20 patients who had MRI scans at 2 timepoints. Paired MRI scans were read with known time sequence using the HOAMRIS. MRI scans were selected by a nonreader based on availability of appropriate sequences, wide range of progression of radiographic hand OA structural severity [based on changes in Kellgren-Lawrence (KL) scores] and a wide range of changes in clinical inflammatory features (based on changes in swollen joint counts).
Statistical Analysis
We calculated the median and interquartile range for each MRI feature based on the reader mean values.
Interreader reliability was calculated for status and change scores. Percentage exact agreement (PEA), percentage close agreement (PCA) and average measure intraclass correlation coefficients (ICC) were calculated using mixed-effect models (absolute agreement). To determine whether the change of sum score in an individual patient was beyond the measurement error, we calculated the smallest detectable change (SDC), which is 1.96 times the standard error of measurement of the change score (based on the residual error) divided by the square root of the number of readers (in this case, 3)8. PEA was defined as a difference of 0 or 0.5 between the minimum and maximum scores across the 3 readers, whereas PCA was defined as a difference of ≤ 1 between the minimum and maximum scores. ICC values < 0.20 were considered as poor reliability, 0.20 ≤ ICC < 0.40 as fair, 0.40 ≤ ICC < 0.60 as moderate, 0.60 ≤ ICC < 0.80 as good, and 0.80 ≤ ICC < 1.00 as very good reliability9.
Responsiveness was assessed by standardized response means (SRM) at joint level and patient level (i.e., sum scores for the 8 DIP and PIP joints). The change score for each patient was averaged across the 3 readers (“averaged change score”). Thereafter, the SRM were computed by dividing the mean “averaged change scores” by the SD of the “averaged change scores.” SRM values ≥ 0.80 were considered as good responsiveness, 0.50 ≤ SRM < 0.80 as moderate, and SRM < 0.50 as low responsiveness10.
RESULTS
The demographic and clinical variables for the 20 patients are presented in Table 1. Among the 160 assessed joints, radiographic hand OA was present in 119 [74.4%; KL grade 2 in 58 (36.3%) joints, KL grade 3 in 31 (19.4%) joints, and KL grade 4 in 30 (18.8%) joints]. The mean (SD) followup time between the 2 MRI examinations was 4.6 (0.3) years. During followup, we observed progression of all features (range 2.0–3.0 for sum scores for most features except malalignment and BML; Table 2). For synovitis, BML, and cysts, decrease during followup was observed by 1 or more reader(s) in 46/160 (28.8%) joints, whereas increasing synovitis, cysts, and BML scores were observed by 1 or more reader(s) in 108/160 (67.5%), 114/160 (71.3%), and 86/160 (53.8%) joints, respectively.
Reliability
The cross-sectional interreader ICC values were good to very good for erosive damage, osteophytes, cartilage space loss, and BML, whereas reliability was moderate for synovitis, cysts, and malalignment. Close agreement was found in ≥ 43.8% of the joints for all MRI features (lowest value for osteophytes), whereas the exact agreement was generally low (Table 3).
The interreader ICC values for change scores were good for erosive damage and BML and moderate for cysts and osteophytes. Poor to fair reliability was observed for synovitis, cartilage space loss, and malalignment. For most features (except malalignment), the SDC values varied between 1.97 and 3.05 (Table 4). Close agreement was found in ≥ 66.3% of the joints for all MRI features (lowest for synovitis, BML, and cysts). The exact agreement ranged from 20.6% to 63.8% for synovitis and malalignment, respectively (Table 4).
In general, the PIP joints demonstrated higher exact agreement in the cross-sectional evaluation of synovitis, cysts, malalignment, and BML than the DIP joints (data not shown). For change scores, the exact agreement was higher for cysts, osteophytes, malalignment, and BML; and close agreement was higher for synovitis in the PIP joint than in the DIP joints (data not shown).
Higher cross-sectional ICC values were found for all features except synovitis and BML in patients with mild to moderate radiographic hand OA at baseline (n = 9 with KL sum score 8–16) compared to patients with more severe radiographic hand OA (n = 11 with KL sum score 19–26; data not shown). The changes were larger for the majority of MRI features (except for similar degree of change for cartilage space loss) in the group with more severe radiographic hand OA. Higher ICC values for change scores were observed for synovitis, cartilage space loss, malalignment, and BML in patients with severe radiographic OA, whereas the reliability of changes in cysts was better in mild disease. There were no differences in ICC values for osteophytes and erosions (data not shown).
Responsiveness
The SRM values for the MRI sum scores were good for all features except cysts and BML, which showed moderate and low SRM values, respectively (Table 5). Analyzing the responsiveness for the MRI sum scores for each reader separately revealed lower SRM values (data not shown). The SRM values for MRI features at individual joint level were low to moderate (Table 5).
DISCUSSION
In our study, the OMERACT MRI working group tested the reliability of status scores and change scores and the responsiveness of the HOAMRIS.
As previously shown7, good reliability was found for the cross-sectional readings. Compared to the previous exercise7, the interreader ICC values were slightly lower despite the same readers. However, the patients in the current study had generally more severe disease when comparing the median values for the MRI features. Joints with severe structural abnormalities may be more difficult to score, which may decrease reliability. In the current study, we found higher cross-sectional reliability for all features except synovitis and BML in patients with mild to moderate radiographic hand OA, supporting our hypothesis.
Whereas the PEA and PCA values were similar for both status and change scores, we found considerably lower interreader ICC values for change scores compared to the ICC values for status scores. This may be explained by a smaller range for change scores leading to lower total variance. Further, in joints without close agreement, the magnitude of the difference was larger for change scores than for status scores (i.e., both increase and decrease were possible for change scores). In contrast to cross-sectional reliability, which overall was higher in patients with milder disease, we found higher reliability for change scores of several MRI features in patients with more severe disease. These differences are most likely due to larger changes of pathology in patients with severe OA, whereas more subtle changes in patients with milder disease were difficult to assess reliably.
Low reliability may further be explained by the small size of the joints, insufficient pre-exercise training, and the quality of the 1.0T MRI scans. Using these scans, the distinction between marginal erosions versus cysts and cysts versus BML may be difficult, emphasizing the need for validation studies using CT and/or histology. In addition, scoring is complicated by the small size of the joints. In general, we found higher reliability for the PIP joints than the smaller DIP joints, as shown in the Oslo hand OA cohort6. Further, the readers used different software and screens of different sizes, which may also decrease reliability. The 20 patients with hand OA in this exercise were carefully selected based on the amount of radiographic OA progression and changes in swollen joints. Hence, there was a spectrum of pathology for all MRI features. Therefore the relatively low number of 20 patients is not the major cause of the poor reliability observed for certain features.
Responsiveness was evaluated by calculation of SRM values. In general, the SRM values were good for most features (except cysts and BML) using the sum score, as opposed to looking at responsiveness in individual joints. Hence, in clinical trials, sum scores may be the most responsive measure. It must be noted, however, that not blinding the readers to time order of the images could have introduced a bias in favor of progression. BML and cysts showed lower responsiveness, which is at least partly related to the natural history of these features. As shown in knee OA11,12,13, BML and cysts were frequently decreased. Further, BML may develop into cysts14. Decrease of synovitis was also frequently present, as shown by Kortekaas, et al15. However, high SRM was observed for synovitis. Other factors affecting responsiveness may relate to reliability and image quality. The current study included mostly women, because the majority of participants in the Oslo hand OA cohort are women. Women may experience a larger progression of hand OA features than men16, and therefore a better responsiveness.
In the Oslo hand OA cohort, only the DIP and PIP joints were covered by the coil, and imaging of the thumb base joint would have required a separate acquisition. In future studies, the definitions of HOAMRIS will need to be validated for the thumb base, which is frequently affected by OA and is important for both pain and function17.
Our results are in line with a previous MRI study by Haugen, et al, showing a high prevalence of MRI features including synovitis in patients with hand OA4, emphasizing that hand OA is a disease of the whole joint with a substantial inflammatory component. With respect to future clinical studies, our results suggest that MRI may be a responsive tool in hand OA clinical trials. However, calibration of readers should be carefully done in all studies to optimize the reliability of readings.
The ICC values for change scores were especially low for synovitis and cartilage space loss. Both features represent important features of OA, and should probably be addressed in clinical OA trials. Using the current MRI sequences, we were not able to assess the cartilage directly in these small finger joints. Hence, cartilage space loss was used as a marker similar to joint space narrowing on conventional radiographs. A previous study has demonstrated higher sensitivity of conventional radiography in detection of joint space narrowing4, and may be a better imaging modality to assess cartilage until we have MR images with more optimal resolution and sequences. Synovitis will probably be an important outcome in future clinical trials. Other assessment tools such as clinical examination and ultrasound are also hampered by modest reliability. MRI may represent the most promising tool to detect changes in inflammation owing to pairwise comparisons of images. Because of the modest reliability for change scores in the current study, we recommend more intensive training of the readers as well as better MRI quality to facilitate better interreader reliability. Future studies should confirm the high responsiveness and reevaluate the longitudinal reliability using MRI scans of higher quality. Currently, the OMERACT MRI working group does not have available data from any observational longitudinal cohorts or clinical trials using 1.5T or 3.0T MRI as an outcome measure.
Our results suggest that MRI is sensitive to change in hand OA. For cross-sectional readings the reliability was good, with high ICC values. The range of change scores was small, leading to lower ICC values for change scores. Further validation of MRI measurements is needed.
Acknowledgment
We thank Barbara Slatkowsky-Christensen for her contribution in data collection in the Oslo hand OA cohort, and Siri Lillegraven, Espen A. Haavardsholm, Tore K. Kvien, Sølve Sesseng, and Désirée van der Heijde for their important contributions in the development of the Oslo hand OA MRI scoring system.
APPENDIX 1.
Footnotes
I.K. Haugen has received funding from the Norwegian Rheumatology Foundation (Norsk Revmatikerforbund)/Ekstrastiftelsen. Data collection in the Oslo hand OA cohort is supported by grants from Grethe Harbitz’ Legacy For Combating Rheumatic Diseases and the Dr. Trygve Gythfeldt og Wife’s Research Fund.