Abstract
Objective. To assess the longitudinal reliability of the Outcome Measures in Rheumatology (OMERACT) Thumb base Osteoarthritis Magnetic resonance imaging (MRI) Scoring system (TOMS).
Methods. Paired MRI of patients with hand osteoarthritis were scored in 2 exercises (6-mo and 2-yr followup) for synovitis, subchondral bone defects (SBD), osteophytes, cartilage assessment, bone marrow lesions (BML), and subluxation. Interreader reliability of delta scores was assessed.
Results. Little change occurred. Average-measure intraclass correlation coefficients were good-excellent (≥ 0.71), except synovitis (0.55–0.83) and carpometacarpal-1 osteophytes/cartilage assessment (0.47/0.39). Percentage exact/close agreement was 52–92%/68–100%, except BML in 2 years (28%/64–76%). Smallest detectable change was below the scoring increment, except in SBD and BML.
Conclusion. TOMS longitudinal reliability was moderate-good. Limited change hampered assessment.
The thumb base, including the first carpometacarpal (CMC-1) and scaphotrapeziotrapezoid (STT) joints, is often involved in hand osteoarthritis (OA). Thumb base OA is associated with particular risk factors and requires distinct therapeutic interventions compared to interphalangeal finger OA1. Therefore, outcome measures specifically assessing thumb base OA are needed.
In response, the Outcome Measures in Rheumatology (OMERACT) magnetic resonance imaging (MRI) working group developed a scoring system of MRI findings in the thumb base: Thumb base OA MRI Scoring system (TOMS)2. This tool has been shown to exhibit good cross-sectional reliability, but data concerning longitudinal reliability are lacking2. By the term longitudinal reliability, we mean the ability to reliably score sequential images, taking into account interreader variability. Understanding the reliability of TOMS for measuring change is needed for effectively implementing this tool.
Our study investigated the longitudinal reliability of TOMS in 2 settings: a prospective observational study with longterm followup and a clinical trial with short-term followup3,4.
MATERIALS AND METHODS
Reliability exercises
Two reliability exercises were performed. An atlas was available to facilitate scoring5. Features assessed were synovitis, subchondral bone defects (SBD), osteophytes, cartilage assessment, bone marrow lesions (BML), and subluxation2. All features but subluxation were evaluated on 0–3 scales in the CMC-1 and STT joints, with 0.5 increments for synovitis, SBD, and BML. Proximal and distal joint parts were scored separately for SBD, osteophytes, and BML. Subluxation was scored absent/present in the CMC-1 joint. In both exercises, MRI were selected to represent a large range of pathology.
In the first exercise, paired MRI (baseline, 2-yr followup) of 25 patients from the Hand Osteoarthritis in Secondary Care (HOSTAS) prospective cohort study (Leiden University Medical Center6) were scored in known time-order by 3 independent readers [1 rheumatologist (FG) and 2 rheumatology fellows (SvB, FK), all experienced in using TOMS]. Coronal and axial T1-weighted (T1W) fast spin echo (FSE), and T2W FSE images with fat-suppression (FS) were obtained on a 1.5T extremity MRI unit (ONI, GE; Supplementary File, available with the online version of this article). No contrast agent was used. Therefore, synovitis was scored on T2W-FS images, as per the original scoring system2.
The second exercise was conducted by an experienced radiologist (CP) and a rheumatology fellow (FK). Paired MRI (baseline, 6-mos followup) of 24 patients with hand OA from a multicenter randomized double-blind trial comparing lutikizumab to placebo7 were scored for synovitis and BML. One reader (CP) scored in unknown and the other in known time-order (FK) for logistical reasons. Coronal and axial T1W-FS images with/without gadolinium-based contrast enhancement, and short-tau inversion recovery or T2W-FS images were obtained according to standardized protocol. Because of incomplete coverage, the STT could only be assessed in 16 patients, and the trapezoid bone was not evaluated.
Data collection for both studies was approved by local ethics committees (P09.004, NCT02384538). All participants provided written informed consent.
Statistical analyses
Separate scores of distal and proximal joint compartments were combined into 1 sum score per joint where applicable. Median and interquartile range (baseline status scores) or range (delta scores) was calculated for each feature, based on the average of the readers. Interreader reliability of delta scores was assessed by calculating intraclass correlation coefficients (ICC; average measure, mixed-effect models, absolute agreement), and percentage exact and close agreement (PEA/PCA). ICC ≤ 0.20 were considered poor, > 0.20 to < 0.40 fair, ≥ 0.40 to < 0.60 moderate, ≥ 0.60 to < 0.80 good, and ≥ 0.80 excellent reliability8. PEA/PCA were defined as a difference of 0/≤ 1 between minimum and maximum scores across readers. For each feature, the smallest detectable change (SDC) was calculated9. We determined how many patients changed beyond measurement error (i.e., change score > SDC), and whether the smallest scoring increment for each feature could be scored reliably (i.e., smallest increment > SDC).
RESULTS
Table 110 presents baseline characteristics of patients from both reliability exercises. Thirteen trial participants received placebo and 11 lutikizumab. Baseline scores of MRI features were generally low (Table 2). Highest scores were given for CMC-1 osteophytes. Overall, more MRI abnormalities were seen in the CMC-1 compared to the STT joint.
Baseline scores of synovitis and BML were comparable in the 2 studies. On average, little change was observed after 6 months and 2 years (Table 2). However, individual patients showed change in synovitis and BML, both increasing and decreasing (Supplementary Figure 1, available with the online version of this article). Cartilage and bone features generally showed less improvement and more deterioration over time.
Table 3 presents the longitudinal reliability in both studies. ICC for most features in both thumb base joints were good to excellent. Fair to moderate ICC were found for cartilage assessment and osteophytes in the CMC-1 joint. ICC for synovitis in the different studies and joints varied from moderate to excellent. ICC could not be estimated for some features (STT synovitis in the clinical trial, STT osteophytes, and subluxation).
Since calculation of ICC was influenced by the small amount of change that occurred over time in both studies, PEA and PCA values were also calculated. PEA/PCA of all features in both joints ranged from 52–92% and 68–100%, except for BML in the CMC-1 in the 3-reader exercise (PEA 28%/PCA 64%). PEA values in that exercise were all lower than for the clinical trial.
The SDC was calculated for all features and should be considered in light of the range and smallest increment of that feature’s score (Table 3). Most SDC were lower than that feature’s smallest scoring increment, although the SDC of SBD and BML in particular were higher than the increment of 0.5. In the cohort study, the SDC for BML in the CMC-1 was even higher than 1 (SDC = 1.27), although in the clinical trial the SDC was better (SDC = 0.87). Most participants did not change more than the SDC (Supplementary Table 1, available with the online version of this article). The largest number of participants with a delta score larger than the SDC, either increasing or decreasing, occurred for synovitis and BML. Features related to cartilage and bone generally deteriorated. Of these, SBD showed the most participants with change.
DISCUSSION
In our report, we show the longitudinal reliability of a recently developed OMERACT MRI scoring system to assess inflammatory and structural features in TOMS. Based on ICC, PEA, and PCA values, our investigation showed that reliability of assessment of delta scores using the TOMS was good.
The longitudinal reliability of the similar Hand Osteoarthritis Magnetic Resonance Imaging Scoring System (HOAMRIS) to evaluate interphalangeal joints was previously published11. Because the HOAMRIS and TOMS assess similar features, similar reliability is expected. Reliability of change scores in the HOAMRIS exercise (20 patients, 3 readers) for erosive damage and cysts was similar to those for SBD in TOMS. BML were also reliably assessed in both studies. However, our results for synovitis, osteophytes, and cartilage assessment were better compared to HOAMRIS. Observed differences between the studies may partly be explained by a higher number of assessed joints for the HOAMRIS, leading to lower PEA/PCA values. Interphalangeal joints are also smaller, and the field strength of the magnetic resonance scanner was lower, which made reliable assessment more difficult.
ICC of the previous cross-sectional reliability exercise of the TOMS were generally higher, while PEA/PCA values were lower2. These differences were found because assessment of ICC of delta scores in a cohort with little change over time generally results in lower values, because ICC values are not only dependent on measurement error, but also on between-subject variability. Between-subject variability is part of the calculation used to produce ICC values, and low between-subject variability can cause unreasonably low ICC values12. Results of the 2 exercises performed in our study were generally comparable, although the difference in blinding for time-order among readers of the clinical trial may have resulted in lower results for agreement between these readers. PEA values in the 3-reader exercise were all lower than for the 2-reader exercise, which can at least partially be attributed to the higher number of readers who have to reach exact agreement in the first case.
Assessment of longitudinal reliability was hampered by the small magnitude of change. Continuous change scores and the number of patients changing more than the SDC were low. Both cohorts reflect the characteristic disease course. In the cohort study, no intervention was given, and inflammatory features were not expected to change. However, over a 2-year period, cartilage and bone damage were expected to increase, which they did, though only mildly. Generally, radiographic progression in the CMC-1 over 2 years is slow13. Moreover, we selected participants with and without thumb base OA for this methodological exercise, which may have contributed to the low amount of change that was observed over time.
Most SDC were low and below the feature’s smallest scoring increment, showing that a change of 1 increment reflects a measurable change in that feature. Only SBD and BML had an SDC above their defined smallest increment of 0.5, and it could be argued that 0.5 increments are too small to be reliably assessed for these features.
Results from our study provide evidence that the OMERACT TOMS can be used to evaluate thumb base MRI in studies of different settings. Future studies are warranted, in particular positive clinical trials, to evaluate sensitivity to change, as well as validation studies.
ONLINE SUPPLEMENT
Supplementary material accompanies the online version of this article.
Acknowledgment
We are indebted to AbbVie (North Chicago, Illlinois, USA) and the Department of Radiology of the Leiden University Medical Center (Leiden, the Netherlands) for providing the magnetic resonance images for the reliability exercises.
Footnotes
PGC is supported in part by the UK National Institute for Health Research infrastructure at Leeds.
- Accepted for publication November 7, 2018.