Abstract
Objective. To dêtermine the reliability of radiographic assessment of knee osteoarthritis (OA) by nonclinician readers compared to an experienced radiologist.
Methods. The radiologist trained 3 nonclinicians to evaluate radiographic characteristics of knee OA. The radiologist and nonclinicians read preoperative films of 36 patients prior to total knee replacement. Intrareader and interreader reliability were measured using the weighted κ statistic and intraclass correlation coefficient (ICC). Scores κ < 0.20 indicated slight agreement, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.0 almost perfect agreement.
Results. Intrareader reliability among nonclinicians (κ) ranged from 0.40 to 1.0 for individual radiographic features and 0.72 to 1.0 for Kellgren-Lawrence (KL) grade. ICC ranged from 0.89 to 0.98 for the Osteoarthritis Research Society International (OARSI) summary score. Interreader agreement among nonclinicians ranged from κ of 0.45 to 0.94 for individual features, and 0.66 to 0.97 for KL grade. ICC ranged from 0.87 to 0.96 for the OARSI Summary Score. Interreader reliability between nonclinicians and the radiologist ranged from κ of 0.56 to 0.85 for KL grade. ICC ranged from 0.79 to 0.88 for the OARSI Summary Score.
Conclusion. Intrareader and interreader agreement was variable for individual radiograph features but substantial for summary KL grade and OARSI Summary Score. Investigators face tradeoffs between cost and reader experience. These data suggest that in settings where costs are constrained, trained nonclinicians may be suitable readers of radiographic knee OA, particularly if a summary score (KL grade or OARSI Score) is used to determine radiographic severity.
Knee osteoarthritis (OA) is characterized by degenerative changes in cartilage, bone, meniscus, and other joint structures, in concert with pain, stiffness, and functional loss. The severity of structural damage in knee OA can be assessed by radiographic evidence of joint space narrowing (JSN) and osteophyte formation1,2. In clinical research, knee OA is often staged by radiologists or other physicians using ordinal scales such as the Kellgren-Lawrence (KL) grade or the Osteoarthritis Research Society International (OARSI) score. Having experienced clinicians to grade radiographic OA in research settings can be expensive, raising the question of whether nonradiologist readers can be trained to provide reliable, valid readings.
Several studies have measured variability in radiographic assessment of knee OA by clinicians3,4,5,6. Intrareader and interreader reliability of the KL score vary widely across studies, with weighted κ ranging from 0.26 to 0.88 and 0.56 to 0.80, respectively4,5,6. One prior study examined the interreader reliability of radiographic assessment of knee OA between nonclinician readers and an experienced clinical reader. This study documented κ statistics for radiographic features of tibiofemoral OA ranging from 0.12 to 0.80, suggesting this approach merits further research7.
In our current study, we aimed to determine the interreader and intrareader reliability of radiographic assessment of severe knee OA among 3 junior nonclinician readers and to assess the agreement between their readings and those of an experienced radiologist.
MATERIALS AND METHODS
Study population
The data presented in this report were collected as part of the Adding Value in Knee Arthroplasty (AViKA) Postoperative Care Navigation study, a randomized controlled trial conducted at Brigham and Women’s Hospital in Boston, Massachusetts, USA. The trial prospectively evaluated a behavioral intervention to optimize postoperative outcomes following primary total knee replacement (TKR). Enrolled were 309 patients ≥ 40 years old with a primary diagnosis of OA who underwent primary TKR8. We chose 39 AViKA participant radiographs at random for our study. Demographic information for the subjects analyzed is presented in Table 1. Our study was approved by the Partners Healthcare Institutional Review Board (protocol 2010P002597).
Training
Two nonclinician readers (medical students) studied the OARSI Atlas of Individual Radiographic Features in Osteoarthritis 2 to learn to grade osteophytes and JSN on standing bilateral radiographs. They spent 5 hours reading films before attending 2 hour-long training sessions with an experienced radiologist who has graded knee OA for several cohort studies and trials. The training sessions for the medical students took place before a research assistant was included in the study. Because of feasibility concerns (limited study funding and time constraints of the radiologist), the medical students provided the initial training for the research assistant to assess radiographic features of OA in 3 hour-long sessions. The 3 nonclinician readers (2 medical students and 1 research assistant) had a final training session with the radiologist before reading films for reliability analyses.
Data collection procedures
The nonclinician readers and the radiologist viewed 39 standing bilateral preoperative radiographs in Centricity Web Version 3.0 and graded individual features of knee OA, blinded to the ratings of other readers. Three subjects were excluded from the analysis because readers inadvertently graded different radiograph views. For each of the 36 subjects included in the analysis, we examined both the left and right knees; however, we were unable to grade radiographic features of OA in knees with implants. Six out of 36 subjects analyzed had 1 knee replaced before enrolling in AViKA. Thus, out of 72 knees (36 × 2), we were able to analyze 66 knees (72 – 6) for interreader reliability. Standing bilateral films were used to assess tibiofemoral features of OA [preferably the posterioranterior (PA) view or the anteroposterior view if PA was unavailable]. These clinical radiographic protocols did not use a standardized positioning device. Sunrise views were used to grade patellofemoral features. Radiology technicians routinely assess image quality and repeat images that are inadequate; therefore, all images used for this reliability analysis were acceptable.
To assess intrareader reliability, 2–3 weeks following their initial reading, 2 nonclinician readers re-graded 17 radiographs (31 knees), and 1 nonclinician reader re-graded 36 radiographs (66 knees). We analyzed left and right knees separately, with 1 film per subject. Examining all knees in the same analysis would have created the potential for clustering of observations between 2 knees of the same subject, requiring a less transparent analysis.
Radiographic measures
Anatomic alignment (in degrees) was defined as the angle formed by the intersection of a line drawn from the intercondylar notch to the center of the femoral shaft and another line drawn from the center of the tibial spines to the center of the tibial shaft. The angle of anatomic alignment and direction of deformity (varus or valgus) were documented. The deformity was considered varus when the tibia was angled inward with respect to the femur, and valgus when the tibia was angled outward with respect to the femur. The length of tibia and femur available to measure on each study was not standardized. Osteophytes and JSN were graded on a 4-point scale (0–3) as per OARSI guidelines2. Grade 0 was considered normal. Grade 1 indicated mild osteophytes or narrowing, grade 2 moderate osteophytes or narrowing, and grade 3 marked and severe osteophytes or narrowing.
Osteophytes were assessed in the lateral femoral, lateral tibial, medial femoral, and medial tibial compartments. Medial and lateral patellofemoral osteophytes were assessed, but only the largest score was recorded. Lateral and medial tibiofemoral JSN were graded in addition to patellofemoral JSN. Joint space width (JSW) was measured in mm at the narrowest point in each of these compartments. The radiologist graded osteophytes and JSN only.
We used individual tibiofemoral osteophyte and JSN scores to generate a KL grade and OARSI summary score for each knee; patellofemoral osteophytes and JSN are not included in these summary scores. If the highest osteophyte score was 0 and the highest JSN score was 0 or 1, we considered the knee KL 0. The knee was KL 1 if the highest osteophyte score was 1 and the highest JSN score was 0 or 1. The knee was KL 2 if the highest JSN score was 0 or 1 and the highest osteophyte score was ≥ 2. The knee was KL 3 if the highest JSN score was 2, regardless of osteophyte scores. The knee was KL 4 if the highest JSN score was 3, regardless of osteophyte scores. We determined the OARSI summary score by adding all tibiofemoral osteophyte and JSN scores.
Statistical analysis
We define intrareader agreement as a measurement of a reader’s own consistency. Interreader agreement is a measure of a reader’s consistency compared to other readers. To calculate intrareader agreement, we compared the first and second reads of the same radiographs by the same reader. To calculate interreader agreement, we compared each reader’s first read of the same radiographs. To assess agreement on categorical variables, we used weighted κ coefficients. Weighted κ scores take into account the ordering of the categorical levels of a variable9. A weighted κ score that evaluates agreement between readers A and B is calculated through a matrix in which reader A’s ratings are arrayed in the rows, and reader B’s ratings in the columns. The diagonal shows the completely correct agreement. A weight is assigned to the agreements of the off-diagonals, reflecting the severity of disagreement, which increases as the levels chosen become farther apart. These Cicchetti-Allison weights are linear and calculated as follows: where “i” is the level of Rater 1 and “j” is the level of Rater 2, and K is the number of levels of the variable9. The κ range is between 0 and 1; weighted κ ≤ 0.20 indicates slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.0 almost perfect agreement10.
To assess agreement on continuous variables, we used intraclass correlation coefficients (ICC), which quantify the similarity of grouped measures between observers. We used the fixed Shrout-Fleiss ICC. The ICC is estimated by (BMS–EMS)/BMS, where BMS is between-subject mean square or the variation explained by the differences between subjects, and EMS is residual mean square, or the variance leftover11. The ICC ranges from −1 to 1; values closer to 0 indicate weaker reliability, while values closer to −1 and 1 indicate stronger reliability. The sample size was chosen to provide reasonable precision as reflected in 95% CI around the estimates of κ.
RESULTS
Gold standard comparison
The agreement in readings of the gold standard radiologist and the nonclinician readers for individual radiographic features of tibiofemoral OA was fair to substantial, with κ statistics ranging from 0.39 to 0.76 (Table 2). Agreement was generally higher for JSN than for osteophytes and for the tibiofemoral joint structures than for the patellofemoral joint. Interreader reliability for KL scores was moderate to almost perfect, with κ statistics ranging from 0.56 to 0.85 (Table 2). The OARSI summary score showed excellent interreader agreement between the radiologist and nonclinician readers, with ICC ranging from 0.79 to 0.88 (Table 3).
Intrareader reliability among nonclinician readers
The intrareader reliability among nonclinicians for individual radiographic features of tibiofemoral OA was fair to almost perfect, with κ statistics ranging from 0.40 to 1.0. Agreement was generally higher for JSN than for osteophytes and for tibiofemoral than for patellofemoral joints. The KL score showed substantial to almost perfect intrareader agreement, with κ statistics ranging from 0.72 to 1.0. Intrareader reliability for alignment (varus or valgus) was substantial to almost perfect, with κ statistics ranging from 0.87 to 1.0 (Table 2).
The OARSI summary score showed excellent intrareader agreement, with ICC ranging from 0.89 to 0.98 (Table 3). Intrareader reliability of knee angle measurement (in degrees) was also excellent, with ICC ranging from 0.92 to 0.98. The ICC for lateral JSW ranged from 0.82 to 0.96, and for medial JSW from 0.94 to 1.0 (Table 3).
Interreader reliability among nonclinician readers
The interreader reliability among nonclinician readers for individual radiographic features of tibiofemoral OA was moderate to almost perfect, with κ statistics ranging from 0.45 to 0.94. Interreader agreement for KL scores was substantial to almost perfect, with κ statistics ranging from 0.66 to 0.97 (Table 2). Interreader reliability of alignment (varus or valgus) κ statistics ranged from 0.76 to 0.93 (Table 2).
The OARSI summary score showed excellent interreader reliability, with ICC ranging from 0.87 to 0.96 (Table 3). Interreader agreement of anatomic alignment angle (in degrees) was also excellent, with the ICC ranging from 0.88 to 0.94. ICC ranged from 0.83 to 0.92 for lateral JSW and from 0.93 to 0.96 for medial JSW (Table 3).
Patellofemoral findings
Patellar osteophytes and patellofemoral JSN are not addressed in the OARSI atlas2.Therefore our analyses of these features are exploratory. The interreader reliability between the radiologist and nonclinician readers for patellofemoral features of OA was slight to substantial, with κ statistics ranging from 0.17 to 0.79. The intrareader and interreader agreement among nonclinician readers for patellofemoral features was better, with κ statistics ranging from 0.43 to 0.84 for intrareader agreement and 0.28 to 0.83 for interrater agreement (Table 2).
DISCUSSION
The current study examines the intrareader and interreader reliability of radiographic assessment of knee OA among 3 nonclinicians and measures the agreement between their readings and those of an experienced radiologist. Interreader agreement between nonclinician readers and the radiologist was moderate to almost perfect for the KL score and excellent for the OARSI summary score. There was substantial to almost perfect intrareader reliability among nonclinician readers for the KL and OARSI summary scores, and interreader agreement between them was substantial to almost perfect for these summary measures.
Of the individual radiographic features of OA, osteophytes showed lower agreement than JSN. The agreement for patellar osteophytes and patellofemoral JSN was lowest, possibly because patellar features are not addressed in the OARSI atlas2. While agreement between the radiologist and nonclinicians was moderate to almost perfect for the KL and OARSI summary scores, reliability varied widely for individual radiographic characteristics, depending on the feature considered.
Agreement for KL grade was more variable than agreement for the OARSI summary score. The KL classification has been criticized for its sensitivity to osteophyte size12, which we found to be less reliable than JSN. The OARSI summary score, which sums osteophyte and JSN scores in all but the patellofemoral compartment, is less affected by 1-point differences in individual radiologic features. Therefore, we suggest that the OARSI summary score may be a stronger and more reliable measure of OA severity than the KL score.
Several prior studies have examined the reliability of radiographic measurement of OA. Spector, et al reported on variability in the assessment of knee OA in a longitudinal female cohort study. For KL scores, κ coefficients for intrareader and interreader agreement ranged from 0.66 to 0.88 and from 0.56 to 0.80, respectively5. In a study conducted by Gossec, et al, 3 rheumatologists graded 50 standing radiographs to assess interreader reliability. For KL grade, κ statistics for interreader and intrareader agreement were 0.56 and 0.61, respectively6. Agreement between rheumatologists in Gossec, et al and between readers in Spector, et al is comparable to the agreement observed between nonclinicians and the gold standard readers in our current study.
Riddle, et al examined the reliability of radiographic assessment of knee OA between 2 experienced and 2 inexperienced orthopedic surgeons for 116 patients in the Osteoarthritis Initiative (OAI), a multicenter study of patients who have or are at risk for OA4. They assessed the validity of their readings by comparison to the gold standard, an adjudicated reading by experienced radiologists in the OAI. Two central readers at the OAI evaluated radiographic knee OA at baseline, and if they disagreed on a particular score, a third reader reviewed the film. If the third reviewer agreed with either of the original readers, that score was final. If the third reviewer did not agree with either of the original readers, the 3 readers came to a consensus score together. Comparison to the gold standard KL grade was fair to substantial, with weighted κ statistics ranging from 0.36 to 0.804. In another study from the OAI, Guermazi, et al assessed the reliability between the central and site-specific readings and reported that κ statistics for interreader agreement for lateral and medial JSN were 0.65 and 0.71, respectively, and 0.37 for osteophytes13. Interreader agreement for KL grade was moderate, with κ equaling 0.52. The findings of these 2 OAI reliability studies resemble ours and provide further evidence that reliability even among experienced readers is generally modest.
Many reliability studies are conducted using an experienced clinician reader as the gold standard. In contrast, Sheehy, et al compared assessment of radiographic knee OA to assessment by magnetic resonance imaging (MRI). They found that KL grading, OARSI JSN scoring, and the compartmental grading scale for OA correlated well with MRI findings, with correlation coefficients equaling 0.836, 0.840, and 0.773, respectively14.
To our knowledge, only 1 other study measuring the reliability of radiographic assessment of knee OA among nonclinician readers has been conducted. In a cohort of patients with early symptomatic knee OA (KL 0 or KL 1 at baseline), Damen, et al assessed the interreader reliability among 4 research assistants and a general practitioner (GP) who was experienced in grading knee OA7. The average agreement for KL grade ≥ 1 between the nonclinicians and the GP was moderate, with κ equal to 0.58. Average κ statistics for individual radiographic features, graded based on the OARSI atlas, ranged from 0.12 to 0.80. Agreement between the GP reader and nonclinician readers was similar to that observed between nonclinicians and the radiologist in this analysis. In contrast to our current study, Damen, et al did not report intrareader or interreader reliability between the nonclinician readers, nor did they report agreement for the OARSI Summary Score7.
Our current study has certain limitations. First, the sample size is low. Additionally, all study subjects underwent TKR and had moderate to advanced radiographic OA. Fifty-five of 66 knees were classified as KL 4 by the experienced radiologist, and our results should be generalized cautiously to a population-based sample or to a population with less severe, low-grade disease. In contrast, the study conducted by Damen, et al assessed the reliability of nonclinician readers to grade early knee OA (KL 0 and KL 1 at baseline) and found similar agreement to our analysis7; the study by Damen, et al may be useful in evaluating the reliability of nonclinician assessment of early knee OA.
Another limitation of the current study is that it was not longitudinal and did not take into account the ability of nonclinician readers to evaluate OA progression. Therefore, we were unable to evaluate the sensitivity and specificity of assessment of structural changes over time. Moreover, while the nonclinician readers participated in the same protocol, they trained in 2 waves; this may have led to minor departures from uniformity in training. Lastly, short view radiographs were read for this study, because full-length radiographs were unavailable. Thus, anatomic alignment and deformity were based on the anatomical and not mechanical axis. Similarly, the study radiographs were done for clinical and not research purposes. For example, this study used clinical protocols that did not involve standardized positioning devices designed to give a metatarsophalangeal view. This limitation should not influence reliability because the different readers used a single image; all readers were exposed to the same image; and the readers did not assess longitudinal change.
These results have significant implications for research on populations with severe OA. It can be expensive to use radiologists to grade knee OA in research settings. Even among clinician readers, reliability varies3,4,5,6. The current study highlights tradeoffs involved in using trained nonclinician readers. The findings show that nonclinician readers assess severe knee OA features with a level of reliability that may be acceptable for certain study settings. The balance between cost and reader experience must be weighed carefully, and our data will help in this regard. Because our current study focused on a population with advanced knee OA, further research is needed on the reliability of nonclinician assessment of knee OA in a population of subjects with ranging severity. Future studies may also focus on enhancing training for nonclinicians to improve their agreement with expert readers. Additional sessions with the radiologist reader in our study as well as more independent practice prior to assessing knee OA for the reliability analysis may have improved the accuracy of nonclinician assessment of OA. Still, our results are in line with reliability studies conducted by experienced clinicians, suggesting that radiographic characterization of knee OA is inherently subjective.
Acknowledgment
We thank Piran Aliabadi, MD, for training the nonclinician readers and for reading radiographic films to provide the gold standard read in this analysis.
Footnotes
Supported by grants from the US National Institutes of Health/National Institute of Arthritis and Musculoskeletal and Skin Diseases: K24AR057827, T32AR055885.
- Accepted for publication March 9, 2016.