Abstract
Previously reported data on 5 computer-based programs for measurement of joint space width focusing on discriminating ability and reproducibility are updated, showing new data. Four of 5 different programs for measuring joint space width were more discriminating than observer scoring for change in narrowing in the 12 months interval. Three of 4 programs were more discriminating than observer scoring for the 0–18 month interval. The program that failed to discriminate in the 0–12 month interval was not the same program that failed in the 0–18 month interval. The committee agreed at an interim meeting in November 2007 that an important goal for computer-based measurement programs is a 90% success rate in making measurements of joint pairs in followup studies. This means that the same joint must be measured in images of both timepoints in order to assess change over time in serial radiographs. None of the programs met this 90% threshold, but 3 programs achieved 85%–90% success rate. Intraclass correlation coefficients for assessing change in joint space width in individual joints were 0.98 or 0.99 for 4 programs. The smallest detectable change was < 0.2 mm for 4 of the 5 programs, representing 29%–36% of the change within the 99th percentile of measurements.
A committee to examine and test the feasibility and reliability of computer-based measurements of features in hand and foot radiographs was established by OMERACT in 2002. Individuals known to be working on computer methods of measurement were invited to join, and 5 groups have actively participated1–8. Two groups, Ziekenhuis Groep Twente, Hengelo, and the Medical University of Vienna, Vienna, Austria, were collaborating with technical departments, and graduate students were contributing much or all of the programming skills. Two new groups have recently joined in the committee’s efforts.
As its initial project, the committee undertook testing programs to measure joint space width9,10. There is agreement that a successful computer-based program for measuring joint space width in metric units must demonstrate equal or greater sensitivity to change than observer scoring in discriminating between treatment arms in clinical trials. To this end, 5 developers have measured joint space width in the COBRA trial image set to compare computer measurements with observer scoring10.
Since the data were presented at OMERACT 8, one program that had not yet completed measurements on the full COBRA set has now done so, and one other program has been extensively revised9,10. The additional data are included in Table 1 (also available from: www.omeract.org and originally published in9,10). The data demonstrate that computer-based measurements are more discriminatory than observer scoring in 11 of 14 comparisons. It should be noted that data were taken into consideration only if the program was able to successfully measure at least 50% of the required joint pairs per patient. This prerequisite explains why the numbers of patients that were assessed differed across methods.
Discriminatory ability of 5 computer programs in comparison with semiquantitative observer scoring in measuring change in joint space width in the monotherapy group and the COBRA therapy group of the COBRA trial, taking all measured joints into consideration. Published in part in a previous OMERACT report10. Since then, Duryea has completed the measurement by his program, one program (method D in10) was left out due to measurement failure, and a revised Sharp program was added. Data presented here include the mean change over all measured joints, but the type and number of joints measured per program differ.
The committee has agreed that data should be complete in at least 90% of paired measurements in order to use automated scoring in studies. This implies successful measurements of joint space width for the same joints at both timepoints in serial radiographs, a requirement necessary for calculating change in joint space width as an outcome measure.
The committee has identified multiple causes of measurement failure, which are listed in Table 2. In order to get some insight into the practical reasons for measurement failure, reasons were recorded in the reevaluation of the entire set of 107 cases using the revised Sharp program, which measures a set of 34 individual joints. Overall, 2.9% of all single-joint measurements failed for a reason. Only 0.2% of single-joint measurements failed due to the inability of the computer program to find the joint margins. The criterion of at least 90% successful joint assessment per patient (≥ 31/34 successful joints) was met in 92% of cases if only one successful timepoint was required, and in 87% of the cases if 2 successful timepoints (for assessing change) were required. This result is comparable to results of the 2 best performing programs in the previous report10.
Potential reasons for measurement failure in the assessment of joint space width by computer programs.
Although our committee has not proposed a standard for reliability, intraclass correlation coefficients (ICC) give an indication of relative agreement, and the smallest detectable changes (SDC) give an indication about absolute agreement of the measurement programs. In order to test reproducibility of the 5 computer programs, the complete assessment was repeated in a set of 30 selected paired cases, all belonging to the COBRA dataset. This set included radiographs taken at baseline and radiographs at 18 months’ followup. Part of these data have been published in an aggregated manner10. In Table 3 the results of 4 of the 5 previous computer programs plus the results of the revised Sharp method are summarized more comprehensively, so that the reliability per program per joint group can be investigated.
Relative and absolute agreement for 5 computer programs with respect to measuring joint space width and change in joint space width, as measured in a set of 30 paired images of the COBRA trial that were assessed twice by all methods.
Expectedly, ICC for status scores were higher than ICC for change scores for all programs. If all joints were taken into account, 4 of the 5 programs yielded ICC for status scores between 0.97 and 0.99. ICC for change scores were above 0.80 for the same 4 of the 5 methods, reflecting acceptable relative agreement. In contrast, however, the SDC varied from 29% to 41% of the measured range (99th percentile), which indicates considerable residual measurement error.
DISCUSSION
Theoretically, measuring change in joint space width in metric units should be more reproducible than semiquantitative observer scoring, since a well constructed computer program may reduce operator/observer influence to near zero. Greater reproducibility may translate into greater sensitivity to change over time. The studies done by this committee support this contention. In addition, metric units make more sense to most people than van der Heijde, Larsen, or Sharp units. For someone not regularly using score data, a statement that a joint decreases width by 0.1 or 0.5 mm is more readily visualized than a joint narrowing score increasing by 1 or 2 units.
Are computer-based programs for measuring joint space width ready for use in clinical trials? The committee members believe they are, provided the conditions for digitizing the images meet the high standards employed in the full COBRA trial, in which images were digitized at 50-micron pixel size. Under these conditions an acceptable success rate was initially obtained by 3 computer programs. The other 2 programs were unable to measure a sufficient number of paired joints to be useful in clinical trials. The 50-micron pixel size (20 pixels/mm) is a higher resolution than is regularly employed by many clinical research organizations, which use 100-micron pixel size (10 pixels/mm) in clinical trials. Unless future image resolution is standardized at the higher resolution, computer programs need to be validated for use on images recorded at the 100-micron pixel size. The issue of digitization resolution needs to be resolved before measurements with the current instruments can be recommended for clinical trials employing resolution lower than 50-micron pixel size.
Additionally, since COBRA included only patients with early disease and consequently with little baseline damage, the programs need to be validated in patients with more extensive baseline damage. Theoretically, this may lead to lower success rates per patient, since measurement failure is more common in damaged joints.
Although the high standard of a 90% success rate in measuring both images in paired image sets was not reached by any of the programs, the majority of failures were due to image problems or structural abnormalities that precluded measurement. These factors may also influence observer scoring. No record was available as to how the readers who scored the radiographs handled these structural problems, other than subluxation, which is treated as joint space narrowing in the Sharp-van der Heijde method. Methods for imputation of data missing for technical reasons should be carefully considered. Multiple imputation based on mixed-effects analysis, as suggested recently by Baron, et al, may offer the smallest bias11.
Most of the other issues that can affect the success rate relate to image quality. Undoubtedly, many of these issues have been ignored too long. The failure of radiographers to include all structures, such as the wrist, little finger, and little toe, is an inexcusable lapse in meeting appropriate standards for investigational studies. From personal experience we can say that such failures are common in clinical trials. Poor dynamic range of images is frequently due to using cheap film. OMERACT should take a strong and positive stand on this issue.
In the future the subcommittee on computer-based measurements is planning studies to evaluate the effect of image resolution of 50- versus 100-micron pixel size on measurements and to examine whether recording the image at 8- or 16-bit gray scale has an effect on computer-based measurements. As indicated above, further evaluations need to be done in patients with more extensive damage. Beyond that we will begin studies on measuring erosions.
Footnotes
-
The lead author, John T. Sharp, died September 14, 2008. Professor emeritus and retired rheumatologist, he chaired the OMERACT subcommittee on automated joint measurement. The members of the subcommittee and co-authors of this article commemorate Prof. Sharp as an inspiring, scholarly, and ever-enthusiastic forerunner in the development of automated joint measurement, and intend to continue the scientific work on automated joint measurement in his spirit.