Article Text

Extended report
Chronological reading of radiographs in rheumatoid arthritis increases efficiency and does not lead to bias
  1. Lilian H D van Tuyl1,
  2. Désirée van der Heijde2,
  3. Dirk L Knol3,
  4. Maarten Boers1,3
  1. 1Department of Rheumatology, VU University Medical Center, Amsterdam, The Netherlands
  2. 2Department of Rheumatology, Leiden University Medical Center, Leiden, The Netherlands
  3. 3Department of Epidemiology & Biostatistics, VU University Medical Center, Amsterdam, The Netherlands
  1. Correspondence to Dr Lilian H D van Tuyl, Department of Rheumatology, VU University Medical Center, PO Box 7057, Amsterdam 1007 MB, The Netherlands; L.vantuyl{at}vumc.nl

Abstract

Objectives To evaluate the difference between chronological and random sequence reading in a series of radiographs with 11 years’ follow-up. In addition, the influence of the starting point and length of series was evaluated.

Methods Two experienced readers independently and repeatedly scored digitised radiographs of 62 patients at time points 0, 2, 5, 8 and 11 years of follow-up from the COBRA follow-up database according to the Sharp/van der Heijde method. A linear mixed model was fitted to the data.

Results Over 11 years the mean scores increased by 3.8 points per year. Compared to random reading, chronological reading resulted in a slightly increased progression rate of 0.4 points per year (p=0.008) and a lower standard error of the mean total progression rate of 0.30 (compared to 0.35 for random reading). Over 11 years, this results in a small difference in progression estimates of about five points, but a highly relevant difference of over 25% of patients needed in a study to find a difference in radiological outcome between two groups. Reading of short series, or series including a baseline radiograph, results in a significantly higher yearly progression rate compared to reading of long series, or series not including a baseline measurement.

Conclusions Chronological reading of radiographs is preferred above random reading, due to decreased variability around the estimation of the progression rate; this increased efficiency translates into smaller sample sizes, or increased power to detect small differences. For studies with long-term follow-up, the same two readers should read all radiographs, including baseline.

  • Rheumatoid Arthritis
  • Outcomes research
  • Epidemiology

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Rheumatoid arthritis (RA) is characterised by chronic joint inflammation, followed by structural damage of joints through cartilage degradation and subchondral bone erosion.1 Radiological damage of joints is one of the key outcome measures of both randomised clinical trials (RCTs) and observational research in RA.2 ,3

Despite the prominent place of radiographic outcome in RA research, there is still debate on the optimal method to read radiographs of RA patients. To evaluate radiographic damage, films of hands and feet can be read with knowledge of time sequence (chronological or unblinded reading) or without knowledge of time sequence (random sequence or blinded reading). From an epidemiological point of view, the latter method seems less susceptible to bias. With this in mind, drug regulatory agencies decided that films in RCTs need to be read by a reader who is blinded for time sequence in order for a drug to be considered for registration.4 ,5 However, there is compelling evidence that reading with a known sequence is methodologically superior to reading with an unknown sequence, resulting in a better signal-to-noise ratio and an increase in detecting clinically relevant changes in the patient without serious overestimation of non-relevant findings.6 ,7 To date, this has only been shown in studies with up to 3 years’ follow-up; higher progression rates seen in short series read chronologically might eventually result in much higher long-term damage scores than series read in random order. When follow-up increases, reading without known sequence becomes increasingly difficult to arrange; it also becomes more artificial, since in most patients damage will increase over the years, unblinding the reader for the time order.

With long-term follow-up studies, short and long-term results are usually studied separately, which gives rise to another question: do prior time points have to be re-read to assess progression beyond the last evaluated time point? There is no data available on the influence of incorporation of the baseline film on outcome of radiographic progression in RA; most long-term follow-up studies decide to re-read a small sample of prior films including the baseline film besides the new long-term follow-up time point.8 ,9 However, re-reading films is time consuming; moreover, the difference in progression between treatment groups for the initial trial period (ie, 1 year) are already known, which makes the null-hypothesis of ‘no difference’ in many cases a priori untrue.

To elucidate the uncertainty on the optimal method to read radiographs, especially in long-term follow-up studies in RA, we studied the difference between chronological reading and random reading. In addition, we evaluated the influence of the starting point as well as the length of series of radiographs on outcome.

Methods

Design

Radiographs from the COBRA trial were available for this study. The 1-year results of the COBRA trial were published in 1997; in 2004 and 2011, follow-up studies were conducted including radiographs; all studies were approved by medical ethical committees and all patients gave written informed consent.10–12 All radiographs were converted to digital images. Thus, a unique digital radiology database of hand and foot films of all COBRA patients with 11 years follow-up was available for this study.

Specifically, radiographs of 62 patients with films of hands and feet available at 0, 2, 5, 8 and 11 years were selected. Sets were specifically selected based on completeness of follow-up, quality of the films and, most importantly, distribution of damage progression (results from previous readings) to optimally represent low or high damage at baseline, and low or high damage progression as present in the dataset, with the median as cut-off point. Thus, equally sized subsets of patients were available to represent one of four subgroups: initial damage and progression both low; initial damage and progression both high; initial damage low and progression high; and finally, initial damage high and progression low.

Four series were constituted, two long versus two short series and two series starting at baseline versus two series starting at 2 years’ follow-up (figure 1). The four series were coded differently for the purpose of random versus chronological reading, resulting in eight series. There was no logical link between the codes for chronological versus random films. For details on coding and ordering of films, see online supplementary appendix.

Figure 1

Study design. One box represents 62 sets of hand and foot images, read in chronological order (black) or random order (grey). Series A and B include a baseline reading, while series C and D start at 2 years’ follow-up. Series A and C are short series with two time points, while series B and D are long series with four time points.

Two experienced readers were asked to independently read a selection of radiographs according to the Sharp–van der Heijde (SvH) method.13 ,14

Before the start, the two readers were asked to read a test series of films of 10 patients and two time points; scores were discussed with a third experienced reader (DvdH) to create consensus over details of the reading method.

Then, each reader received the digital database of radiographs on hard disk, organised as described in the online supplementary appendix. Readers were given access to a state-of-the-art computer with two high resolution (1920×1200 px) 24-inch screens. Readers were aware of the study question, but unaware of the number of repeated readings.

The readers entered data into specifically designed software for the SvH-method IRDE 1.02 (van der Heijde).

Statistical analysis

Data were checked for discrepancies between readers using a Bland and Altman plot. If readers differed more than 15 SvH points in progression score in any series (ie, twice the standard deviation of the difference between the two readers), the entire series was offered to both readers to re-read. To avoid bias in the estimation of the different variance components, the new scores were used as the final score. The complex coding of the data was back translated to the original trial coding for analysis.

In some cases radiographs were of suboptimal quality and one or both readers decided not to enter a score for one or more joints. Since techniques to account for missing data can influence the research question, a missing data analysis was done to see if there was a difference in the number of missing data on the joint level for: (a) random versus chronological reading; (b) little damage progression versus large damage progression; and (c) short versus long series.

A sensitivity analysis was done by creating a second dataset that excluded two patients with the most missing data and comparing the outcomes to the original dataset.

Intra-class correlation coefficients (ICCs) for each of eight series averaged over all time points in that series were calculated as a measure of agreement between readers. The ICCs were calculated according to a design where patients are nested within patient-groups (baseline damage×progression), crossed with the factors ‘observer’ and ‘time’, with patient and observer random, resulting in the following formula: ICC=(patient variance+patient×time variance)/total variance.15

Linear mixed modelling

A linear mixed model was fitted to the data, with unrestricted intercepts and a modelled linear effect of time. The following effects were entered a priori: reading method, length of a series (short/long), (incorporation of) baseline (0/2-year start), baseline damage (high/low), progression (high/low). The latter two variables define four patient groups. Hence we started with a model with 64 fixed effects (32 intercepts and 32 slopes). In a backward stepwise procedure, non-significant (p>0.05) interactions with time were removed from the model, starting with the highest order interaction. This resulted in five first-order interactions (baseline damage×time; progression×time; baseline×time; length of a series×time; reading method×time) and two two-order interactions (progression×baseline×time; baseline×length of a series×time). Apart from the residual term, intercept and the slope of time were random.

To evaluate efficiency of the two reading methods, a larger model with separate covariance structures for each of the methods was fitted, since the simplified model above was optimised for calculating progression rates, but not SE of progression rates. Efficiency was calculated with SEs of progression rates between chronological(c) and random(r) reading as: (SEc/SEr)2.

Raw analysis

Mean (SD) and progression of scores per series and time point were calculated and differences in SvH scores between combinations of series were analysed using Student t tests.

The mean of the total SvH scores of the two readers was used in all analyses. Progression scores and differences in progression scores are presented with 1 decimal, ICCs and SEs are presented with 2 decimals. Series are coded A to D to reflect the different conditions according to figure 1. All series are read in chronological and random order but differ in length of follow-up and starting point; series A consist of two time points including a baseline radiograph, series B consist of four time points including a baseline radiograph, series C consist of two time points starting at 2 years’ follow-up and series D consist of four time points starting at 2 years’ follow-up.

Results

Of the 62 selected patients, 70% were female; at baseline, their mean (SD) age was 47 (12) and disease duration was 5 (5) months. Since the SvH system scores 44 joints for erosions and 44 joints for narrowing, with 62 patients and 24 readings per patient (see figure 1), each reader read 1488 images, with a total of 130.944 joint scores.

There was no difference in the number of missing data between chronological and random reading (3243 vs 3227 joints on either erosion or narrowing score, respectively, of a total of 130 944 scores, 2.5%). Likewise, the number of missing data between patients with more or less damage at a time point prior to the evaluated time point was equal, as was the number of missing data between short versus long series (mean 2.2 vs 2.1 per film, respectively). Hence, we decided to choose a simple approach to account for missing data, by imputing a score of 0 to a missing erosion or narrowing score in a joint. As a consequence, patients with missing data in a damaged joint would score slightly lower in that assessment.

In total, 27 (11%) of chronological series and 59 (24%) random series were re-read to correct discrepancies in the original reading, and the results of the second reading were used in the analysis, as described in Methods. The ICCs of the sets read in chronological order ranged between 0.74 and 0.85 and the sets read in random order between 0.61 and 0.73. There were no differences between ICCs from short versus long sets.

The mean yearly progression rate (averaged over all iterations) in this dataset was 3.8 SvH points. Figure 2 illustrates the mean total SvH scores at each time point and the progression of SvH over time per series.

Figure 2

Mean and 95% CI of the total Sharp–van der Heijde (SvH) scores at each time point and the progression of SvH score over time per series, split for chronological and random reading method. Series A and B include a baseline reading, while series C and D start at 2 years’ follow-up. Series A and C are short series with two time points, while series B and D are long series with four time points.

Chronological reading is more efficient

Compared to random reading, chronological reading resulted in a slightly higher yearly progression rate of 4.0 vs 3.6 for random reading (p=0.008). Chronological reading was also more precise: SE of the mean of the progression rate was 0.30 compared to 0.35 for random reading. Over 11 years, the difference of 0.4 points per year in radiological progression between reading methods results in a small difference in progression of about five points, but a highly relevant difference in efficiency of over 25% of patients needed in a study to find a difference in radiological outcome between treatment groups. Post hoc, we studied the efficiency increase in the patients with low progression rates (two subgroups having either low or high initial damage scores, n=30) to emulate current trials. In these patients the efficiency gain of chronological reading was lower at 8%.

The raw analyses confirmed the findings of the statistical model regarding the difference between chronological reading and random reading. Although there were no significant differences in absolute SvH scores per time point between the two reading methods, progression scores between the two reading methods were significantly different (with an exception for series A). The chronological reading method scored more progression than the random reading method (figure 2).

The baseline film should be included

The progression rate is estimated to be significantly higher (1.1 points per year, p<0.001) when the baseline radiograph is read, than when the baseline radiograph is not read. This difference is more profound in short series (1.7 points per year, p<0.001) than in long series (0.4 points per year, p=0.034), but equally large for chronological versus random reading.

Moreover, the estimation of the score at time point 2 is 2.6 points higher for series including the baseline value compared to series not including the baseline value, when read in chronological order (p=0.001). When read in random order, this difference is smaller and does not reach significance (1.1 point; p=0.16). In addition, the estimation of the score at time point 2 is 2.7 points higher for short series including the baseline value compared to short series not including the baseline value (p=0.004). For long series, this difference is smaller and does not reach significance (0.9 point; p=0.16).

The mean yearly progression rate for short series is 1.1 point per year higher than the progression rate for long series (p<0.001). Moreover, the CI around the estimation of the progression rate for short series is wider than that for long series (figure 3) and the standard error of the progression rate for short series is larger than for long series (0.38 vs 0.28).

Figure 3

Mean yearly progression scores with 95% CI for short versus long series and series including/not including a baseline measurement, split for chronological reading (black) and random reading (grey). Series A and B include a baseline reading, while series C and D start at 2 years’ follow-up. Series A and C are short series with two time points, while series B and D are long series with four time points.

The findings of the model concerning time point 2 years were mostly confirmed in the raw analysis; at 2 years, series that include the baseline reading have a 2.4-point higher score than series that do not include the baseline: 20.4 (22.7) vs 17.9 (19.3) (p=0.071). When looking at 2 years for chronological and random reading separately, chronological reading scores slightly higher (2.9 (1.8), p=0.11) than random reading (2.0 (2.0), p=0.32). There is no significant difference in absolute scores at T2 between short (19.6 (22.1)) vs long (18.1 (20.1)) series. For series read chronologically but not randomly, there is a slight, non-significant difference between scores at T2 for short (19.3 (21.6)) vs long (17.4 (18.5)) series (both 20 (22)).

Sensitivity analyses on a subset of the original dataset that excluded the two patients with the most missing data provided highly similar results (data not shown).

Discussion

This long-term study strongly suggests that chronological reading of radiographs should be preferred above random reading: differences between the estimates obtained by both methods were highly similar but precision was increased by chronological reading. This increase in precision represents an important gain in efficiency of RCTs, resulting in smaller sample sizes for the same power, or smaller detectable differences at the same power. This is becoming an increasingly relevant feature of measurement tools, since placebo controlled studies are no longer the standard and differences between effective treatment groups become smaller. Moreover, chronological reading more closely resembles the routine clinical situation.

Strengths of this study include the extensive efforts made in the design to minimise the chance of bias, and the advanced analyses, that allowed estimation of all known sources of variability that might influence progression rates in clinical studies. A weakness is the poor quality of some images of the digitised radiological dataset, an almost unavoidable feature in the majority of clinical (observational) studies to date. With most centres now switched to digital imaging at the source, this problem is likely to become less in the future. However, positioning of hands and feet remains a common technical problem. In this study we demonstrated that the missing data did not influence the results.

This study is not the first to investigate methods of reading radiographs in RA. In 1987, Fries et al showed that the precision of paired scoring (reading more than one film simultaneously) was greater than reading single films separately.16 In 1997 two studies confirmed this finding.17 ,18 In 1999, van der Heijde et al showed that reading films in chronological order results in higher progression rates and a better signal-to-noise ratio than scoring films in blinded pairs or singly.6 In 2002 Bruynesteyn et al7 reported that knowing the chronological sequence leads to an increase in detecting clinically relevant changes in the patient without serious overestimation of non-relevant findings.

Despite this evidence, a review on reporting of radiographic measures in RA showed that of RCTs with radiographic outcomes, 53% of studies used a random reading technique and 44% used chronological reading.19

The continued popularity of random reading is most likely due to the regulatory agencies that dictate to score radiographs randomly. This comes from a time that there were doubts whether chronological reading would be introducing an expectation bias, since the readers are aware of time sequence and expect scores to remain stable or worsen. Conceptually it remains a challenge to follow the reasoning behind this dictate, for two reasons: first, even if such a bias were present, it would affect all patients and groups in a trial in a similar way, and thus not bias the estimate of the difference in progression between the groups; and second, the expectation would increase the estimate of damage which is an adverse outcome. Assuming the existence of damage progression in an RA group is the conservative position, and documenting some damage progression makes it more difficult to claim the ‘arrest’ of damage progression. In any case the problem of expectation bias has partly been solved by the new scoring guidelines that allow for improvement of scores and thus diminish the expectation of scores to stabilise or worsen.20 For any remaining doubts the current study provides evidence for the absence of bias when reading series in chronological order, since the difference in progression between the two methods is very small. Given the absence of any relevant difference in the estimate of progression, the gain in efficiency becomes of prime importance, because time and money can be saved in order to achieve the same results.

It can be questioned whether the results of our analysis can be directly translated to RCTs on treatments that can (almost) completely suppress radiological progression. The data used for the current analysis included a balanced sample of patients with and without progression, and with or without baseline damage. In the subgroup of patients with low progression rates, the gain in efficiency was smaller. This is to be expected, since the noise around small differences is likely to be smaller than the noise around large differences. However, even in trials with little progression, it is advisable to use the most efficient method of reading radiographs.

The study also demonstrates that reading the baseline radiograph makes a difference. This implies that ‘extrapolation’ to earlier time points should be avoided. Our advice for studies with long-term follow-up is to have the same two readers read all radiographs, including baseline.

Finally, the study showed that the estimated yearly progression rate for short series is significantly higher than that for long series, with a wider CI and larger standard error for short series. This seems logical, with the accuracy of the estimate increasing as more data (observation years) accumulates. However, effectiveness of new drugs in clinical trials is always evaluated based on short-term data. Fortunately, the higher progression rate will influence both treatment groups equally and will therefore not have influence on the evaluation of effectiveness of a study drug. But care must be taken when extrapolating results found in short series over longer periods, since the progression rate is likely to be slower over longer periods.

In conclusion, chronological reading of radiographs yields similar estimates of progression with higher precision, and should thus be preferred above random reading; this increased efficiency translates into smaller sample sizes, or increased power to detect small differences. In addition, chronological reading has more face value as it more closely resembles the routine clinical situation. For studies with long-term follow-up, ideally the same two readers should read all radiographs including baseline.

Acknowledgments

We thank Dr M Thabet and Dr J van Nies for reading the radiographs in this study and the Dutch Arthritis Association for funding this study.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:

Footnotes

  • Handling editor Tore K Kvien

  • Contributors All authors attributed significantly to the conception and design, analysis and interpretation of data, drafting of the article and revising it critically for important intellectual content. In addition, all authors approved the final version to be published.

  • Funding Provided by the Dutch Arthritis Association, which had no role in the study design, collection, analysis and interpretation of data, writing the report and the decision to submit the paper for publication.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.