Interpreting Studies of Diagnostic Accuracy

In this issue of The Journal, Payet, et al examine a test for elevated anticyclic citrullinated peptide antibody (anti-CCP) levels and demonstrate its inability to identify rheumatoid arthritis (RA) among anti-CCP-positive patients with rheumatic disorders¹. Their results seem to be at odds with previous findings, which have shown the high diagnostic accuracy of anti-CCP tests for differential diagnosis of RA^2,3. It is important, therefore, to consider how to appropriately interpret studies of diagnostic accuracy and assess generalizability. Such considerations are important when planning, reporting, or reading studies of diagnostic accuracy.

Pepe⁴ lists the following 6 criteria for identifying settings where diagnostic tests would be useful: (1) the disease should be potentially serious, (2) the disease should be relatively prevalent in the target population, (3) the disease should be treatable, (4) the treatment should be available to those who test positive, (5) the test should not harm the individual, and (6) the test should accurately classify diseased and non-diseased individuals. Given that RA is a chronic disease (with a worldwide prevalence of 1%⁵) that can lead to severe disability, premature mortality⁶, and a loss of quality of life⁷, and given that appropriate therapeutic intervention can greatly enhance clinical outcomes⁶, it is clear that the first 4 criteria have been met in this setting. Anti-CCP antibody tests satisfy the fifth criterion, so it remains to establish that they can be used to accurately classify diseased and non-diseased individuals, which motivates studies of diagnostic accuracy such as that considered by Payet, et al¹. Note that there is evidence that this sixth criterion could be met because anti-CCP tests have been shown to be useful in identifying patients with early-stage RA⁸ and predicting which patients will progress from undifferentiated arthritis to RA^3,5,9. However, as we will discuss, it is important to avoid extrapolating diagnostic study results beyond the particular use of the test under study.

Studies of diagnostic accuracy involve evaluating the ability of a novel index test to detect a target condition whose true status is determined through the use of a reference standard¹⁰. The agreement between a binary index test and reference standard can be summarized in a 2 × 2 table and can be used to derive a number of different measures of diagnostic accuracy including sensitivity and specificity, positive and negative predictive values, and positive and negative diagnostic likelihood ratios (DLR), as demonstrated in Table 1. These measures summarize different aspects of the test results: sensitivity and specificity summarize the degree to which the test reflects disease status, and the predictive values summarize the likelihood of disease given the test result, while the DLR reflect the degree to which test results affect a potential diagnosis⁴.

View this table:

Table 1.

Agreement between a test for high levels of anticyclic citrullinated peptide antibodies (Y) and diagnosis of RA (R) in Payet, et al¹.

Studies of diagnostic accuracy have important implications in patient care, but the design and reporting of such studies have often been less than ideal¹⁰. The Standards for Reporting of Diagnostic Accuracy (STARD) initiative resulted in a broad set of guidelines for the reporting of studies of diagnostic accuracy^10,11; these STARD guidelines allow readers to assess the generalizability of study results. This editorial will focus on only 2 specific potential sources of error in generalizing results of diagnostic studies: spectrum bias and imperfect reference standard bias.

Spectrum bias occurs when attempting to extrapolate to a population that is different from the sample in terms of patient characteristics⁴. It is well known, for example, that changes in disease prevalence will directly affect the predictive values of a diagnostic test. Consider the test for elevated anti-CCP whose diagnostic accuracy for RA among anti-CCP-positive patients with rheumatic disorders is summarized in Table 1; if this same test were applied as a screening tool in the general population where the prevalence of RA is 1%, then even if the test retained the same sensitivity, specificity, and DLR, the positive predictive value would drop to 0.0107 from 0.83, while the negative predictive value would rise to 0.9928 from 0.23. This calculation of posttest probability of disease can be achieved by taking the product of the appropriate DLR and the pretest odds of disease (i.e., disease prevalence divided by 1 minus disease prevalence) to get the posttest odds of disease, which can then be transformed to probabilities (e.g., 1.07 × 0.01 / 0.99 = 0.0108, 0.0108 /(1 + 0.0108) = 0.0107; 0.71 × 0.01 / 0.99 = 0.0072, 1–0.0072 / (1 + 0.0072) = 0.9928). Fagan¹² presented a nomogram to graphically represent this Bayesian relationship. This calculation, however, relies on the tenuous assumption that DLR is the same in these 2 different situations⁴. While sensitivity, specificity, and DLR are not directly affected by changes in disease prevalence, such changes are often indicative of underlying differences in patient characteristics that will directly affect these measures of accuracy¹³. Therefore, studies should not be interpreted as assessing some absolute diagnostic accuracy of a test, but rather as assessing a particular use of a test in a particular setting¹⁴. Diagnostic tests that are effective for use in primary care, for example, may be useless in tertiary care settings¹⁵. This issue can potentially be mitigated through the use of regression modeling, which can help to control for important confounders and identify important subpopulations⁴; however, care should always be taken to avoid extrapolating beyond the population represented by the sample under study.

Imperfect reference standard bias occurs when the reference standard to which the index test is compared is not a perfect indicator of true disease status⁴. In this case, measures of diagnostic accuracy can be over- or underestimated, depending on the error inherent in the reference standard. Suppose, for example, that the reference standard R used in Table 1 is only able to identify late stage RA, and thus is an imperfect reference standard for the true diagnosis of RA, which we will call D. The true diagnostic accuracy of the test might actually be better represented by Table 2, where we have assumed that R, the reference summarized in Table 1, misdiagnosed 40 early-stage RA patients with elevated anti-CCP as having a non-rheumatoid rheumatic disorder. In this case, many of the true measures of diagnostic accuracy would be underestimated using the results from Table 1. Alternatively, rather than viewing R as an imperfect reference for D and the results of Table 1 as biased estimates of the results of Table 2, one might interpret these as representing different uses of the same test: in Table 1, we are summarizing the utility of our test for identifying patients with late-stage RA, while in Table 2 we are summarizing the utility of our test in diagnosing RA more generally. The appropriate interpretation of study results very much depends on the reference standard and patient spectrum included in the study; any attempt to extrapolate to other settings is likely to be problematic.

View this table:

Table 2.

Agreement between a test for high levels of anticyclic citrullinated peptide antibody (Y) and a hypothetical true diagnosis of RA (D), given that R in Table 1 is an imperfect reference standard.

When reading the results of the study of Payet, et al¹, as with any study of diagnostic accuracy, it is very important to consider the study population and the reference standard when considering whether these results can be generalized to your setting. The utility of the test under study would be very different in a generally healthy population (anti-CCP distributions differ between patients with rheumatic disorders and the general population⁵), or even among all patients with non-rheumatoid rheumatic disorders (according to the results of Payet, et al, specificity of a test for elevated anti-CCP would be as high as 93% if not restricting to the anti-CCP-positive group where specificity is only 21%). Additionally, it is important to note that Payet, et al used diagnoses of RA based on the American College of Rheumatology 1987 revised criteria¹⁶ rather than the 2010 criteria⁶ because of the latter’s reliance on anti-CCP testing. This was done in an effort to avoid incorporation bias, which could have artificially inflated measures of diagnostic accuracy because of the lack of independence between the index and reference tests¹⁷. However, as acknowledged in their discussion, this approach potentially led to an under-identification of cases of early-stage RA⁶, which is where anti-CCP testing is particularly useful⁵. It could, therefore, result in imperfect reference test bias if one attempted to extrapolate these conclusions as an assessment of the utility of the test for diagnosing early-stage RA or predicting RA development. Such conclusions should only be drawn based on longitudinal studies that compare baseline test results to disease statuses measured at a later stage using time-dependent measures of diagnostic accuracy⁴.

It is necessary to understand the setting in which a test was conducted, to avoid extrapolation biases. Such biases and misunderstandings can be mitigated if those conducting studies of diagnostic accuracy follow these 4 guidelines: (1) explicitly define the particular use of the test of interest, (2) carefully consider whether the population and the reference standard under study are consistent with this use, (3) use regression models to control for important concomitant factors when comparing tests, and (4) follow STARD guidelines in reporting results to ensure that readers can appropriately assess the generalizability of study results and examine potential sources of error.

REFERENCES

1.
1. Payet J,
2. Goulvestre C,
3. Bialé L,
4. Avouac J,
5. Wipff J,
6. Job-Deslandre C,
7. et al.
Anticyclic citrullinated peptide antibodies in rheumatoid and nonrheumatoid rheumatic disorders: Experience with 1162 patients. J Rheumatol 2014;41:xxxx.
2.
1. Kudo-Tanaka E,
2. Ohshima S,
3. Ishii M,
4. Mima T,
5. Matsushita M,
6. Azuma N,
7. et al.
Autoantibodies to cyclic citrullinated peptide 2 (CCP2) are superior to other potential diagnostic biomarkers for predicting rheumatoid arthritis in early undifferentiated arthritis. Clin Rheumatol 2007;26:1627–33.
3.
1. Taylor P,
2. Gartemann J,
3. Hsieh J,
4. Creeden J
. A systematic review of serum biomarkers anti-cyclic citrullinated peptide and rheumatoid factor as tests for rheumatoid arthritis. Autoimmune Dis 2011;2001:815038.
4.
1. Pepe MS
. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2003.
5.
1. Avouac J,
2. Gossec L,
3. Dougados M
. Diagnostic and predictive value of anti-cyclic citrullinated protein antibodies in rheumatoid arthritis: a systematic literature review. Ann Rheum Dis 2006;65:845–51.
6.
1. Aletaha D,
2. Neogi T,
3. Silman AJ,
4. Funovits J,
5. Felson DT,
6. Bingham CO,
7. et al.
2010 rheumatoid arthritis classification criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheumatism 2010;62:2569–81.
7.
1. Aggarwal R,
2. Liao K,
3. Nair R,
4. Ringold S,
5. Costenbander KH
. Anti-citrullinated peptide antibody assays and their role in the diagnosis of rheumatoid arthritis. Arthritis Care Res 2009;61:1472–83.
8.
1. Takasaki Y,
2. Yamanaka K,
3. Takasaki C,
4. Matsushita M,
5. Yamada H,
6. Nawata M,
7. et al.
Anticyclic citrullinated peptide antibodies in patients with mixed connective tissue disease. Mod Rheumatol 2004;14:367–75.
9.
1. van der Linden MP,
2. van der Woude D,
3. Ioan-Facsinay A,
4. Levarht EW,
5. Stoeken-Rijsbergen G,
6. Huizinga TW,
7. et al.
Value of anti-modified citrullinated vimentin and third-generation anti-cyclic citrullinated peptide compared with second-generation anti-cyclic citrullinated peptide and rheumatoid factor in predicting disease outcome in undifferentiated arthritis and rheumatoid arthritis. Arthritis Rheum 2009;60:2232–41.
10.
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE,
4. Gatsonis CA,
5. Glasziou PP,
6. Irwig LM,
7. et al.
The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003;138:W1–12.
11.
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE,
4. Gatsonis CA,
5. Glasziou PP,
6. Irwig LM,
7. et al.
Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clin Chem Lab Med 2003;41:68–73.
12.
1. Fagan TJ
. Nomogram for Bayes theorem [letter]. N Engl J Med 1975;293:257.
13.
1. Leeflang MM,
2. Rutjes AW,
3. Reitsma JB,
4. Hooft L,
5. Bossuyt PM
. Variation of a test’s sensitivity and specificity with disease prevalence. CMAJ 2013;185:E537–44.
14.
1. Streiner DL
. Diagnosing tests: using and misusing diagnostic and screening tests. J Pers Assess 2003;81:209–19.
15.
1. Knottnerus JA,
2. Buntinx F
, editors. The evidence base of clinical diagnosis: theory and methods of diagnostic research. 2nd ed. Oxford: Wiley-Blackwell; 2009.
16.
1. Arnett FC,
2. Edworthy SM,
3. Bloch DA,
4. McShane DJ,
5. Fries DF,
6. Cooper NS,
7. et al.
The American Rheumatism Association 1987 revised criteria for the classification of rheumatoid arthritis. Arthritis Rheum 1988;31:315–24.
17.
1. Pewsner D,
2. Battaglia M,
3. Minder C,
4. Marx A,
5. Bucher HC,
6. Egger M
. Ruling a diagnosis in or out with “SpPIn” and “SnNOut”: A note of caution. BMJ 2004;329:209–13.

Main menu

User menu

Search

REFERENCES

Content

Resources

Subscribers

More