Abstract
Objective. To test the reliability of the consensus-based ultrasound (US) definitions of elementary gout lesions in patients.
Methods. Eight patients with microscopically proven gout were evaluated by 16 sonographers for signs of double contour (DC), aggregates, erosions, and tophi in the first metatarsophalangeal joint and the knee bilaterally. The patients were examined twice using B-mode US to test agreement and inter- and intraobserver reliability of the elementary components.
Results. The prevalence of the lesions were DC 52.8%, tophus 61.1%, aggregates 29.8%, and erosions 32.4%. The intraobserver reliability was good for all lesions except DC, where it was moderate. The best reliability per lesion was seen for tophus (κ 0.73, 95% CI 0.61–0.85) and lowest for DC (κ 0.53, 95% CI 0.38–0.67). The interobserver reliability was good for tophus and erosions, but fair to moderate for aggregates and DC, respectively. The best reliability was seen for erosions (κ 0.74, 95% CI 0.65–0.81) and lowest for aggregates (κ 0.21, 95% CI 0.04–0.37).
Conclusion. This is the first step to test consensus-based US definitions on elementary lesions in patients with gout. High intraobserver reliability was found when applying the definition in patients on all elementary lesions while interobserver reliability was moderate to low. Further studies are needed to improve the interobserver reliability, particularly for DC and aggregates.
Gout is a common inflammatory joint disease and is caused by the formation and deposition of monosodium urate crystals (MSU) in joints or soft tissues. The diagnosis of gout is conventionally based on the history, clinical examination, uric acid serum levels, and subsequent polarization microscopy of joint or tophus aspirates. Polarization microscopy is the definitive way to diagnose gout. In suspected patients with gout without synovial effusion or clinical tophi, the sampling of relevant material to examine for MSU is challenging, and other diagnostic procedures are warranted.
Uncontrolled hyperuricemia may be associated with chronic kidney disease and cardiovascular disease with subsequent increased morbidity and mortality1. This has underlined the importance for an early, accurate diagnosis of gout, and with the development of new therapeutic options, imaging modalities have been investigated to determine whether they may improve disease assessment. In this respect, ultrasound (US) has been shown to be promising because it allows direct visualization of the crystal deposits and is also increasingly available in the clinical setting. Although several published studies have highlighted the involvement of US for the assessment of elementary components in gout, a systematic literature review emphasized the lack of definitions of US elementary lesions2. To improve the use of US in the evaluation of gout, an Outcome Measures in Rheumatology Clinical Trials (OMERACT) task force subgroup was formed.
The first step in the standardization process of US as a tool for diagnosis and monitoring of gout was to obtain consensus-based definitions on US elementary lesions, as revealed by the systematic literature review2 to be double contour (DC), aggregates, tophi, and erosions. The consensus-based definitions were obtained through a Delphi exercise and tested in a subsequent Web exercise3. In the latter, a good to excellent reliability was found for all lesions except for aggregates, which was moderate when testing the definitions on static images3. The second step in the standardization process and the aim of our present study was to test the reliability of the consensus-based definitions in known patients with gout to ensure the pathologies to be present. This second step is mandatory before it is possible to test the US ability as a diagnostic tool.
MATERIALS AND METHODS
Study design and setting
Our present study was performed according to a prespecified protocol. The reporting of the OMERACT reliability exercise followed the recommendations from the Enhancing the QUAlity and Transparency Of health Research network4 using the Guidelines for Reporting Reliability and Agreement Studies statement5.
Ethics committee approval of our study was obtained from the Berlin Medical Association (Berliner Ärztekammer, Eth-17/13). All patients gave informed consent. Following the consensus-based definitions obtained through a Delphi and Web exercise process, a workshop was conducted to evaluate the reliability of detecting the elementary components in patients and assessing agreement between sonographers with experience in US of gout.
Measurements
The following definitions of the US elementary lesion obtained from step 13 were tested in the patients with gout: (1) Double contour: “Abnormal hyperechoic band over the superficial margin of the articular hyaline cartilage, independent of the angle of insonation and which may be either irregular or regular, continuous or intermittent and can be distinguished from the cartilage interface sign”. (2) Tophus [independent of location (e.g., extra-articular/intra-articular/intra-tendinous)]: “A circumscribed, inhomogeneous, hyperechoic and/or hypoechoic aggregation (which may or may not generate posterior acoustic shadow) which may be surrounded by a small anechoic rim”. (3) Aggregates [independent of location (intra-articular/intra-tendinous)]: “Heterogeneous hyperechoic foci that maintain their high degree of reflectivity even when the gain setting is minimized or the insonation angle is changed and which occasionally may generate posterior acoustic shadow”. (4) Erosions: “An intra- and/or extra-articular discontinuity of the bone surface (visible in 2 perpendicular planes)”3. Examples can be seen in Figure 1.
Patients
To ensure that all elementary lesions may be detectable, 8 patients with polyarticular, tophaceous gout, verified by polarization microscopy from the Medical Center of Rheumatology Berlin-Buch, volunteered for our study. Their clinical data may be seen in Table 1. Each patient underwent a bilateral B-mode US examination (including dynamic examination) of the first metatarsophalangeal (MTP) joints, the cartilage of the intercondylar region of the knees (ICR), and the proximal (PPT) and distal (DPT) parts of the patella tendons. The MTP joints were examined dorsally, from medially to laterally. The tendons were also examined from the medial to the lateral aspect and at the enthesis where signs of erosive changes were also evaluated. The lower extremities were chosen for feasibility reasons. All examinations were performed during a morning and an afternoon session on the same day, in the same room, using 8 Esaote MyLab Twice/Class machines, equipped with 6–18 MHz broadband linear array transducers. The B-mode settings of all 8 US machines were identical.
Ultrasonographers
Sixteen of the rheumatologists previously involved in the development of the consensus-based US definitions for gout participated in the workshop. All rheumatologists were experienced in musculoskeletal US.
Outcomes and rating process
Each patient (numbered 1–8) was assigned to 1 machine (numbered 1–8), and the ultrasonographers (numbered 1–16) then moved from 1 patient to the next in a predefined (randomized) sequence with 13 min allocated for scanning and scoring the findings. All patients were scanned twice by the same examiner to assess the intrareader reliability. The data were collected immediately after the session to ensure no communication between examiners.
Statistical analysis
Intra- and interobserver reliability was estimated based on Cohen κ coefficient. These κ coefficients and the corresponding 95% CI were interpreted according to Landis and Koch6: κ values of 0–0.20 were considered poor, 0.20–0.40 fair, 0.40–0.60 moderate, 0.60–0.80 good, and 0.80–1 excellent. Percentage of observed agreement (i.e., percentage of observations that obtained the same score) and prevalence of the observed lesions were also calculated.
Because our study used a hierarchical design, with repeated measures across 8 patients and 16 rheumatologists, Cohen κ for measuring agreement between 2 raters had to be extended according to Light and Fleiss for use with multiple raters, as well as repeated measures within patients7,8. Thus, we used a crossed design in which all raters (1–16) evaluated all patients (8) in duplicates (2 tests), in all anatomical positions (4 positions: DPT, ICR, MTP1, and PPT), and on both sides of the body (right, left). The outcomes were the facet of differentiation, and they were nested in patient and rheumatologist. For intraobserver reliability, to adjust for the “clustered data” we inflated the standard error (SE) of the κ estimate (tophus, aggregates, and erosions) by multiplying the unadjusted SE with 2.828 (i.e., 8 × 16 independent observations rather than the apparent 1024 clustered observations); for DC, the unadjusted SE was multiplied by 2.000 (i.e., 8 × 16 independent observations rather than the apparent 512 clustered observations). Interobserver reliability was assessed by Light’s κ (mean κ across all pairs of interobservers) based on the second test scenario8.
RESULTS
All patients were men with a mean age of 67 years (range 48–74) and all received urate-lowering therapy (Table 1). The prevalence of lesions are listed in Table 2, showing that DC and tophus are most frequently observed (> 50%), especially in the ICR and MTP1, respectively, while aggregates and erosions were observed in < 30% of the anatomical areas and most frequently in the MTP1 joint for both.
Table 3 summarizes the results of the reliability of US elementary lesions in patients. The intraobserver agreement (absolute terms) was for tophi (87%, 891/1024), aggregates (84%, 857/1024), erosions (87%, 889/1024), and DC (76%, 391/512). Further, the intra- and interobserver κ values are shown. For the intraobserver reliability, the best reliability per lesion was seen for tophus (κ 0.73, 95% CI 0.61–0.85) and lowest for DC (κ 0.53, 95% CI 0.38–0.67). The results for the interobserver reliability per lesion were best for erosions (κ 0.74, 95% CI 0.65–0.81) and lowest for aggregates (κ 0.21, 95% CI 0.04–0.37).
DISCUSSION
Because of the introduction of new treatment options for both acute and chronic gout, research activities have focused on developing validated outcome measures to evaluate treatment effects. This has been the case, in particular, with regard to tophus regression and joint inflammation, including joint swelling9. For both chronic and acute gout, the suggested variables in the OMERACT core domain set mentioned above may be evaluated by US. To validate US as a possible outcome instrument, we set out to standardize US by first defining the US elementary lesions in gout. After the initial step of defining the elementary lesions and test these on static images of well-illustrated lesions3, the next step was to test the reliability of the definitions in known patients with gout, adding image acquisition to the exercise.
The intraobserver agreement was found to be good for all lesions and moderate for the DC. The interobserver reliability was fair for aggregates (0.21) and moderate for DC (0.47), and good for the other components. These findings are in line with previous studies on reliability for US lesions and regions10,11, also in gout12. Aggregates are heterogeneously described in the literature2 and are believed to be deposits of crystals in the soft tissues not large enough to be defined as a tophus3, while DC is created by the deposits of crystals on the surface of the cartilage13 and may be detected by US in up to 60% of joints, including asymptomatic joints in patients with gout14,15,16. In our study, the overall DC prevalence was 53% (Table 2). The US definition obtained in the Delphi for aggregates is less specific than the other definitions3 and was proposed to define soft tissue hyperechogenicity often seen in patients with gout2. Though the good intraobserver agreement demonstrates that the individual sonographers know what to score, they appear to have different perceptions of the definition because the interobserver agreement is only fair. Even when looking at the individual sites, there is no joint or tendon site that has better reliability for detecting aggregates.
The poor performance of the aggregate definition may also partly be related to statistical factors. The prevalence of aggregates was low overall, and this may explain the fair κ. When the prevalence is very low or very high, low κ values are obtained even though the overall agreements are high (paradox κ)17. This is because with high prevalence, the agreement expected by chance is very high and κ is the remaining agreement obtained after discounting agreement by chance. This may also partly explain the lower site agreement for DC.
DC is perceived to be indicative of gout18. The reliability for DC was moderate for both intra- and interobserver agreement, which is in contrast to the interobserver reliability of the DC in previous publications where it was found to be excellent in local research groups14,15,18,19. This may partly be related to more exercise time on pathological features in small study groups and may indicate that more scanning time together in a group is necessary. Further, the US definition for DC is more detailed and the discrepancy between the interobserver agreement found in the Web exercise and that found in the patient workshop may in part be related to the image acquisition and scanning technique. In static image Web exercises, only clear images of the pathologies are chosen, which might not always resemble the clinical setup with patients. Another possible pitfall might be that the DC may resemble cartilage interface, which can only be seen in an area where the insonation angle is 90° and appears as a white line on the surface of the cartilage. The DC is seen as a white line — punctuate or linear — on the surface of the cartilage, and also in areas where the insonation angle is < 90°, and will move with the cartilage during dynamic exercise as the crystals are deposited on the surface of the cartilage. Though this differentiation between cartilage interface sign and DC is well known, it may have had an effect on the findings during image acquisition in patients — especially in the ICR where the presence of even a minimal effusion makes the cartilage interface sign more frequent. Further studies are needed to evaluate the optimal site for cartilage pathology.
Not surprisingly, the reliability of erosions was good for both intra- and interreader reliability, which can be explained by a longstanding experience in the group scanning the pathology from rheumatoid arthritis (RA). The exercise also demonstrated a good intra- and interreader reliability for tophus, which is an aspect of the urate burden in patients with gout. However, aggregates were the lesions with the lowest reliability both in the static image exercise3, but even lower in patients, especially between observers. This raises the issue of whether the definition is truly covering the US lesions it is supposed to describe (soft tissue hyperechogenicity). Because a tophus is also a collection of aggregates, there is a risk that these 2 elementary lesions are overlapping and impairing the agreement between observers. Further steps are needed to improve the US definitions for aggregates before it may be an elementary US lesion in gout as part of future outcome measures.
For DC, the reliability was moderate and further studies are needed to improve the reliability by focusing on MTP joint cartilage before the definitions may be used in multicenter studies and before testing the diagnostic sensitivity of US in gout20.
Future steps that include collaboration with histologists may be beneficial to improve the definitions followed by Web exercises with focus specifically on these 2 features (aggregates and DC) in images resembling daily clinical situations. Patient workshops are needed to develop optimal image acquisition techniques, and finally the development of an US atlas of pathological lesions may aid in the reliability process because this has been found to highly increase reliability for scoring synovitis in RA between sonographers21.
Consensus-based US definitions of elementary lesions in gout were tested in patients and showed high intraobserver reliability, but lower interobserver reliability — especially for aggregates. DC had lower interreader reliability than tophus and erosions. Further studies are needed to explore the US definition of aggregates for improving the reliability between sonographers.
Acknowledgment
The authors are indebted to the patients for their participation in the study and to Vanessa Schmidt for collecting data from the workshop.
APPENDIX 1
List of study collaborators. OMERACT Ultrasound Gout Task Force members: Ingrid Möller, David Bong, Marcin Szkudlarek, Eugenio De Miguel, Veronica Sharp, Christian Dejaco, Eugene Kissin, Petra Hanova, Frederique Gandjbakhch, Jane Freeston, Juhani Koski, Nanno Swen, Oscar Epis, Sibel Aydin, Viviana Ravagnani, Anthony Reginato, and Richard J. Wakefield.
Footnotes
Financial support from Outcome Measures in Rheumatology Clinical Trials (OMERACT). LT is funded by the Danish Rheumatism Association and RC is funded through grants from the Oak Foundation.
- Accepted for publication July 23, 2015.