Abstract
Objective To assess the reproducibility of patient-reported tender (TJCs) and swollen joint counts (SJCs) of patients with rheumatoid arthritis (RA) compared to trained clinicians.
Methods We conducted a systematic literature review and metaanalysis of studies comparing patient-reported TJCs and/or SJCs to clinician counts in patients with RA. We calculated pooled summary estimates for correlation. Agreement was compared using a Bland-Altman approach.
Results Fourteen studies were included in the metaanalysis. There were strong correlations between clinician and patient TJCs (0.78, 95% CI 0.76–0.80), and clinician and patient SJCs (0.59, 95% CI 0.54–0.63). TJCs had good reliability, ranging from 0.51 to 0.85. SJCs had moderate reliability, ranging from 0.28 to 0.77. Agreement for TJCs reduced for higher TJC values, suggesting a positive bias for self-reported TJCs, which was not observed for SJCs.
Conclusion Our metaanalysis has identified a strong correlation between patient- and clinician-reported TJCs, and a moderate correlation for SJCs. Patient-reported joint counts may be suitable for use in annual review for patients in remission and in monitoring treatment response for patients with RA. However, they are likely not appropriate for decisions on commencement of biologics. Further research is needed to identify patient groups in which patient-reported joint counts are unsuitable.
Rheumatoid arthritis (RA) is characterized by synovial joint inflammation leading to loss of function and disability if untreated. The systematic evaluation of joints by healthcare professionals (HCPs) is crucial in the monitoring and treatment of RA as part of the “treat-to-target” approach recommended in North American and European guidelines.1,2,3 Regular monitoring of disease activity allows up-titration of medication to achieve and maintain remission.1 In addition to clinical practice, joints counts are used across clinical research and are included in the Outcome Measures in Rheumatology (OMERACT) RA core dataset.4
Physician-measured joint counts have been shown to be predictive of mortality5 and are included in disease activity indices such as the Disease Activity Score (DAS), Simplified Disease Activity Index (SDAI), and the Clinical Disease Activity Index (CDAI). The DAS in 28 joints (DAS-28) is a composite measure of disease activity that takes into account weighted values of the 28-joint count of swollen and tender joints, patient global assessment, and C-reactive protein (CRP). In the UK, the DAS-28 is the primary tool used for the assessment of RA and is central to National Institute for Health and Care Excellence technology appraisals for RA therapies.3
Swollen (SJC) and tender joint counts (TJC) are performed mostly by HCPs. Several studies have evaluated the reproducibility of patient self-reported joint counts in the assessment of disease activity, as these have the potential to increase patient engagement and encourage self-management behavior. Improved self-management has been associated with beneficial outcomes across health status, pain, and fatigue.6 A 2015 systematic review reported high intra- and interobserver reliability for patient-reported TJCs but a lower intra- and interobserver reliability for patient-reported SJCs.7 A metaanalysis performed in 2009 reported a summary estimate Pearson correlation coefficient for TJC of 0.61 (95% CI 0.47–0.75) and for SJC of 0.44 (95% CI 0.15–0.73).8 The key measure of reproducibility for patient-reported joint counts is agreement. Reproducibility is an umbrella term for the concepts of agreement and reliability.9 Specifically, agreement is concerned with measurement error, whereas reliability relates to the ability of the assessment to discriminate between people or objects and concerns the ratio of the variability between participants or objects relative to the total variability, including measurement error. While reliability is important, agreement measures are typically preferred for cases in which the instrument is used for evaluation purposes, such as TJCs and SJCs.9 Previous reviews have neglected agreement, instead focusing on reliability or correlational approaches that ignore systematic intra- and interindividual differences. This review builds on previous work by considering agreement using a Bland-Altman–type approach to assess reproducibility of patient-reported joint counts in clinical practice.
The coronavirus disease 2019 pandemic has rapidly altered the rheumatological landscape, with a transition to video and telephone clinics. It is likely that reduced face-to-face clinical interaction will be sustained. Now more than ever there is a need to understand the reproducibility of self-reported joint counts. We aimed to conduct a systematic review and metaanalysis of the published evidence around the reproducibility of self-reported TJCs and SJCs for use in calculating disease activity indices including the DAS-28, SDAI, and CDAI.
METHODS
Literature search. This study was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and registered with the international prospective register of systematic reviews (PROSPERO 2020 CRD42020189116). A systematic search of the EMBASE and MEDLINE databases was performed for studies published between January 1, 1990, and June 1, 2020. After reviewing 2 previously published systematic searches,7,8 the following search was conducted: “rheumatoid arthritis OR rheumatoid” AND “joint OR joints OR disease activity” AND “patient-report OR patient-assess OR self-report OR self-assess OR self-monitor OR self-monitor* OR self-administ* OR self-evalua* OR self-examin* OR self-rate OR self-rating”. Reference lists of previous systematic reviews were assessed for additional eligible studies.
The search was limited to studies with human participants, and the following English-language publication types: article, article in press, clinical trial, comparative study, or observational study.
Eligibility criteria. Eligible studies included a patient and a trained assessor of TJCs and/or SJCs, and a direct comparison between patient and trained assessor joint counts in patients with RA. Review articles, letters to the editor, and conference abstracts were excluded.
Study selection. Titles and abstracts of studies retrieved using the search strategy detailed above, as well as those identified from reference lists of selected publications, were reviewed independently by 2 investigators (VP, SR), and any disagreement was adjudicated by a third reviewer (MY). The data from the eligible studies were extracted into a table.
Data extraction. Two investigators (VP, SR) each extracted data from half the eligible studies. Data extracted included the following: authors, year, country in which the study was performed, study design, number of subjects, number of assessors, blinding, types of assessors compared, patient education level, study inclusion/exclusion criteria, age, sex, number of joints examined, mean SJCs and TJCs, agreement and correlation measures, agreement value, and intraobserver reliability.
Quality assessment. Risk of bias was assessed using the Quality Appraisal of Reliability Studies (QAREL) checklist, which comprises an 11-item checklist that covers 7 key principles in diagnostic reliability studies (Supplementary Table 1, available with the online version of this manuscript).10 The 7 principles are as follows: spectrum of examiners, spectrum of subjects, examiner blinding, order effects of examination, suitability of the time interval among repeated measurements, appropriate test application and interpretation, and appropriate statistical analysis.10
Statistical methods. Correlation coefficients for TJCs and SJCs were tabulated, with a separate metaanalysis performed for each measure using a random effects model (Supplementary Data 1, available with the online version of this article). Fisher Z transformation of correlation coefficients was used for metaanalysis and results were displayed graphically using forest plots. Given that the Pearson and Spearman methods are on the same metric, these were combined in the same metaanalysis, with the Pearson method used where both were reported. Sensitivity analyses stratified by correlation method confirmed no difference in the estimates. Statistical heterogeneity was described using the I2 statistic.
As detailed by de Vet, et al,9 we distinguished the use of agreement and reliability parameters, which collectively can be referred to as “reproducibility parameters.” Agreement parameters assess the closeness of repeated measurements by estimating measurement error. Reliability parameters assess whether study objects can be distinguished from each other despite measurement error and are related to variability between study objects.9 Intraclass correlation coefficient (ICC) and Cohen κ are reliability parameters. Agreement parameters are expressed on the actual scale of measurement, whereas reliability parameters are expressed as a dimensionless value between 0 and 1.9 It is also important to note that Spearman and Pearson correlation coefficients are neither measures of reliability nor agreements— they are measures of association.
Studies that reported reliability estimates, in terms of interclass correlations or Cohen κ, were not used in the metaanalysis but were described narratively in the results.11,12,13,14 We performed subgroup analysis exploring whether self-reported joints obtained by text or mannequin format affected correlation or reliability. At the study level, agreement between patient and HCP joint counts were compared using study means following a Bland-Altman–type approach.15 Analyses were performed using Stata 16 (StataCorp LLC).
RESULTS
Search results. The electronic database search identified 1530 articles following removal of duplicates. Title and abstract review removed 1486 articles, leaving 42 articles for full-text review. Reference searching identified 2 additional publications. A further 24 articles were excluded after full-text review, leaving 20 eligible publications. Further details can be found in the flow diagram (Figure 1).
Study and patient characteristics. Details on included studies can be found in Table 1.11–14,16–31 The median sample size was 64 (range 30–447). The mean age of patients ranged from 49 to 65 years old, with 60–92% being female. Five studies reported race, with a range of 75–97% White.13,14,20,22,29
All 20 studies collected data on TJCs, with 15 also considering SJCs. A total of 14/20 (70%) studies utilized a 28-joint count. Of the remaining 6 studies, 1 study had a lower number of joints included, with only 20 joints assessed, and the remaining 5 studies used joint counts > 28 (range 36–60).
Joint counts were measured by physicians only in 14 studies, by trained nurses only in 1 study, by trained assessor/research assistant only in 2 studies, and combined nurse or research assistant and physician in 3 studies. Nine studies detailed how patients were trained to measure joint counts, which was typically a 5- to 10-minute session.
Five studies detailed patient education level or socioeconomic position.14,19,22,23,28 Education levels ranged from 8 to 13 years spent in education, with a wide representation of educational backgrounds. Studies from Peru and Colombia had a higher proportion of patients with a low educational background. The association between joint counts was comparable to the overall analyses.
Summary correlation coefficients, reliability estimates, and agreement. For TJCs, 7 studies used Pearson and 6 studies used Spearman correlation coefficients. One study reported a correlation coefficient without specifying the type. For TJC reliability, 4 studies reported ICC, 1 reported Cohen κ, and 1 reported Kendall W (Table 2). For SJCs, 5 studies used Pearson and 5 studies used Spearman correlation coefficients. For SJC reliability, 3 studies reported ICC, 1 reported Cohen κ, and 1 reported Kendall W (Table 3).
The correlation coefficients between patient-reported and clinician joint counts are detailed in Table 2 and Table 3. There was a strong correlation between clinician and patient TJCs of 0.78 (95% CI 0.76–0.80; I2 = 83.4%) and a strong correlation between clinician and patient SJCs of 0.59 (95% CI 0.54–0.63; I2 = 72.4%). The summary estimates are presented in a forest plot in Figure 2.
TJC correlation coefficients ranged from moderate to strong (range 0.37–0.94), whereas SJC correlation coefficients ranged from weak to strong (range 0.16–0.93). There was higher correlation for TJCs/SJCs measured on a mannequin compared with text format. For TJCs, mannequin correlation was strong (range 0.60–0.94), whereas text correlation ranged from moderate to strong (range 0.37–0.89). For SJCs, mannequin correlation ranged from moderate to strong (range 0.43–0.93) whereas text correlation ranged from weak to strong (range 0.16–0.58).
Reliability coefficients (ICC, Cohen κ, Kendall W) for patient- and clinician-reported joint counts are detailed in Table 2 and Table 3. When reporting the strength of reliability coefficients, we reported as described by Landis and Koch in 1977.32 For SJC reliability, values ranged from fair to substantial (range 0.28–0.77), whereas for TJC reliability, values ranged from moderate to near perfect (range 0.51–0.85). Higher reliability for TJC was found for mannequin compared with text format, whereas no studies measured SJC reliability using a text format. For TJCs, mannequin reliability ranged from moderate to near perfect (range 0.51–0.85), whereas text reliability was moderate (0.55).
Bland-Altman plots were used to visualize data from across all studies that provided mean TJCs and/or SJCs with limits of agreement calculated to provide an estimate of measurement error (Figure 3). These provide additional insight into the reliability measures from the studies that explicitly reported them. From the more inclusive analysis, it was apparent that agreement was better for SJCs, and agreement for TJCs was better for lower joint counts. On average, patients report 1.1 more painful joints than clinicians, but because joint counts are skewed, this bias was not constant. Specifically, bias was negligible for TJCs < 5, whereas when the clinician TJCs increased to > 5 joints, the bias caused by overestimation by patients increased. For SJCs, the difference was negligible and any bias appears to go in the opposite direction, with patients tending to report a lower value than clinicians.
Reliability of individual joints. Three studies assessed reliability for individual joints between patients and physicians13,20,31 and 1 study assessed reliability for individual joints between patients and a trained assessor.13 Reliability was measured by Cohen κ or Kendall W. Reliability varied substantially with a median (IQR) of 0.49 (0.39–0.63) for TJC and 0.26 (0.20–0.38) for SJC, with substantial variation between different joints. There were no obvious trends, but reliability appeared to be higher in larger joints (shoulders, knees, and elbows).
Intrapatient reproducibility of TJCs and SJCs. Two studies reported intrapatient correlation coefficients and 2 studies reported reliability coefficients for TJCs and SJCs.16,19,24,25 For TJCs, correlation ranged from 0.87 to 0.96 and reliability ranged from 0.90 to 0.94. For SJCs, correlation ranged from 0.87 to 0.97 and reliability ranged from 0.56 to 0.89. Due to small numbers, summary estimates were not performed. The interval between repeated joint counts ranged from 30 minutes to 7 days (Supplementary Tables 2 and 3, available with the online version of this manuscript).
Effect of training. Two studies analyzed the effect of patient training.11,18 Radner, et al11 reported a paradoxical reduction in the ICC after training for TJCs (0.75–0.59) and a small improvement for SJCs (0.32–0.35). Levy, et al18 observed an improvement in Pearson correlation for both TJCs and SJCs (0.79–0.94 and 0.41–0.93, respectively).
Risk of bias. No study met all the QAREL checklist criteria. Common themes included lack of clarity as to whether participants (patients or HCPs) were blinded to discriminatory information such as inflammatory markers. Few studies explicitly commented on blinding of results between assessors. No details were available regarding sequence of assessments (HCP vs patient, TJC vs SJC; Supplementary Table 1, available with the online version of this manuscript)
DISCUSSION
We present the most comprehensive metaanalysis of self-reported joint count reproducibility to date, to our knowledge, describing measures of correlation, reliability, and agreement between patients and HCPs. The key finding is that the existing evidence supports self-reported joint counts as a reasonable measure to aid clinical decision making as part of disease activity indices such as the DAS, SDAI, and CDAI, although there are important caveats.
It is important to highlight the difference between measures of correlation and agreement. Assessing agreement assumes that 2 measures are comparing a common construct. In contrast, correlation can be used to describe unrelated constructs, and correlation can be high even if agreement is low. For example, if a patient consistently scored their SJCs lower than an HCP, correlation could be very good, but with low agreement. Few studies evaluated agreement, despite agreement representing a vital component of reproducibility.
Correlation between HCPs and patients was strong, although it was higher for TJCs than SJCs, consistent with a previous metaanalysis.8 One explanation may be greater difficulty for individuals to discriminate a truly swollen joint as opposed to a bony deformation or swelling of other structures nearby.26 Tenderness is reliant on symptoms, whereas swelling relies more on an objective measure from the assessor.12,13
Few studies reported measures of reliability (ICCs) but saw a similar pattern with lower reliability for SJCs than TJCs. The Bland-Altman plots offer additional insight into the reproducibility of the assessments, drawing upon data from all included studies. Across the studies, mean differences between patient- and clinician-reported scores were lower for SJCs than TJCs. The mean differences were stable across the range of SJC values, whereas for TJCs, agreement was excellent for low values but diminished as TJC values increased, demonstrating a positive bias for pain-dependent responses. This could be interpreted as evidence that self-reported joint assessments are in greater agreement at lower values, but as disease activity rises, the agreement of the self-reported joint count reduces. The clinical interpretation of this could be that a self-reported DAS that demonstrates low disease activity or remission is suitable for decision making; however, caution is needed when interpreting moderate or high DAS scores based upon self-reported joint counts. The latter point is relevant to decisions about biologic or targeted immune modulation therapies, whereby the use of patient-reported joint counts may be unsuitable.
An important question is whether differences in patient- and HCP-reported counts are above a clinically significant threshold. Detection of swollen joints may be more important than tender joints, as it is the persistence of objective inflammatory disease that predicts radiographic progression.33 We are unable to describe how accurately self-reported joint counts could classify people into discrete disease activity bands, such as remission, low, moderate, or high activity. From the information we present, it is likely that accuracy of self-reported joint counts will be better for remission and low disease activity states compared to more active disease.
A subsequent question to ask is whether variability between patients and HCPs varies from interobserver differences between HCPs. There is a paucity of published data on interobserver variability for joint counts performed by HCPs, but available research suggests that interobserver variability is not dissimilar between clinicians and that agreement is also worse for SJCs than TJCs, with similar magnitudes in difference.34,35
There are limitations to our metaanalysis. Studies have been included from several decades, and over this time understanding of the effect of health literacy on health outcomes and equity within health care has evolved, potentially adding confounding over time. The studies were heterogeneous in design and risk of bias was substantial. For example, it was often unclear how many assessors were involved, or whether they were blinded to objective measures of disease activity (inflammatory markers or imaging results) at the time of performing joint counts.
The studies lacked detailed information on patient socioeconomic position, educational background, or prior health awareness. In clinical research, more educated patients tend to volunteer to participate in studies.36 This is a pertinent issue as health literacy and patient educational level may have an effect on the reliability of patient-reported joint counts. Future studies should aim to capture health literacy level and ensure inclusion of a diverse patient population representative of patients seen within clinical practice.
Finally, concomitant fibromyalgia (FM) was not accounted for. Patients with FM and RA report higher TJCs and pain scores but not SJCs.37
The increased use of remote monitoring in RA management requires a greater understanding of the reliability and agreement of self-reported disease activity measures. We present evidence to inform the use of self-reported joint counts. There is good correlation between patient- and clinician-reported joint counts. Reliability is lower than correlation, although fewer studies reported on this. As per our use of the Bland-Altman–type plot, agreement was better for lower values of TJCs. Self-reported joint counts in RA without concomitant FM now have sufficient reproducibility to justify their use in routine practice.
Footnotes
MY is funded by Versus Arthritis.
S. Rampes and V. Patel contributed equally to this work.
The authors declare no conflicts of interest relevant to this article.
- Accepted for publication May 5, 2021.
- Copyright © 2021 by the Journal of Rheumatology