Abstract
Objective. To assess the methodological quality of randomized controlled trials (RCT) of medical and surgical therapy in patients with arthroplasty.
Methods. We conducted a Medline database search for all arthroplasty RCT from 1997 and 2006. The quality of the methods of all eligible RCT was assessed by a trained abstractor. We used a checklist of trial quality characteristics, and the overall trial quality was assessed by 3 scales: Jadad (range 0–5), Delphi list (range 0–9), and numeric rating scale (NRS; range 1–10), based on User’s Guides to the Medical Literature.
Results. A total of 196 articles were included in the analysis; most included hip (n = 81) or knee (n = 80) or both hip/knee arthroplasty (n = 19); 66 (34%) assessed pharmacological treatments, 117 (60%) nonpharmacological treatments, and 13 (7%) both. Mean (SEM) overall quality scores of arthroplasty RCT were low: Jadad score 2.36 (1.4), Delphi list 5.33 (1.6), and NRS score 4.30 (2.6). Multivariable analyses revealed that nonpharmacological intervention RCT had lower odds (odds ratio 0.28–0.39; p = 0.008–0.033) and those with no funding had lower odds (OR 0.28–0.50; p = 0.014–0.119) of being in the highest quartiles of the 3 overall quality scores. In contrast, multicenter RCT had 1.8–4.7 times higher odds of being in highest tertiles of quality scores (p = 0.017–0.185).
Conclusion. Methodological deficiencies in reporting of hip/knee arthroplasty RCT offer an opportunity for improvement. Type of intervention, number of trial centers, and presence of funding were independently associated with overall trial quality. In future, multicenter RCT (rather than single-center) and modeling protocols of single-center RCT similar in rigor to multicenter RCT may improve the quality of arthroplasty RCT.
In the US, arthritis and other rheumatic conditions affected an estimated 70 million people in 20011, led to 744,000 hospitalizations and 44 million ambulatory care visits in 19972, and cost $149 billion in direct and indirect costs in 1992 (2.5% of the gross national product)3. Arthritis leads to significant physical and psychological morbidity4,5 and is the leading cause of disability in adults in the US6. Joint arthroplasty is the most significant advance in treatment of patients with endstage arthritis; 202,500 primary total hip arthroplasty (THA) and 402,100 primary total knee arthroplasty (TKA) procedures were performed in the US in 20037. Arthroplasty is associated with relief of pain and improvement of function and quality of life7. THA has been called “the operation of the century”8.
Due to significant public health burden and cost associated with hip and knee arthroplasty, we need high quality evidence upon which physicians and patients can base their decisions. To our knowledge, there are no published reports assessing the quality of arthroplasty randomized controlled trials (RCT).
A recent systematic review of RCT in osteoarthritis found differences between pharmacological and nonpharmacological RCT9. We conducted a systematic review of the available literature to examine the quality of reporting across randomized trials in arthroplasty. Specifically, we aimed (1) to examine the methodological quality of arthroplasty RCT; and (2) to study whether intervention (pharmacological vs nonpharmacological), trial (funding source, number of centers, number of patients per trial), or publication (year of publication, type of journal, journal impact factor) characteristics were associated with overall trial quality (Jadad score, etc.) and with specific quality standards (allocation concealment, use of placebo, etc.).
MATERIALS AND METHODS
Search strategy
Medline was searched by a librarian from the Cochrane Library Systematic Review Group (IR) using the following search terms for arthroplasty: “exp arthroplasty, Replacement, Knee/ or exp Joint Prosthesis/ or exp Arthroplasty, Replacement/ or joint arthroplasty.mp. or exp Arthroplasty, Replacement, Hip.” This search was further limited to RCT published in the 2 calendar years 2006 (the most recent year at the time of review) and 1997 (a year about a decade earlier), to examine if the quality of RCT had changed over a decade. Upon review of the titles and abstracts by a senior author (JS), articles were excluded if they were letter/editorial, nonrandomized, or published in non-English language, were not arthroplasty-related, or did not include clinical outcomes (i.e., economic analyses, etc.). There were no restrictions by the journal name or specialty.
Detailed evaluation of study quality
Training of a single abstractor (SM) the senior epidemiologist (JS) consisted of: (1) review of the literature and key articles describing the quality assessments of trials; (2) detailed discussion of key assessment components, including allocation concealment, blinding etc.; (3) 3 rounds of independent abstraction of articles (14 articles) by both the senior author (JS) and the trained abstractor (SM), which led to > 95% agreement on all abstracted data. After the training period, SM, who was blinded to the study hypotheses, assessed and abstracted trial quality data from all included studies using a structured abstraction form, modified from that used by Boutron, et al9. Data were entered into forms created using Microsoft Access 2003 (MicroSoft, Redmond, WA, USA) (Appendix 1).
We obtained the following characteristics for each included study: (1) year of publication, journal, title; (2) body region involved — upper (shoulder, elbow, hand) or lower extremity (hip, knee, foot, long bones); (3) financial support — public, private, neither, both, or not clear; (4) number of centers involved — single center, multicenter, not clear; (5) number of patients/study: ≤ 50, 51–100, 101–200, 201–500, and > 500; (6) treatment classification: pharmacologic (oral, topical, intramuscular, intravenous, intraarticular, or other) versus nonpharmacological (surgery, arthroscopy, joint lavage, acupuncture, rehabilitation, behavioral, or other); (7) type of study — original versus followup/subgroup analyses; (8) type of journal — orthopedics/surgery, anesthesia, internal medicine/medical subspecialties, and rehabilitation/others; and (9) journal impact factor — classified as ≤ 0.5, > 0.5–1, > 1–2, > 2–5, > 5–10, and > 10. Impact factor and number of patients were categorized due to a skewed distribution.
We examined whether the CONSORT (Consolidated Standards of Reporting Trials) criteria10 were reported in a flowchart or in the text and whether the loss to followup was < 20%. We assessed trial design, mode of randomization, blinding, and outcome assessment. The CONSORT checklist was not used, since this was described in 2001, after one of the years of included articles (1997). Generation of randomization sequence was considered (1) appropriate if selection bias was prevented by use of random numbers, computerized random number generation, pharmacy controlled, opaque sealed envelopes, numbered or coded bottles; (2) inappropriate if patients were allocated alternately, according to date of birth, date of admission, hospital number etc.; and (3) indeterminate. Allocation was considered concealed if both patients and investigators enrolling patients in the study could not foresee the assignments due to centralized randomization/pharmacy control/opaque envelopes, etc.
We assessed if blinding of patients, care providers, and outcome assessors was reported, if it was appropriate11, whether it was theoretically efficient, and whether it was tested. Appropriateness of blinding was categorized as follows: (1) appropriate — stated that neither person doing assessments nor study participant could identify the intervention being tested or use of active placebos, identical placebos, or dummies; (2) inappropriate — comparison of tablet versus injection with no double-dummy; and (3) indeterminate.
The following details regarding the intervention were extracted: (1) Was the intervention individualized (i.e., treatment modification according to individual’s profile)?; (2) Was the intervention described in enough detail to be reproducible?; (3) Was there a control intervention? If so, was this placebo, active control, usual care, waiting list, or other?; (4) Was the potential placebo effect of each treatment similar?; (5) Was the quality of intervention and control intervention assessed?; (6) Could care providers influence the treatment effect? If so, was this due to their experience, learning curve, or training of care providers at the beginning of the trial?; (7) Was there a contamination of the 2 groups (by providing intervention to the control group)?; (8) Were concomitant treatments reported?; and (9) Was treatment compliance tested?; (10) If tested, how was it assessed (pill counts, patient report, video, reporting diary, not reported)?
The statistical analysis section was examined to determine whether a trial reported a justification for sample size, whether the analyses were described as intention to treat analysis (ITT), i.e., all participants randomized were included in the analysis and kept in the original groups12, or modified ITT, i.e., analysis excluded those who never received treatment or who were never evaluated while receiving treatment.
Study outcomes
Outcomes included reporting of each trial quality characteristic and the overall quality assessment. Trial quality was assessed in detail by examining the adequacy of reporting of allocation concealment, generation of allocation (i.e., randomization) sequence, use of placebo, CONSORT diagram, reproducibility of intervention, loss to followup, adverse events, sample size justification, use of intention to treat analysis, and blinding of patients, care providers and outcome assessors.
Overall trial quality was assessed using 3 validated measures: Jadad score13,14, Delphi list’s overall score15, and overall subjective assessment of validity of the study as described in the Users’ Guides to the Medical Literature16. Jadad scale assesses the appropriateness of randomization, blinding, and loss to followup, and ranges from 0 to 5. The Delphi list includes 9 items that assess trial characteristics on a 0–9 score, including randomization, similarity at baseline, eligibility criteria, allocation concealment, blinding of outcome assessor, patient and care provider, inclusion of ITT, and report of point estimates and variability. The overall subjective evaluation of the study’s quality was assessed on a numerical rating scale (NRS) ranging from 1 to 10 by answering the question, “To what extent were systematic errors or bias avoided in this report?”. We included multiple scales of overall quality for 2 reasons: (1) Jadad scale is heavily weighted to double-blinding, which is often not possible in surgical RCT; we therefore included the Delphi list, which has no points for use of placebo and awards only one point each for blinding of care providers, assessors, and patients; and (2) for robustness of analyses. For all 3 measures, a higher score indicates higher quality.
Statistical analysis
For continuous measures, we calculated mean and standard error of the mean and for categorical variables the frequencies and percentages. We used chi-square and independent sample Student’s t tests to examine the univariate association of trial, intervention, and publication characteristics with trial quality — assessed by both individual quality characteristic (allocation concealment, etc.) and the overall trial quality (Jadad scale, Delphi list, and subjective overall score), respectively. We performed 3 separate multivariable-adjusted logistic regression analyses to assess which of the trial characteristics significant in the univariate analyses were independently associated with overall trial quality, outcome being the highest tertiles of Jadad, NRS, and Delphi list scores. The cutoffs for the highest tertiles were ≥ 3 on Jadad score, ≥ 6 on NRS, and ≥ 6 on the Delphi list score. Variables with a right-skewed distribution, i.e., the journal impact factor and the number of patients, were categorized into dichotomous and categorical variables, respectively, allowing enough numbers in each category.
Sensitivity analyses were done for the above-described multivariable analyses by considering 2 predictors, impact factor and number of patients, as continuous variables instead of categorical variables as in the previous models. All tests were 2-sided, and we considered p < 0.05 statistically significant.
RESULTS
Characteristics of included studies
After screening the abstracts and full text, excluding non-English language, 196 articles were eligible for abstraction, 67 from year 1997 and 129 from 2006 (Figure 1). Of these, 130 articles assessed nonpharmacological therapy and 79 assessed pharmacological therapies (13 articles assessed both). Eighty articles included surgical interventions, 17 rehabilitation therapy, 3 education intervention, 3 behavior therapy, 22 oral medications, 25 parenteral, 40 intraarticular, and 30 other interventions (Figure 1).
The process of articles selected for review. *Totals add up to more than a simple sum since many studies had multiple types of interventions.
Of the 196 studies included for analyses, 81 included only THA, 80 only TKA, and 19 both THA and TKA (Table 1). Over one-third of studies (35%) included sample sizes of 50 patients or less. The studies were primarily published in orthopedics (64%) or anesthesia journals (17%), with a few in internal medicine and related subspecialty (14%) and rehabilitation/other journals (6%).
Characteristics of studies. Values are N (%). Numbers are rounded to the nearest digit; total may add up to > 100, since many trials had > 1 type of intervention.
Methodological quality of included studies — univariate analyses for overall quality and individual quality characteristics
The overall quality of studies was low: Jadad score was 2.36 (range 0–5), Delphi list scale score 5.33 (range 0–9), and overall NRS score 4.3 (range 1–10); scores were at or below the mean of the range of each scale (Table 2). Univariate analyses showed that type of intervention, number of centers, number of patients, funding source, type of journal, and journal impact factor were significantly associated with overall quality (Table 2). The year of publication was not associated with overall RCT quality in univariate analyses.
Association of trial characteristics with overall quality as assessed by Jadad (range, 0–5), Delphi list (0–9), and numeric rating scale (1–10) scores. Values are mean (SEM).
Examination of individual study characteristics revealed that a low proportion of studies described the following: adequate generation of allocation sequence (43%); allocation concealment (39%); CONSORT diagram (10%); blinding of patients (31%), care providers (17%), and outcome assessors (45%); use of ITT or modified ITT for analyses (10%); and sample size justification (36%). Only 18% of the studies used placebo and 51% reported < 20% loss to followup. Since it may not be possible/ethical to blind patients/care providers or use placebo in surgical trials, when restricting this to pharmacological trials, numbers were still low at 61%, 42%, and 47%. On the other hand, some quality indicators were reasonably well described, including adverse event reporting (57%), potential similarity of placebo to treatment (64%), and reproducibility of intervention (97%) (Table 3).
Characteristics of randomized arthroplasty trials by the type of intervention. Values are n (%).
RCT of pharmacological interventions or those that had both pharmacologic and nonpharmacological interventions had significantly better quality standards than nonpharmacological intervention RCT (Table 3). Specifically, trials of surgical or rehabilitation interventions had significantly lower use of placebo, blinding of patients, care providers or outcome assessors, or sample size justification (Appendix 1). Similar deficits were noted in individual quality characteristics in small sample size RCT, compared to larger sample size RCT (Table 4).
Characteristics of randomized trials of arthroplasty by number of patients and number of centers. Values are n (%).
Studies published in 1997 were significantly less likely than those published in 2006 to describe allocation concealment or provide sample size justification, but were more likely to describe the blinding of care providers or outcome assessors (Appendix 2). Studies published in internal medicine journals (Appendix 2) or in journals with higher impact factor (Appendix 3) were significantly more likely to have better reported methodological quality.
Multivariable correlates of overall quality
Multivariable models of overall RCT quality included all variables significant in univariate analyses, namely, type of intervention, number of centers, number of patients, funding source, type of journal, and journal impact factor. We found that compared to pharmacological intervention RCT, nonpharmacological intervention RCT had lower odds of being in the highest tertiles of Jadad, Delphi list, and NRS scores [odds ratio (OR) 0.28–0.39, p = 0.033–0.008] (Table 5). Higher number of centers was significantly associated with higher Delphi list score (OR 4.7, p = 0.017) and lack of funding was significantly associated with lower NRS score (OR 0.28, p = 0.014). Number of patients, journal type, and journal impact factor were no longer significantly associated with overall RCT quality in multivariable-adjusted analyses. Sensitivity analyses that adjusted the described multivariable analyses for journal impact factor and number of patients as continuous variables (instead of categorical in the main analyses) did not change these findings.
Multivariable adjusted predictors of overall trial quality.
DISCUSSION
What does this report add to the literature?
In this first systematic review of a large number of arthroplasty RCT, we found many methodological deficiencies in allocation concealment, blinding, use of placebo, ITT/modified ITT and sample size calculations, with most scores ranging from 20% to 50%, resulting in low overall trial quality. Our multivariable-adjusted analyses suggested that nonpharmacological intervention, lack of funding support, or single-center location were independent predictors of lower overall trial quality.
Limitations of our study
Our study has several limitations. We examined the quality of RCT reporting, not the quality of RCT; it is possible that reporting of methods was inadequate for some RCT that were conducted with more rigor17. However, readers have access only to published reports and we suggest that due attention be paid to the reporting of the RCT. Second, certain methodological aspects such as blinding and use of placebo may not be easily amenable to improvement in RCT of nonpharmacological interventions18. Many potential areas of improvement exist in conducting and reporting arthroplasty trials of both pharmacological and nonpharmacological interventions, including adequate use of ITT, sample size calculation, allocation sequence generation/concealment, outcome assessor blinding, etc. Third, Jadad score focuses primarily on double-blinding, which may not fairly evaluate the quality of nonpharmacological interventions, as discussed above18. We included Delphi list score to avoid this bias since Delphi list awards only 2 of the 9 points for blinding of patients and providers, but scores were low on Delphi list for both pharmacological and nonpharmacological trials. In addition, individual quality standards were still met in < 50% cases, even for pharmacological trials, confirming that the surgical nature of 60% of the arthroplasty trials does not completely explain these deficits in trial quality reporting. Fourth, limiting to English language may limit generalizability; however, < 10% of articles were non-English, so inclusion of these articles is unlikely to have substantially changed our findings or conclusions. Last, due to multiple comparisons, at least 8 statistically significant differences in our study may have been due to chance (total comparisons, about 150). We acknowledge this as a limitation, and thus our findings should be interpreted with some caution until confirmatory studies are available. However, we are fairly confident that differences with p values < 0.001 are unlikely to be due to chance. We found consistent patterns for most of the differences, examining individual quality and overall quality standards and with sensitivity analyses, and we had stated our hypotheses a priori.
Predictors of overall trial quality in multivariable models
An important finding in our study is the observation of significant independent association of nonpharmacological intervention with lower trial quality in multivariable analyses, which confirms and extends the previous similar findings from univariate analyses of osteoarthritis and general RCT9,19. Nonpharmacological arthroplasty RCT scored low on most quality standards including use of placebo and blinding of patients (which are less amenable to improvement due to ethical/practical issues). However, these trials also scored low on potential similarity of interventions, sample size justification, and ITT/modified ITT analyses, which are amenable to improvement, as much in surgical as in nonsurgical RCT. Thus, opportunities for improving arthroplasty trial reporting exist, especially for nonpharmacological trials. These improvements should be made in conjunction with reporting additional CONSORT criteria specifically focused on nonpharmacological RCT, as described by Boutron, et al18. Such improvements in study design and reporting will not only result in better study design for surgical RCT, but will allow for replication of results of RCT in different study populations, making interventions more generalizable.
Neither journal type nor journal impact factor was significantly associated with overall study quality in the multivariable analyses. This implies that RCT quality cannot be inferred by journal type or journal impact factors. Readers should be aware that high impact journals may not necessarily publish high quality studies. Simply relying on impact factor may, in fact, provide a false sense of assurance of study validity. Critically reviewing the methodology of individual trials, regardless of journal or impact factor, remains the most important safeguard.
Our findings further support inclusion of more centers as a significant predictor of RCT quality. This is likely correlated with increasing sample size and reflects that studies with larger sample sizes were also more likely to be multicenter studies. In principle, larger studies are more likely to be coordinated carefully due to increased complexity of their conduct (i.e., multiple investigators). Thus, multicenter trials may, in fact, be a surrogate for study quality. Smaller trials are often single-center, single-investigator studies with limited funding and resultant methodological pitfalls such as insufficient sample size and limited study power (Type II errors or beta errors). Based on our findings, we recommend investigators carefully consider patient-important outcomes and adequately power their studies to have high probability of success. These resultant larger sample sizes will inevitably require multicenter rather than single-center trials. Alternatively, when arthroplasty RCT are being done as single-center RCT, authors should consider examining methods/protocol from multicenter RCT to improve the RCT quality.
Our finding of independent association of presence of funding with better overall trial quality confirms similar univariate associations noted for industry-funded trials20,21. In our study this was noted in univariate and confirmed in multivariable analyses. On further analysis, we found that the difference noted was primarily due to better reporting for RCT with private funding compared to those with no/unclear funding. No significant differences were noted between privately and publicly funded RCT. Due to limited funding resources, obtaining funding may be beyond the control of investigators in many circumstances. However, presence of financial support seems to correlate with better quality RCT, likely due to availability of better resources to plan, conduct, analyze, and report RCT.
Variation in study quality in univariate analyses
Lack of significant association of year of publication with RCT quality in univariate analysis disproved one of our hypotheses, that study quality would have improved over time. This observation is similar to that reported for RCT of antibacterial agents22 and of low back pain23 over time, but is in contrast to studies of RCT in sepsis24 and colorectal/laparoscopic resections25, which showed improvement in quality over time. Arthroplasty RCT published in 2006 reported < 60% for most quality standards, identifying several areas for improvement.
One published study reported weak correlation of 0.21 between impact factor and trial quality of oncology RCT26. We found a significant increase in overall trial quality for journals with higher impact factor in univariate, but not in multivariable adjusted analyses. This was most notable for journals with impact factor > 2 and may have been due to more methodological rigor in higher impact journals. Our study confirmed a previous report of a better overall quality score in RCT published in internal medicine/rheumatology journals versus orthopedics/rehabilitation/surgery journals in univariate analysis9,27,28.
Comparison with previous similar studies
Compared to the earlier study of osteoarthritis RCT9, we report even lower use of placebo (18% vs 52%, respectively); blinding of patients (31% vs 65%), care providers (17% vs 47%), and outcome assessors (45% vs 85%); use of ITT/modified ITT (20% vs 56%); and sample size justification (36% vs 52%) in arthroplasty RCT. Allocation concealment (39% vs 21%, respectively) was higher, and reproducibility of intervention (97% vs 91%) and allocation sequence generation were similar to osteoarthritis RCT (43% vs 49%). These differences seem to be attributable primarily to higher proportion of RCT of nonpharmacological interventions among arthroplasty RCT (60%) versus osteoarthritis RCT (45%), which had lower quality than pharmacological RCT, both in this and in a previous study9, and difference in study populations of arthroplasty versus osteoarthritis.
Methodological reviews of RCT from various fields of medicine and surgery have found many deficiencies in their reporting9,19,22,29⇓⇓–32. The low overall quality score we found for arthroplasty RCT is in agreement with previous studies that included RCT in surgical specialties33,34, as well as reviews in other fields, including headache35, physical therapy19, and infectious disease22. Our findings of quality deficits (< 50% reporting) in arthroplasty trials are similar to studies of RCT in surgical specialties — review of ophthalmology RCT found < 50% of RCT reported sequence generation, randomization restriction, allocation concealment, allocation implementation, patient flow diagrams, and sample size calculation33. Less than one-third of obstetrics and gynecology RCT reported allocation sequence generation or allocation concealment34.
On the other hand, quality standards such as description of intervention to be reproducible and use of placebo with similar potential effect as the intervention were described in the majority of most arthroplasty trials (64%–97%). Learning curve, standardization and reproducibility of the procedure, center’s volume of care, and care provider expertise36 are specific important methodological issues for surgical trials18, and therefore this finding is reassuring.
Blinding of patients and/or surgeons is a challenge in surgical RCT37,38. One study found that only 33% of surgery RCT were blinded39. Another study found that it was impossible to blind in 72% of the orthopedics RCT40. The same study also reported that in 16%, 50%, and 50% of RCT where blinding was possible for providers, patients, and assessors, respectively, the RCT did not blind or did not describe blinding40. This implies that even for orthopedics RCT that have challenges with regards to blinding of patients and in some cases providers, blinding is still possible in the majority and should be done when possible. The most room for improvement exists in blinding outcome assessors, which is possible in most instances. It is also important to ensure assessors are independent of the surgeons/providers. This alone has a huge potential in reducing observer bias and improving the RCT quality.
In summary, we found methodological deficiencies in several areas of reporting of arthroplasty RCT. Overall trial quality is associated with trial and intervention characteristics. We have identified many areas of improvement for conduct and reporting of arthroplasty RCT.
Acknowledgment
We thank Indy Rutks from the Cochrane Library for performing the literature search, and Ruth Brady (research associate), Pearlita Ochoa (administrative assistant), and Amy Anderson for their administrative help.
Appendix
Characteristics of arthroplasty trials by the specific intervention type. Values are n (%)
Appendix
Characteristics of arthroplasty trials by journal type and year of publication. Values are n (%)
Appendix
Characteristics of arthroplasty randomized trials by journal impact factor. Values are n (%)
Footnotes
Supported by the NIH CTSA Award 1 KL2 RR024151-01 (Mayo Clinic Center for Clinical and Translational Research); and the Minneapolis VA Medical Center, Minneapolis, MN.
- Accepted for publication July 9, 2009.