Abstract
Systematic reviews (SRs) are a structured means of knowledge synthesis used by a variety of healthcare practitioners to aid in medical decision making. The SR, if conducted rigorously, is considered to be at the top of the hierarchy for research studies. In addition to synthesizing evidence, SRs identify research priorities, address questions that may not be answerable by individual studies, and identify gaps to be addressed in future primary research. There are several steps that need to be taken when developing SRs to provide the best available evidence—the most essential being the assessment of risk of bias (ROB). Several ROB tools have been developed for use according to study design. Increasingly used is the assessment of certainty of evidence using approaches such as those developed by the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) working group. Whereas ROB is assessed for individual studies, the certainty of evidence is assessed for each critical or important outcome across studies. Analysis can be quantitative (meta-analysis) or qualitative (narrative), with the former intended to develop estimates of the effect measure (ie, the statistic that compares collated data), with confidence limits around that estimate. This review will focus on the steps required to develop SRs, from registration of the review protocol to the conduct, analysis, and reporting, with a focus on the assessment of ROB and certainty of evidence to ensure the development of a methodological and rigorous process.
Systematic reviews (SRs) use structured methods to identify, analyze, and collate scientific literature.1 If conducted methodically, the SR is considered as the highest tier in research studies.2 SRs generate an aggregate of knowledge and can be applied by a variety of users including patients, healthcare providers, researchers, and policymakers.1 SRs are focused and unbiased, with explicit methods of identification and analysis of collated research data.3 Further, SRs provide the best available evidence to inform decision making, both for clinical practice and healthcare policy.3,4
SRs differ from other types of reviews, such as scoping reviews. SRs are designed to address key clinical questions, analyze global evidence, address practice variation and conflicting results to guide decision making, identify evidence gaps, and inform future research. In contrast, a scoping review—also a type of knowledge synthesis that uses a systematic approach—has a broader scope and aims to identify concepts, theories, sources, and knowledge gaps pertaining to the objectives.5,6 Scoping reviews identify the types of evidence (eg, cohort studies, clinical trials), explore how research has been conducted, identify concept characteristics, and often precede the conduct of SRs.6 Scoping reviews are not intended to answer a clinical question (such as appropriateness or effectiveness of therapy) or to inform practice.6 The assessment of risk of bias (ROB) or critical appraisal is not an essential component of scoping reviews.5 Criteria for features to be included in a scoping review have been established,5 and descriptions of other types of reviews beyond the scope of this paper have been extensively reviewed elsewhere.7-9
SRs can assess the effectiveness and safety of a treatment, procedure, or policy; determine the accuracy of a test for diagnosis and/or prognosis; compare outcomes with different exposures described in observational studies; provide incidence estimates from single-arm studies; and analyze perspectives and experiences through qualitative evidence syntheses.10 The analysis in SRs can be quantitative or nonquantitative/qualitative, and both methods have a systematic analytic approach. The quantitative SR features a metaanalysis, which is an analysis of all pertinent and clinically significant measures of effect—whether dichotomous (eg, mortality) or continuous (eg, duration of hospitalization)—that includes confidence intervals (CIs) and an assessment of heterogeneity (variability).11 A metaanalysis uses statistical techniques to combine outcomes of individual studies to provide an overall summary statistic, with the aims of providing a more precise estimate of the effect of an intervention on an outcome and reducing uncertainty.10 A qualitative review is a descriptive review developed when data are not amenable for a metaanalysis, such as when data are sparse, from studies of different designs, from studies that are of low quality,12 and/or too heterogeneous for statistical aggregation (Table 1).10,11
Adapted from Treadwell et al.12
Both quantitative and qualitative SRs adhere to the same criteria for conduct, including developing and registering a protocol for the SR, a systematic approach to search for relevant references, an analysis for bias, and a summary according to the best available evidence.
Prior to performing an SR, bibliographic databases and registries for SRs on the same or similar research questions should be searched to avoid duplication.13 An SR team is assembled that includes content and methodological experts who ideally do not have conflicts of interest or involvement in important decisions required for the review.14 Establishing a team allows for the completion of tasks including the selection of eligible studies, data extraction, and assessment of the ROB by ≥ 2 people independently to minimize the probability of errors.14
Patient and public involvement in SRs is essential, similar to that in randomized controlled trials (RCTs), and is increasingly being described. An analysis of the 56 SRs demonstrated that 59% solely involved patients, 18% solely engaged the public, and 23% included both. Patients and the public were involved at various phases of the review process, though predominantly in the development of the question and interpretation. Involvement can include focus groups or ongoing patient participation. Acknowledgment or authorship may be considered for the latter, although this is not routinely offered.15
Registration of a protocol for an SR
SRs are developed to be transparent, robust (ie, the degree to which minor alterations in data do not alter conclusions),12 and free from bias as much as possible.4 Conducting a high-quality SR requires the development of a protocol that defines the main objectives, design, and planned analyses for the review. A protocol written in advance of the review and completed prior to determining study eligibility is ideal in order to ensure that the review methods are transparent and reproducible. Publication of protocols (and completed reviews) permits for the tracking of revisions to enable an examination of the effect that the changes may have on the results of the review.16
Prospective registration of SR protocols may also prevent unintended duplication.13 Protocol registration differs from publication of a manuscript for a protocol as the latter will undergo peer review.13 There are several options for registering a protocol for an SR, including registries that are specific to SRs and those that include SRs.13 Organizations conducting or commissioning SRs, such as the Cochrane Collaboration or the Joanna Briggs Institute (JBI), have their own databases of ongoing and published SRs that are restricted to reviews performed within their organizations.13 The International Prospective Register of Systematic Reviews (PROSPERO), established in 2011, is one of the most commonly used registries. PROSPERO includes several mandatory fields to describe the clinical question, inclusion/exclusion criteria, data collection process, critical appraisal, primary and secondary outcomes, data synthesis, and investigators.17 The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting guidelines—originally developed for RCTs—have been extended to include reporting guidelines for SR protocols (PRISMA-P) and include similar fields to PROSPERO.18 Completion of protocols prior to conducting the literature search leads to a more comprehensive literature search strategy. PRISMA-P is intended to facilitate the process of reporting a protocol and registration with PROSPERO.19 Publication of well-developed SRs now includes registration information and often the submission of the SR protocol as a supplementary appendix. Registration of an SR and the required steps in the development of a protocol as defined by PRISMA-P19 are akin to registration with clinical trial websites such as ClinicalTrials.gov and the International Standard Randomised Controlled Trial Number (ISRCTN) in the United Kingdom, as well as the items necessary for protocols for RCTs using the Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT) checklist.20 For both SRs and RCTs, these checklists are designed to optimize transparency, reproducibility, and completeness in design and conduct.20
Steps in the development of an SR
Various agencies have published guidance for the development of SRs, including the Agency for Healthcare Research and Quality (USA),21 JBI (Australia),22 and the Cochrane Collaboration (UK).14 The general criteria used for the development of SRs are similar across these organizations. Regardless of the criteria used for quantitative research, standards for reporting of SRs are predominantly expected to follow the widely endorsed PRISMA reporting guidelines developed for RCTs and the Meta-analysis of Observational Studies in Epidemiology (MOOSE) criteria.23,24 Several PRISMA reporting checklists have been developed to include reporting of SRs for studies assessing diagnostic accuracy25; outcome measurement instruments (ie, how an outcome is measured), such as laboratory tests and scales26; complex interventions (ie, those with multiple components)27; and those reporting harms,28 among others.29 PRISMA and MOOSE checklists aim at improving the reporting of SRs and metaanalyses and include many of the criteria used by agencies to guide development of SRs.1 In addition to referring to guidelines for the development of SRs, it is beneficial to refer to reporting guidelines early in the review process to ensure all elements are included when planning the SR. Fillable PRISMA and MOOSE checklists are available to facilitate the completion of this step.30,31 The features described in PRISMA and MOOSE have been categorized below into 4 categories for simplicity, each of which is essential for methodical development.
1. Clinical/research question
The research question is a carefully formulated one and is the critical first step in the process of developing an SR. The rationale, objectives, and scope are described in detail to permit for the eventual progression into the next phases of describing eligibility of patients and populations; the intervention, screening method, or diagnostic test; the selection of the comparators; and the selection of primary and secondary outcomes, including surrogates. This step is summarized in both qualitative and quantitative studies in PICOT (Patient/Population – Intervention/Test – Comparison/Comparator – Outcome – Time) format, sometimes including study design. The selected outcomes reflect those pertinent to clinical practice, and not those reported in studies. The research question serves as the guide for the systematic search strategy, study eligibility, and citation selection.
2. Study selection
Search strategy. The design of a search strategy for the selection of eligible studies requires the assistance of an information specialist/librarian with experience in conducting SRs, using accurate search terms and sources to ensure the transparency and reproducibility of the search. Generally, ≥ 2 databases are searched to ensure all eligible studies are included.32 Search sources outside of medical databases include grey literature (eg, website and policy publications), government sources, and nongovernmental documents.33 The inclusion of grey literature is routinely recommended by some agencies.34 Inclusivity and completeness are key; thus, avoiding the sole use of English-language sources, including searches of clinical trial registries, and contacting authors for incomplete information is advised.34 The search strategy should ideally be peer reviewed, as this is deemed to improve the quality and comprehensiveness of the search strategy.35 The Peer Review of Electronic Search Strategies (PRESS) is a structured tool that includes guidance and checklists for the completion of the peer review process.35 A detailed search strategy for ≥ 1 medical database is generally included in the publication of an SR to assure reproducibility.
Selection of studies. The selection criteria and the selection process (ie, the assessment of citations/references by reviewers, which should be completed at least by 2 reviewers and independently, as well as the approach to divergences) are determined prior to completion of the search. Criteria used for the selection of citations are generally piloted to ensure that studies of relevance are included. The process of study selection is summarized in a PRISMA diagram (Figure).1 Notably, separate searches may be needed for quantitative SRs if adverse events or harms for an intervention are infrequent and are not adequately assessed.
PRISMA flow diagram for new systematic reviews, which includes searches of databases, registers, and other sources.1 PRISMA: Preferred Reporting Items for Systematic reviews and Meta-Analyses.
The included study designs should consider the availability of similar study designs, for example, the availability of RCTs in SRs focusing on interventions. If large, well-designed RCTs are available, including only RCTs in the SR is a consideration. In the absence of RCTs, or where RCTs have small sample sizes or have not been rigorously conducted, observational studies are incorporated into the SR and analyzed separately from RCTs.
3. Data abstraction
The data abstraction process comprises the abstraction of data and a description of the abstraction process, similar to the selection of citations. This typically outlines the reviewers who will complete the data abstraction and whether the abstraction will be conducted in duplicate (ie, 2 reviewers or more) and independently. Overall, the data abstraction section specifies the characteristics that will be used to satisfy the PICOT criteria, including the conduct of the study, outcome measures, and financial support (eg, industry support).36 For example, study population characteristics should include elements such as sex, age, and comorbidities, among others, to determine similarity to the population of interest. The intervention is described to enable comparability; examples include medication formulation, and route and frequency of administration. Outcome features comprise definitions of outcomes and similarity to the outcomes of interest, including the use of actual outcomes (eg, mortality) or surrogate outcomes (eg, disease-free survival instead of overall survival), statistical measures/units, and a description of the scale (eg, validation) and its administration (eg, self-administered or administered by the research team), if applicable. Software options for citation libraries, mapping of selection criteria, and data abstraction are available (eg, Distiller SR, Covidence37,38), as are examples of manual data abstraction forms (eg, Cochrane Collaboration36).
Assessment of ROB. An assessment of ROB—also referred to as quality of a study—is an essential component of the data abstraction process. Bias is assessed at the levels of study design, conduct, and analysis, and refers to the confidence in the estimates for outcomes that have been generated.39 Limitations in the study design, conduct, or analysis can lead to systematically inaccurate results.40 Bias in studies can be classified into 4 general categories41: (1) selection bias, the process that leads to groups not being comparable (eg, when allocation is according to prognosis); (2) performance bias, the process of providing dissimilar care (eg, when allocation is not concealed); (3) detection bias, wherein an outcome is influenced by knowledge of the intervention, which tends to be more important for subjective outcomes (eg, pain assessment) than objective outcomes (eg, mortality); and (4) attrition bias, which occurs when participants are lost to follow-up, when there are missing data, or when there are deviations from a study protocol, as remaining participants may not be representative of the population.41 ROB assessments are intended to detect these biases to ensure that outcomes reflect true estimates, as low-quality studies generally exaggerate treatment effects.42
Numerous tools are available to assess ROB for quantitative and qualitative research. ROB assessments for quantitative research can be checklists (eg, the checklist developed by JBI), scales (eg, Jadad scale for RCTs),43 or domain-based (ie, an assessment at different stages of study, such as the Cochrane Collaboration ROB for RCTs).44 Checklists provide a variety of quality measures scored individually, whereas scales also provide a total score by summing features, assuming equal weights for each individual feature (although weights may not be equal) or assigning more emphasis to specific features. The latter has been demonstrated previously to be problematic, as pooling studies according to various scales will lead to different estimates of effect and CIs.42,45 The assessment of domains provides a descriptive summary of and emphasis on individual components that can lead to bias, such as allocation concealment as a measure of performance bias.
The necessity for critical appraisal in qualitative research has been established; however, tools used to assess ROB are thought to represent a unified approach without differentiating the distinct methodological approaches for qualitative research (such as grounded theory, interpretative phenomenology, or discourse analysis) and methods for data collection (such as interviewing, use of focus groups and observations).46,47 The available tools include checklists and frameworks for ROB assessment.46,47 Checklists for qualitative studies are similar to those used in quantitative studies, whereas frameworks assess concepts of (1) transferability (ie, the ability to make connections between data and wider community settings); (2) credibility (ie, the appropriateness of participants’ accounts, as interpreted by the researcher); and (3) reflexivity and transparency (ie, the influence of the researcher on the analysis, rather than grounding the analysis and its transparency).46
Instruments used to assess ROB (ie, checklists/scoring systems, scales, domain scores) for quantitative SRs have been evaluated. The methods are similar to each other in that their intent is to determine whether results are plausible, without flaws, and permit for the inclusion of other biases that may be specific to the clinical question, such as variable duration times for outcome assessment.48 Each method has advantages and disadvantages. An analysis of scoring systems suggested that an overall score may not necessarily correlate with overall quality of a study, and scales may provide different results for the same study.45 The components used in checklists differ somewhat, and although all incorporate consistent features such as masking and allocation concealment, do not require a detailed description compared to using domains.49 The use of domains is considered to be a standardized approach to ROB, but interrater agreement and time to completion may be variable and lengthy, respectively.50,51 The more commonly used methods and published advantages and disadvantages specific to these methods are described in Table 2.40,50-63
Subjectivity and judgment are required in the assessment of ROB of studies in an SR. As an example, in a study addressing transfusion, blinding may not be considered critical, whereas the blinding of participants for subjective outcomes such as pain would be considered critical.48 Further, an overall categorization of ROB (ie, high or low ROB) for each study subsequently needs to be determined depending on the importance of each domain on the outcomes.44 Conducting dual, independent reviews will limit additional unnecessary subjectivity.
4. Data synthesis
A metaanalysis is a statistical method to collate outcomes of studies in an SR,64,65 and is conducted when outcomes—as well as the measurement statistics used for those outcomes—are predominantly the same. Metaanalyses have the potential to (1) improve precision, particularly if there are many small studies that cannot provide convincing evidence of the effect of an intervention in isolation; (2) answer new questions not addressed in individual studies; and (3) address controversies from conflicting results of studies as well as explore differences.66 The intent to include a metaanalysis is prespecified in an SR protocol and in the registration of the SR. The rationale for selection of an effect measure (the statistic that compares outcome data) in a meta-analysis67 is also generally prespecified. Common effect measures include odds ratios (ORs) or risk ratios for dichotomous/binary outcomes and mean differences or standard mean differences for continuous outcome variables. The risk ratio (relative risk) and OR are relative measures, whereas the risk difference is an absolute measure.66 Table 3 defines these measures and describes considerations for selection.
Selection of specific summary statistics also depends on whether values are consistent, have the same mathematical properties, and can be easily understood.66 The estimate of the effect measure is generally expressed with the degree of uncertainty, such as a CI or standard error.65,67 A CI provides a range of probabilities within which the true estimate lies, and is a measure of precision as it reflects the adequacy of the sample size used for the true estimate.68 Several software options are available to estimate effect measures.69,70 An assessment of the effect of overall ROB (or aspects that are considered more significant in the assessment, such as allocation concealment) on the effect measure (ie, sensitivity analysis, which is the primary analysis with the substitution of alternate values according to ROB)66 permits for the examination of the robustness of the metaanalysis.41 Subgroup analysis (ie, dividing participants or studies into subgroups, such as an analysis of male vs female individuals or studies of different geographic locations) may be conducted to assess variability or to explore specific questions.66
The synthesis model for a metaanalysis can be a fixed-effect or random-effects model.71 A fixed-effect synthesis presumes that there is a common treatment effect across all study settings, whereas in a random-effects metaanalysis, treatment effects vary from study to study.71 In a random-effects model, the differences in observed effect sizes are not only due to random error, similar to a fixed-effects model, but also to variation in true treatment effects (referred to as heterogeneity).71 The summary effect from a fixed-effect model is an estimate of the assumed common underlying treatment effect; in contrast, for the random-effects model, the summary effect is an estimate of the average of the distribution of treatment effects across various study settings.71 As between-study heterogeneity is common and may not be identifiable, the random-effects model is the standard (ie, default) model for metaanalyses and is conducted if prespecified in a protocol, despite high heterogeneity.72
Certainty of evidence. In addition to individual study assessment, the overall certainty of evidence according to each outcome needs to be conducted, the most used is the method developed by the Grading of Recommendations Assessment, Development, and Evaluations (GRADE) working group (Table 4).73,74 Evaluating the reliability and validity for the data for each outcome by determining the methods used to assess them in each individual study is required, as the quality of data for an outcome may differ across studies.64 For instance, an outcome may be primary for 1 study and be systematically measured, but it may be a secondary or tertiary outcome in another study and may not be as carefully measured.64
Assessment of certainty of evidence for outcome assessment across studies according to GRADE.39
GRADE categorizes studies into 2 groups: RCTs and observational studies. The former is assumed to be associated with less ROB but can be downgraded in quality depending on the overall ROB as assessed by 4 features: (1) the directness of evidence (whether the outcome directly answers the health question), (2) precision (the extent of confidence in the estimate of effect to support a decision),75 (3) the inconsistency of results (differing estimates of treatment effects across studies), (4) and the presence/absence of publication bias. Publication bias refers to studies not being submitted or published because of the strength and direction (ie, negative) of the trial result.76 Studies with statistically significant results are more likely to be published and those with negative results often face delayed publication.77 Visual (ie, a funnel plot that assesses whether there is asymmetrical representations of studies, representing publication bias) and statistical tests for asymmetry can be conducted to gauge for publication bias. The certainty in the quality of evidence can also be upgraded for observational studies and nonrandomized studies, such as in cases where there is an evident dose-response relationship (Table 4). The low certainty of evidence assigned by GRADE for nonrandomized studies is due to the fact that causation cannot generally be determined by nonrandomized studies. These nonrandomized studies, however, play a considerable role in identifying associations; can be complementary (eg, provide information in different populations), often providing long-term outcomes of benefit (not available in RCTs) or harm; and they may be more reflective of usual practice. Thus, nonrandomized studies may provide higher quality of evidence than RCTs.78 The GRADE approach is also used for qualitative research and categorizes absence of methodological limitations (ie, limitations in design or conduct), adequacy of data (ie, richness and quantity), coherence (ie, clarity and rationale of the fit of data), and relevance (ie, applicability).79
Overall, certainty in the evidence for an outcome is presented as high, moderate, low, or very low certainty of evidence of effect, with high certainty suggesting that future research studies are unlikely to change the confidence in the estimate of effect and very low certainty representing an uncertain estimate of effect.80 GRADEpro (Evidence Prime), a software for GRADE analyses, generates tables that summarize the description of features used in the categorization of the certainty of evidence (ie, the summary of findings table) as well as estimates of effect based on data from metaanalyses.9,81 In the absence of an estimate from a metaanalysis, a descriptive assessment of overall certainty of evidence can be used.
A completed SR will also be evaluated for its quality and ROB. Two commonly used tools, A Measurement Tool to Assess Systematic Reviews, version 2 (AMSTAR2)82 and Risk of Bias in Systematic Reviews (ROBIS)83 use a domain-based assessment of bias for SRs of RCTs and observational studies, and include such items as protocol registration (AMSTAR2), adequacy of the literature search, eligibility of individual studies, ROB of included individual studies, appropriateness of metaanalytical methods, and consideration of ROB during interpretation. AMSTAR2 also includes conflicts of interest and is thought to be considered easier to use, whereas ROBIS requires more expertise.84 Prior to conducting an SR, awareness of the requirements of PRISMA for reporting of SRs and those of AMSTAR2 and ROBIS for the evaluation of an SR are of similar importance.
Conclusion
To accurately conduct a rigorous SR and provide the best available evidence for decision making, (1) awareness of other published reviews or protocols to avoid duplication, and (2) criteria for developing and reporting an SR are essential. Although steps in conducting an SR are structured, there are several judgments that are needed in the SR process that allow for a transparent, reproducible, and credible method when a detailed description is provided. The ROB of each study and certainty of evidence for each outcome are critical for assessing the plausibility and accuracy of findings. Advantages and disadvantages have been described for each method of assessing ROB. The selection of a method should be on the basis of assuring confidence in the estimates of the study outcomes.
Footnotes
CONTRIBUTIONS
NS designed the framework, conducted the search, critically reviewed and interpreted the intellectual content of the studies, prepared the manuscript, approved the final version, and is accountable for all aspects of the work. RD contributed to the interpretation of data, reviewed for important intellectual content, approved the final version, and is accountable for all aspects of the work.
FUNDING
The authors declare no funding or support for this research.
COMPETING INTERESTS
The authors declare no conflicts of interest relevant to this article.
ETHICS AND PATIENT CONSENT
Institutional review board approval and patient consent were not required for this work.
- Accepted for publication March 18, 2025.
- Copyright © 2025 by the Journal of Rheumatology
This is an Open Access article, which permits use, distribution, and reproduction, without modification, provided the original article is correctly cited and is not used for commercial purposes.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵








