Abstract
Objective. Outcome Measures in Rheumatology (OMERACT) Filter 2.1 revised the process used for core outcome measurement set selection to add rigor and transparency in decision making. This paper describes OMERACT’s methodology for instrument selection.
Methods. We presented instrument selection processes, tools, and reporting templates at OMERACT 2018, introducing the concept of “3 pillars, 4 questions, 7 measurement properties, 1 answer.” Truth, discrimination, and feasibility are the 3 original OMERACT pillars. Based on these, we developed 4 signaling questions. We introduced the Summary of Measurement Properties table that summarizes the 7 measurement properties: truth (domain match, construct validity), discrimination [test-retest reliability, longitudinal construct validity (responsiveness), clinical trial discrimination, thresholds of meaning], and feasibility. These properties address a set of standards which, when met, answer the one question: Is there enough evidence to support the use of this instrument in clinical research of the benefits and harms of treatments in the population and study setting described? The OMERACT Filter 2.1 was piloted on 2 instruments by the Psoriatic Arthritis Working Group.
Results. The methodology was reviewed in a full plenary session and facilitated breakout groups. Tools to facilitate retention of the process (i.e., “The OMERACT Way”) were provided. The 2 instruments were presented, and the recommendation of the working group was endorsed in the first OMERACT Filter 2.1 Instrument Selection votes.
Conclusion. Instrument selection using OMERACT Filter 2.1 is feasible and is now being implemented.
- OUTCOME MEASURES
- OUTCOME ASSESSMENT
- HEALTH STATUS INDICATOR
- OMERACT
- REPRODUCIBILITY OF RESULTS
- RELIABILITY AND VALIDITY
Core outcome sets (COS) are increasingly recognized as a minimum set of outcomes that will be measured across all clinical trials in a given field to facilitate comparisons of interventions and metaanalyses, and to avoid selective outcome reporting bias1. The Outcome Measurement in Rheumatology (OMERACT) has promoted and supported the development of COS since its inception in 19922. Although the main focus has been in the area of musculoskeletal disorders and rheumatologic conditions3, it has also found application in other fields4,5.
OMERACT divides the task of creating a COS into 2 components: first, determining what needs to be measured (core domain sets), and second, deciding how to measure each of the domains, also referred to as “instrument selection.” This in turn leads to a core outcome measurement set, when there is at least 1 outcome measurement instrument identified for each domain. In 2012, OMERACT voted to revise its processes to recognize both the growth of the organization and of the literature available on measurement properties of any given outcome measurement instrument. The creation of a core domain set was outlined by Boers, et al6 and is expanded on in this issue in 2 companion papers7,8. The purpose of this paper is to describe a data-driven, evidence-based process for the instrument selection process and the OMERACT Filter 2.1 methodology.
MATERIALS AND METHODS
Foundations of the OMERACT Filter: 3 pillars, 4 questions, 7 measurement properties, 1 answer
Truth, discrimination, and feasibility are the pillars of the OMERACT Filter9. Truth refers to whether the measure’s scores can be shown to be truthful, measuring what was intended. Discrimination asks whether the measure discriminates between situations of interest, such as between treatment arms in a clinical trial. Finally, feasibility answers questions about the practicality of using the tool: time, cost, and burden. Together, these 3 pillars describe a set of standards which, when met, answer one question: Is there enough evidence to support the use of this instrument in clinical research of the benefits and harms of treatments in the population and study setting described?
In OMERACT Filter 2.1, we recognized that the 3 pillars of the original OMERACT Filter are best represented by 4 signaling questions (Figure 1A). Two questions split the truth pillar into a practical appraisal of the instrument and its content with “Is it a match with the target domain?”, and a more data-driven, hypothesis-testing assessment of the instrument’s scores with “Do the numeric scores make sense (i.e., are the scores relating to other measures or the testing situation in a way it should if it measures the domain well)?” The question reflecting the discrimination pillar is “Can it discriminate between groups of interest?”, assessing whether the instrument identifies differences between treatment and control groups found in clinical trials. The signaling question, “Is it practical to use?” i.e., in cost, burden, and access, reflects the feasibility pillar.
In practice, when this method is used to assess an instrument, the signaling questions are slightly reordered, putting practical appraisals of concept match and feasibility ahead of the review of the evidence available on the more data-driven features of testing truth and discrimination. This saves time and resources because it allows instruments to be set aside if they are not identifying the target domain concept or are not feasible for use in the target application. This reordering is seen in the bottom of Figure 1B.
The 4 signaling questions and the traffic light ratings they received are linked on the OMERACT Filter 2.1 Instrument Selection Algorithm (Figure 2). Ratings are completed for each question and then combined into an overall rating for the instrument. Red always means “stop, do not continue,” amber means “a caution is raised, but you can continue,” and green means “go, this question is definitely answered affirmatively.” White circles indicate an absence of evidence, leaving working groups to decide whether they wish to create the evidence needed or consider it as a gap, so further evaluation should stop because evidence is missing. Once all 4 questions are answered, based on this evidence the working group recommends an overall level of endorsement (Figure 2 bottom panel).
Instrument selection using the OMERACT methodology
The step-by-step process of OMERACT’s instrument selection methodology will be described briefly here following the steps illustrated in Figure 3, “How to choose an instrument the OMERACT Way.” A detailed description of these steps is available in The OMERACT Handbook10.
Revisit the domain definition. Prior to embarking on any instrument selection process, working groups should review the domain(s) each instrument is trying to identify. This is done making use of the definitions described in the OMERACT Onion document7 and the OMERACT Filter 2.1 Framework8.
Find candidate instruments. Creating a new instrument is a difficult task, and groups often can identify an existing instrument(s) by searching the literature11,12,13 or speaking to experts in the field.
Is the instrument a match for the target domain? Working groups then address the signaling questions described above. Armed with the domain definition and the candidate instrument, working groups can identify whether the instrument or outcome measure (terms used here interchangeably) matches the intended target domain. This is done by seeking the experiences of those who will respond to the instrument. Working groups should talk to people, particularly those with the lived experience of the disease and domain, to see whether the instrument identifies the breadth and depth of the experience. Templates for surveying respondents are provided in the OMERACT Instrument Selection Workbook (www.omeract.org/resources). Available data can be used to examine whether the response distribution for the scale is appropriate. High ceiling or floor effects in people experiencing the domain (i.e., physical limitation) could flag that the scale will not detect the differences of interest in the relevant population or could also reflect an expected level for certain indices or aggregate scores14. Cognitive interviews can be used at this stage to examine how items are interpreted; for example, whether people, particularly those with the lived experience of the disease and domain, would prefer different question stems, anchors, or response options15.
Is it feasible to use this outcome measure? Feasibility is a practical assessment of the burden of use, where burden could be cost, time, equipment, personal burden for the respondent (e.g., language, health literacy) or administrator (e.g., required training), the interpretability of the scores, and other similar considerations16. Some of these features can be assessed using surveys or checklists compiled with working group and other input (see OMERACT Instrument Selection Workbook; www.omeract.org/resources) or through other structured techniques in focus groups or nominal group processes. Occasionally assessments of feasibility (time to complete the assessment or survey, complexity of language, or technical demands of interpreting imaging results) are published in the literature; however, OMERACT will also accept the appraisal of the working group for the answer to this question.
Narrowing the number of candidate measures. At the next stage, the working group determines whether there is a clear match of an outcome measurement instrument with the target domain and whether the instrument is feasible for use in the intended setting. An instrument that is not a good match to the target domain definition or is not feasible should be set aside, because these shortcomings are unlikely to be easily addressable. This is a key step in the process and often leads to a shortened list of candidate instruments. Working groups are asked to record the level of agreement within their working group and any comments made when either proceeding or setting aside an outcome measurement instrument at this point.
Gather evidence for the next 2 signaling questions. The last 2 questions (“Do the numeric scores make sense?” and “Can it discriminate between groups of interest?”) are represented by 5 additional measurement properties that require data-oriented answers: construct validity (scores relate to other known measures in a way that is consistent with the underlying domain of interest), test-retest reliability (no change in score when patients are stable, estimate of day-to-day variability), longitudinal construct validity (responsiveness; ability to detect change when it has occurred), ability to discriminate in a clinical trial (specific ability to detect change between arms in a clinical trial), and thresholds of meaning (benchmarking scores and changes in score for interpretation; as seen in Figure 1B). The evidence to support performance of an instrument on each of these properties is based on the growing body of literature on measurement properties17,18. In response to this, the OMERACT Filter 2.1 has adopted standard systematic review techniques as described by Slavin19 to identify and process available literature. Slavin describes the stages of such a review as (i) gathering the evidence, (ii) appraisal of quality of the evidence, (iii) data extraction, and (iv) synthesis of findings. The result is parallel systematic reviews, one for each of the measurement properties of interest. The process is described briefly here and in more detail in The OMERACT Handbook10.
Gathering the evidence on the measurement properties. Systematic literature searches are conducted with the support of library scientists and standard search term templates available to working groups. The search terms focus on the measurement properties and the relevant patient population for the outcome measure. Searches are run often by a librarian or information scientist; the working group screens the titles and abstracts to see if they match the instrument and to ensure they are about measurement properties. Positive or possible articles are obtained for full–text review of their relevance, and to see which measurement properties are addressed in that article. Working groups at this point begin building their Summary of Measurement Property (SOMP) table, where the relevant articles are listed (Table 1) and the measurement properties covered are recorded. Importantly, only the 7 measurement properties relevant to the application of an existing measure in a clinical trial are reviewed. Tracking of the yield and selection of articles should be rigorous and reported in a PRISMA flow chart (http://www.prisma-statement.org/).
Evidence for OMERACT endorsement can also be created by the working groups by conducting a study to address any gaps found in the SOMP table. The methods and results of these studies are independently reviewed by at least 2 members of the Technical Advisory Group of OMERACT (https://omeract.org/tag) before they are considered for inclusion.
Quality assessment. All evidence, both that found in the literature and new evidence created by the working group, undergoes quality assessment. Several quality assessment tools exist in the literature, though few specifically address our goal of looking to exclude those with critical flaws that could lead to a risk of bias in the estimation of the measurement property performance. COSMIN (4-point checklist version) is one frequently used critical appraisal tool for measurement studies20. Only certain items in the checklist offer a “poor” response category. This rating is reserved to indicate the situations in which the methods reported are flawed enough that this evidence should not be included in the review owing to risk of bias. In 2015, we worked with the COSMIN and reworded these specific items into a positive, dichotomized response to identify whether the study reported good methods, and had successfully avoided a risk of bias as indicated in that poor rating. Focusing only on measurement properties needed for OMERACT Filter 2.1, we added 2 measurement properties important to OMERACT that were not in COSMIN (clinical trial discrimination and thresholds of meaning), to produce the COSMIN-OMERACT Good Methods Checklist found in our current OMERACT Handbook10.
The Good Methods Checklist items are assessed independently by 2 persons, and agreement is sought between them. Any newly created evidence has the Good Methods check done by 2 members of the technical advisory group independent of the working group. This is rated in traffic light format again and the color entered in the cells of the SOMP table (Table 1), with green or amber indicating good methods, and red indicating a high risk of bias. Only studies that have passed the Good Methods Check move to the next stage of extracting information and the results of the measurement property tests.
Data extraction. The results of the testing of measurement properties are extracted from the publications and placed into a narrative summary of the testing procedures, study characteristics, and results. Enough detail is provided in a data extraction table to allow a user of the data to follow the logic and rationale for the decisions made. Results are compared to international recommendations for acceptable performance in terms of results of a measurement property study. In the SOMP table, a “+” is placed for a positive performance, “+/–” for equivocal, and “–” for inadequate performance.
Synthesis. The next step is the synthesis of evidence that has been appraised as at least adequate-quality evidence (green or amber color), into a rating of the performance of the instrument for each of the 7 measurement properties in the SOMP. Both published and new studies are considered. Our synthesis methods are based on the practices of several groups in different fields4,20,21, which emphasize the importance of having Quality information (using studies with good methods); Quantity (at least 2 good methods studies), showing Consistency of the findings across these pieces of evidence; and adequate Performance in the tests of that measurement property. Combining these elements, Quality, Quantity, Consistency, and Performance (QQC-P), a synthesis statement is made for each measurement property. The working group then decides on a recommendation based on their good quality evidence.
3.7 Identify the “winners” (best instruments)
In the last row of the SOMP, the working group identifies the instrument(s) that have passed the Filter 2.1 requirements with either a green (endorsed) or amber (provisionally endorsed) rating at the instrument level. All amber-rated instruments must have a clearly defined research agenda of what additional work is needed to bring this instrument to a green for full endorsement.
3.8 Bring it to a vote
Core to the OMERACT decision-making process is engaging the OMERACT community in evaluating the results of the instrument selection process and seeking a vote of support from that community regarding the rigor and conclusions of that process. When evidence about an instrument is gathered, and a decision is made as to the level of endorsement the working group thinks it should receive, the group will bring this to the OMERACT Technical Advisory Group for review. If the evidence is deemed to be of sufficient quality, the group may have an opportunity to present its findings at a full plenary session, called a workshop, during a face-to-face OMERACT biennial meeting. Seventy percent agreement by the OMERACT community (voting at that session) will be considered support for the endorsement.
In addition to the guidance in Instrument Selection chapter The OMERACT Handbook10, the OMERACT Master Checklist and Workbook for Instrument Selection have been developed to help working groups keep track of their progress and to ensure full and transparent reporting. These resources are available on the OMERACT Website (https://omeract.org/resources). No ethics approval was required for this work because it did not involve human subjects.
RESULTS
Results of the initial application of the OMERACT Filter 2.1 Instrument Selection Algorithm
At OMERACT 2018, a presentation was given in the opening plenary to describe the instrument selection process delineated above, and in The OMERACT handbook. The OMERACT methods for instrument selection figure, known as the “The OMERACT Way,” and the OMERACT Filter 2.1 Instrument Selection Algorithm were provided for reference throughout the meeting. The Psoriatic Arthritis Working Group presented 2 instruments for endorsement by the OMERACT community, becoming the first group to move through the Filter 2.1 Instrument Selection process. The first was the 66-joint swollen joint count and 68-joint tender joint count (SJC66/TJC68 joint counts) as instruments to reflect the domain of musculoskeletal disease activity in the peripheral joints. The second was the Psoriatic Arthritis Impact of Disease questionnaire (PsAID12) for the measurement of the core domain psoriatic arthritis-specific health-related quality of life. The final recommendations of the working group were presented at the plenary session, where they highlighted strengths and weaknesses of the 2 candidate instruments. Both the SJC66/TJC68 and PsAID12 achieved consensus (i.e., 70% or greater vote) by the OMERACT community and were the first instruments to be passed through OMERACT Filter 2.1 as fully and provisionally endorsed measures, respectively22,23.
DISCUSSION
The OMERACT Filter 2.1 revisions address instrument selection within an evolving paradigm of measurement instrument assessment. These methods emphasize the increasing need for an outcome measure’s scores to have enough evidence to engender confidence in its use in a particular setting. The process has its foundation in the original OMERACT pillars of truth, discrimination, and feasibility that are still critical requirements for instruments to meet, and adds systematic approaches to gathering, appraising, and synthesizing evidence on the performance of the instrument. The OMERACT Technical Advisory Group will continue to work with OMERACT working groups to operationalize the instrument selection process to ensure we are achieving the goal of transparent, rigorous, evidence-based instrument selection for core outcome measurement sets.
Acknowledgment
The authors thank Dr. Caroline Terwee for her valuable input to the development of the COSMIN-OMERACT good methods checklist. In-kind support from the Knowledge Transfer and Exchange team and the Measurement Sciences group at the Institute for Work & Health is acknowledged for the development of the OMERACT Way graphics and input into the methods used.
Footnotes
PGC is funded in part by the NIHR Leeds Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the UK National Health Service, the NIHR, or the Department of Health. JAS is supported by the resources and the use of facilities at the VA Medical Center at Birmingham, Alabama. LMM is a Principal Investigator on the Australian Rheumatology Association Database, which has received arms-length funding from AbbVie Australia, Pfizer Australia, Janssen Australia, and Eli Lilly Australia. OMERACT is a registered nonprofit independent medical research organization whose goal is to improve and advance the health outcomes for patients with musculoskeletal conditions. OMERACT receives unrestricted educational grants from the American College of Rheumatology, the European League of Rheumatology, and several pharmaceutical companies. The grants are used to support fellows, international patient groups, and a major international biennial conference that results in many peer-reviewed publications. The views expressed in this article are those of the authors and do not necessarily reflect the position of the US Department of Veterans Affairs or the US government.
- Accepted for publication January 24, 2019.