Abstract
Objective. At OMERACT 8 a framework for levels of evidence was proposed for the validation of biomarkers as surrogate outcome measures. We aimed to adapt this scheme in order to apply it in the setting of soluble biomarkers proposed to replace the measurement of damage endpoints in rheumatoid arthritis (RA), psoriatic arthritis (PsA), and ankylosing spondylitis (AS). We also aimed to generate consensus on minimum standards for the design of longitudinal studies aimed at validating biomarkers.
Methods. Before the meeting, the Soluble Biomarker Working Group prepared a preliminary framework and discussed various models for association and prediction related to the statistical strength domain. In addition, 3 Delphi exercises addressing longitudinal study design for RA, PsA, and AS were conducted within the working group and members of the Assessments in SpondyloArthritis International Society (ASAS) and the Group for Research and Assessment of Psoriasis and Psoriatic Arthritis (GRAPPA). This formed the basis for discussions among OMERACT 9 participants.
Results. The proposed framework was accepted by consensus. In the study design domain a requirement for both prospective observational studies and randomized controlled trials (RCT) in different drug classes was noted. A template for determining the level of statistical strength was proposed. The addition of a new domain on biomarker assay performance was considered essential, and participants suggested that for any biomarker this domain should be addressed first, i.e., before starting clinical validation studies. Participants agreed on most elements of a longitudinal study design template. Where consensus was lacking the working group has drafted solutions that constitute a basis for prospective validation studies.
Conclusion. The OMERACT 9 Soluble Biomarker Group has successfully formulated a levels of evidence scheme and a study design template that will provide guidance to conduct validation studies in the setting of soluble biomarkers proposed to replace the measurement of damage endpoints in RA, PsA, and AS.
- RHEUMATOID ARTHRITIS
- PSORIATIC ARTHRITIS
- ANKYLOSING SPONDYLITIS
- BIOMARKERS
- STUDY DESIGN
- STRUCTURAL DAMAGE
In healthcare all interventions should be aimed at improving patient outcome, defined as “how a patient feels, functions and survives”1. As longterm outcomes are often difficult to identify in the setting of a clinical trial, measurement of biomarkers that could serve as surrogate outcomes are an attractive possibility, but proper validation of a biomarker for use in this setting is difficult. At the OMERACT 8 conference a scheme was proposed that grades the level of evidence in support of a biomarker meeting the definition of a surrogate outcome (see Appendix)1. The soluble biomarker group felt that such a scheme could be adapted for a step earlier in the development process of a drug, i.e., to validate a biomarker that could replace the measurement of damage endpoints in early proof of concept studies. Development and validation of such biomarkers reflecting structural damage currently constitutes a high priority objective both for the drug discovery process and for the practising clinician, particularly for inflammatory disorders of joints and spine where damage progression is slow.
The scheme proposed at OMERACT 8 is based on 4 domains: target, study design, statistical strength, and penalties. For the domains target (that is, substituted by the marker), study design (of the best evidence), and statistical strength, the scores are additive. Penalties are then applied if there is serious counter-evidence. A total score (0 to 15) determines the 5 levels of evidence, with Level 1 the strongest and Level 5 the weakest. There was also agreement with the proposal that biomarkers that have been validated at only Levels 3 to 5 constituted disease-centered variables with no immediate or obvious meaning to patients or clinicians, while biomarkers that attained Levels 1 or 2 validation constituted patient-centered variables with obvious patient and clinical relevance. It was proposed that the term “surrogate” be restricted only to markers attaining Levels 1 or 2.
In discussions at that conference it was recommended for the study design domain that the rankings be more explicit in the minimum standards of design for both observational and randomized controlled studies. Work on the statistical strength domain was deferred to a statistics working group2. An important omission from the generic framework that is particularly relevant to soluble biomarkers is the absence of a performance domain that stipulates recommended standards for the handling and processing of soluble biomarker samples.
The Soluble Biomarker Working Group outlined 3 objectives in the program of work for OMERACT 9: (1) To adapt the generic biomarker levels of evidence framework for soluble biomarkers. (2) To set minimum standards for study design that validates biomarkers as reflecting structural damage in rheumatoid arthritis (RA), psoriatic arthritis (PsA), and ankylosing spondylitis (AS). (3) To propose a framework for quantifying the statistical strength of the association between the biomarker and the damage endpoint.
METHODS
Levels of evidence framework
The generic biomarker framework was presented at a specially convened meeting of the OMERACT 9 Soluble Biomarker Working Group that was held over 2 days in London, England, in November 2007. The primary objectives of biomarkers for RA, PsA, and AS were first discussed and agreed upon, and the generic scheme to assignment of levels of evidence developed at OMERACT 8 was reviewed. This was followed by discussion and critique of the framework with respect to its application to the validation of soluble biomarkers. A proposal for an adaptation of the framework for the validation of soluble biomarkers reflecting damage endpoints was then drafted. This new proposal was presented to participants at OMERACT 9. This included a document that highlighted the proposed modifications to the OMERACT 8 generic framework. The framework was discussed independently by 2 groups at the breakout sessions, and the statistical strength domain was discussed by a separate working group of methodologists and biostatisticians. Rapporteurs summarized the principal issues and concerns and the proposed modifications to the draft soluble biomarker framework at the report-back plenary session. After further discussion in the plenary session, modifications to the framework that generated consensus were incorporated into the new scheme, and participants were then asked to vote on the following question: “The working group has adopted the framework and domains outlined in the OMERACT 8 Surrogate Superworkshop for generating levels of evidence. Do you agree with the new framework for soluble biomarkers?”
Principal requirements for longitudinal study design
The principal aim of this initiative was to propose a minimum set of standards with respect to study design, principal outcomes, processing of biomarker samples, and documentation of potential confounders for the conduct of a longitudinal study aimed at the validation of a soluble biomarker reflecting damage endpoints. This was conducted using a Delphi approach. The principal design issues were identified at the London meeting using the framework for longitudinal studies generated at OMERACT 4 that highlighted core domains (health status, disease process, damage), potential covariates, demographic variables, and study design features that ought to be addressed when planning a longitudinal study3.
The discussions in London constituted the first phase of the Delphi exercise, the solicitation of items, and addressed issues relevant to all 3 categories of arthritis. The subsequent steps in the Delphi were conducted separately for RA, PsA, and AS. Three steps in the Delphi exercise were organized electronically for each of the 3 different disease categories. The first electronic exercise solicited additional domains organized under categories of health status (symptoms, physical function, psychosocial function), disease process (joint tenderness/swelling, global disease, acute-phase reactants), and damage (imaging). Working group members were also asked to propose potentially confounding covariates and relevant demographic variables. Members were asked to propose items for core study methodology organized under the following headings: inclusion criteria, disease phenotype, study duration, approach to selection of patient cohort, treatment strategy, analysis of radiographic endpoint, frequency of clinical assessment, type of biomarker sample collected, frequency/time of biomarker sample collection, biomarker sample processing, and biomarker sample transport and storage. For RA, the convenor of the Delphi (WPM) provided a draft template of items based on discussions at the London meeting to the OMERACT 9 Biomarker Working Group members as a basis for further solicitation of items. For PsA and AS, solicitation of items was conducted electronically among the membership of the Group for Research and Assessment of Psoriasis and Psoriatic Arthritis (GRAPPA) and the Assessment in SpondyloArthritis International Society (ASAS), respectively, after a draft template was provided by convenors for the AS (WPM) and PsA (OF) Delphi.
Electronic voting was then conducted in the subsequent 2 rounds of the Delphi exercise among OMERACT 9 Biomarker Working Group members. In addition, voting was conducted among GRAPPA members for PsA, among ASAS members for AS, and among OMERACT 9 registrants for RA. Two types of voting questions were presented. One type requested selection of an item among a range of options. Consensus for selection of a particular item was defined on the basis of ≥ 70% of participants voting in favor of that item, while consensus for exclusion was defined as ≤ 30% of participants voting for that item in any round of voting. The second type of question was presented in a Likert format comprising 5 scoring categories ranging from 1 = definitely unacceptable and/or unnecessary, exclude from study design, to 5 = definitely acceptable, essential that it be included in the study design. An additional option was provided, namely, “don’t know/not an expert.” Consensus for selection or exclusion of a particular item was defined on the basis of ≥ 70% or ≤ 30%, respectively, of participants voting a score of 4 or 5 on the Likert scale in any round of voting. The results of the Delphi exercise were presented at OMERACT 9, and participants were presented with a summary handout at the soluble biomarker group plenary session. Principal areas of disagreement were highlighted in the handout and discussed at the plenary session.
Development of statistical strength domain
For the statistical strength domain, the strength of association and prediction models is the central theme. Various models for assessing the association between marker change and target change, and for assessing prediction of the effect of treatment on marker change and target change were presented and discussed at the OMERACT 9 Soluble Biomarker Working Group meeting in London. Relevant models presented were based on: change in the biomarker during therapy; change in the target outcome in the long term, including measuring the outcome repeatedly for greater insight into the progression of the target outcome; and change in pertinent covariates during therapy and/or repeatedly during the longterm period for greater insight into the nature of the “confounding” relationship. Depending on the nature of the data, various models were considered, including: regression analysis of target outcome on change in biomarker; multiple regression analysis of target outcome on change in biomarker and the covariates; longitudinal multiple regression analysis of target outcome on change in biomarker and the time-dependent covariates; and mixed model repeated measures.
The data used for demonstrating the various models were from the Combinatietherapie Bij Reumatoide Artritis (COBRA) trial dataset4. This trial showed that step-down combination therapy with prednisolone, methotrexate, and sulfasalazine (SSZ) was superior to SSZ monotherapy for suppressing disease activity and radiologic progression of RA. The analysis focused on investigating whether urinary C-terminal cross-linking telopeptide of type II (CTX-II) collagen, a specific biochemical marker of cartilage degradation, was associated with radiological damage and progression in patients with RA. Various regression-based models were presented and discussed, and the need for a template to determine the statistical strength for such models was identified. A research agenda was determined to review various schemas that could be used for categorizing and determining levels of statistical strength.
RESULTS AND DISCUSSION
Levels of evidence framework
The generic framework for a levels of evidence scheme was adapted for soluble biomarkers at the London meeting (see Appendix), and following modification at OMERACT 9 was accepted by 85% of workshop participants (Table 1). Agreement was reached on the following adaptations to the domains:
-
Target outcome domain. The grading of 0 (disease-centered, reversible) to 5 (death) was not considered relevant to the validation of a biomarker reflecting structural damage. A grading of 0 to 3 [patient centered, irreversible, minor organ/clinical morbidity (radiography)] was considered appropriate, with radiography being accepted as a patient-centered outcome. Some argued that the target outcome has already been defined as radiography in formulating the principal objectives of the validation process and that there is, therefore, no need to include this domain in the scheme. The counter-argument was that other measures of damage, e.g., magnetic resonance imaging (MRI), may be increasingly relevant as validation data increase. For example, it has been shown that bone marrow edema on an MRI has predictive validity for radiographic damage and can be reliably detected and quantified5. As clinicians increasingly target and require guidance in the treatment of pre-radiographic disease, MRI may increasingly constitute a relevant outcome for biomarker validation studies.
-
Study design domain. The grading of 0 (animal studies, case reports, cross-sectional, retrospective) to 5 (≥ 3 RCT each of different drug class, ≥ 3 randomized surrogate objective trials) was modified to incorporate an equal weighting for RCT and prospective observational studies. Longitudinal studies would have to be consistent with the minimum standards for longitudinal study design advocated by the group (see below). Randomized surrogate endpoint trials were considered too high a hurdle for the objectives of this biomarker validation process. The proposed ranking recognizes the importance of biomarker validation with different drug classes. For example, it has now been consistently demonstrated that C-reactive protein has predictive validity for structural damage in patients with RA receiving methotrexate, but not in those receiving anti-tumor necrosis factor therapies6–8. Both longitudinal cohort studies and RCT are deemed essential. The former address validation in a wider spectrum of patients and over longer time periods, while the latter can more readily address validation with different drug classes and potential confounders.
-
Statistical strength domain. In evaluating the association between marker change and target change or the prediction of the effect of treatment on marker change and target change, regression-based modeling is the primary statistical technique, and the fitting of the model will lead to a goodness of fit statistic (such as, percentage of the variation explained by the model R2). The statistical evidence of the association or prediction can be determined using a modification of the Sterne and Smith interpretation of the p value, taking the number of observations into consideration9, whereas the statistical strength per se can be based on the model, with the effect size determined using the coefficient for the biomarker [e.g., the odds ratio (OR)]. This effect estimate can be translated using Cohen’s standardized mean difference (SMD), i.e., an OR value can be transformed into an SMD10: \batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \[ \mathit{\hbox{ SMD }}=\surd 3/\pi \hspace{0.17em}\hspace{0.17em}\hbox{ log }OR \] \end{document} and levels of strength can be derived based on the usual thresholds for interpreting “fair” (0.2), “good” (0.5), “very good” (0.8).
-
Penalties domain. The grading in the generic template proposal was largely adopted although with the stipulation that rather than being additive for different studies, the highest score would be applied as a penalty and that the same minimum standards be applied to the evaluation of study design.
-
Performance domain. This domain is not a component of the generic template but it was agreed that biomarkers should have been validated according to the criteria comprising this domain before proceeding with clinical validation studies. The criteria address standards of reproducibility, feasibility (readily accessible, availability of international standards, costs), biomarker stability, and evaluation of confounders that are defined in the OMERACT 9 biomarker validation draft criteria under the categories of feasibility and discrimination (criteria 1, 2, 4, and 5).
OMERACT 9 Levels of Evidence framework for validation of a soluble biomarker reflecting damage endpoints in rheumatoid arthritis, psoriatic arthritis, and ankylosing arthritis (adapted from the generic biomarker framework at OMERACT 81).
Longitudinal study design consensus
The following design issues and key recommendations were highlighted at the London meeting for consideration in the Delphi voting exercise: principal inclusion criteria, study design (RCT vs observational), treatment strategy, study duration, appropriate damage endpoints, frequency of assessment, and sample collection and processing. A total of 52 ASAS and 45 GRAPPA members provided additional input into the items proposed for the AS and PsA Delphi exercises, respectively. For the first round there were 130 OMERACT 9 participants, 46 ASAS members, and 53 GRAPPA members who participated in the Delphi voting exercise for RA, AS, and PsA, respectively. In the second round of voting, the corresponding number of participants was 113, 43, and 46, for OMERACT 9 participants, ASAS, and GRAPPA members. The final results of these 3 Delphi exercises are presented in Table 2.
Summary results of 3-stage Delphi consensus exercise addressing minimum standards for longitudinal study design for validation of biomarker reflecting damage endpoints in rheumatoid arthritis (RA), psoriatic arthritis (PsA), and ankylosing spondylitis (AS). Items lacking consensus are indicated in bold type (percentage of respondents voting in support of the item is indicated in parentheses).
Failure of consensus was evident for 2 key items that were discussed further at OMERACT 9. The first focused on the diagnostic inclusion criterion for a validation study of an RA biomarker. There were 2 principal schools of thought on this matter. Some considered it desirable for a validation study, especially the first, to stipulate the American College of Rheumatology (ACR) classification criteria on the premise that such patients would be not only relatively homogeneous but also more likely to demonstrate disease progression, which increases statistical power. Inclusion of patients with a wide spectrum of disease activity and severity was also considered desirable, since some biomarkers may reflect radiographic progression better in early versus late disease and vice versa. Other participants were supportive of differentiating patients on the basis of the anti-cyclic citrullinated peptide (CCP) antibody test in early disease on the premise that these patients are a distinct group both prognostically and on the basis of pathophysiology11–13. The latter could, therefore, imply quantitative and/or qualitative differences in the relationship between a particular biomarker and radiographic damage. A compromise proposal was to include patients on the basis of the ACR criteria but then to prespecify analysis stratified by anti-CCP status. Both RCT and observational studies were considered equally desirable to ensure generalizability of study findings, although RCT for AS were not considered feasible because progression of radiographic change is not reliably detected prior to 2 years in patients on standard therapy14. Validation in studies employing diverse and flexible treatment strategies was considered desirable since the real clinical utility of a biomarker is dependent on the demonstration that levels of the biomarker are independently associated with structural damage regardless of treatment approach.
Consensus was not achieved in regard to the minimum standards for the handling of biomarker samples because a substantial minority of respondents refrained from voting in the Delphi exercise as they assigned themselves the designation “not an expert.” It was decided by consensus that the biomarker group should develop a proposal for the systematic handling of biomarker samples, which is presented in Table 3. First, the group has recommended the collection of both urine and serum. Although feasibility is an obvious advantage for serum, it is important to standardize the collection of serum in view of previous reports that preanalytical handling of serum influences certain biomarker levels, such as metalloproteinases, which are released from platelets and leukocytes particularly when using collection tubes that enhance clotting (kaolin-coated)15,16. Ideally, possible interfering factors should be identified as discussed under the Performance criterion assay-related confounders (Table 1) and recommendations for standardization of sample collection clarified prior to clinical studies. A practical problem is that several biomarkers are often tested simultaneously, and collection procedure may not be optimal for all biomarkers. In addition, samples are often collected as a routine during observational studies and RCT and then analyzed retrospectively. It is obvious that analysis of the individual biomarkers should only be done on samples that have been obtained and handled in a way that ensures reliable measurement with respect to diurnal variation, centrifugation, freezing temperature, stability to freeze/thaw cycles, etc. These characteristics may vary considerably from marker to marker. It will not always be known in advance which markers will later be analyzed. Therefore, a default approach is to recommend standardized operating procedures for sample collection as outlined in Table 3.
OMERACT 9 Soluble Biomarker Working Group minimum standards for the handling and processing of biomarker samples.
CONCLUSIONS AND FURTHER DIRECTIONS
The OMERACT 9 Soluble Biomarker Working Group has laid the groundwork for a systematic and standardized approach to biomarker validation studies. These recommendations constitute draft proposals until tested in prospective studies. These prospective studies will form the basis for further revision of these recommendations in preparation for subsequent OMERACT meetings.
Appendix 1. Ranking surrogate validity: domains, criteria, and ranks. From Lassere, et al. J Rheumatol 2007;34:607–15
