Abstract
Objective Tocilizumab (TCZ) has shown similar efficacy when used as monotherapy as in combination with other treatments for rheumatoid arthritis (RA) in randomized controlled trials (RCTs). We derived a remission prediction score for TCZ monotherapy (TCZm) using RCT data and performed an external validation of the prediction score using real-world data (RWD).
Methods We identified patients in the Corrona RA registry who used TCZm (n = 452), and matched the design and patients from 4 RCTs used in previous work (n = 853). Patients were followed to determine remission status at 24 weeks. We compared the performance of remission prediction models in RWD, first based on variables determined in our prior work in RCTs, and then using an extended variable set, comparing logistic regression and random forest models. We included patients on other biologic disease-modifying antirheumatic drug monotherapies (bDMARDm) to improve prediction.
Results The fraction of patients observed reaching remission on TCZm by their follow-up visit was 12% (n = 53) in RWD vs 15% (n = 127) in RCTs. Discrimination was good in RWD for the risk score developed in RCTs, with area under the receiver-operating characteristic curve (AUROC) of 0.69 (95% CI 0.62–0.75). Fitting the same logistic regression model to all bDMARDm patients in the RWD improved the AUROC on held-out TCZm patients to 0.72 (95% CI 0.63–0.81). Extending the variable set and adding regularization further increased it to 0.76 (95% CI 0.67–0.84).
Conclusion The remission prediction scores, derived in RCTs, discriminated patients in RWD about as well as in RCTs. Discrimination was further improved by retraining models on RWD.
An expanding treatment armamentarium means more treatment options for patients with rheumatoid arthritis (RA); however, clinicians face difficult decisions when attempting to make evidence-based recommendations regarding which disease-modifying antirheumatic drug (DMARD) treatment will be most effective in a given patient. While the majority of patients with RA will find an effective treatment, not all do; many spend months trying medications that may not work for them.1 Prior investigations have attempted to find biomarkers that can help personalize treatments, but most efforts have not produced useful results.2 While an exhaustive comparison between all treatment options is a desirable goal, a natural first step is to identify and understand predictors of a single drug’s success.
We recently examined clinical data from several randomized controlled trials (RCTs) and were able to derive and validate a prediction score for remission among patients using tocilizumab monotherapy (TCZm).3 Monotherapy with TCZ has been found more effective than monotherapy with some targeted therapies for RA.4 However, RCT data may not always replicate in typical practice with real-world data (RWD).5 These differences may derive from different patient populations, different treatment patterns, or other more subtle differences.5
RWD offer important advantages over RCTs in that patients are more heterogeneous, with a greater variety of clinical characteristics as well as experience with other biologic and targeted DMARD treatments. We examined the performance of our original prediction score3 for remission among patients using TCZm among patients in the Corrona registry, a large RWD set from the United States.6 We employed various machine learning algorithms to take advantage of the copious data contained within Corrona.
METHODS
Study design. No research occurred before patients gave written informed consent for these analyses. The research was reviewed and approved by the New England Institutional Review Board (IRB; Corrona-RA-100, IRB Tracking #120160610), a centralized human ethics committee.
Our study sought to answer the following questions regarding remission in patients with RA using TCZm: (1) To what extent do the findings of Collins, et al3 replicate in RWD? (2A) Will expanding the set of remission predictors improve model fit? (2B) Can data from patients on other therapies improve a predictive model for patients on TCZm? (3) What gains are there from applying nonparametric estimators of remission probability?
Question 1: Evaluating baseline model in RWD. As a baseline model, we used the remission model from Collins, et al3 derived from patients on TCZm from 2 RCTs.7,8 In the most successful model from the Collins study,3 12 variables representing demographics, basic RA characteristics, and treatment history were included in a logistic regression (LR) model based on 2 criteria: an a priori baseline set of covariates and model odds ratio (OR). We refer to this model as LR-OR.
In the analysis by Collins, et al,3 2 RCTs were used for derivation and 2 for validation.4,7,8,9 In the current analyses, our focus was predictive discrimination in RWD. Hence, all 4 trials were used for derivation of models validated in the RWD. Access to the RCT and RWD data was granted following deidentification and IRB approval from Partners Healthcare Human Studies. Variables from the 4 RCTs were harmonized by Collins, et al.3 The < 5% of subjects with missing values were removed from the study.
We evaluated the baseline LR-OR model using patients from the RWD dataset. This was pursued using the parameter values fit in the 4 RCTs, and also by refitting the parameters of the LR-OR model to RWD. Using both methods serves to estimate the extent to which model fit is affected by the cohort discrepancies between RCT and RWD.
Question 2: Expanding the variable set and derivation population. The original variable set used in LR-OR was limited to the covariates collected in the 4 RCTs underlying its derivation.4,7,8,9 The Corrona RWD used in the current analyses contains a greatly expanded feature set not available in the RCTs (see below and in the Supplementary Material, available from the authors on request). All models fit with this expanded set were trained and evaluated using only RWD. Adding covariates comes at a “statistical power price,” however, since they contribute to increased variance if the cohort remains of fixed size. To address this, we used a regularized logistic regression model (LR-Reg), which penalizes models with many large coefficients.
To further reduce the variance of our model parameters, we also fit models to an expanded population, including patients on any biologic DMARD monotherapy (bDMARDm), such as TCZm. Variables that predict remission in RA are likely to be predictive for patients on different therapies. By including an indicator for TCZm therapy, this design choice greatly increased the cohort size while enabling the model to remain predictive for our cohort of interest. Validation was then pursued for the cohort including only patients using TCZm.
Question 3: Applying machine learning algorithms in prediction of remission. A limitation of using LR for predicting remission is that it models the log-odds of an event as a linear function of the covariates. As an alternative, we used nonparametric random forest estimators: ensembles of tree-structured decision rules.10 Random forests are capable of discovering meaningful interactions and transformations of variables while mitigating increased variance through the use of bootstrapping. A drawback is that it is often difficult to describe ensembles of learned decision rules concisely.11 For this reason, we used random forests primarily to get an indication for how limited linear models were in this task.
Study populations. Following Collins, et al,3 we used 4 RCTs—ACT-RAY, FUNCTION, ADACTA, and AMBITION—for derivation of our baseline model.4,7,8,9 Our RWD cohort was extracted from the Corrona RA registry, the largest prospective cohort study of RA in the world.6 The registry comprises medical history including conditions, diagnoses, laboratory results, and treatments, as well as demographic and lifestyle data. Records are collected at regular visits through 2 questionnaires filled in by the patient and their physician. We used a version of the registry exported on February 4, 2018, containing visits from 54,646 patients recorded from October 2001 to December 2017.
Patients in the Corrona RWD were eligible for inclusion if they were aged > 18 years and on bDMARDm for a minimum of 3 months, with at least 1 follow-up visit no later than 9 months after initiation. Patients starting a bDMARD in combination with other DMARDs were included if the monotherapy with the target drug was started at most 3 months after target drug (not necessarily monotherapy) initiation; this occurred when the non-bDMARD was stopped, resulting in bDMARDm. For a list of bDMARDs, see Supplementary Material (available from the authors on request). If patients were eligible at multiple timepoints, only the first instance was included. For models fit to all bDMARD subjects, an indicator for TCZ treatment was added.
The most striking difference between the RWD and RCT cohorts was Clinical Disease Activity Index (CDAI) at baseline. For this reason, we evaluated models both for all RWD patients on TCZm and for the subset of patients with baseline CDAI > 20 (Table 1 and Table 2).
Study outcome (RA remission). The primary outcome of interest was disease remission at 24 weeks following initiation of monotherapy, to match the outcome of the RCTs used by Collins, et al.3 Remission was defined by a CDAI < 2.8.12 A benefit of the CDAI remission criteria is that it does not require access to a laboratory measurement and is thus widely available in RWD. In the Corrona RWD, remission status was evaluated at the follow-up visit closest to 24 weeks after start of monotherapy, but no sooner than 3 months or no later than 9 months after initiation.
Potential predictors. The variables used in the models derived from the RCTs were identical to the “OR” set used by Collins, et al.3 These included demographic variables, RA characteristics, and previous use of DMARDs (biologic or nonbiologic). The baseline variable set is listed in its entirety in Table 1.
The extended variable set derived from the RWD included additional disease activity scores; history of cancer, hypertension, rheumatoid factor, joint erosions, and/or deformity; additional comorbidities; prescriptions of nonsteroidal antiinflammatory drugs and glucocorticoids; work status; education; general medical problems; physical disability; current DMARDs; and number of previous DMARDs. A full description of this set is provided in the Supplementary Material (available from the authors on request).
Most of the RWD were collected through questionnaires filled out at each Corrona patient visit (typically every 6 months) for each patient. Baseline features for the Corrona subjects were defined as the last recorded measurements taken prior to initiation of the target drug (TCZm or bDMARDm).13 In particular, if the target drug was prescribed for the first time between patient visits, the data from the last visit before prescription were used. Both baseline (original) and extended variable sets were extracted from the registry data. Missing values were imputed with the R package MICE (v3.13.0; www.jstatsoft.org/v45/i03; Supplementary Material, available from the authors on request).
Statistical analyses. The envisioned clinical use case for the developed risk scores is to aid in treatment of new patients. Therefore, out-of-sample and out-of-distribution generalization are primary concerns. To address the former, we used sample splitting when deriving and validating models on RWD. For a single experiment, the full RWD cohort was first split at random into a derivation set and a validation set, the former used for fitting model parameters and the latter only for evaluation. The overall quality of each method was then computed as an average over a large number of repeated experiments. To assess out-of-distribution generalization, we evaluated the baseline model fit to the RCT cohorts on the RWD. The reverse (RWD to RCT) was not considered here as the extended variable set is not available in the RCTs. The primary quality metric was the area under the receiver-operating characteristic curve (AUROC), which measures the extent to which models successfully rank subjects’ probability of remission. Standard errors were computed using the classical model of Hanley and McNeil.14 Calibration of estimated probabilities, following Platt scaling fit to held-out data,15 was evaluated in the standard way.
Models from 3 different families were fit to multiply imputed data, pooled and evaluated: LR, LR-Reg, and random forests (see Supplementary Material, available from the authors on request, for details). For logistic models, fit to standardized variables, we used the magnitudes of regression coefficients as proxies for the variables’ importance. For random forests, as is common, we measured variable importance by a variable’s ability to discriminate between subjects with different outcomes when used in splitting nodes in the trees, captured by the mean decrease in impurity.
Weighting of subjects in the extended cohort. Extending the cohort to include patients on bDMARDs other than TCZ (see Question 2) induced a shift between the derivation (all bDMARDm) and evaluation cohorts (only TCZm). In particular, TCZm patients make up a small proportion of the overall RWD cohort and would have limited effect on an unweighted model. To mitigate this, we made use of inverse propensity weighting, as is standard practice for handling distributional shift between treatment groups. An LR model was fit to estimate the propensity of patients in the RWD to receive TCZm treatment compared to receiving any bDMARDm therapy (including TCZm). This was used to weight samples to emphasize those who had higher propensity to be put on TCZm. The weights were then used to fit weighted LR and weighted random forest models tailored to TCZm patients.
RESULTS
Patient sample characteristics. From the RCTs, a total of 853 subjects were enrolled in the TCZm arms and had complete data. Among these, 80% were female and 80% were White. At baseline, the mean CDAI was 40.1 and 52% of subjects had been treated previously with both methotrexate (MTX) and another DMARD. In the RWD, out of 54,646 subjects, 3204 subjects were identified as fitting our criteria for bDMARDm. Of these, 76% were female and 93% were White. The mean baseline CDAI was 17.4 and 83% were previously treated with both MTX and another DMARD. In the bDMARDm cohort, 452 were treated with TCZm.
Missingness at baseline in the RWD, before imputation, was low (< 2%) for variables in the original feature set, with the exception of disease duration (11%), erythrocyte sedimentation rate (30%), and Health Assessment Questionnaire–Disability Index (HAQ-DI; 25%). For the extended feature set, indicator variables representing certain past comorbidities, joint erosion, rheumatoid factor, smoking, and previous pregnancy had high (> 30%) missingness. For evaluation of models fit to the RWD, the RWD was repeatedly split into a validation set, containing 50% (n = 226) of TCZm subjects and 20% (n = 550) of other bDMARDm subjects, and a derivation set containing remaining subjects.
Derivation and validation of the prediction model. The full results of our evaluation of different sets of predictors (original/extended), models (LR, LR-Reg, random forest) and derivation sets (RCT, RWD TCZm, RWD bDMARDm) are presented in Table 2. Each combination was evaluated within 2 cohorts, each containing only TCZm subjects: the full RWD TCZm population and, for a closer comparison with RCTs, the subset of patients among these with baseline CDAI > 20. The cohorts were comparable on demographic characteristics (Table 1). The calibration of the LR and random forest models, following Platt scaling, is illustrated in Figure 1.
All LR-Reg and random forest models trained on the extended feature set demonstrated larger AUROCs than models using the original feature set. For example, when trained on all bDMARDm subjects, LR-Reg (extended) achieved 0.76 (95% CI 0.67–0.84) AUROC compared to 0.72 (95% CI 0.63–0.81) for LR (original), with numbers in parentheses indicating the 95% CIs. This suggests that there are gains to be made in predictive performance by including additional comorbidity, lifestyle, and treatment variables in the risk score. Note, however, that the CIs overlap. We saw no advantage of using the random forest model over the regularized LR model in either setting, indicating that the remaining variance in the outcome is unlikely to be due to underutilized interactions or nonlinearities.
We found that expanding the derivation cohort to include non-TCZ bDMARDm patients improved the AUROC for both the TCZm and the TCZm high-CDAI cohorts. Compare, for example, 0.75 (95% CI 0.66–0.83) AUROC for LR (extended) trained on bDMARDm to 0.65 (95% CI 0.54–0.75) for the same model fit to TCZm patients only. For LR-Reg, the AUROCs were 0.76 (95% CI 0.67–0.84) and 0.74 (95% CI 0.65–0.82) when fitting to bDMARDm and TCZm, respectively. Comparable gains were seen for other models in the full group and for all models in the CDAI > 20 group, but due to its small validation set, the variance in the latter results is high. The estimated probability of reaching remission was low (around 10%) for most subjects. Subjects who are more likely to reach remission than others may be identified by thresholding estimated probabilities. For thresholds at 10%, 12.5%, and 15%, the sensitivity/specificity of the best-performing model, LR (extended), were (0.75/0.54), (0.64/0.64), and (0.58/0.72), respectively. The original model, derived in the RCTs, performed substantially worse in the high-CDAI cohort than in the full TCZm cohort, even though the criterion CDAI > 20 was meant to increase the similarity to the RCTs. However, as we can see in Table 1, significant differences in the cohorts remained. The results may be partially explained by higher variance in outcome for the high-CDAI cohort, after controlling for baseline disease severity.
For all models and derivation sets, different measures of disease severity (e.g., Disease Activity Score in 28 joints and CDAI) at baseline were consistently highly predictive of remission. In Table 3, we list the features with highest estimated importance in the LR-Reg and random forest models trained on the extended feature set of the full bDMARDm cohort, ordered by feature importance (random forests) or coefficient magnitude (LR-Reg). The highest-ranked features are mostly unsurprising: the majority pertain to measures of disease severity either explicitly (CDAI, moderate disease activity state, physician global assessment) or implicitly (larger number of previous DMARDs). For the LR-Reg model, disability and unusual fatigue were associated with lower chances of remission.
DISCUSSION
Machine learning applied to RWD may offer new opportunities to better define the course of disease and to identify better treatment strategies. In RA, the expanded treatment options present a challenge for clinicians and patients, as predictors for response to specific treatments are lacking. In prior work, we used RCT data to derive and validate predictors of remission among patients initiating TCZm.3 In the current study, we tested this prediction rule using RWD and attempted to refine the prediction rule using machine learning. We found that the original prediction rule held up well in RWD from Corrona, despite notable differences between the RCT and RWD populations. This, and the fact that the original rule contains only commonly available variables, point toward the feasibility of implementing these rules in clinical practice.
The implications of this work are several. First, we examined the validity of prediction models derived and validated in RCTs in RWD. RCTs, while appropriate for estimating average treatment effects on the selected cohort, are limited in their generalizability to a broader population. RWD offer an insight into how patients are treated in the healthcare system, and what their outcomes are in the absence of strict inclusion criteria and potential experiment effects. Thus, RWD provide a “laboratory” for testing findings from RCTs. Indeed, we found that the RCT and RWD cohorts differed substantially in terms of disease duration and severity, as well as treatment history. The FUNCTION trial enrolled subjects that were MTX-naïve with short disease duration,4 whereas the other RCTs enrolled subjects that showed inadequate response to MTX.7,8,9 Despite this, the discrimination between patients was good for the transferred model. This indicates that (1) good predictors in the RCTs are good predictors in the RWD, and (2) the variables observed at baseline appropriately controlled for the cohort differences. However, the overall performance characteristics (sensitivity and specificity) of our models were modest and gains could be expected with additional data. Thus, the current prediction rules should not be used in daily practice without further improvements.
Second, we developed and validated in RWD several prediction models for remission with TCZm. As noted above, TCZ when given as monotherapy seems to be more effective than other bDMARDs as monotherapy.4 The variables identified in our prediction rule were disabled working status, prior DMARDs, and baseline disease severity. These variables are not surprising and may be obvious to some clinicians. Our prediction rule attempts to put together variables (some new and some old) that have never been put together in a single prediction rule; this may have utility in real-world clinical care. It also points out the value of several nonclinical variables. We anticipate further refining this rule in other datasets with more variables and less missing data. After improving the rule, we may test it among other bDMARDs; it may be that future iterations of this rule could be programmed in an electronic medical record and help clinicians identify therapy likely to be effective in patients with a given set of characteristics.
Finally, this set of analyses used a robust RWD dataset (Corrona) with an expanded set of variables. Because of the presence of many potentially correlated variables, we used machine learning algorithms to analyze these data. We recognize that disabled working status is contained in the HAQ-DI, so there is the potential for overlap between potential predictor variables. In high-dimensional settings such as these, machine learning may be used together with sample splitting to discover models with reduced predictive variance at the cost of a small increase in bias.16 This was confirmed in this work, in particular for the regularized LR models. Such an approach is appropriate, particularly in applications where out-of-sample prediction is the goal, rather than parameter identification. The benefits of machine learning are smaller when domain knowledge is strong enough to identify a successful model without the need to search over a large set of variables. In some cases, a model with slightly lower predictive accuracy may be preferred if it is easier to interpret, explain, or communicate.17
Strengths of the current analyses include the validation of a previously derived and validated algorithm using RWD as an external validation dataset.3 However, several limitations should be noted. The work needs to be expanded to consider the prediction rule across other bDMARDs; this is planned future work. We had significant rates of missing values for some variables in the RWD, which is typical but likely had some effect on model fit. Our strategy of pooling model estimates on multiply imputed data is standard practice, but not immune to bias. Corrona encompasses patients from North America only; while more generalizable and much larger than many RCTs, this is a limitation.
In conclusion, we were able to test a prediction rule for remission with TCZm among patients with RA in RWD and found that it worked well. Additional variables enhanced the prediction rule further. Moreover, using data from other bDMARDs allowed us to improve the model fit. We encourage other investigators to derive and validate prediction models for RA treatment across RCTs and RWD. Machine learning algorithms may play important roles in optimizing prediction rules.
Footnotes
This work was supported by a research grant from Roche/Genentech to Brigham and Women’s Hospital.
DHS receives salary support from research grants to Brigham and Women’s Hospital from AbbVie, Amgen, Corrona, Janssen, Pfizer, and Roche/Genentech; and serves on the editorial board of Arthritis & Rheumatology and on the FDA Arthritis Advisory Committee. FDJ receives salary support from research grants to Chalmers University of Technology from AstraZeneca. SCK received research grants to Brigham and Women’s Hospital from AbbVie, Pfizer, Roche, and Bristol Myers Squibb for unrelated studies. JEC is a consultant to Boston Imaging Core Labs and serves as Associate Editor for Statistics at Osteoarthritis and Cartilage.
- Accepted for publication April 9, 2021.
- © 2021 The Journal of Rheumatology