Introduction

The rapid identification of genetic markers for multifactorial diseases from genome-wide association studies is fuelling interest in investigating the predictive ability and health care utility of genetic risk models. Genetic risk models are investigated for their potential to target diagnostic, preventive and therapeutic interventions for multifactorial diseases. Implementation of these models in health care requires a series of studies that encompass all phases of translational research [1, 2], starting with a comprehensive evaluation of genetic risk prediction.

Various measures are available for the assessment of risk prediction models, each addressing a different aspect of performance and utility [3, 4]. The GRIPS Statement recommends that transparent and complete reporting should provide a description of the risk factors and the risk model by reporting univariate and multivariate odds ratios for the predictors, present risk distributions for individuals with and without the outcome of interest, and report measures of model fit, predictive ability and others, if pertinent [5, 6]. Examples of measures include the Hosmer–Lemeshow statistic [7] and Nagelkerke’s R2 [8] for model fit, the area under the receiver operating characteristic (ROC) curve (AUC) [9] and integrated discrimination improvement (IDI) [10] for predictive ability, and percentages of total reclassification [11] and net reclassification improvement (NRI) [10] for clinical utility.

Even though the assessment of risk prediction models is relatively standard, there is no single statistical package that would allow for the computation and production of all these measures and plots. Therefore, we developed PredictABEL, a freely available R package, which contains functions to obtain all descriptive tables, measures and plots that are used in genetic risk prediction studies.

Description of PredictABEL

The core part of PredictABEL comprises functions for the assessment of risk prediction models. The measures and plots covered in PredictABEL are listed in Table 1. Most functions can be applied to predicted risks, risk scores or any other continuous predictor variable, but some to predicted risks (probabilities) only. Predicted risks and genetic risk scores can be obtained using functions in the package, but they can be imported from other programs as well. The functions to obtain predicted risks using logistic regression analysis are specifically written for models that include genetic variables, eventually in addition to non-genetic factors, but they can also be applied to construct models based on non-genetic risk factors only. Genetic risk scores can be computed as unweighted and weighted risk scores, where weights are obtained from uploaded data or imported from meta-analyses, e.g., as beta coeffcients.

Table 1 Measures and plots covered in PredictABEL (version 1.1)

The tables and plots generated using PredictABEL are saved as separate files in the working directory. Tables can be saved as Excel or tab-delimited text files and figures can be saved as publication-quality EPS or TIFF files or as JPEG files for insertion in manuscripts. All figures are available in a ready-made layout, but they can be customized to the journal style or preferences of the user. A hypothetical dataset and examples of use are included in the package to demonstrate all functions.

Example

The hypothetical dataset included in the package was reconstructed from an empirical study on age-related macular degeneration (AMD) [12], using a simulation method that has been described in detail elsewhere [13]. Based on published frequencies and odds ratios of the genetic variants and non-genetic risk factors implicated in AMD and on published population disease risks, we created a dataset that contains genotype data and disease status for 10,000 individuals. Predicted risks were obtained using logistic regression analysis, for which the codes are provided in the package. Two risk models were constructed: a model based on non-genetic risk factors only and a model based on genetic and non-genetic predictors.

Figure 1 presents three examples of plots that are produced by PredictABEL. Figure 1a shows distributions of predicted risks based on genetic and non-genetic factors for individuals with and without AMD. The degree of overlap between the two histograms is indicative for the discriminative accuracy of the risk model. This discriminative accuracy is assessed by the AUC and visualized in a ROC plot. Figure 1b presents the ROC curves for the two risk models. The figure shows that the model with genetic factors had a higher AUC than the model without. Using the same function, the AUC values were quantified as 0.80 and 0.74. Finally, Fig. 1c presents the calibration plot for the risk model based on the genetic and non-genetic variables as predictors, which shows how well predicted risks match observed risks. The calibration plot suggests that the model was well calibrated, which was supported by the non-significance of the Hosmer–Lemeshow test (P = 0.65).

Fig. 1
figure 1

Example graphs produced by PredictABEL. a Distributions of predicted risks in individuals with and without age-related macular degeneration (AMD); b ROC plot presenting risk models without and with genetic variants; and c Calibration plot comparing predicted risks with observed risks. Figure 1a and c present the risk model based on genetic and non-genetic risk factors

Finally, Table 2 presents an example of the reclassification table and statistics that are produced by PredictABEL. The reclassification table presents the categorization into risk groups according to the initial and updated risk models. The table provides information about the total number of individuals that change between risk categories and about correct and incorrect reclassification. The percentage of total reclassification and NRI are calculated from the reclassification table. The table indicates that net 8.8% of the individuals without AMD and 9.6% of those with AMD would be correctly reclassified when the clinical model was updated by the addition of genetic factors.

Table 2 Reclassification table comparing clinical risk models without and with genetic factors

Conclusions

PredictABEL is a comprehensive software package, designed for the development and assessment of genetic risk prediction models. PredictABEL is a part of the GenABEL software suite for statistical genomics [14, 15] and for that reason written in R to enable easy transfer of data from gene discovery to genetic prediction studies. A detailed manual is available that demonstrates and explains all the functions in the package. The manual is accessible for researchers who do not regularly use R software. The manual and the package are freely available from the GenABEL project website (http://www.genabel.org) and from CRAN (http://cran.r-project.org/).

The current version of PredictABEL (version 1.1) includes all basic descriptive tables, measures and plots that are used in the assessment of risk prediction models. Planned extensions of the package include other strategies to construct risk models, e.g., using Cox Proportional Hazards analysis for prospective data, and functions to construct simulated data for the evaluation of genetic risk models [13]. Furthermore, we will optimize the interconnectivity between PredictABEL and other packages in the GenABEL suite.

Where the GRIPS Statement aims to improve the transparency, quality and completeness of reporting [5, 6], PredictABEL has similar goals for the assessment of genetic risk prediction studies. The collection of all measures and plots in a single, software package gives a comprehensive overview of the various measures that are available for the assessment of risk prediction studies. This overview emphasizes that different measures are available to answer different questions in the assessment of risk models and facilitates the selection of the most appropriate measure for the question under study.