Special Series: Missing DataReview: A gentle introduction to imputation of missing values
Introduction
Missing data are a common problem in all types of medical research. There are various methods of handling missing data. Simple and frequently used methods include complete or available case analysis, the missing-indicator method [1], and overall mean imputation. However, these methods lead to inefficient analyses and, more seriously, commonly produce severely biased estimates of the association(s) investigated [2], [3], [4], [5], [6]. There are more sophisticated (imputation) techniques to handle missing data, such as multiple imputation, that give much better results [2], [3], [4], [5], [6]. With these techniques, missing data for a subject are imputed by a value that is predicted using the subject's other, known characteristics. Presently, these sophisticated techniques are easy accessible and available in standard software such as SAS and S-Plus. Nevertheless, there seems to be a general lack of understanding that has limited their use in epidemiological research.
In this short report we will give a gentle introduction into the logic behind these sophisticated imputation techniques of missing data. We will not go into technical details, nor into details on how to perform these analyses. For this we refer to the literature [2], [3], [4], [5], [6], [7], [8]. Instead, to assist medical researchers in their future data analyses we aim to clarify in simple wording why (more sophisticated) imputation is a better, more valid method than the simple and frequently used techniques for handling missing data. We will start with a brief introduction on different types of missing data and the principles of imputation in general, followed by explaining single and multiple imputation, and why frequently used methods fail. All this will be illustrated using data from a simple simulation study.
Section snippets
Types of missing data
If subjects who have missing data are a random subset of the complete sample of subjects, missing data are called missing completely at random (MCAR) [9]. Typical examples of MCAR are when a tube containing a blood sample of a study subject is broken by accident (such that the blood parameters can not be measured) or when a questionnaire of a study subject is accidentally lost. The reason for missingness is completely random, i.e., the probability that an observation is missing is not related
Imputation is replacement
We start this section by noting that in the classical (frequentistic) statistical view, conclusions drawn from any study should not depend on the sample that is involved in the study. Should the study be repeated with a different sample, nearly identical results should be obtained. The conclusions do not depend on the given set of subjects in the sample. This implies that every subject in a randomly chosen sample can be replaced by a new subject that is randomly chosen from the same source
Single imputation
Direct replacement of subjects by new subjects from an identifiable source population based on observed subject characteristics may be feasible when the number of study variables is limited, as in our diagnostic example study where only two variables, the test result and disease status, were used. Commonly, however, the number of covariates is large. Suppose a nondiseased male subject, aged 39, with a body mass index of 24.5, and a systolic blood pressure of 110 has a missing test result. If
Multiple imputation
To obtain correct estimates of the standard errors and P-values, we should take into account the imprecision caused by the fact that the distribution of the variables with missing values is estimated. This can be done by creating not a single imputed data set, but several or multiple imputed data sets in which different imputations are based on a random draw from different estimated underlying distributions [4], [5]. There are various approaches to creating these multiple imputed data sets.
Simulation study
We performed a simulation study based on our diagnostic example to illustrate that single imputation yields unbiased estimates with too narrow confidence intervals and multiple imputation indeed yields unbiased estimates with correct standard errors and P-values. We simulated 1,000 samples of 500 subjects using R [14]. The samples were drawn from a population consisting of equal numbers of diseased and nondiseased subjects. The true regression coefficient in a logistic regression model linking
Indicator method
A still popular method for handing missing values is the so-called missing-indicator method [1]. For each independent variable with missing values a new dummy or indicator (0/1) variable is created with “1” indicating a missing on the original variable and “0” indicating an observed value. For the original variable the missing values are recoded as “0.” For (original) categorical variables this in fact means, creating an extra value category for the missing values. When estimating the
Final comments
Our purpose was to provide insight into how sophisticated imputation works, to facilitate the understanding and cooperation between medical researchers and statisticians, and to make the data analysis a success. Complete and available case analyses provide inefficient though valid results when missing data are MCAR, but biased results when missing data are MAR, which is the more common form of missingness in epidemiological research. Other frequently used methods to handle missing data such as
Acknowledgments
We gratefully acknowledge the support by The Netherlands Organization for Scientific Research (ZON-MW 904-10-006 and 917-46-360).
References (15)
- et al.
Developing a prognostic model in the presence of missing data. An ovarian cancer case study
J Clin Epidemiol
(2003) - et al.
Diagnostic research on routine care data: prospects and problems
J Clin Epidemiol
(2003) Theoretical epidemiology. Principles of occurrence research in medicine
(1985)- et al.
A critical look at methods for handling missing covariates in epidemiologic regression analyses
Am J Epidemiol
(1995) Logistic regression with missing values in the covariates
(1994)Multiple imputation for non response in surveys
(1987)Analysis of incomplete multivariate data
(1997)