Analysis of longitudinal randomized clinical trials using item response models

https://doi.org/10.1016/j.cct.2008.12.003Get rights and content

Abstract

Patient-relevant outcomes, such as impairments, disability and health-related quality of life, are becoming increasingly popular as outcome measures in clinical research. These outcomes are generally assessed using questionnaires. In a longitudinal randomized clinical trial where the outcome is measured by a questionnaire or some other instrument consisting of a set of discretely scored items, treatment effects can be analyzed using item response theory. The problem addressed is how to take the estimation error in the estimates of the latent outcome variables into account in the estimation of the treatment effects. Three approaches are compared: plausible value imputation (PVI), concurrent marginal maximum likelihood (MML) estimation and a limited information two-step marginal maximum likelihood method. The results show that the power of the former two methods to detect small and moderate effect sizes is considerably larger than the power of the latter approach. An additional advantage of the PVI method as compared to MML is that the treatment effects can be estimated with standard software. An example using data from a longitudinal randomized clinical trial illustrates the use of the methods in a practical setting. It is shown that even when responses on different sets of items for different groups of patients are used for the data analysis, the power to detect the experimental effects is comparable to the power obtained when responses to all items for all patients are used in the analysis. This creates considerable flexibility in the design and use of measures in experiments.

Introduction

The patient's perspective on the impact of a disease and the effectiveness of a treatment is becoming more and more important. This has lead to the development of a large number of concepts and instruments to measure patient-relevant outcomes. Points in case are the measurement of pain [1], the measurement of disability [2] and the measurement of health-related quality of life [3]. Usually, these outcomes are assessed using questionnaires and Item Response Theory (IRT) models are often used to analyze such outcomes. Using IRT to model the responses to such questionnaires has many advantages. If it can be shown that a unidimensional IRT model fits the response data, this supports the construct validity of the instrument, that is, it serves as evidence that the scores can be meaningfully attributed to some underlying unidimensional variable, that is, to a so-called latent variable. Further, IRT distinguishes between item characteristics (item parameters) and the characteristics of the respondents (the values of persons on the latent variable). This separation of parameters facilitates the analysis of data collected in a so-called incomplete design, that is, a design where different persons respond to different sets of items (given that the design is linked in some way). This, for instance, supports data collection designs where the pre-test and post-test differ, or where the instrument is targeted at the level of the respondents on the latent variable. One step further is computerized adaptive testing, where every respondent is administered a unique set of items selected from a calibrated item bank using a sequential statistical optimization algorithm [4].

Patient relevant outcomes measured using questionnaires are becoming increasingly important in clinical trials. After collection of the response data and estimation of the latent outcome variables, the next step in the analysis of a clinical trial is to evaluate group differences on the outcome measure, possibly in a longitudinal design. Hypotheses can be tested using analysis of variance and regression models for the latent outcome variables. However, if the estimation error in the latent outcome variables is ignored inferences can be very misleading [5]. Several methods are available to properly estimate the parameters of model linear models on latent outcome variables. A first method is to estimate the parameters of the IRT measurement model (item parameters) and the structural model (the regression parameters) concurrently using marginal maximum likelihood (MML) [5]. As an alternative, this estimation procedure can also be divided into two separate steps, where the parameters of the measurement model are estimated first, followed by MML estimation of the structural parameters. This procedure will be labeled two-step MML, and abbreviated MML2. As another alternative, Fox and Glas [6] consider concurrent estimation in a fully Bayesian framework where computations are made using the Gibbs sampler. The advantage of MML and Bayesian methods is that they are based on a well-founded statistical framework. Disadvantages are the numerical complexity of the methods and the need to use specialized and not readily available software. A much used alternative is based on the idea to estimate the value of the latent outcome variable for every respondent and to perform the analysis of variance or regression on these estimates. A major problem with this approach is that the outcome variables are not direct observations but estimates with an estimation error. A solution to this problem is using multiple imputations drawn from the posterior distributions of the latent variables. The variance of these draws accounts for the uncertainty in the estimates. The method is generally known as plausible value imputation [7]. The method consists of three steps: in the first step, the IRT model is estimated and validated using generally available software such as BILOG [8], MULTILOG [9] or PARSCALE [10], in the second step, values of the latent variables are drawn from their posterior distribution, and in the last step, these so-called plausible values are imputed into the structural model, say, an analysis of variance model. The last step can be performed using standard user-software, such as SPSS, SAS or STATA.

The methods and simulation studies presented here are an extension of the work by Holman, Glas and de Haan [11]. These authors examined the power of the MML2 method in a two-legged trial with the two-parameter logistic model (2PL model) as a measurement model. They conclude that the number of respondents in each arm of a randomized trial required to detect certain effect sizes varies with the number of items used. They also conclude that as long as at least 20 dichotomously scored items are used, the number of items barely affects the number of respondents needed to detect effect sizes of 0.5 and 0.8 (in terms of Cohen's d-metric [12]), with a power of 80%.

In the present article, their research is generalized in several directions. With respect to estimation methods, the two versions of MML estimation and plausible value imputation are studies. The results show that in applications where the number of respondents is relatively small (which is usually the case in clinical trials) the power of hypothesis testing using plausible value imputation is larger than the power of MML2. Further, the approach is generalized to a two-way longitudinal design and an IRT model for polytomously scored items.

Section snippets

Item response theory

Consider polytomously scored items labeled i = 1,…, K Every item has response categories labeled j = 0,…, mi. (in the sequel the index i of mi is dropped for convenience). Item responses will be coded by stochastic variables Untij, for respondents labeled n = 1,…, N and time points labeled t = 1,…, T with realizations unij. Further, unij = 1 if a response was given in category j and zero otherwise. The probability of scoring in a response category j on item i is given by a response function Pij(θnt) = P(U

Marginal maximum likelihood

Marginal maximum likelihood (MML) estimation is a much used technique for item calibration. For the 2PL and 3PL models, the theory was developed by Bock and Aitkin [21]. Under the label “Full Information Factor Analysis”, a multidimensional version of the 2PL model and 3PL model were developed by Bock, Gibbons, and Muraki [22]. MML estimates of the regression coefficients β and the covariance matrix Σ can be obtained either concurrently with the item parameters or treating the item parameters

Dichotomously scored items, two groups

The first set of simulations pertains to the simulations done by Holman, Glas and de Haan [11]. These authors use the MML2 method where item parameters are fixed. In the present report, concurrent MML estimation and plausible value imputation are also considered. Holman, et al. consider a design with 30, 40, 50, 100, 200, 300, 500 or 1000 subjects in each of the two groups and questionnaire lengths of 5, 10, 15, 20, 30, 50, 70 or 100 dichotomously scored items. It is beyond the scope of the

An example

The example serves to illustrate how latent regression models can be used in practice. The data are from a study of the effect of a combination of thalassotherapy, exercise and patient education in patients with fibromyalgia. Patients with fibromyalgia were selected from a rheumatology out-patient department and from the members of the Dutch fibromyalgia patient association. The patients were randomized to receive either 2.5 weeks of treatment in a Tunisian spa resort, including

Summary and discussion

Item response theory (IRT) models are increasingly used to analyze data obtained from randomized clinical trials. The resulting latent outcome variables are then used in a structural model to estimate the effect of the experimental treatment, possibly over time. Unfortunately, the latent outcome variables are often treated as fixed, such that the estimation error of the parameters is often ignored. It is well known that inferences ignoring measurement error can be very misleading [37]. Three

References (37)

  • E. Muraki et al.

    PARSCALE: Parameter scaling of rating data

    (2002)
  • J. Cohen

    Statistical power analysis for the behavioral sciences

    (1988)
  • E. Muraki

    A generalized partial credit model: application of an EM algorithm

    Appl Psychol Meas

    (1992)
  • G.N. Masters

    A Rasch model for partial credit scoring

    Psychometrika

    (1982)
  • A. Birnbaum

    Some latent trait models and their use in inferring an examinee's ability

  • G. Rasch

    Probabilistic models for some intelligence and attainment tests

    (1960)
  • H. Goldstein

    Multilevel mixed linear models analysis using iterative generalized least squares

    Biometrika

    (1986)
  • A.S. Bryk et al.

    Hierarchical Linear Models

    (1992)
  • Cited by (32)

    • Health-Related Quality of Life Outcomes in Head and Neck Cancer: Results From a Prospective, Real-World Data Study With Brazilian Patients Treated With Intensity Modulated Radiation Therapy, Conformal and Conventional Radiation Techniques

      2021, International Journal of Radiation Oncology Biology Physics
      Citation Excerpt :

      A high score in functional scales and overall HRQoL scale represents a high level of functioning, and a high score in symptom scales or single items represents a high level of symptoms or difficulties.7 We developed an alternative score, based on the item-response theory (IRT), to represent global HRQoL.12 Unlike the measures assessed with EORTC QLQ-C30, in which 2 questions provide information to score global HRQoL, the information in all other items included in this questionnaire and in all items of EORTC QLQ-H&N43 was used to generate a novel global HRQoL measure (HRQoL-IRT).

    View all citing articles on Scopus
    View full text