Abstract
Rheumatology research often involves correlated and clustered data. A common error when analyzing these data occurs when instead we treat these data as independent observations. This can lead to incorrect statistical inference. The data used are a subset of the 2017 study from Raheel et al consisting of 633 patients with rheumatoid arthritis (RA) between 1988 and 2007. RA flare and the number of swollen joints served as our binary and continuous outcomes, respectively. Generalized linear models (GLM) were fitted for each, while adjusting for rheumatoid factor (RF) positivity and sex. Additionally, a generalized linear mixed model with a random intercept and a generalized estimating equation were used to model RA flare and the number of swollen joints, respectively, to take additional correlation into account. The GLM’s β coefficients and their 95% confidence intervals (CIs) are then compared to their mixed-effects equivalents. The β coefficients compared between methodologies are very similar. However, their standard errors increase when correlation is accounted for. As a result, if the additional correlations are not considered, the standard error can be underestimated. This results in an overestimated effect size, narrower CIs, increased type I error, and a smaller P value, thus potentially producing misleading results. It is important to model the additional correlation that occurs in correlated data.
Rheumatology research often involves correlated and clustered data. Many consortia and registries, such as the Rheumatology Informatics System for Effectiveness (RISE)1 registry have been formed to pool together data from multiple institutions to effectively study rare diseases. Patients from the same institution may be more similar to each other than they are to patients from other institutions because of institutional differences in patient care or documentation, or because of regional differences in environmental factors. This commonality (or more technically, correlation) between patients from the same institution needs to be accounted for when conducting statistical analyses. Similarly, rheumatology studies often involve data gathered at multiple timepoints per patient, which are often referred to as repeated measures. Thus, these types of data are common, and errors commonly occur when the within-cluster (or within-person) correlation is ignored in the analysis.2,3
The objective of this article is to provide a high-level overview of the importance of properly analyzing the correlated data, the consequences of ignoring the correlation, and the recommended approaches and statistical software to conduct the analysis. Data from a population-based study of patients with rheumatoid arthritis (RA) is used as an example. Ethics approval was obtained from the Mayo Clinic Institutional Review Board (17-002593).
Negative consequences of using incorrect analysis methods
As the data within a cluster tend to be more alike than the data from a different cluster, the data from the same cluster are not independent. Treating multiple observations from the same patient as if they came from different patients is incorrect. The conventional statistical approaches, such as 2-sample t test, analysis of variance (ANOVA), or logistic regression would not be appropriate as the assumption of independence of observations is violated, resulting in incorrect results for the standard error (SE), confidence interval (CI), test statistic, and P value. The invalid statistical inference may result in misleading conclusions.
Common statistical approaches used to analyze correlated data
Depending on the type of data and the goal of the analysis, several analytical approaches are available. Of necessity, some of the approaches are more complex than the statistical approaches that assume independence of observations (eg, 2-sample t test and chi-square test). Paired data (ie, 2 observations per person or cluster) can be analyzed using either a paired t test or the Wilcoxon signed-rank test when comparing continuous outcomes. The McNemar test can be used for a binary outcome.
When there are more than 2 observations per person or cluster, or when the analysis needs to adjust for other covariates or confounders, a regression approach can be used. For continuous outcomes, generalized linear mixed models (GLMMs) or generalized estimating equations (GEEs) can be used. The GLMM or conditional approach uses a mixed-effects model that includes fixed effect(s) (ie, exposure status, covariates of interest) and a cluster-specific random effect (ie, if the patient data are clustered in the hospitals, random intercepts can be specified to capture the clustering effect and to allow hospital-specific intercept in the model) to capture the within-cluster correlation. It models the correlated response conditional on the clusters and covariates. GLMMs are often used because of their flexibility to specify the distribution of the dependent variable via the link function and the options for different covariance structure for the random effect. Alternatively, a GEE or marginal approach models the mean response and treats the covariance structure as a nuisance variable. Similar to the GLMM, the distribution of the dependent variable and the working correlation structure need to be specified. In contrast to GLMM, GEE provides the population-averaged or pooled estimates.
For time-to-event outcomes, multiple adaptations to the Cox model can be used to analyze correlated data. A frailty or conditional approach is similar to GLMM, whereby a cluster-specific random effect is included in the Cox proportional hazards model to account for the homogeneity within the cluster. This model provides the inference conditional on the cluster. A marginal approach similar to the GEE approach generates the pooled estimates across the clusters.
Example
To demonstrate the effect of failure to account for the correlated data structure, we used data from a study by Raheel et al.4 The description of the study variables was published previously.4 Patients with RA in Olmsted County, Minnesota were included in the analytical set. There were 17,270 visits from 633 patients with RA between 1988 and 2007. In addition to demographic characteristics, clinical information after the onset of RA (the date when RA criteria were met) was recorded. Descriptive statistics relevant to this example are summarized in Table 1. The outcomes of interest in this example are flare of RA disease (binary) and the number of swollen joints (continuous). The distribution of number of swollen joints is likely to be skewed; however, it was chosen for the purpose of demonstration. Another analytic approach, such as Poisson or negative binomial model, could be considered but discussion of the choice of model is out of the scope of this paper. Rheumatoid factor (RF) positivity and sex were the independent variables considered in the analysis. Because the data were a sample of patients with RA from a single county, it was possible that some patients were treated in different facilities. In this example, the facility identifier was not available and, therefore, only the correlation from the repeated measurements of the same patient was considered. If the identifier of the treatment facility was in the data, the clustering effect of treatment facility would have been included in the analysis. The models with and without accounting for the within-patient correlation were fitted, and the model estimates (ie, regression coefficients [β], 95% CI, and P values) were reported. Both univariable and multivariable models were assessed. For the models assuming independence in observations, logistic regression and linear regression (GLM) were used for binary and continuous variables, respectively. GLMM with random intercepts for each patient and GEE were fitted with and without the clustering effect. In the models that did not account for the correlated data structure, no covariance structure was specified because the observations were assumed to be independent. In the models with repeated measurements, the correlation of the outcomes from the same patient was specified by the covariance structure or the working correlation structure. For detailed information regarding the distribution of the outcomes, link function, and covariance/correlation structure, please see the footnotes in Table 2 and Table 3. All analyses were conducted using SAS 9.4 (SAS Institute).
The results from binary models are presented in Table 2. In the univariable models assuming independent records, the model estimates for RF positivity (β) and 95% CIs were almost identical from the logistic regression, GLMM, and GEE models (β 0.43, 95% CI 0.33-0.53 or 0.33-0.52). When the correlation from the repeated measurements was considered, the model estimate from the GLMM did not change much (β 0.44) and the width of the 95% CI increased 125% ([0.67–0.22]/[53–0.33]). In GEE models, the value of β decreased from 0.43 to 0.33, and the width of the 95% CI increased 116% (from width 0.19 to 0.41). Theoretically, the value of the β should be the same in the models with and without the clustering effect. The discrepancy in the β values between GLMM and GEE models was due to the different algorithms used to estimate the model variables. A wider CI is equivalent to larger SE. This increase was due to the additional variability estimated from the within-patient correlation. Note that in this example the variance component was used to estimate the between-patient correlation. The increase of SE will still be observed if another type of covariance structure is chosen. Similar patterns were observed in the multivariable models that included both RF positivity and sex as the independent variables.
Table 3 includes the results from the models with a continuous dependent variable. Using the univariable model as an example, the beta estimates from both GLMM and GEE models were all the same (β 0.07). However, the 95% CIs were wider for the models where the clustering effect was accounted for. The width increased 20% from 1.69 to 2.03 in the GLMM and 22% from 1.70 to 2.07 in the GEE model. For the multivariable models, the impact of the within-patient correlation was consistent with the observed results in the univariable models.
Based on this empirical example, ignoring the repeated measurement data structure resulted in an underestimated SE, overestimated effect size, narrower 95% CI, higher type I error, and a smaller P value. An improper statistical approach can affect the statistical inference, and thus, result in inaccurate interpretations of the findings. The same concept can be extended to multiple levels of correlation (ie, patients nested in a facility).
Other considerations, reference materials, and statistical software
In contrast to the example, where only 2 relevant covariables were considered in the analysis, a real research situation typically includes many more relevant covariates. Overfitting is a common problem in the analysis of correlated data, as the number of random effects needed to represent all the clusters may also be considerable. It may be necessary to limit the number of covariables to avoid overfitting the model. Alternatives include using penalized methods (eg, least absolute shrinkage and selection operator) or simplifying the covariance structure to reduce the estimated variables in the model.
More information on the acceptable approaches to analyze correlated data can be found in these references: Moen,5 Snijders,6 Liang,7 Therneau,8 and Zeger.9
Suggested software in SAS and R are shown in Table 4.
Conclusions
Failure to account for correlation due to multiple observations per patient or multiple patients per institution, or other issues, can result in an erroneous estimate of variance that may yield unwarranted significant results or biased estimates of the effect size. Proper analysis methods to account for correlated data are readily available, but the added complexity of these methods may require a higher level of statistical knowledge and understanding. It is recommended that researchers work with an experienced statistician to ensure correlated data are appropriately accounted for in the analysis of a research study. It is important to recognize correlated data and avoid analytical blunders through proper study design, careful data assessment, and the use of statistical methods that do not assume independent observations.
Footnotes
This work was supported by grants from the National Institutes of Health (NIH) National Institute of Arthritis and Musculoskeletal and Skin Diseases (R01 AR46849). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
The authors declare no conflict of interest relevant to this article.
- Accepted for publication May 9, 2023.
- Copyright © 2023 by the Journal of Rheumatology
This is an Open Access article, which permits use, distribution, and reproduction, without modification, provided the original article is correctly cited and is not used for commercial purposes.