Abstract
Clinical manifestations and disease progression often exhibit significant variability among patients with rheumatic diseases, complicating diagnosis and treatment strategies. A better understanding of disease heterogeneity may allow for personalized treatment strategies. Cluster analysis is a class of statistical methods that aims to identify subgroups or patterns within a dataset. Cluster analysis is a type of unsupervised learning, meaning there are no outcomes or labels to guide the analysis (ie, there is no ground truth). This makes it difficult to assess the accuracy or validity of the identified clusters, and these methods therefore require thoughtful planning and careful interpretation. Here, we provide a high-level overview of clustering, including different types of clustering methods and important considerations when undertaking clustering, and review some examples from the rheumatology literature.
Introduction
Disease heterogeneity poses a challenge to both research and patient care in rheumatic and musculoskeletal diseases, as clinical presentation and disease course can differ substantially between patients.1 Researchers are often interested in identifying patients who are similar to each other with respect to clinical, demographic, or disease characteristics. This could create opportunities for precision medicine, including tailored treatment strategies, improved prognostic models, and targeted clinical trial recruitment. Cluster analysis is an exploratory data analysis technique for grouping together similar observations. Examples include investigating disease heterogeneity in osteoarthritis (OA) using biochemical biomarkers and defining multimorbidity clusters in patients with ankylosing spondylitis (AS).2,3
Cluster analysis is a type of unsupervised learning. “Unsupervised” means that there are no outcomes or labels. In the more familiar supervised learning, we aim to explain or predict an outcome. For example, we might aim to predict which patients achieve remission or which patients are likely to be adherent to their medication. In unsupervised learning, we instead aim to uncover patterns or structure in the data. For example, we may ask whether there exist patterns in disease activity or medication adherence.
The application of clustering methods in rheumatology research is increasing. A PubMed search for common clustering methods (eg, K-means, latent class analysis) indicates an increasing trend in the use of these methods in studies evaluating conditions such as OA, systemic lupus erythematosus, and rheumatoid arthritis, as shown in Figure 1.
Number of OA, RA, and SLE publications using cluster analysis by year (2014-2023), as identified in a PubMed search for common clustering methods applied to OA, SLE, and RA. OA: osteoarthritis; RA: rheumatoid arthritis; SLE: systemic lupus erythematosus.
Given the rising use of these methods, this article aims to provide a high-level overview of clustering, highlighting examples from the rheumatology literature. This is meant to be an applied overview of clustering; a more technical overview is provided elsewhere,4,5 as are in-depth tutorials.6-10
Types of cluster analysis
There are several approaches to clustering, all of which aim to group similar observations together into subgroups. These approaches differ with respect to how the clustering is performed; different methods may be appropriate depending on the structure and nature of the data, the hypothesized cluster structure, and the goals of the analysis.
Partitional clustering. Partitional clustering approaches seek to separate, or partition, observations into a predefined number of groups.5,11 Datapoints are randomly assigned to clusters to start with, and then iteratively reassigned until an optimal solution is obtained. Partitional clustering methods work well when the clusters are spherical or globular, that is, tightly packed together with uniform density and variance.5 These methods tend to not work well in the setting of correlated features, and in this case, typically require data reduction or preprocessing, as described below. Partitional clustering methods are popular because the methods and results are generally easy to interpret, and the algorithms are usually efficient.
K-means is one of the most popular partitional clustering methods.12,13 The total number of clusters, K, is prespecified, and observations are partitioned into each of these K clusters. The algorithm seeks to minimize the squared Euclidean distance between each point in a cluster and the mean of the cluster, called the centroid. The steps in K-means clustering are shown in Figure 2.
Visualization of K-means clustering method. Step 1: Initialization (centroids are randomly initialized). Step 2: Assignment (each observation is assigned to the closest centroid). Step 3: Centroid update (the algorithm recalculates the mean value of each cluster based on assigned observations). Step 4: Reassignment (each observation is assigned to the new closest centroid). Steps 5-6: Centroids are updated, and observations are reassigned. This process repeats until the algorithm converges.
K-medoids clustering, or partitioning around medoids, is similar to K-means, but instead of defining the cluster center as the centroid (mean), the medoid is used.10 A medoid is an observation within the cluster that has minimum dissimilarity with all other members of the cluster. Rather than computing distances between each observation and the mean of the cluster, the distance is computed between each observation and a representative observation (medoid). Similarly to K-means, an iterative process reassigns cluster medoid and cluster membership until the algorithm converges.
K-medoid clustering is more robust to outliers and extreme observations compared to K-means and can be used for mixed data types (ie, continuous and categorical). Both K-means and K-medoids use random initialization of clusters and can therefore be sensitive to the starting point; different starting placement of the centroid/medoid may result in different clustering solutions.14 Running the algorithm multiple times with different random initial configurations is recommended.15 K-medoids can be computationally intensive for large datasets; extensions such as Clustering Large Applications Based Upon Randomized Search (CLARANS) use random sampling to reduce processing time.16 Recent extensions and updates allow for more flexibility in the requirements for partitional clustering. For example, the K-prototype algorithm relaxes some assumptions with respect to the data structure and cluster shape required by K-means.17
Partitional clustering: choosing the number of clusters. Partitional clustering methods require that the number of clusters (K) be prespecified. Typically, a range of possible values for K is considered,4 and measures of within-cluster dissimilarity are computed for each K and compared. Since within-cluster dissimilarity naturally decreases as the number of clusters increases, choosing the number of clusters requires weighing increasing complexity (ie, increasing the number of clusters) with improvements in dissimilarity.
The elbow method compares within-cluster dissimilarity between clustering solutions graphically (ie, between different values of K).6 Within-cluster dissimilarity is often estimated by the within-cluster sum of squares (WCSS), which is the sum of the squared distances of each point to the centroid. The WCSS is plotted for different values of K, and the “elbow” of the curve, or the point at which the curve bends, will indicate the ideal number of clusters (ie, the point at which the added complexity of adding another cluster does not substantively improve WCSS). An example of the elbow method is shown in Figure 3A. Cluster dissimilarity score (ie, WCSS) is estimated for K ranging from 1 to 9. WCSS improves substantially from K = 1 to K = 2, K = 2 to K = 3, K = 3 to K = 4, and then begins to level off. Here, the optimal number of clusters is 4. Although the elbow method is intuitive, it is also subjective13; there may be no clear elbow or multiple potential elbows to choose from.18 Figure 3B shows an example in which the ideal number of clusters is not obvious: there is no clear bend in the line and WCSS appears to gradually improve as more clusters are added.
Elbow method for determining optimal K. The K-means score (eg, within-cluster sum of squares) is plotted against the number of clusters, K (range 1-9). (A) The K-means score improves substantially from K = 1 to K = 2, K = 2 to K = 3, K = 3 to K = 4, and then begins to level off. Here, the optimal number of clusters is 4. (B) The ideal number of clusters is not obvious because the K-means score improves gradually from K = 1 to K = 9 with no clear bend in the line.
There are a number of alternatives to the elbow method. Many of these methods evaluate both within-cluster similarity (eg, using WCSS) and between-cluster separation, and are therefore less prone to demonstrating increasing performance as the number of clusters increases (known as “inertia”). A comprehensive review is beyond the scope of this paper but can be found elsewhere.19,20 We briefly review a few popular methods here.
The silhouette method measures how well each datapoint lies within its cluster compared to other clusters.21 Values close to 1 indicate that the datapoint is close to the cluster center, values close to 0 indicate that the datapoint is on the boundary between clusters, and negative values suggest that the datapoint is misassigned. The silhouette method can be used to assess the fit of datapoints within clusters, as shown in Figure 4. Here, each bar represents the silhouette score for 1 datapoint; silhouette scores are plotted in decreasing order. The value of K with the highest average silhouette score is chosen as the solution.
Silhouette plots for (A) for K = 3, and (B) for K = 4. Each bar represents the silhouette score for 1 datapoint; silhouette scores are plotted in decreasing order. Here, the average silhouette value is higher for the 4-cluster solution (0.67) than the 3-cluster solution (0.56), so the optimal K would be 4.
The gap statistic computes intracluster variation for each clustering solution (1 to K) and compares this to the intracluster variation under a null reference distribution (ie, one in which there are no clusters).22 The value of K that maximizes this difference is selected as optimal. The Calinski-Harabasz index, Davies-Bouldin index, and Dunn index are other common clustering fit metrics that assess the within-cluster similarity compared to between-cluster separation. A detailed review of these can be found elsewhere.23
Partitional clustering: an example. Karmacharya et al aimed to identify multimorbidity phenotypes in patients with AS based on the presence or absence of 31 AS comorbidities.3 Comorbidities are associated with higher mortality and complicate AS disease assessment. The features to be clustered—presence or absence of comorbidity—are binary, so authors used K-medians clustering, which has been shown to have superior performance over K-means when clustering binary data.24 The number of clusters was determined using the elbow method. The primary analysis included only baseline comorbidities; sensitivity analysis to examine the stability of clusters included rerunning the K-median algorithm using longitudinal comorbidity data. Comorbidities missing in > 50% of patients were not included in the analysis, and multiple imputation was used to impute missing data for the remaining comorbidities prior to clustering. The authors identified 5 distinct multimorbidity clusters, and both the number and type of comorbidities appeared to be associated with long-term disease outcomes.3 In particular, the “depression” cluster was associated with worse disease activity and function. Most studies to date have examined individual comorbidities in AS; here, the clustering analysis was able to illuminate patterns of comorbidities that tend to present together, revealing whether and how such patterns are related to AS disease activity.3
Hierarchical methods. Hierarchical clustering does not rely on a predefined number of clusters and instead aims to group together similar datapoints in a hierarchical fashion.4 Hierarchical clustering is often used when the research question is focused on understanding the hierarchical, or nested, structure of the data and groupings, rather than simply dividing datapoints into clusters.4 It is often used in gene expression analysis or in clustering microarray data.25
Two general approaches are used. The more popular “bottom-up” approach initializes every datapoint in its own cluster and then merges datapoints and clusters based on a similarity index. This is referred to as agglomerative clustering, whereby datapoints are gathered together, or agglomerated, into clusters. “Top-down” hierarchical clustering is more computationally intensive and therefore less common. In this approach, all datapoints start in the same cluster and then are split into separate clusters based on a similarity index. This is referred to as divisive clustering, whereby groups of datapoints are divided into clusters.
Hierarchical clustering solutions are often visualized with a dendrogram, a tree-like figure that shows the clustering solution for each possible number of clusters from 1 to N (ie, each datapoint is in its own cluster). The datapoints are displayed along the bottom of the graph and are joined together in a tree diagram. The height of the branch signifies how similar datapoints or clusters are, with shorter branches indicating more similarity. The cluster size increases moving from the bottom to the top of the graph. A simple example of clustering countries is shown in Figure 5. Here, the United States and Canada were deemed the most similar and hence have the shortest branch. The dendrogram shows all possible clustering solutions. A 2-cluster solution (dashed line) consists of a Central and North American cluster and a European cluster. A 5-cluster solution (dotted line) groups together Germany and the Netherlands, England and Scotland, Italy and Spain, the US and Canada, and Belize. Dendrograms are often combined with heatmaps to visualize the similarity between datapoints, with cells colored based on the magnitude of dissimilarity.26 The goal of hierarchical clustering is often to provide insight into the structure/hierarchy of the data. Although clustering validation metrics can be used to find an optimal number of clusters, that is often not the goal of hierarchical clustering.
Dendrogram showing countries clustered by similarity. Here, the US and Canada were deemed the most similar and hence have the shortest branch. The dendrogram shows all possible clustering solutions. A 2-cluster solution (dashed line) consists of a Central and North American cluster and a European cluster. A 5-cluster solution (dotted line) groups together Germany and the Netherlands, England and Scotland, Italy and Spain, the US and Canada, and Belize.
The dendrogram is formed by grouping together similar datapoints. As with partitional clustering, we must define what we mean by “similar.” Typically, the Euclidean distance is used to measure the distance between datapoints or clusters. Alternate distance measures are sometimes used depending on the nature of the data, including data type, correlation structure, and variance.23,27 For example, Manhattan distance is the sum of the absolute differences along each dimension and can work well with high-dimensional datasets that may include irrelevant or redundant information. Mahalanobis distance accounts for the covariance structure of the data; it is useful in settings in which data are correlated or have unequal variance. Gower distance is often used in the setting of mixed data types (ie, continuous and categorical).28
In agglomerative clustering, another decision in defining distance is to choose the points between which distances are calculated so that the algorithm can determine which clusters should be joined.15 This is referred to as linkage. For example, with single link (nearest neighbor), the distance (eg, Euclidean, Manhattan) is calculated between the closest points in 2 clusters, whereas with average link, the distance is calculated as the average of all pairwise differences between all pairs in 2 clusters. Popular linkage methods also include complete link, centroid link, and Ward distance, as described in detail elsewhere.15,23
Model-based methods. Partitional and hierarchical clustering are both “hard” clustering methods, meaning that each observation belongs to 1 cluster only. Model-based clustering methods use a statistical model to represent the underlying data generating process, which is assumed to be a mixture of different subpopulations.29 Instead of grouping or dividing datapoints into clusters, model-based clustering uses a mathematical model to explain the heterogeneity in the data and uncover subgroups. The probability of belonging to each cluster is computed for each observation, and observations are assigned to the cluster with the highest probability. This allows for the investigation of overlap between clusters. Model-based clustering methods are often used when the clusters may not be well separated or when there is uncertainty in the underlying distribution of the data.
Latent class and latent profile analyses are popular model-based clustering methods that aim to identify presumed unobserved, or latent, subgroups.30,31 Latent class analysis is used for clustering categorical variables or a mix of categorical and continuous variables, and latent profile analysis is used for clustering continuous variables. Cluster membership is estimated through maximum likelihood estimation. As with K-means and other partitioning clustering methods, the number of clusters must be specified. In practice, a range (ie, 1 to K) of cluster numbers is considered, and model fit statistics such as the Bayesian information criterion (BIC) and Akaike information criterion are used to select the optimal number.32 Because model-based clustering methods use a statistical model, statistical tests such as the bootstrap likelihood ratio test can also be used to statistically test for the number of clusters.33,34 Maximum likelihood estimation can sometimes find local rather than maximum likelihood; therefore, it is recommended that many sets (eg, 100) of starting values are used.35
Longitudinal clustering. Many of these methods can be extended to cluster longitudinal data. K-Means for Longitudinal Data (KmL) is an extension of K-means clustering for longitudinal data.36,37 Repeated measures latent class analysis is an extension of latent class analysis that clusters on the individual repeated measurements.38 Latent class growth analysis (also called group-based trajectory modeling) and latent class growth mixture models are model-based clustering approaches for longitudinal data. These cluster based on an individual’s longitudinal trajectory.39 The same principles with respect to data preprocessing, assessing model fit, and choosing the number of clusters generally apply to these methods. Published guidelines provide a framework for building, interpreting, and reporting longitudinal clustering models.40,41
Longitudinal clustering: an example. Hanberg et al sought to cluster patients with antineutrophil cytoplasmic antibody–associated vasculitis (AAV) based on longitudinal changes in estimated glomerular filtration rate (eGFR).42 Chronic kidney disease is a common and serious side effect of AAV. The authors used group-based trajectory modeling, which clusters participants based on the longitudinal trajectory of outcome (in this case, eGFR). The goal was to identify distinct longitudinal renal trajectory groups in an effort to inform the development of personalized care strategies for patients with AAV. Group-based trajectory modeling (Proc Traj in SAS [SAS Institute]43) was used to partition patients into clusters based on longitudinal renal function, measured by percent change in eGFR. Three to 5 groups were considered. Group-based trajectory modeling is a model-based method, and therefore model fit statistics such as BIC and Bayes factor were used to determine the final model, along with group size criteria (see below for guidance on sample size in clustering) and face validity (determined by the authors’ clinical expertise). Four renal trajectory groups were identified (rapid decline, impaired, preserved, and recovery), with the rapid decline and impaired groups having the greatest burden of clinically significant kidney disease and overall comorbidity. Whereas prior studies have examined the outcome of endstage renal disease in this population, this analysis sheds light on patterns of renal function over time.42 Although most patients had largely stable but clinically significant renal function impairment, the clustering analysis uncovered a group with rapid decline in renal function. The findings underscore the burden of chronic kidney disease in AAV, provide evidence that clinically distinct renal trajectory phenotypes may exist in AAV, and offer a framework for ongoing work in the realm of personalized care.42
Additional methods. Although a comprehensive review of all clustering methods is beyond the scope of this article, we briefly mention a few additional algorithms here and point readers to additional resources.5,11,23 Density-based clustering algorithms, such as density-based spatial clustering of applications with noise (DBSCAN), work by separating areas of high and low density instead of relying on cluster centroids; these methods can work well when clusters have arbitrary shapes and sizes.44
Fuzzy clustering is a “soft” clustering approach, meaning it allows datapoints to be assigned to multiple clusters. Unlike model-based methods, fuzzy clustering does not necessarily assume a specific underlying data distribution. Fuzzy C-means is an extension of K-means clustering that identifies overlapping clusters.45 For some research questions, we may be interested in finding patterns both in patients and in features. Biclustering is a method that simultaneously clusters the rows (eg, patients) and the columns (eg, features) of a dataset. It is often used to analyze gene expression data and, more recently, has been used to examine feature selection in the setting of high-dimensional data.46,47
Practical details
Data types and standardization. Some methods, such as K-means, can be used only on continuous data. Other methods, such as K-medoids, hierarchical clustering, and model-based clustering can be used on mixed data types. Standardization of continuous variables is recommended, given that the distance measures used for clustering methods are dependent on the scale used to record measurements.6 Often, variables are scaled to have a mean of 0 and an SD of 1.48
Missing data. Most partitional and hierarchical clustering algorithms require complete data; observations with missing data on any clustering variables must be omitted, or imputation must be conducted prior to clustering. New strategies for handling missing data are also being developed, such as the K-prototypes algorithm, which imputes missing data using values from the corresponding cluster center.49 Some model-based methods can handle missing data, such as full information maximum likelihood in latent class analysis.32,50
Dimension reduction and variable selection. Undertaking clustering with a large number of variables can be computationally intensive. Including highly correlated input variables can lead to a disproportionate weighting of domains. For example, clustering patients with OA into groups based on 13 features, including 10 measures of cartilage, 2 measures of bone, and 1 measure of inflammation, would lead to clusters largely defined by differences in cartilage. Data reduction or variable selection is therefore sometimes necessary prior to clustering; including superfluous information can lead to problems in model fitting, interpretation, and generalizability.50,51 Variable selection methods filter out both redundant and uninformative variables.4,51 Dimension reduction methods, such as principal component analysis (PCA), are used to reduce the number of input variables. Ultimately, there is no fixed rule for the number of variables to include in a clustering model.
Explaining clusters and linking to covariates. We are often interested not only in clustering datapoints into groups but also in understanding or describing the clusters. Spider plots, also called radar plots or star plots, are a way to visualize differences between clusters on multiple variables.52 Variables are standardized so that they are on the same scale, and typically the mean of each variable for each cluster is plotted in the figure’s spokes. Figure 6 is an example of a spider plot for a hypothetical clustering analysis to investigate phenotypes in knee OA using imaging-based biomarkers. The spider plot shows a 3-group solution, with cluster A on average having higher scores on inflammatory markers (effusion-synovitis, Hoffa-synovitis), cluster B having higher scores on meniscal/cartilage variables (meniscal extrusion/morphology, % of denuded bone area, cartilage thickness), and cluster C with higher scores on bone biomarkers (bone marrow lesions, osteophytes).
Spider chart for visualizing structural phenotype clusters. This chart displays the average cluster value for clusters/phenotypes A, B, and C across 8 variables. Each variable has its own axis, and all the axes are joined at the center of the figure. Here, we can see the distinct characteristics of each structural phenotype.
Statistical testing is sometimes used to compare clustering variables or other participant characteristics between clusters. Caution should be taken with this approach, as the statistical power to detect differences depends on the number and size of the clusters, which will not be known a priori.53,54 With model-based methods, a 3-step approach that accounts for uncertainty in group assignment is recommended.43,55
Cluster stability and validation. Clustering methods may find clusters even if none truly exist in the data56; therefore, it is essential to assess cluster stability and validation. Cluster stability refers to the amount of variation in the clustering solution over different samples from the same population.11 Assessing cluster stability can help to evaluate whether the clusters identified in the sample actually correspond to a meaningful clustering of the underlying distribution.57 This may include comparing optimal clustering solutions across different subsamples (eg, see Figure 2 in Liu et al58), or involve indices such as the variation of information criterion.57,59
External cluster validation refers to comparing clustering results to an externally known result, such as a dataset with provided class labels, or to using an independent validation dataset.60,61 External validation with an independent dataset may involve assessing whether the number and size of clusters are similar, and whether the composition of clusters in terms of important features is similar (see Angelini et al2 below for an example).
Sample size. Clustering analysis does not lend itself to conventional sample size calculation and power analysis.62 Although some statistical tests are available to assess the number of clusters or whether any clusters exist, the power of these tests depend on factors that are difficult to know a priori (eg, the number and size of clusters or separation between clusters).33,63 Work has shown that a sample size of 100 provides > 80% power to detect 2 clusters if the smaller cluster has at least 10% prevalence.64 Other work suggests that investigators should aim for sample sizes of 20 to 30 per expected subgroup.62
Clustering from start to finish: an example. Angelini et al sought to identify OA endotypes using biochemical biomarker data.2 Disease heterogeneity has been a major hurdle in developing effective therapies in OA; a better understanding of disease endotypes may provide an opportunity to personalize medicine by targeting treatments to patients according to their disease drivers. Sixteen biochemical biomarkers markers reflecting joint tissue turnover were measured. First, the biomarkers were log-transformed due to the skewed distribution. The authors used PCA for dimension reduction (see Figure 1 in Angelini et al2) due to high correlation between biomarkers (Spearman ρ > 0.6). K-means clustering was then applied to the principal components. One to 9 clusters were considered (ie, K = 1-9), and the optimal number of clusters was selected based on the combined silhouette score, J-score, and adjusted mutual information score. The authors then sought to explain the clusters and link the cluster assignment to covariates. They used a supervised machine learning approach—random forest classifiers—to assess associations between biochemical biomarkers and clusters, aiming to determine which variables were most important for determining cluster membership. Three endotype clusters were identified: low tissue turnover, structural damage, and systemic inflammation. Finally, external validation was performed by replicating the cluster analysis in a separate cohort, which produced consistent results. The results suggest the existence of different OA phenotypes, and the authors posit that biomarker clustering methods could help inform stratification for OA clinical trials.
Summary
Clustering is a helpful tool for understanding heterogeneity in a population. Given the heterogeneity of rheumatic and musculoskeletal diseases, clustering provides an opportunity to identify distinct groups of patients with similar clinical, demographic, or disease characteristics. Identifying phenotypically distinct subgroups of patients could help to inform more specialized treatment plans for patients.
ACKNOWLEDGMENT
The authors would like to thank Dr. Lindsey MacFarlane for critical review of this article.
Footnotes
This work was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS; grant nos. NIAMS K01 AR075879 and NIAMS P30 AR072577).
JEC has received consulting fees from Boston Imaging Core Labs. SC declares no conflicts of interest relevant to this article.
- Accepted for publication August 14, 2024.
- Copyright © 2024 by the Journal of Rheumatology
This is an Open Access article, which permits use, distribution, and reproduction, without modification, provided the original article is correctly cited and is not used for commercial purposes.