Abstract
Objective. To develop a weighted summary score for the Medsger Disease Severity Scale (DSS) and to compare its measurement properties with those of a summed DSS score and a physician’s global assessment (PGA) of severity score in systemic sclerosis (SSc).
Methods. Data from 875 patients with SSc enrolled in a multisite observational research cohort were extracted from a central database. Item response theory was used to estimate weights for the DSS weighted score. Intraclass correlation coefficients (ICC) and convergent, discriminative, and predictive validity of the 3 summary measures in relation to patient-reported outcomes (PRO) and mortality were compared.
Results. Mean PGA was 2.69 (SD 2.16, range 0–10), mean DSS summed score was 8.60 (SD 4.02, range 0–36), and mean DSS weighted score was 8.11 (SD 4.05, range 0–36). ICC were similar for all 3 measures [PGA 6.9%, 95% credible intervals (CrI) 2.1–16.2; DSS summed score 2.5%, 95% CrI 0.4–6.7; DSS weighted score 2.0%, 95% CrI 0.1–5.6]. Convergent and discriminative validity of the 3 measures for PRO were largely similar. In Cox proportional hazards models adjusting for age and sex, the 3 measures had similar predictive ability for mortality (adjusted R2 13.9% for PGA, 12.3% for DSS summed score, and 10.7% DSS weighted score).
Conclusion. The 3 summary scores appear valid and perform similarly. However, there were some concerns with the weights computed for individual DSS scales, with unexpected low weights attributed to lung, heart, and kidney, leading the PGA to be the preferred measure at this time. Further work refining the DSS could improve the measurement properties of the DSS summary scores.
Systemic sclerosis (SSc) is a chronic, heterogeneous multi-system disease. A barrier to the study of SSc has been the difficulty in measuring disease status1,2,3. Disease activity measures the potentially reversible aspects of disease that vary over time3,4,5,6. Disease damage measures the irreversible tissue injury3,4,5,6. Our study focused on measuring disease severity, the total effect of disease on organ function including both reversible and irreversible components5.
Two common measures for severity in SSc are the Scleroderma Disease Severity Scale (DSS) developed by Medsger, et al5,7, and the physician’s global assessment (PGA) of severity. The DSS rates the severity of SSc in 9 organ systems, each scored separately depending on the level of involvement (no, mild, moderate, severe, or endstage). The PGA reflects a physician’s judgment of the subject’s overall disease severity using the visual analog scale or the numerical rating scale (NRS) while considering all information available. In the absence of a gold standard, the DSS and PGA are commonly used to estimate disease status4, and despite not having been extensively validated, are nevertheless believed to be accurate8 and are widely used both in SSc9,10 and in other rheumatic diseases11.
Choosing between the SSc severity measures requires a careful examination of advantages and disadvantages, both practical and numerical. An important limitation of the DSS, specifically acknowledged by the authors of the scale, is that it results in 9 separate scores5. Nonetheless, a simple summed score of the original 9 or modified versions of the DSS scores have been used without validation12,13,14,15.
A summed score requires an assumption that each of the items, for example lung and joint/tendon severity, provide equal amounts of discrimination for disease severity. Weighted alternatives to the summed score drop this assumption while maintaining the simplicity of a single-number summary. Alternatively, the PGA is simple and highly feasible, but inherently incorporates subjective physician opinion, which may inject additional heterogeneity into the measure. We undertook this study to develop a weighted summary score (WSS) for the DSS and compare its measurement properties with those of a PGA and a summed DSS score.
MATERIALS AND METHODS
Study subjects
The Canadian Scleroderma Research Group (CSRG) includes subjects with SSc recruited from 16 centers. Ethics committee approval for the CSRG data collection and study protocols was obtained at McGill University (Montreal, Quebec, Canada) and at all participating study sites. All subjects provided informed written consent to participate. Our study did not require additional ethical approval.
All subjects in the registry must have a diagnosis of SSc confirmed by a rheumatologist, be ≥ 18 years of age, and be fluent in English, French, or Spanish. Over 98% of the cohort meets the 2013 American College of Rheumatology/European League Against Rheumatism classification criteria for SSc16. Subjects have been recruited since 2004 and were seen at baseline and yearly thereafter. The subjects in our study included those whose baseline visit was between September 2004 and February 2013, and who had complete data for both the DSS and PGA. Data were collected for all study instruments and variables at the baseline visit.
Study instruments
Measures of disease severity were the DSS5,7 and the PGA. The DSS assesses disease severity in 9 organ systems: general health, peripheral vascular, skin, joint/tendon, muscle, gastrointestinal (GI) tract, lungs, heart, and kidneys. Each organ is scored separately from 0 to 4 depending on whether there is no, mild, moderate, severe, or endstage involvement. For the purposes of our study, some adaptations were made. The results of any investigation not requested by the physician were considered “normal”5. For the skeletal muscle system, physicians assessed muscle strength in the neck flexors and the right and left, upper and lower proximal extremities using the British Medical Research Council scale17, and calculated as reported previously18. The Health Assessment Questionnaire (HAQ; described below) was used to assess the patient’s use of ambulation aids needed to assign endstage severity for the skeletal muscle system. To score the GI system, in addition to the standard tests (an abnormal esophagram, abnormal esophageal manometry, or abnormal small bowel series), subjects were also given a score of 1 for mild GI disease severity if they reported difficulty swallowing, acid taste in their mouth, choking at night, burning sensation, feeling of being full shortly after eating, or taking gastroprotective or promotility agents. If malabsorption, episodes of pseudo-obstruction, or abnormal hydrogen breath test were present, a score of 3 for severe GI disease severity was given. To score the heart system, physicians also considered electrocardiogram results, left ventricular ejection fraction values, presence of conduction abnormalities, distended neck veins, and arrhythmias. Full details on these adaptations can be found elsewhere18,19. Study physicians recorded the PGA of disease severity using an 11-point NRS ranging from 0 (no disease) to 10 (very severe disease).
Study variables
Disease duration was measured from the onset of both the first Raynaud and first non-Raynaud disease manifestation to baseline study visit. Subjects were classified into limited (skin involvement of the arms and/or legs distal to elbows or knees, with or without facial involvement) and diffuse (skin involvement of the proximal limbs and/or trunk) cutaneous subsets (lcSSc and dcSSc, respectively)20 according to the maximum extent of skin involvement at any time during their participation in the cohort. SSc sine scleroderma was classified as lcSSc21.
Mortality was assessed at any point in the study period (2004–2013) based on information provided by the physicians, or notice of death.
Function was assessed using the HAQ Disability Index22, with scores ranging from 0 (no disability) to 3 (severe disability), and patient ratings of a series of SSc symptoms from the Scleroderma HAQ (SHAQ)23,24,25,26.
Health-related quality of life (HRQOL) was measured using the Medical Outcomes Study Short Form-36 (SF-36)27. The SF-36 is a self-administered, generic HRQOL questionnaire covering 8 domains. Each domain is scored separately and combined into physical (PCS) and mental component summary (MCS) scores and normalized based on a general population sample.
Similarly to the physicians, subjects were asked to rate the severity of their disease on a scale from 0 to 10, yielding the patient’s global assessment (PtGA) of disease severity.
Statistical analysis
Descriptive statistics summarized the baseline characteristics of the study subjects. Three composite measures of disease severity were compared: the DSS summed score, the DSS WSS, and the PGA. The summed score was calculated by adding the scores of the 9 organ systems for each individual variable, thus ranging from 0 (lower severity) to 36 (higher severity). Weights for the WSS were obtained using item response theory (IRT) to estimate organ-specific discrimination variables by fitting a generalized partial credit model (GPCM)28 to the 9 organ system subscales of the DSS. For each organ system, the GPCM estimates both the level of severity at which a patient is more likely to be categorized in 1 category instead of the 1 below, and a discrimination variable that measures the strength of the relationship between the organ system and severity. The WSS, which weights each organ system’s score by the organ system’s discrimination variable, was then calculated and scaled to range from 0 to 36, allowing for direct comparison with the summed score29,30. The score for the PGA was the number recorded by the study physician between 0–10.
Because the PGA incorporates the physician’s subjective opinion, inter-rater reliability was assessed by the intraclass correlation coefficient (ICC). The ICC was computed for each composite measure, which represents the magnitude of variability introduced by individual physicians. Because of the small number of physicians, 31 in total, we used a Bayesian hierarchical model to obtain 95% credible intervals (CrI) for the 3 ICC.
The convergent and discriminative construct validity of each composite measure was assessed. For convergent validity, correlations were computed to compare associations of the study instruments with patient-reported outcomes (PRO). Both nonparametric Kendall tau and Spearman rank correlation were used to account for their differing emphases.
For discriminative validity, dichotomous subsets were constructed to identify subjects with less and more severe disease based on the median values of various PRO. The mean disease severity scores of each subset were computed and the differences in these means were tested using the Wilcoxon rank-sum test.
Cox proportional hazards models were fit to assess the extent to which each composite measure was predictive of mortality by estimating the proportional change that can be expected in the hazard related to changes in the composite measure. First, a baseline model was fit, controlling for age and sex. Cox proportional hazard models additionally adjusting for 1 of the composite measures were compared. The relative predictive ability of each measure was assessed through a comparison of R2 values. For each model, the proportional hazards assumption was tested using a chi-square test of the scaled Schoenfeld residuals. For each composite measure, dichotomous subsets of the subjects were constructed by splitting subjects into 2 groups based on the median values and log-rank tests assessed whether there was a statistically significant difference in mortality between those with low and high values.
To assess statistical significance, we applied a posthoc Bonferroni correction factor for each of the 54 independent convergent validity comparisons (p < 0.0009) and each of the 27 independent discriminative validity comparisons (p < 0.002).
All analyses were done using R version 3.1.131. The GPCM was fit using the ltm package32. The Bayesian models were fit using JAGS and the R2jags packages33,34. The proportional hazards models were fit using the survival package35.
RESULTS
The study included 875 subjects (Table 1). About 86% were women with a mean age of about 55 years. Mean disease duration was 11.1 years since the first non-Raynaud symptom and 14.6 years since the first Raynaud symptom. About 37% of subjects had dcSSc. Disease severity, measured by organ system, was mild to moderate, with the GI tract (mean 1.95, SD 0.81), peripheral vascular system (1.58, SD 1.24), lungs (1.41, SD 1.11), and skin (1.24, SD 0.66) being the most severe.
The discrimination variables and the rescaled weights for the 9 DSS scores estimated using the GPCM are presented in Table 2. The skin scale was most discriminating among subjects, being weighted 2.47× higher in the weighted compared with the summed score. The general system, joint/tendon, GI, and muscle systems had weights about equal to 1, and the peripheral vascular, heart, lung, and kidney scales received weights below 1.
The mean summed score was 8.60 (SD 4.02) and the mean WSS was 8.11 (SD 4.05), both compared with a maximum score of 36. The mean PGA was 2.69 (SD 2.16), compared with a maximum of 10. There were 578 unique sets of scores on the 9 organ subscales, resulting in observing 26 of the possible 36 unique values of the summed score (26 unique values observed were 0 to 24 and 26) and 578 of the possible 875 unique values of the WSS (range 0–25.85). All 11 possible values of the PGA were observed. Figure 1 shows summary plots of the 3 composite measures. As expected, the WSS and summed score were highly correlated (Figure 1A). Nevertheless, there was substantial variation in the center of the distribution (e.g., the WSS for subjects with a summed score of 10 ranged from 5.78 to 13.19), suggesting that the measures would not yield exactly the same ordering of subjects.
Assessing between-physician heterogeneity
The ICC for the WSS was 2.0% (95% Bayesian CrI 0.1–5.6), for the DSS summed score it was 2.5% (95% CrI 0.4–6.7), and for the PGA it was 6.9% (95% CrI 2.1–16.2). Although the measured ICC for the PGA was the largest, its absolute magnitude was still small, indicating that it still did not represent a substantial part of the variability of the PGA. Therefore, there were no meaningful differences in the subjective contribution of the physician to the 3 measures.
Construct and discriminative validity
Kendall τ and Spearman ρ correlations of all 3 composite measures with the SF-36 PCS, the HAQ and PtGA of pain, GI problems, breathing, and severity were statistically significant and moderate in strength (Table 3). However, all correlations with the SF-36 MCS and the Kendall τ between the PGA and Raynaud phenomenon global assessments were weak and nonstatistically different from 0 under the Bonferroni correction. For each outcome and correlation considered, with the exception of the finger ulcer global assessment, the bootstrap CI for the 3 composite measures overlapped, indicating no difference in the strength of association with any of the 3 composite measures. All 3 composite measures were able to discriminate between subjects with better or worse scores on all PRO, with the exception of the PGA on the GI problem global assessment (Table 4).
Predictive validity for mortality
Death was observed for 120 patients (13.7%) in the study, with mean time to death of 2.89 years (SD 1.99 yrs). A reference Cox proportional hazards survival model that included age and sex as baseline covariates yielded an R2 value of 4.0% and a concordance probability of 0.66. Three further Cox proportional hazards models, each adjusting for 1 composite measure and age and sex, were generated. The model with the PGA had an R2 of 13.9%, followed by the model with the DSS summed score (12.3%) and that with the WSS (10.7%). While each composite measure provided some additional explanatory power over the reference model, the differences in explanatory power among the 3 were small. There was insufficient evidence to reject the proportional hazards assumption for all 3 models (p > 0.05), indicating that the model assumptions were satisfied. Similarly, mortality between subjects with low and high values on each of the measures was significantly different (p < 0.05), indicating that all 3 composite measures were predictive of mortality. Thus, in so far as predictive validity, the 3 measures were again about similar.
Posthoc analysis of the DSS organ scale weights
The unexpected low weights of the DSS lung, heart, and kidney scales in the WSS led to some posthoc analyses. First, box plots of the PGA at each DSS skin scale level indicated that the median score of the PGA was visibly different across levels (data not shown). However, for lung, heart, and kidney, the relationship between the DSS and PGA scores was not monotonically increasing (Figures 2A, 2B, and 2C), illustrating that, unlike the skin scale, the lung, heart, and kidney scales had poor discriminatory ability for the latent trait of severity. This provides an explanation, at least in part, for their weights.
Three variables compose the DSS lung scale: systolic pulmonary artery pressure (sPAP), forced vital capacity (FVC), and DLCO. Tables cross-classifying patients indicated significant heterogeneity in these 3 measures for subjects at the same level of the DSS lung scale (data not shown). In addition, box plots of the PGA scores against each of these 3 variables showed that the bottom categories of both sPAP and DLCO and the top categories of the FVC did not provide meaningful discrimination for different values of PGA (Figures 2D, 2E, and 2F), further demonstrating the poor discriminatory ability of the DSS lung for severity, as measured by the PGA.
Eighty-seven percent of subjects had normal or mild scores for heart severity and over 96% were normal for kidney severity, indicating a lack of endorsement of the higher categories. Cross-tabulations between each of the DSS organ scales showed that subjects with high scores for kidney and heart did not generally have high scores on the other DSS organ scales (data not shown).
Finally, we performed a sensitivity analysis based on disease subset and disease duration since the first non-Raynaud symptom. When stratifying by disease subset, there were no statistically significant differences in the weights for lcSSc or dcSSc compared with those calculated on all subjects. When stratifying by short (≤ 3 yrs) versus long disease duration, only the weight for the GI system was statistically lower than for all subjects. In comparison, the PGA performed similarly among all subsets of patients.
DISCUSSION
We have shown that the DSS summed and WSS and a PGA of severity each showed moderate levels of convergent and discriminative validity and predictive validity for mortality. Although the PGA had the potential to and did contain more between-physician heterogeneity than the other 2 measures, the amount of physician-specific heterogeneity relative to the total variability of the measure was small and did not impair the performance of the PGA in terms of construct or predictive validity.
To construct the WSS, a GPCM was used to obtain weights for disease severity, allowing for the weights of the 9 organ scales to be internal to the instrument. Thus, the WSS based on these weights can be used regardless of what other measures it may be compared to. While multivariate linear regression procedure could have been used to generate weights, any weights obtained would be specifically tuned to a particular outcome and would not necessarily be generalizable. Principal components would not have been an appropriate alternative either because they require continuous outcomes and would not have respected the categorical design of the DSS scales.
The weights for the WSS are obtained from the GPCM using maximum likelihood providing the best 1-dimensional summary of the 9 DSS organ subscales under the restriction that an increasing latent severity score cannot result in a decreasing expected organ subscale score for any organ30,36. Although disease severity is poorly represented through a unidimensional latent construct, we rather present the WSS as a more flexible, 1-dimensional alternative to the summed score that removes the naive assumption that all organ systems are equally discriminative of disease severity. Note that even though in other situations multidimensional summaries of disease severity might be found to be more useful, they require larger sample sizes for estimating variables and are more difficult to interpret.
The weights obtained from the IRT models were unexpectedly low for the lung, heart, and kidney systems. These 3 DSS scales did not discriminate well among subjects with higher and lower disease severity (Figure 2). The proposed cutoffs for FVC, DLCO, and sPAP, when examined separately, did not adequately discriminate between different degrees of severity, leading to considerable heterogeneity (Figure 2). The low weights for the heart and kidney scales may have occurred because of deviation from the assumption of unidimensionality of disease severity required by the GPCM. Subjects with extreme scores on these scales did not systematically have high scores on other organ scales. However, the summed score would be susceptible to the same problem, because it also assumes unidimensionality. Alternatively, the low weights may also have been because of low rates of endorsement across the spectrum of severity, a possible interaction with time not identified in the DSS score, or from those with asymptomatic disease obtaining intermediate scores.
Because all 3 measures appear to be valid, it is of interest to consider whether 1 measure should be preferred. The PGA had slightly lower correlations with 3 of the PtGA on the SHAQ than the DSS summed and WSS. This could be because of the way in which the DSS more directly accounts for these symptoms, rather than a shortcoming in the PGA ability to measure disease severity. Therefore, because the PGA is the simplest to record and its measurement properties were similar to those of the more complex DSS summed score and the WSS, it appears as the preferred measure for global disease severity in SSc, particularly if some variables required in the DSS were not collected. However, while the DSS is a cross-sectional measure based on objective criteria, the PGA is inherently subjective and allows the physician to include information about the observed and potential disease trajectory. Because all CSRG investigators are experienced clinicians in SSc, the relative benefit of the PGA over the DSS summed score may be due to the high levels of familiarity with the disease; inexperienced physicians may benefit from using the more objective DSS summed score.
Further work refining the DSS, in particular scoring some rare but severe renal and cardiac manifestations differently, revising the DSS lung scale by using different cutoffs or separating variables that measure different aspects of cardiopulmonary disease (e.g., FVC, DLCO, sPAP), and determining the effect of disease duration on severity could potentially improve the measurement properties of a revised DSS summed score or WSS. Last, further work studying the 3 measures longitudinally could result in additional information regarding their relative use.
The DSS summed and WSS and a PGA of severity using an NRS ranging from 0–10 had moderate levels of construct and predictive validity and low levels of between-physician heterogeneity. The PGA is the simplest measure for global disease severity to record and may be the preferred measure for experienced clinicians in SSc at this time.
Acknowledgment
We are grateful for a critical review of the manuscript and the helpful suggestions of Dr. Thomas Medsger.
APPENDIX 1
List of study collaborators. Investigators of the Canadian Scleroderma Research Group: J. Pope, London, Ontario; J. Markland, Saskatoon, Saskatchewan (deceased); D. Robinson, Winnipeg, Manitoba; N. Jones, Edmonton, Alberta; N. Khalidi, Hamilton, Ontario; P. Docherty, Moncton, New Brunswick; E. Kaminska, Calgary, Alberta; A. Masetto, Sherbrooke, Quebec; E. Sutton, Halifax, Nova Scotia; J.P. Mathieu, Montreal, Quebec; S. Ligier, Montreal, Quebec; T. Grodzicky, Montreal, Quebec; S. LeClercq, Calgary, Alberta; C. Thorne, Newmarket, Ontario; G. Gyger, Montreal, Quebec; D. Smith, Ottawa, Ontario; P.R. Fortin, Quebec City, Quebec; M. Larché, Hamilton, Ontario; M. Abu-Hakima, Calgary, Alberta; T.S. Rodriguez-Reyna, Mexico City, Mexico; A.R. Cabral, Mexico City, Mexico; M. Fritzler, Calgary, Alberta.
Footnotes
The Canadian Scleroderma Research Group (CSRG) is funded by the Canadian Institutes of Health Research (CIHR; grant #FRN 83518), the Scleroderma Society of Canada and its provincial Chapters, the Scleroderma Society of Ontario, the Scleroderma Society of Saskatchewan, Sclérodermie Québec, and the Cure Scleroderma Foundation, and receives support from INOVA Diagnostics Inc., Dr. Fooke Laboratorien GmbH, Euroimmun, and Mikrogen GmbH. The CSRG has also received educational grants from Pfizer and Actelion pharmaceuticals. Dr. Hudson is funded by the Fonds de la recherche en Santé du Québec. Alexandra Iliescu was funded by a CIHR Undergraduate research award. Dr. Harel’s work was funded by the CSRG. Dr. Steele’s work was funded in part by the Natural Sciences and Engineering Research Council of Canada.
- Accepted for publication April 21, 2016.