Abstract
Objective. To investigate the operating characteristics of the American College of Rheumatology (ACR) traditional format criteria for Wegener’s granulomatosis (WG), the Sørensen criteria for WG and microscopic polyangiitis (MPA), and the Chapel Hill nomenclature for WG and MPA. Further, to develop and validate improved criteria for distinguishing WG from MPA by an artificial neural network (ANN) and by traditional approaches [classification tree (CT), logistic regression (LR)].
Methods. All criteria were applied to 240 patients with WG and 78 patients with MPA recruited by a multicenter study. To generate new classification criteria (ANN, CT, LR), 23 clinical measurements were assessed. Validation was performed by applying the same approaches to an independent monocenter cohort of 46 patients with WG and 21 patients with MPA.
Results. A total of 70.8% of the patients with WG and 7.7% of the patients with MPA from the multicenter cohort fulfilled the ACR criteria for WG (accuracy 76.1%). The accuracy of the Chapel Hill criteria for WG and MPA was only 35.0% and 55.3% (Sørensen criteria: 67.2% and 92.4%). In contrast, the ANN and CT achieved an accuracy of 94.3%, based on 4 measurements (involvement of nose, sinus, ear, and pulmonary nodules), all associated with WG. LR led to an accuracy of 92.8%. Inclusion of antineutrophil cytoplasmic antibodies did not improve the allocation. Validation of methods resulted in accuracy of 91.0% (ANN and CT) and 88.1% (LR).
Conclusion. The ACR, Sørensen, and Chapel Hill criteria did not reliably separate WG from MPA. In contrast, an appropriately trained ANN and a CT differentiated between these disorders and performed better than LR.
Wegener’s granulomatosis (WG) and microscopic polyangiitis (MPA) are closely related systemic vasculitides (SV). Both are associated with antineutrophil cytoplasmic antibodies (ANCA) and are grouped as ANCA-associated vasculitides. ANCA directed against proteinase 3 (PR3) are more common in WG and those directed against myeloperoxidase (MPO) are associated with MPA, but these associations do not reliably separate WG from MPA1,2,3. The clinical presentation of both diseases may also overlap, as most organ systems can be affected by WG as well as MPA. Although involvement of the ears, nose, and throat (ENT) such as granulomatous rhinitis is a common finding in WG and is observed in only a minority of patients with MPA, it may also be more frequent in MPA than generally estimated: ENT involvement was described in up to 30% of MPA cases in some series4,5,6 and may even represent the first clinical symptom of vasculitis in up to 16% of cases6. Common clinical presentations in these patients with MPA include unspecific rhinitis, sinusitis, and deafness due to inner ear involvement. Bilateral nasal polyps with histological proof of vasculitis have also been described7. Granuloma formation, which is regarded to be indicative for WG, is often difficult to demonstrate8,9. Therefore, differentiation between the 2 disorders is not always clear. Although the initial treatment of WG and MPA does not differ, the differentiation of both disorders is not only academically but also clinically important, because granuloma do not respond as well to immunosuppressive agents as sole vasculitic lesions do, and are associated with high relapse rates. Consequently, a refractory course of ANCA-associated vasculitis is observed more commonly in WG than MPA10,11,12, and new treatment options such as B-cell depletion using rituximab may be promising to treat granuloma12.
Classification and nomenclature of SV are complicated, and various systems have been developed. The American College of Rheumatology (ACR) classification criteria13 and the 1994 Chapel Hill Consensus Conference (CHCC) nomenclature14 are the systems most widely used, although the ACR criteria do not include MPA. The CHCC nomenclature provided names and working definitions on 10 different types of primary SV including MPA, but these definitions were not intended to be used as classification or diagnostic criteria. Nevertheless, they are frequently applied for this purpose under routine clinical conditions and in clinical research6,9, especially in the case of MPA in the absence of alternative criteria15. Because genuine diagnostic criteria for SV are lacking, the CHCC nomenclature — supplemented with surrogate measurements for vasculitis and granuloma formation — has also been transformed into traditional format criteria sets9. However, application of these and modified criteria (Sørensen diagnostic criteria) to unselected cohorts of patients with vasculitis also resulted in the misclassification of both patients with WG and patients with MPA15. Therefore, criteria that reliably differentiate WG from MPA are still lacking.
Artificial neural networks (ANN) have been effectively applied to the classification of SV including giant cell arteritis16 and Churg-Strauss syndrome17. In the latter study, the ANN proved to be superior to the ACR classification criteria.
It was therefore our objective to investigate the operating characteristics of the ACR traditional format criteria for WG, the CHCC nomenclature for WG and MPA (transformed into classification criteria), and the Sørensen diagnostic criteria for WG and MPA, and then to generate and validate improved criteria for clinical differentiation between WG and MPA by using an ANN as a new approach.
MATERIALS AND METHODS
Patients from the multicenter cohort
Within a European collaboration for the standardization of ANCA assays (the EC/BCR Project for ANCA Assay Standardization), 13 referral centers in 10 European countries participated in the collection of patient data, biopsies, and sera1. Within that project, each center had been asked to retrospectively include the last 20 consecutive patients with idiopathic SV seen in that center before the start of the study and prospectively the first 15 consecutive patients with SV after that date. Among other cases, 240 patients with WG and 78 patients with MPA had been recruited. Patients had been selected on clinical and histological criteria only, not on ANCA serology.
Classification of patients within the EC/BCR project
After entry, all patients had been classified based on data retrieved from records and at site visits. As described1, a system for the classification of patients had been designed based on the diagnostic names and definitions adopted by the CHCC14. Patients were classified as having WG if they had histologically proven vasculitis with granuloma and/or giant cells or if they had clinical evidence of at least 1 airway symptom or sign typical for granulomatous lesions of WG such as pulmonary nodules, subglottic stenosis, chronic rhinitis with massive crusting and epistaxis, or proliferative mastoiditis. Demonstration of orbital granuloma by computed tomography or magnetic resonance imaging also led to a classification of WG. Patients were classified as having MPA if they had systemic (extrarenal) manifestations compatible with or histology demonstrating small vessel vasculitis in the absence of airway symptoms typical for granulomatous lesions of WG. Signs and symptoms in the ENT region that could also be attributed to solely vasculitic (and not granulomatous) lesions such as inner ear deafness or unspecific rhinitis were allowed in patients with MPA.
Selection of measurements for classification
Within the EC/BCR project, a total of 79 clinical, immunological, and histological characteristics present between the date of diagnosis and the date of entry into the study had been scored1. Clinical measurements were grouped to organ systems. For example, if either nasal crusting or bloody discharge from the nose had been present during the patient’s course and had been regarded as due to WG/MPA, involvement of the nose was scored. Of those 79 measurements, 23 clinical ones, including chest radiograph, serum creatinine, and urinary findings, were assumed to be of possible importance for distinguishing WG from MPA and were considered for the development of new approaches to classification (Table 1). Missing data (0.3%) were replaced as recommended by Lee, et al18. It was further investigated whether the inclusion of ANCA test results would improve the classification. Therefore, the results of indirect immunofluorescence [cytoplasmic ANCA (cANCA) and perinuclear ANCA (pANCA)] and solid-phase assays (Copenhagen ELISA for antibodies against PR3 and MPO)1 were added to the analysis. Because it was our aim to develop instruments that separate WG from MPA on the basis of clinical data, histological measurements were not considered.
Existing models of classification
We investigated how many of the patients met the ACR traditional format criteria for WG13 and specified how often the individual criteria were met (Table 2). Accuracy was defined as the percentage of correctly classified patients (sensitivity for the disease in question plus specificity for disease to be ruled out, divided by the number of all patients). Sensitivity was defined as the percentage of patients recognized by the set of criteria under investigation, and specificity as a negative test in the other patient group.
CHCC definitions of WG and MPA
In accordance with other investigators9, we tested the usefulness of the CHCC definitions (Table 3), supplemented with surrogate measurements for granuloma formation, glomerulonephritis, and small vessel vasculitis, for the classification of WG and MPA. These surrogate measurements9 had been adapted from the scoring systems created by the Birmingham Vasculitis Activity Score19. The association of WG and MPA with ANCA and certain clinical manifestations such as necrotizing glomerulonephritis, mentioned as “common” by the authors of the CHCC definitions, were listed as optional criteria. Their presence was not required for the assignment of a patient to the disease in question.
Sørensen diagnostic criteria for WG and MPA
Sørensen, et al proposed diagnostic criteria for WG and MPA (Table 4) based on the CHCC definitions9. They also allowed surrogate measurements for granulomatous inflammation and glomerulonephritis to replace histology9. Concerning WG, eosinophilia of the peripheral blood and tissues (of an undefined extent) was an exclusion criterion. However, a moderate elevation of the eosinophil count has been recognized repetitively in WG17,20 and was shown to result in the misclassification of patients with WG using the Sørensen criteria15. Therefore, these criteria have been modified by Lane, et al15, allowing eosinophilia of up to 1500 eosinophils/μl to be present in WG. We examined the ability of the original criteria proposed by Sørensen, et al9 and the modified criteria15 to distinguish WG from MPA (Table 4).
Newly developed models of classification
A prototypical software tool called approximation and classification of medical data (ACMD)21 was used to train the network. ACMD uses various strategies to improve the training of self-learning ANN, e.g., early stopping to avoid overfitting the network22 and ensembles to improve robustness and generalization performance23. Adaptive propagation24 was used as the learning algorithm, a further development of the back-propagation algorithm25. To control for the so-called peaking phenomenon26, feature selection was performed by the neural net clamping technique27.
Figure 1 shows the ANN structure used in our study. During training, datasets with a known outcome were entered at the input neurons (input variables, either continuous or categorical) and at the output neuron (binary output variables, namely WG or MPA). After starting the network, the input data were processed in the hidden and output layers, resulting in a number between 1 and 0 at the output neuron, representing assignment to WG or MPA. The activity of the output neuron depended on the inputs and the weights at the connections. The key feature of ANN is that the weights at the connections are “learned” during training of the network. “Experience” in the trained network is stored in these interconnection weights28. The system begins with random weights at the connections between the neurons. The software correlates the network output with the actual outcome and calculates an error value. The ANN attempts to minimize the error by adjusting the weights at the connections according to a learning algorithm28. This process is repeated a predefined number of times during the training phase. At the end of the learning process, the optimum weight factors are fixed. In the user phase, data from cases not previously interpreted by the network are entered, and an output is calculated based on now-fixed weight factors29.
Binary logistic regression (LR) was calculated by the software SPSS V.11.5.1 (SPSS Inc., Chicago, IL, USA). Feature selection was done by backward search using the Wald test. Exclusively default settings were chosen.
As a tool for generating the classification tree (CT) we used the chi-squared automatic interaction detector, a module of the SPSS analysis software (Answer Tree V.3.1, SPSS Inc.). While building up the CT, the number of levels below the root was restricted to 5, the minimum number of records per main node was set at 10, and the minimum number of records per end node was set at 5. Otherwise, default settings were chosen.
Validation of allocation methods
All 3 methods of allocation (ANN, LR, and CT) were validated by 2 approaches: using the leave-one-out method, and using an independent monocenter cohort of patients with WG and MPA.
The leave-one-out method means that the algorithms for classification are established by using n – 1 of all cases, and to use the missing case for validation. By performing n rounds of classification, every case is used once for the purpose of validation.
The independent monocenter cohort consisted of 46 consecutive patients with WG and 21 consecutive patients with MPA from the Mannheim University hospital (academic referral center; nephrology/rheumatology unit). Independently of our study, patients had been assessed by an interdisciplinary team including nephrologists, rheumatologists, ENT, and eye specialists, and had undergone an extensive imaging procedure. The clinical diagnosis (WG or MPA) had been made according to the same guidelines as described for the multicenter cohort.
Statistical methods
The 2-tailed Wilcoxon test was used for numerical variables such as age and creatinine. Otherwise, the chi-squared test was used.
RESULTS
Histology
By definition, granuloma formation or giant cells could be present only in biopsy specimens of patients with WG and could be demonstrated in 27% of the 240 patients with WG from the multicenter cohort (monocenter cohort 33%). Pauciimmune crescentic glomerulonephritis in conjunction with airway symptoms compatible with WG was shown in 49% (54%). Nonrenal vasculitis was demonstrated by histology in 31% (22%). In 10% (24%) of WG cases, no histology was available. By renal biopsy, crescentic glomerulonephritis compatible with MPA was found in 77% of the 78 cases of the multicenter cohort (monocenter cohort 21%), although these findings of cause were insufficient to separate these cases from WG. Extrarenal biopsy demonstrated MPA in 29% (0%) of cases. No histology was available in 5% (0%) of the patients with MPA.
The distribution of the affected organs is given in Table 1. Involvement of the nose, sinuses, and ears, as well as pulmonary nodules, was associated with WG in both cohorts (p < 0.0005 in multicenter cohort). In contrast, renal involvement was significantly associated with MPA in the multicenter cohort (p < 0.005) with a similar trend in the monocenter cohort (p = 0.08). Among patients with MPA, serum creatinine was significantly higher in both cohorts as compared with WG (p < 0.01). In the multicenter, but not in the monocenter cohort, cANCA and PR3 ANCA were significantly associated with WG (p < 0.0005). MPO ANCA and pANCA were associated with MPA in both cohorts (p < 0.0005).
Existing classifications
A total of 70.8% of patients with WG fulfilled the ACR traditional format criteria for WG (Table 2, accuracy 76.1%). The criteria “granulomatous inflammation on biopsy” and “abnormal chest radiograph” were least often met. The specificity for WG was 92.3%. False-positive cases (patients with MPA) mostly had oral ulcers in conjunction with a nephritic urinary sediment.
Traditional format criteria of the CHCC definition for WG were met by only 13.3% of the patients with WG and none of the MPA cases (Table 3, accuracy 35%). It is noteworthy that granulomatous inflammation of the respiratory tract was histologically proven in only 23.3% of cases. However, substitution of clinical findings with measurements used as surrogate histological data for both granulomatous inflammation and vasculitis increased the sensitivity for WG to 85.8%, with a specificity of 89.7% (accuracy 86.8%). Traditional format criteria for MPA, from the CHCC definition for MPA, had a sensitivity of only 39.7% for MPA, with a specificity of 60.4% (Table 3, accuracy 55.3%). Substitution of histological findings by surrogate measurements increased sensitivity to 89.7%, at the cost of a reduced specificity (10.8%; accuracy 30.2%).
Applying the Sørensen criteria for WG to the cohort of WG cases led to a sensitivity of 59.4%. Because eosinophilia of the tissues or blood (eosinophil count > 500/μl) was present in 31.6% of WG cases, the criterion “lack of eosinophilia” was least often met (Table 4, accuracy 67.2%). Allowing the eosinophil count to be as high as 1500/μl increased the sensitivity to 86.9% (accuracy 86.6%). Positivity for PR3 ANCA, which is part of these criteria, was found in 61% of patients with WG and 33% of patients with MPA and did not help to separate these 2 disorders. Sørensen criteria for MPA had a specificity for MPA of 97.3% and a sensitivity of 74.4% (Table 4, accuracy 92.4%).
Newly developed models of allocation
The ANN was initially trained with 23 clinical measurements, excluding ANCA test results (Table 1). The backward search revealed 4 of these measurements to be relevant for distinguishing between WG and MPA (Figure 1): pulmonary nodules and involvement of nose, sinuses, and ears (all associated with WG). Using these 4 measurements as input neurons and 1 hidden layer, the ANN correctly assigned 230/240 patients with WG (95.8%) and 70/78 patients with MPA (89.7%; accuracy 94.3%, Table 5). Inclusion of the other 19 clinical measurements listed in Table 1 did not further improve the assignment of patients. Validation using the monocenter cohort resulted in the correct assignment of 91% of cases. Involvement of the nose (6 cases), ears (5 cases), and a pulmonary nodule (1 case) in patients with MPA was associated with the incorrect assignment to WG in both cohorts. Including ANCA test results during the training phase of the ANN did not improve the assignment of patients from the multicenter cohort. However, 1 additional patient with MPA (MPO ANCA-positive) with involvement of the ear (deafness of the inner ear due to suspected vasculitis in the absence of otitis media; reversed by glucocorticosteroids) was correctly classified as MPA when the ANCA status had been included during the training phase, with a resulting accuracy of 92.5%.
In logistic regression, feature selection considered 10 measurements significant: male sex, nose, sinus, ear, lungs/bronchi, hematuria, eyes, joints/muscles, large vessels, and pulmonary nodules. Use of these measurements resulted in the correct classification of 92.8% of cases from the multicenter and 88.1% from the monocenter cohort, using the leave-one-out methodology for validation (Table 5). Inclusion of ANCA test results slightly reduced the number of correctly assigned patients to 86.6%.
Allocation of patients by CT employed the same measurements that were also selected by the ANN, i.e., involvement of nose, sinuses, and ears, as well as pulmonary nodules. Presence of each of these 4 measurements leads to the classification as WG, otherwise MPA. Thus, the CT could also be described in the traditional format as a table with 4 criteria. The accuracy obtained was equal to that of the ANN (94.3% for the multicenter and 91.0% for the monocenter cohort; Table 5). Adding ANCA test results to the measurements under consideration did not change the results.
DISCUSSION
Our study shows that an ANN can distinguish between WG and MPA with an accuracy of over 90% based solely on clinical data. An identical accuracy was achieved by a CT approach. These newly developed models were superior to established instruments (ACR traditional format criteria for WG, CHCC nomenclature for WG and MPA used as classification criteria, and Sørensen diagnostic criteria).
The differentiation between WG and MPA based on clinical data is not easy. First, ENT involvement is commonly regarded as a hallmark of WG. However, it was observed in 13.9% of the MPA cases described here. Moreover, studies addressing the frequency of ENT lesions in different patients with SV described ENT involvement in up to 30% of patients with MPA4,5,6. Second, formation of granulomata and giant cells is restricted to WG. But proof of these lesions by histology is often difficult to obtain and is commonly available in ≤ 50% of cases9,17 (28% of the patients described here). In a prospective study performed under routine conditions, only 6 of 25 patients with newly diagnosed WG had biopsy-proven disease8. This illustrates that classification criteria that depend solely on histology are difficult to work with in clinical practice. That is why the classification procedures presented here do without histology. Accordingly, criteria for WG/MPA delineated from the CHCC nomenclature were met by only 13.3% of the patients with WG and 39.6% of patients with MPA of the multicenter cohort, frequencies that are in line with previous studies9. Also, the ACR traditional format criteria for WG depend partially on histology, and when applied to the multicenter cohort of patients led to a sensitivity of only 70.8%. This is in line with Rao, et al, who found the positive predictive value of the ACR classification criteria to be as low as 29%30.
Surrogate measurements that replace the histological proof of granulomata, giant cells, and vasculitis by clinical measurements have been used by various investigators. We demonstrated that the use of surrogate measurements increased the accuracy of the CHCC criteria for WG from 35.0% to 86.8% and contributed to the accuracy of the Sørensen criteria. However, the performance of the CHCC criteria for MPA did not improve with surrogate measurements (accuracy 55.3% without and 30.2% with surrogate measurements). These criteria did not separate WG from MPA, which suggests that they are not of value in classification. On the other hand, it should be stressed that the use of surrogate measurements should not replace the search for appropriate histological confirmation, as biopsy material is essential to rule out other nonvasculitic conditions such as neoplasia and infection.
Our second objective was to develop new instruments that better distinguish between WG and MPA on clinical grounds: the rarity of ENT involvement in MPA, which was certainly influenced by the definition of MPA and might underestimate the true frequency of ENT involvement in this disease, made the ENT-associated measurements the most important discriminators, together with the presence or absence of pulmonary nodules on the chest radiograph. Both the ANN and CT relied on these measurements and led to a better separation between WG and MPA than all previously established methods of assignment (accuracy 94.3% in training and 91.0% in validation cohort). One may argue that accuracy is not the appropriate measurement when data are not balanced (WG: n = 240; MPA: n = 78). However, this imbalance reflects the real a priori distribution of WG:MPA of about 3:131, with an even more pronounced predominance of WG in northern latitudes. The results demonstrate that WG and MPA can be distinguished solely by clinical measurements in the vast majority of cases. Only 18 of 318 patients from the multicenter cohort (monocenter validation cohort 6 of 67 patients) were misclassified. Besides ear and nose manifestations, there were further very selective measurements such as an orbital pseudotumor (seen only in WG) and tracheal involvement that strongly favored the diagnosis of WG. However, the measurements were associated with a low sensitivity and were not used by both the CT and the ANN.
Within the multicenter cohort, no differences could be detected in the sensitivity and specificity dependent on whether data on the patients were collected retrospectively or prospectively. This is probably because ENT involvement is usually present very early during the course of WG.
Although PR3 ANCA are more common in WG than in MPA1,2,3, the inclusion of ANCA test results did not improve allocation within the monocenter cohort, and only 1 additional case of MPA from the validation cohort was correctly allocated by the ANN. This made the ANN the most accurate model of classification for the validation cohort. The good performance of the ANN is in line with previous reports that demonstrated a superiority of appropriately trained ANN compared to conventional methods in the classification of SV16,17. Further, multiple other applications underline that ANN are promising tools to address clinical problems such as prognosis estimation and risk assessment. Examples include the prediction of survival and complications after percutaneous endoscopic gastrostomy32, the prediction of outcome in subdural hematoma33, and the classification of schizophrenic patients34, where the ANN proved to be superior to established models.
The advantage of ANN is that they reveal nonlinear relationships and have the ability to analyze the interaction between many variables at different levels. For example, there is evidence that the superiority of ANN is influenced by their ability to adjust the importance of certain measurements depending on the presence or absence of other variables35. Usually, the ANN software allows identification of the specific input variables that have most value in terms of predictive accuracy. However, these should not be viewed as independent predictive variables as perceived by a clinician. Because ANN process data in a nonlinear way, the network logic of classification cannot be broken down into simple elements of clinical reasoning35.
There are limitations to the ANN-based model and the CT/LR presented here. First, the number of possible outcomes was limited, as the approach was restricted to only 2 different SV. More meaningful and probably more complex models would have arisen if further vasculitic disorders had been incorporated into the analysis. But within the EC/BCR study, the number of datasets from these diseases was by far too small to train an ANN or to develop other reliable models of allocation. Nevertheless, because explicit classification criteria for the differentiation of WG and MPA have so far not been developed, the results presented here may help to separate these 2 disorders. Because WG and MPA differ from each other also with respect to their further course, for example in terms of relapse rate36, a correct allocation will also have prognostic implications. Second, the ANN has not been tested against nonvasculitic disorders. This approach would help to develop diagnostic criteria that are urgently awaited to separate SV from other disorders that mimic vasculitis. In contrast to the separation of WG and MPA, the ANCA test might be very important in this setting, provided the pretest probability for the presence of vasculitis is rather high. To rule out or confirm vasculitis, histology will also remain important. Models that solve these diagnostic problems may be of higher value to the clinician than models for classification. However, datasets from vasculitis and control patients that can be used to develop diagnostic criteria are not currently available, to our knowledge.
Established criteria such as the ACR traditional format criteria for WG, the Sørensen criteria for WG and MPA, and criteria of the CHCC nomenclature did not reliably separate WG from MPA. In contrast, both a newly formulated and easy to use CT and an appropriately trained ANN — based on clinical data and not on histology — correctly assigned the majority of patients, and were associated with an accuracy of 91% when validated by application to an independent cohort. Both methods were superior to LR. The addition of ANCA test results to the criteria under consideration slightly improved the performance of the ANN, but not of the CT and LR. ANN offer exciting prospects for many applications in clinical medicine and warrant prospective testing in the classification of vasculitides.
Footnotes
-
Dr. van der Woude is deceased.
- Accepted for publication December 22, 2010.