Abstract
Objective. Our objective was to calculate rheumatoid arthritis (RA) point prevalence estimates in the CARTaGENE cohort, as well as to estimate the sensitivity and specificity of our ascertainment approach, using physician billing data. We investigated the effects of using varying observation windows in the Régie de l’assurance maladie du Québec (RAMQ) health services administrative databases, alone or in combination with self-reported diagnoses and drugs.
Methods. We studied subjects enrolled in the CARTaGENE cohort, which recruited 19,995 participants from 4 metropolitan regions in Québec from August 2009 to October 2010. A series of Bayesian latent class models were developed to assess the effects of 3 factors: the number of years of billing data, the addition of self-reported information on RA diagnoses and drugs, and the adjustment for misclassification error.
Results. The 3-year 2010 point prevalence estimate among cohort members aged 40–69 years, using physician billing plus self-report, adjusting for misclassification error in each source, was 0.9% [95% credible interval (CrI) 0.7–1.2] with RAMQ sensitivity of 84.0% (95% CrI 74.0–93.7) and a specificity of 99.8% (95% CrI 99.6–100.0). Our results show variations in the prevalence point estimates related to all 3 factors investigated.
Conclusion. Our study illustrates that multiple data sources identify more RA cases and thus a higher prevalence estimate. RA point prevalence estimates using billing data are lower if fewer years of data are used.
- BAYESIAN LATENT CLASS MODELS
- PREVALENCE
- QUEBEC
- SELF-REPORT DATA
- CANADIAN PROVINCIAL HEALTH ADMINISTRATIVE DATA
- RHEUMATOID ARTHRITIS
Rheumatoid arthritis (RA) is a type of chronic autoimmune disease, and like most chronic diseases, it is caused by a constellation of potential factors, including environmental and genetic risk factors1. Surveillance data can provide insights into the epidemiology of RA. Additionally, prevalence data derived from surveillance can assist in making future projections and studying geographic variations2. Having unbiased prevalence estimates is essential to improving care and outcomes. In Canada, the provincial government health insurance is nearly universal and administrative databases such as those collected by the Régie de l’assurance maladie du Québec (RAMQ) have been an attractive resource for prevalence studies on RA3. Methods for estimating RA prevalence in these databases rely on physician billing and/or hospitalization International Classification of Diseases (ICD) codes4. Prevalence estimates of RA obtained from administrative health databases have varied depending on several factors such as case definitions5 and the size of the observation window available for analysis in the health administrative database6,7. Any ascertainment method within health administrative databases may miss some true cases and misclassify others.
An additional source of data for RA surveillance is self-reported data collected from large survey databases8,9,10,11. Ascertainment of RA based on the patient’s self-reported data should be done with caution because misclassification is a concern. Supplementing this ascertainment method with medication information such as disease-modifying antirheumatic drugs (DMARD) improved the accuracy of self-reported RA in some studies12. DMARD are the cornerstone of RA treatment and according to national and international guidelines, all RA patients with active disease should be offered DMARD therapies. Of course, a small number of patients with RA cannot take these drugs (if their RA is in remission — a relatively rare event — or for other reasons), so there could be false-positive and false-negative RA cases using this method as well. What makes the situation more challenging in large population-based surveillance studies is the absence of a gold standard to validate self-report or health administrative data sources.
Few RA prevalence estimates are available in Quebec or even in Canada. One prior study estimated RA prevalence for Quebec, using only physician billing and hospitalization diagnostic codes for the period 1992–2008; this accounted for misclassification error in administrative data3. However, additional studies may be helpful to elucidate the effects of the observation window within health administrative databases, the use of self-reported information, and the adjustment for misclassification error in all ascertainment methods on RA prevalence estimates. This study’s specific objective was to calculate, within 11 different observation windows in physician billing data, 2010 RA point prevalence estimates (unadjusted and adjusted for misclassification error) among the CARTaGENE cohort of adults aged 40–69 years, as well as to estimate the sensitivity and specificity of our ascertainment approach, using administrative data (alone or combined with self-reported data)13.
MATERIALS AND METHODS
Study setting, sources of data, ascertainment of RA cases, and time frame
This study took place in the context of a large established cohort entitled CARTaGENE, which recruited 19,995 participants (aged 40–69 yrs) from August 2009 to October 2010 from 4 metropolitan regions in Québec (Montreal, Sherbrooke, Québec City, and Saguenay, constituting 55.7% of the Quebec population). Participants were randomly selected from the provincial health insurance FIPA files (fichier administratif des inscriptions des personnes assurées), which include the entire population because health insurance coverage in Quebec is universal. Individuals were excluded if they were not registered in the FIPA files (such as the military), resided outside the selected regions in 2009, lived in First Nations reserves or longterm healthcare facilities, or were in prison. Participants were invited to an interview and completed a self-administered sociodemographic and lifestyle questionnaire as well as an interviewer-administered health questionnaire. Participation rate was 25.6% and there were regional variations in the participation rates, with the Saguenay region having the highest participation rate (33.9%) and the Montreal northern suburbs having the lowest (21.8% for Laval and 21.2% for the North Shore). Data on demographic and socioeconomic factors, lifestyle habits, mental health, individual and family history of disease, medical care history such as visits to a doctor or a nurse, and current medications were collected8. Further details on the CARTaGENE cohort can be found elsewhere8. The CARTaGENE research cohort has been linked to RAMQ data using patients’ unique provincial health insurance numbers. The RAMQ medical service database has information on physician outpatient visits, including diagnoses coded according to the International Classification of Diseases, 9th revision (ICD-9) during the time interval of data collection.
Our study included all CARTaGENE participants who were interviewed between 2009 and 2010. Individuals with incomplete or missing information concerning RA diagnosis and current DMARD use were excluded. Therefore, our reported estimates may be considered as estimates of 2010 point prevalence for RA, in which the point represents the end of 2010 and the denominators are those individuals enrolled in CARTaGENE by the end of the data collection phase. Survey-based RA cases were defined using the self-reported information on RA diagnosis as well as current use of either conventional DMARD (hydroxychloroquine, sulfasalazine, methotrexate, leflunomide, azathioprine, cyclosporine, gold, and cyclophosphamide) and/or the biologic DMARD (infliximab, adalimumab, etanercept, abatacept, and rituximab). RAMQ-based RA cases were defined using physicians’ claims data according to an algorithm requiring 2 or more RA diagnoses by any physician at least 2 months apart but within a 2-year span, or at least 1 RA diagnosis by a rheumatologist.
RAMQ data were available for our study subjects from January 1, 1998, to December 31, 2010. Eleven successive nested observation windows that ranged from a minimum of 3 years (2008–2010) to a maximum of 13 years (1998–2010) were constructed by adding successively one earlier year to the years under observation (2008–2010; 2007–2010 … 1998–2010). Therefore, all time windows ended in December 2010 and were used to calculate the point prevalence of 2010.
Statistical methods
In our analyses, we considered both the self-reported and physician claims ascertainment methods to be imperfect. In such case (i.e., in the absence of a gold standard), the true RA status can be thought of as “missing.” By knowing the values of the sensitivity and specificity of the imperfect ascertainment method, a latent class analysis can be used to adjust the prevalence for misclassification errors. We used a Bayesian latent class analysis to summarize the existing information about each variable (sensitivity, specificity, and prevalence) in the form of prior distributions. Then, the prior information was updated by the data through Bayes’ theorem to result in posterior distributions of these variables14,15,16,17,18.
More specifically, the number of subjects who are categorized as having RA according to each imperfect ascertainment method is a mix of true-positive and false-positive individuals. The Bayesian latent class model links the observed results of each method to the unobserved truth of RA status using the following formula: (total sample size)*[(prevalence of RA*sensitivity of the ascertainment method) + (1 − prevalence)(1 − specificity of the ascertainment method)]18.
Informative prior distributions were used over the sensitivity and specificity of RAMQ based on the subjective opinions of 8 experts in the field as well as on a published validation study of provincial administrative data, which used primary care records as reference standard19. We varied the prior distributions of the sensitivity and specificity of the physician claim ascertainment method ranging from 60% to 90% and 82% to 99%, respectively. Informative prior distributions over the prevalence ranging from 0% to 8% were chosen based on the literature. For the sensitivity and specificity of self-reported data, “uninformative” prior distributions [e.g., β (1,1)] were used. For all variables, a β prior distribution was used18.
In a Bayesian latent class model, the likelihood function relating the observed and latent data to the unknown variables for one ascertainment method (i.e., RAMQ) is as follows:
L (a,b,X,Y/π, Se, Sp) = [πSe]X [π(1 − Se)]Y [(1 − π)(1 − Sp)]a–X [(1 − π)(Sp)]b–Y, where “a” and “b” are the observed number of individuals with positive and negative results on the ascertainment method (here RA diagnoses in RAMQ), respectively; X and Y are the latent truly positive; π is the prevalence of RA; and Se and Sp are the sensitivity and specificity of the ascertainment method, respectively. In the case where RAMQ was combined with self-reported sources, the likelihood contributions of all possible combinations of observed and latent data are provided in Table 1. The likelihood is proportional to the product of each entry in the last column raised to the power of the corresponding entry in the first column of the table.
To address the potential issue that self-reported RA diagnosis and DMARD use may be dependent, even conditional, on the true disease status in the model combining the 3 methods, conditional correlation between the 2 CARTaGENE self-reported sources of information in RA subjects and in non-RA subjects were incorporated20.
The unadjusted (naive) estimates of RA prevalence were estimated based on RAMQ billing codes for each time window in administrative data. These estimates were obtained by dividing the number of those diagnosed with RA by the total sample size. The unadjusted prevalence estimates were calculated using the Bayesian method for single proportions. Uninformative β prior distribution [e.g. β (1,1)], where all values are equally likely, was used over the unknown unadjusted prevalence variable. In this case, the posterior prevalence estimates (unadjusted for misclassification error) are expected to be numerically the same as those obtained using frequentist method18 (i.e., dividing the number of those diagnosed with RA using billing codes by the total sample size).
Posterior estimates for each variable were determined based on a sample from the posterior distribution using Gibbs sampling with the WinBUGS statistical freeware (version 1.4.3, MRC Biostatistics Unit). Each model was assessed after a burn-in of 5000 iterations and a further 30,000 iterations for use in inferences21. The mean and 2.5–97.5 percentile values (95% credible intervals; CrI) for each variable were extracted.
Approval for the study was obtained from McGill University Ethics Review Board (approval number: A04-M47-12B), CARTaGENE as well as Commission d’accès à l’information du Québec (approval number: 100 49 57). Additionally, participants signed a written informed consent to publish the material.
RESULTS
The baseline characteristics of the study cohort were evaluated, including age, sex, geographical region, education, and current working status. Just over half of the sample was female, and the overwhelming majority lived in Montreal. The full profile of the participants is presented in Table 2.
Using only self-reported RA diagnosis, without any adjustment for misclassification, the RA prevalence estimate was 2.9% (564 out of 19,704) with 95% CrI 2.6–3.1. The naive estimate from DMARD use was lower at 0.9% (182 out of 19,704) with 95% CrI 0.8–1.1. Adjusting for misclassification error decreased the point prevalence estimate to 1.3% (95% CrI 0.07–3.2) for self-RA diagnosis and 0.4% (95% CrI 0.02–1.1) for current DMARD use.
We found 197 RA cases using only 3 years of physician billing, unadjusted for misclassification error. When more years were used, the number of RA cases continued to increase, up to 321 when looking back 13 years.
The unadjusted 2010 RA prevalence point estimate based on 3 years of RAMQ data alone was 1.0% (197 RA cases out of 19,704) with 95% CrI 0.9–1.2. Using 5 years of data, the prevalence point estimate increased by 20%. When using 13 years of RAMQ data, there was a 60% increase in the unadjusted prevalence point estimate (1.6%, 95% CrI 1.5–1.8) compared to the estimate from using 3 years of data (Table 3).
Adjusting for misclassification error using the Bayesian latent class model, RA prevalence point estimate was 0.4% (95% CrI 0.03–1.1) for the shortest observation window. Additionally, the adjusted prevalence was lower than the unadjusted prevalence estimates for all observation windows. The adjusted estimates across all time windows showed an increasing trend but remained lower than the RAMQ-based unadjusted estimate. The CrI around the adjusted point estimate using RAMQ alone were much wider than the CrI around the unadjusted estimates, which is expected because adjustment accounts for misclassification.
As for the combined RAMQ and self-reported information, the different combinations of the observed data are presented in Supplementary Table 1 (available from the authors on request). For all observation windows, the adjusted point estimates derived from combining RAMQ with self-reported data were lower than the unadjusted estimates and higher than the adjusted estimates using RAMQ alone. When combining administrative and self-reported data, adding more years of administrative data increased the adjusted point estimates (Table 3) in a similar fashion to when administrative data were used alone. The CrI were all overlapping. Figure 1 shows the increasing trends in the point estimates (unadjusted and adjusted, with administrative data alone and then adding self-reported data).
The results for the sensitivity estimates of case ascertainment across varying time windows (with administrative data alone and combining with self-reported data) are shown in Table 4. The sensitivity of case ascertainment using RAMQ data alone was unchanged (78%) for all observation windows. However, complementing the RAMQ billing codes case ascertainment method with self-reported data sources on RA diagnosis and current DMARD use increased the point estimate for sensitivity from 78.1% (95% CrI 58.3–92.6) to 84.0% (95% CrI 74.0–93.7) for the shortest time window. Our estimates of the sensitivity of RAMQ data versus the self-reported data remained relatively steady over time. The specificity of RAMQ ascertainment method alone as well as combining it with self-reported data was high (99%) and stable throughout all time windows.
DISCUSSION
In this study, a series of Bayesian latent class models were developed to assess the effects of 3 factors (i.e., the length of observation window within administrative data, the inclusion of self-reported information on RA, and adjustment for misclassification error in administrative data) on RA prevalence estimates in the CARTaGENE sample. Our results show variations in the prevalence point estimates related to all 3 factors. There was negligible change in the sensitivity estimates for case ascertainment using administrative data with more years of observation, but a noticeable gain in sensitivity when additional information from self-reported information on RA diagnosis and current DMARD use were added to the model. The 3-year 2010 point prevalence estimate among adults aged 40–69 years using the 3 ascertainment methods and adjusting for misclassification error in each method was 0.9% (95% CrI 0.7–1.2).
Previous studies of the effect of increasing years of administrative data on rheumatic diseases prevalence estimates found trends similar to ours (i.e., higher prevalence estimates with more years of data)6,7,22,23,24. However, ours is the only one that adjusted for the imperfect data sources. As evident from our study, the inclusion of self-reported RA data reduced the trend for incomplete ascertainment with few years of administrative data. RA is a dynamic chronic disease, characterized by unpredictable flares and remissions of disease activity25. During periods of remission, patients may not seek medical treatment, at least for RA. So, extracting ICD codes for a short observation window in RAMQ may miss some cases, specifically those patients in remission or with mild disease activity who happen not to use health services in the years under observation. Since 1 diagnostic code is allowed per physician visit in Quebec, RA patients with comorbidities may escape detection based on ICD codes within short observation windows if the code reported by the physician is for comorbidity and not RA.
Ng, et al studied the effect of the number of years of administrative data observed on estimates of SLE prevalence and recommended the use of long time windows to avoid underascertainment6. However, using longer observation windows could lead to overestimation of RA prevalence if misclassification error is not accounted for. This highlights the importance of carefully thinking about both sensitivity and specificity. Moreover, using longer time windows within health administrative databases has some drawbacks when the interest is in more recent prevalence estimates because temporal changes such as diagnostic drift have occurred over time26. For example, the American College of Rheumatology (ACR) criteria for RA diagnosis have changed 3 times in the last 50 years27. The most recent are the 2010 ACR/European League Against Rheumatism classification criteria28. These changes in diagnostic criteria could alter RA prevalence estimates when longer time windows are analyzed.
The sensitivity of case ascertainment using administrative data alone was about 78% and remained steady throughout all time windows in our study. Supplementing administrative data with patient self-reported RA diagnosis and current use of DMARD increased the point estimate for sensitivity to about 85% (although CrI overlapped). This finding may be important for investigators who may have access to only a few years of administrative data, if they have additional sources of information on RA status. The importance of using multiple data sources is corroborated by recommendations from other researchers working on chronic disease surveillance26,29,30. In the absence of other data sources, lengthening the number of years of RAMQ data increases RA prevalence point estimates, but with overlapping CrI across all observation windows.
One potential limitation in our study is the use of current DMARD consumption as an ascertainment method. Prior DMARD use was not available in the data. If ever DMARD use was assessed instead, then a better identification of RA cases (i.e., increase in the sensitivity estimate) would have been likely with the 3 ascertainment methods. Current DMARD use identifies only those with active disease. Although the low sensitivity of this ascertainment method was accounted for in the prior distribution, it is possible that accounting for ever DMARD use would have improved the collection of RA cases and further reduced the misclassification error by identifying those who were in remission during the survey.
Additionally, our adjusted results using health administrative data alone were not that precise even with such a large sample size. The difficulty in getting accurate prior information on the sensitivity and specificity can affect the precision of the posterior intervals. However, the precision was improved with additional information on RA status from self-reported data.
In our study, we did not use hospitalization RA codes. In fact, the Canadian working group on rheumatic disease definitions for surveillance using administrative data has done analyses of billing data with or without hospitalization data, and their consensus (based on analyses from each province) was that hospitalization data does not increase sensitivity of RA ascertainment.
The strengths of our study were the use of a very large cohort of individuals with both self-reported and administrative data on RA. Both data sources were adjusted for misclassification error in the absence of gold standard, which reflects a real-life challenge because few RA ascertainment approaches are considered 100% accurate. To the authors’ knowledge, this is the first study to date to combine self-reported data and Canadian provincial health administrative data to estimate an adjusted RA prevalence.
Our study illustrates that when using administrative data, RA point prevalence estimates are lower if few years of data are observed, and that multiple data sources can help identify more RA cases.
- Accepted for publication February 13, 2019.