Abstract
Objective.Rheumatoid arthritis (RA) is a complex autoimmune rheumatic disease that is strongly influenced by genetic factors. Numerous genes are convincingly associated with RA, including genes in tumor necrosis factor signaling (TNF) and the nuclear factor-κB pathway. To date, except for genes within the HLA region, no data exist regarding potential copy number variations (CNV) involving RA-associated genes. We set out to identify genes affected by CNV that are associated with RA at a genome-wide level.
Methods.Data from the Wellcome Trust Case Control Consortium (WTCCC) were used in our analyses. The initial WTCCC cohort genotyped 3004 controls and 1999 RA cases using the GeneChip 500k Mapping Array Set. We performed a comparative intensity analysis using the PennCNV algorithm, which uses a hidden Markov model to detect CNV. A total of 2271 controls and 1572 RA samples passed quality control criteria and were included for association analysis. Association analysis was performed in 2 phases: (1) to identify CNV that are < 1 Mb with a population frequency < 5%; and (2) to identify large CNV that are > 1 Mb. Fishers’ exact test was performed to quantify significance of the CNV.
Results.We observed that the genome-wide CNV burden is 2-fold higher in patients with RA compared with controls. We identified 11 rare copy number variable regions with < 5% frequency that had an association with RA that reached a p < 1 × 10−4. These include TNFAIP3 and TNIP1, which has been implicated in association studies for RA, systemic lupus erythematosus, and psoriasis. We identified CNV involving IRF1, which functions as a transcription activator of genes induced by interferons; ALOX5AP and LCP2, involved in inflammatory mediation; B2M, an MHC-class I associated gene; and PRKCH, a gene involved in T cell signaling pathways. A 57 kb deletion with 1% frequency in RA cases at 7p21.3 was also observed. Six of these loci overlap with CNV catalogued in the Database of Genomic Variants.
Conclusion.This is the first study to identify non-HLA RA-associated CNV using genome-wide analyses. Validation and functional significance of these deletions/duplications in RA and other autoimmune diseases need to be further investigated.
Rheumatoid arthritis (RA) is the prototypic seropositive inflammatory arthritis. The heritability of RA is between 50% and 60%, with HLA-DRB1 accounting for 30% of the genetic risk of developing RA1. The RA-associated genes can be broadly characterized as being involved in T cell activation (PTPN22, STAT4, and CTLA4) and the nuclear factor-κB (NF-κB) signaling pathway (CD40, TRAF1, TNFSF14, and TNFAIP3)2.
Recently, structural variation of DNA, namely copy number variation (CNV), has been recognized as important for both normal genomic variability and in disease susceptibility. A CNV is a common type of genomic variation ranging in size from 1 kilobase (kb) to several megabases (Mb) and they cover at least 25% of the human genome3. Among different individuals, a single CNV can have different forms (i.e., duplication, deletion). The variable region that contains these different CNV is known as the copy number variation region (CNVR). CNV have the potential to disrupt genes and are associated with many complex diseases, including psoriasis, systemic lupus erythematosus (SLE), Crohn’s disease, autism, and osteoporosis4,5,6,7,8,9. Thus, it is important to assess the influence of CNV to obtain a thorough understanding of genetic susceptibility in complex disease.
Previous CNV studies in RA have noted association with chemokine ligand 3-like 1 (CCL3L1) and Fc fragment of IgG, low affinity IIIb, receptor (FCGR3B). CCL3L1 binds to several proinflammatory cytokine receptors, including chemokine receptor 5 (CCR5). In a recent New Zealand study, a copy number higher than 2 at the CCL3L1 locus was a risk factor for RA, but this was not replicated in a smaller UK cohort10. Two independent Dutch studies have reported that the 1q23 region, containing FCGR3B, has been associated with CNV that influence susceptibility to RA11,12. However, no such association was noted in the UK cohort1, nor in the recent WTCCC CNV analysis13.
We performed a comparative intensity analysis to detect CNV using the WTCCC single-nucleotide polymorphism (SNP) genotype array data. In addition, we performed a CNV genome-wide association analysis to identify disease-susceptible loci in the WTCCC RA cohort.
MATERIALS AND METHODS
Cohort
The initial cohort data consist of 3004 shared controls and 1999 individuals with RA obtained from the WTCCC data center. The controls were taken from 2 groups: 1504 samples from the 1958 British Birth Cohort (58C) and 1500 additional controls recruited from the UK Blood Services (UKBS). A set of 500,568 SNP were genotyped using Affymetrix GeneChip 500K Mapping Array Set (Affymetrix Inc., Santa Clara, CA, USA)14.
Quality control
We used PennCNV software to detect high-resolution copy number variation. PennCNV uses a hidden Markov model to detect kilobase resolution detection of CNV with low false positives15. Prior to the analysis, the sample inclusion/exclusion criteria were set based on WTCCC SNP analysis. All subjects that passed quality control criteria of the WTCCC original SNP association analysis were included for CNV analysis. Samples were excluded after a check for contamination, false identity, relatedness, and non-Caucasian ancestry. Affymetrix Power Tool, a BRLMM (Bayesian robust linear model with Mahalanobis distance classifier) algorithm, was used to make genotype calls for NSP and STY arrays separately16. The first array uses the NSP I restriction enzyme to genotype 250K SNP while the second array uses STY I restriction enzyme to genotype 250K SNP. To reduce signal-to-noise ratio we used the WTCCC data to model parameters for canonical genotype clustering, instead of using default canonical genotype clustering information provided by PennCNV, which is used to calculate log R ratio (LRR) and B allele frequency (BAF) values. After separate calculations of LRR and BAF, the 2 array sets, NSP and STY, were combined to make CNV detection calls in the autosomes. Many of the genomic samples can have below-optimum genomic wave quality control values; hence, in our sample inclusion criteria we used the default parameters of PennCNV to restrict case or control samples — LRR standard deviation < 0.25, BAF drift > 0.01, and wave factor > 0.05 or < −0.05. A total of 2271 controls (1123 58C and 1148 UKBS) and 1572 RA samples passed all the quality control criteria and were included for association analysis.
After exclusion of samples, we performed quality control on CNV calls. PennCNV tends to show false-positive CNV calls on centromeric and telomeric regions. Due to the complex nature of hemizygosity in sex chromosomes, we excluded all CNV detected from sex chromosomes. In addition, human immunoglobulin coding regions showed false-positive CNV calls using PennCNV, therefore we excluded CNV that overlapped at 50% or more of its length with the following immunoglobulin regions: chr2: 88.9–89.4 Mb, chr14: 21.1–22.0 Mb, chr15: 17.0–21.0 Mb, chr16: 31.8–36.8 Mb, and chr22: 20.7–21.5 Mb. Also, large CNV that overlap with centromeres and telomeres were excluded from the following regions: chr1: 12.0–14.1 Mb, chr10: 46.3–47.1 Mb, chr14: 19.2–19.4 Mb, chr15: 18.4–20.0 Mb, and chr19: 24.2–32.7 Mb (build 35). The boundaries for immunoglobulin, centromere, and telomere regions were obtained from a previous study that used PennCNV17.
Association analysis
Association analysis was performed on the detected CNV that were < 1 Mb, and Fisher’s exact test was performed to quantify the significance of the CNV with exact breakpoint match or within a CNVR. Association analysis was considered to be suggestively significant if (1) p < 1.0 × 10−4; (2) there were at least 15 SNP within the CNV; and (3) the population frequency was < 5%. To avoid possible plate/batch effects, we manually checked the significant CNV samples and ignored any association if a CNV was identified in > 50% of samples from the same plate/batch. We also excluded CNV with strong p values if the LRR standard deviation for that CNV call was between 0.20 and 0.25 to avoid borderline CNV calls. A second association analysis was performed on large CNV that were > 1 Mb. Large CNV are not frequent, but any CNV in fewer than 3 samples were ignored. Fisher’s exact test was performed to quantify the significance of the CNV. The cutoff value for significance at p < 1 × 10−4 was arbitrary, since we are not aware of the total number of CNV in the human genome. However, this level of significance was also used in the recent CNV analysis by WTCCC that introduced the population frequency threshold for rare CNV of < 5%13.
The Database of Genomic Variants (DGV) includes CNV that have been annotated. The detected CNV in our case and control samples were compared with the DGV (Build 35).
RESULTS
Genome-wide CNV distribution
High-resolution SNP obtained from the WTCCC data center had an average call rate of 99.63%, demonstrating high quality intensity data. The log R spread for the controls and cases showed that ∼90% of the samples were within mean standard deviation of 0.15 to 0.22 in controls and 0.13 to 0.20 for the RA cases. After quality control, we identified 5927 CNV in RA cases and 4333 in WTCCC controls that were at least 1 kb in length. There was almost a 2-fold increase in the total CNV burden of 1 kb or more in size among RA patients compared to controls. The excessive burden of CNV compared to controls has also been noted in patients with schizophrenia and autism18,19. However, without experimental validation of CNV it is difficult to exclude the possibility that this may be due to an experimental artefact. Most of these CNV were found in fewer than 3 samples and thus were omitted from the association analysis. Fifty-four percent of the CNV detected in this study overlapped with CNV in the DGV.
Association analysis
Our association analysis revealed significance for the HLA region chr6 30583394–30889981 (p < 5 × 10−7), which overlaps with DGV ID 3600. There were 11 CNV that achieved significance (p < 1 × 10−4), shown in Table 1. The association analysis showed mostly one-sided CNV occurrences (in cases more than in controls). The disease-associated CNV involved numerous genes, including the interferon regulatory factor 1 (IRF1) gene; the tumor necrosis factor (TNF)-induced gene TNFAIP3 and its interacting protein 1 gene TNIP1; 2 autoimmune related genes, lymphocyte cytosolic protein 2 (LCP2) and beta-2-micro-globulin (B2M); 2 genes, protein kinase C-eta (PRKCH) and phosphatidylinositol-3,4,5-trisphosphate-dependent rac exchange factor 1 (PREX1), involved in cell-signaling pathways; 2 known inflammatory mediator genes, arachidonate 5-lipoxygenase-activating protein (ALOX5AP) and lipopoly-saccharide-induced TNF factor (LITAF); and the cell proliferation gene serglycin (SRGN). One CNVR contained deletions located in an intergenic region on chromosome 7p21.3, with the nearest gene, thrombospondin type-1 domain-containing protein 7A (THSD7A), located approximately 50 kb away from the deletion.
The association analysis of large CNV with RA (Table 2) revealed a duplication on chromosome 16p11.2 that was 1.46 Mb (p < 0.001). Duplications and deletions on 16p11.2 have previously been reported and validated in 2 independent studies17,20. However, our data confirmation revealed there was very low probe intensity in this region and therefore this result should be interpreted with caution.
DISCUSSION
Copy number variation may play a significant role in genetic susceptibility and expression of complex autoimmune disease. An example of this is the role of CNV in beta-defensins in Crohn’s disease and psoriasis. Beta-defensins are small antimicrobial peptides that play an important role in the innate immune system. A cluster of beta-defensin genes contain an extensive array of CNV in humans. It has been demonstrated that individuals with Crohn’s disease have significantly lower beta-defensin copy number than that in controls. It is hypothesized that this results in the reduction of the antimicrobial barrier in the gut leading to Crohn’s disease21. Meanwhile, individuals with psoriasis have significantly higher copy numbers than controls. The association analysis suggests 6 or more copies of beta-defensin is correlated with susceptibility to psoriasis. Due to the cytokine-like properties of the beta-defensin this could lead to an inappropriate inflammatory response generating psoriatic lesions after minor injury, infection, or other environmental triggers4.
In our study assessing CNV in RA by analyzing a genome-wide association study, we first report the greater genome-wide burden of CNV in RA patients compared with controls. We found an approximately 2-fold increase in the number of CNV carried by an individual with RA. We also identified 11 rare CNVR, with < 5% frequency, that showed evidence of association with RA. Some of the loci identified encoded genes within the CNVR previously implicated in autoimmune diseases. The 2 most interesting candidates are TNFAIP3 and TNIP1. The products of the TNFAIP3 and TNIP1 genes are A20 and ABIN1/Nafla, respectively. These proteins physically interact with each other to influence the ubiquitin-mediated destruction of IKKg (IkB kinase-g)/NEMO, which is an essential nexus of NF-κB signaling22. A20 also regulates the degradation of several other components of the TNF signaling pathway. Polymorphisms of TNFAIP3 have been associated with RA, SLE, and psoriasis, although the specific variants differ among these autoimmune diseases23,24,25,26. TNFAIP3 knockout mice develop multiorgan inflammation and arthritis23. As well, gene-disruptive CNV in classical Hodgkin’s lymphoma are found in TNFAIP327. This may be of potential interest as there is increased incidence of lymphoma among patients with moderate to severe RA.
Another candidate of potential interest is interferon regulatory factor 1. Type I interferons are a family of cytokines typically produced during viral infection but their multiple immunomodulatory effects are increasingly being recognized, including upregulation of innate immune receptors, polarization of T cells towards a TH1 phenotype, and activation of B cells. Recent studies provide strong evidence for an association between IRF-5 gene variants and RA, particularly for patients with RA who are negative for anti-cyclic citrullinated peptide28. An insertion-deletion polymorphism in the IRF-5 gene has also been shown to confer risk to inflammatory bowel disease29.
The most significant CNV detected in our study overlaps with ALOX5AP, also known as FLAP. Expression of ALOX5AP is regulated by the binding of TNF-α to the promoter of ALOX5AP, thus modulating its gene expression30. In a functional study using collagen-induced arthritis in the DBA/1 mouse, it was shown that the severity of arthritis in FLAP-deficient mice was substantially reduced31. This implies the potential importance of ALOX5AP in arthritis. As well, LITAF showed upregulation in mice with glucose-6-phosphate isomerase (GPI)-induced arthritis32.
CNV involving 2 other genes reported here, PRKCH and PREX1, have also been observed in acute myeloid leukemia and gastric cancer cell line, respectively33,34. PRKCH was reported to be associated with RA in a Japanese population35. LCP2 and B2M are 2 additional autoimmune-related genes (Table 1). LCP2 is important for normal T cell development and B2M is an MHC class I associated gene that is associated with spondylitis36. The CNVR at locus 7p21.3 is of particular interest. First, it contains deletions rather than duplications, and, in general, deletions are known to have higher phenotypic consequences (or penetrance)5. Second, this deletion occurs in at least 1% of RA cases and it is a validated deletion annotated in DGV (ID 3665). Finally, in a recent study, it was shown that CNV can have long-range effects (up to 1 Mb) on gene expression37; therefore, validation of the detected CNVR in the intergenic region near THSD7A on 7p21.3 is warranted.
Variation in copy number of FCGR3B was shown to influence susceptibility to RA in 2 independent studies in a Dutch population35,36. However, in our analysis, no evidence of association was observed due to poor coverage of the regions [FCGR1A (chr1: 1146567361–146577146) and FCGR-A/2A/2B/2C/3A/3B (chr1: 158,288,275–158,382,415)]. Limitations of our study include the issue of multiple testing and possibly under-estimating the number of common CNV due to inaccurate measurement of CNV breakpoints using SNP microarray data. Technological developments in sequencing will allow efficient identification of these breakpoints. Although we do not know the number of CNV in the human genome, we have done an exploratory analysis detecting disease-associated CNV that are p < 1.0 × 10−4. The false-positive error in the detection of duplications is significantly higher in Affymetrix arrays than in Illumina arrays38. Hence, we used a more restricted LRR standard deviation in our analysis than that used in the previously reported Illumina array CNV analysis using PennCNV17. Default parameters were used for the BAF drift and wave factor adjustment. As in a previous study, for the SNP intensity parameter, we have used the criteria where intensities of at least 15 SNP are considered for CNV < 1 Mb and intensities of 50 SNP are considered for CNV > 1 Mb.
In a recent CNV analysis carried out by the WTCCC, there were no associations of 3432 common CNV with common diseases, including RA. That analysis used the same cohort as in our study and a custom-designed Agilent Comparative Genomic Hybridization (CGH) chip for CNV detection, which is much more accurate than using SNP arrays for identifying CNV. However, even with CGH they identified a 15% false-positive rate when detecting duplications. Among the true-positive CNV, 50% had a frequency ≥ 5%, all were > 500 bp in length, and most were tagged by nearby SNP13. Similarly, our analyses also showed no strong associations of common CNV with RA other than those within the HLA region. One possible reason for this observation is that the probes for Affymetrix 500k are not uniformly distributed across the genome (median 2.5 kb space between SNP), hence the poor coverage affects the detection capacity and the frequency of CNV38.
In summary, we are at the early stages of identification of CNV involved in inflammatory rheumatic diseases. Our study identified 11 rare CNVR associated with RA, and these now need to be verified by additional molecular studies. The functional significance of the validated CNV would then need to be elucidated. We suggest that CNV should be routinely analyzed during a genome-wide association study, which may reveal additional alleles with low to modest disease risk associated with RA.
Acknowledgment
We thank Dr. Kai Wang for guidance with the analysis of copy number variation and the Wellcome Trust Case Control Consortium for providing the data.
- Accepted for publication December 21, 2010.