A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn)

Li, Zhiqiang; Zhang, Zhao; He, Zangdong; Tang, Wei; Li, Tao; Zeng, Zhen; He, Lin; Shi, Yongyong

doi:10.1038/cr.2009.33

Download PDF

Letter to the Editor
Published: 17 March 2009

A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn)

Zhiqiang Li^1,2,3^na1,
Zhao Zhang^2,3^na1,
Zangdong He^1,2,3,
Wei Tang^2,3,
Tao Li^1,2,3,
Zhen Zeng^1,2,3,
Lin He^2,3,4 &
…
Yongyong Shi^1,2,3

Cell Research volume 19, pages 519–523 (2009)Cite this article

3915 Accesses
684 Citations
Metrics details

Dear Editor,

Haplotypic information in diploid organisms provides valuable information on human evolutionary history and plays an important role in identifying a candidate gene in the etiology of complex genetic diseases. However, haplotypes of diploid individuals cannot be acquired easily. Molecular haplotyping methods are very costly and have low throughput, and current genotyping and sequencing methods do not provide information on the linkage phase in diploid organisms. The application of statistical methods to infer the haplotype phase in samples of diploid sequences is a very cost-effective approach. Several computational and statistical methods have been developed for haplotype inference, including Clark's algorithm ¹, the Expectation Maximization (EM) algorithm ², and Gibbs sampler ³. Because of its interpretability and stability, the EM algorithm has become one of the most widely used statistical algorithms. However, the standard EM algorithm has several weaknesses, including the inability to handle a large number of markers and convergence to the local optimum. To overcome these problems, various derivative methods have been developed, such as the Partition-Ligation EM (PLEM) algorithm to handle many more linked loci ⁴, the Optimal Step Length EM (OSLEM) algorithm to accelerate the calculations ⁵, and the Stochastic EM (SEM) algorithm to deal with missing genotypic data and to avoid convergence to local maxima ⁶. However, most packages are intended for use with single-nucleotide polymorphism (SNP) data in a biallelic manner.

More and more researchers are analyzing both multiallelic and biallelic markers in the linkage and/or association studies of complex diseases. The analysis of linkage disequilibrium (LD) between multiallelic loci and haplotype inference of many loci (including bi- and multi-allelic markers) present a number of common problems. The major difficulty for the haplotype inference problem is exploring the large number of haplotype pairs that are consistent with the observed genotypes. One solution is to limit the number of candidate haplotypes. Ideally, a condensed haplotype set would make the process of haplotype inference quick and accurate. Our approach is to construct the haplotype space gradually rather than beginning with the set of all possible haplotypes.

SHEsis ⁷ (http://analysis.bio-x.cn) is a robust and user-friendly software platform for the analysis of linkage disequilibrium, haplotype construction, and genetic association at polymorphism loci ⁷. However, the haplotype construction element is based on the standard EM algorithm (the so-called Full-Precise-Iteration algorithm referred to in a previous article), and it therefore comes with the deficiencies mentioned above. In this study, we have developed a new improved algorithm, the PL-CSEM (Partition-Ligation Combination-Subdivision EM), designed for efficient estimation of haplotypes constructed from large numbers of biallelic or multiallelic loci in diploid individuals.

We applied the CS strategy, which deals with the number of alleles for each locus, to construct an optimum set of candidate haplotypes. The essential steps of the CS strategy are as follows: we first combine some alleles to reduce the total for each multiallelic locus (the combination step), and then use the EM algorithm to construct haplotypes with the new alleles, i.e. from the combination step, and subdivide the phase hierarchically, through a bottom-up approach (the subdivision step). For instance, given a pair of multiallelic loci with k and l alleles, respectively, there are k×l possible haplotypes (Figure 1). The standard EM algorithm has to consider all the possible haplotypes, but using the CS strategy, we first combine alleles to make both loci biallelic, and then use the EM algorithm to estimate haplotypes with the new alleles. In the EM step, only four possible haplotypes are taken into consideration. Then, only the haplotypes whose frequencies are greater than a threshold value (e.g. 10⁻⁵) are retained in the next step. In the subdivision step, every allele (except the initial allele) is subdivided into two parts, so the remaining haplotypes are broken up to form the set of candidate haplotypes for the next EM step. The subdivision and EM steps are repeated until all the alleles are returned to the initial state. In the program, the alleles are combined randomly in the combination step. To test whether the random combination is robust, we generated 10 simulated data sets (5 multiallelic loci of 500 individuals). For each data set, we estimated the haplotype frequencies 10 times (they can be combined in different ways), and calculated the performance indexes for comparison. The results show that despite the different combination possibilities, the groups of significant haplotypes are very close, and only those haplotypes with very low frequencies are different. We believe, therefore, that the random method of combination is robust (for details, see Supplementary information, Table S1).

Niu and Qin et al. first implemented the Partition-Ligation (PL) strategy together with Gibbs sampling and the EM algorithm (i.e. PLEM) to estimate haplotype phases for a large number of SNPs ^{4, 8}. The PL strategy is a divide-conquer-combine technique, and it is useful in dealing with a large number of linked loci. We also implemented the PL strategy in our program. It can be described as follows: first break down all of the marker loci into stretches of “atomistic” units, construct haplotypes for each unit and then rebuild the phase hierarchically, through a bottom-up approach. In each unit, haplotypes are generated through the EM algorithm or the CSEM algorithm (for the ones that include multiallelic loci). In addition, we have devised another method for use in the ligation step. Since every unit can be considered as a multiallelic locus, the CS strategy could be useful. If the set of candidate haplotypes space is too large, the CS strategy would be utilized in this step. More than two segments could be combined at each ligation step.

To avoid eliminating haplotypes too readily, in each EM/CSEM step the threshold value for eliminating haplotypes is adaptable. However, a backup-buffering strategy is also available. The user can set the threshold value and the size of a buffer for retaining some partial haplotypes whose estimated frequencies are below the threshold value. Thus, appropriate values can be set to retain more independent haplotypes.

To illustrate the use of our algorithm in practice, we analyzed real data sets of HLA (5 multiallelic loci (15, 32, 14, 13 and 7, respectively) of 420 unrelated individuals) and GH1 (14 biallelic loci and 1 triallelic locus of 154 unrelated individuals). We also derived real SNP data sets (100 replicates of 20 and 40 SNPs of 60 individuals) from the International HapMap Project and generated simulated data sets (100 replicates of 20 and 40 SNPs of 60 individuals, and 20, 40, 80, and 160 SNPs of 500 and 1000 individuals) using a coalescence model ⁹. We evaluated the performance of our method, compared with PLEM. The PLEM program was downloaded from J. S. Liu's web site. The parameters of the program were set up as recommended. We computed the I_H and I_F scores previously used by Excoffier and Slatkin ², and the average error rates based on either individual phase call (INDI) or the proportion of incorrectly inferred loci (LOCI) ⁴. I_H is an index of performance in terms of haplotype identification, and its value can vary between 1, when the identified haplotypes are exactly those present in the true sample, and 0, when none of the true haplotypes has been identified. I_F is defined as one minus half the sum of absolute differences between estimated and true haplotype frequencies, and it varies from 0 to 1. The more accurate the estimation, the closer to 1 the I_F value. In addition, we recorded the running time (RT), using Inter (R) Xeon(TM) CPU 2.80 GHz, 2.79 GHz, and 2.00 GB RAM on the Microsoft Windows XP operating system with 1 s as the time unit.

Tables 1 and 2 show the performance of our program and PLEM for the real data sets and the simulated data sets, with the better performance shown in bold. Because the real phase information of the HLA data were unknown, we could not calculate its I_H, I_F, and error rates. The results of GH1 and HLA for the PLEM program were also unavailable, as the PLEM program cannot handle multiallelic loci. For most of the comparisons, the PL-CSEM program performs better than the PLEM program, obtaining higher IH and IF scores, with lower error rates in terms of both the individual phase calls and the incorrectly inferred loci, as well as requiring less running time.

Table 1 Performance of programs for the real data sets

Full size table

Table 2 Performance of programs for the simulated data sets

Full size table

In summary, the most important improvement in our program, relative to PLEM, is that for large data sets, our system is much less time-consuming and has a lower, or at least comparable, error rate. In addition, our program can deal with loci with dozens of alleles, which is beyond the scope of the PLEM program. The PL-CSEM program, which is integrated into the SHEsis ⁷, an existing web-based platform, is freely available through the internet (http://analysis.bio-x.cn). (A test version is currently available. 〈http://analysis.bio-x.cn/myAnalysis0.php〉) However, reference 7 and this paper should be cited whenever it is used in publication.

( Supplementary Information is linked to the online version of the paper on the Cell Research website.)

References

Clark AG . Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 1990; 7:111–122.
CAS PubMed Google Scholar
Excoffier L, Slatkin M . Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995; 12:921–927.
CAS PubMed Google Scholar
Stephens M, Smith NJ, Donnelly P . A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 2001; 68:978–989.
Article CAS PubMed PubMed Central Google Scholar
Qin ZS, Niu T, Liu JS . Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 2002; 71:1242–1247.
Article CAS PubMed PubMed Central Google Scholar
Zhang P, Sheng H, Morabia A, Gilliam TC . Optimal step length EM algorithm (OSLEM) for the estimation of haplotype frequency and its application in lipoprotein lipase genotyping. BMC Bioinformatics 2003; 4:3.
Article PubMed PubMed Central Google Scholar
Tregouet DA, Escolano S, Tiret L, Mallet A, Golmard JL . A new algorithm for haplotype-based association analysis: the Stochastic-EM algorithm. Ann Hum Genet 2004; 68:165–177.
Article CAS PubMed Google Scholar
Shi YY, He L . SHEsis, a powerful software platform for analyses of linkage disequilibrium, haplotype construction, and genetic association at polymorphism loci. Cell Res 2005; 15:97–98.
Article CAS PubMed Google Scholar
Niu T, Qin ZS, Xu X, Liu JS . Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002; 70:157–169.
Article CAS PubMed Google Scholar
Hudson RR . Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002; 18:337–338.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from the Hi-Tech Research and Development Program of China (2006AA02A407), Shanghai Changning Health Bureau Program (2008406002), Shanghai Municipal Health Bureau Program (2008095), Shanghai Program (07DZ22917), Shanghai Leading Academic Discipline Project (B205), National S973 Program (2007CB947303), National Basic Research Program of China (2006CB910601), Shanghai-Unilever Research and Development Fund (06SU07007), and National Key Technology R&D Program (2006BAI05A05). We are grateful to Amy De'ath of the Welsh Transplantation and Immunogenetics Laboratory for providing the Sinhala Data, and Horan et al. of the Institute of Medical Genetics, Cardiff University, for providing the GH1 data set.

Author information

Zhiqiang Li and Zhao Zhang: These two authors contributed equally to this work.

Authors and Affiliations

Changning Mental Health Center, Affiliated Hospital of Bio-X Center, Shanghai Jiao Tong University, 299 Xiehe Road, Shanghai, 200042, China
Zhiqiang Li, Zangdong He, Tao Li, Zhen Zeng & Yongyong Shi
Bio-X Center, Shanghai Jiao Tong University, 1954 Huashan Road, Shanghai, 200030, China
Zhiqiang Li, Zhao Zhang, Zangdong He, Wei Tang, Tao Li, Zhen Zeng, Lin He & Yongyong Shi
Institute for Nutritional Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 294 Taiyuan Road, Shanghai, 200031, China
Zhiqiang Li, Zhao Zhang, Zangdong He, Wei Tang, Tao Li, Zhen Zeng, Lin He & Yongyong Shi
Institutes of Biomedical Sciences, Fudan University, Shanghai, 200032, China
Lin He

Authors

Zhiqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zangdong He
View author publications
You can also search for this author in PubMed Google Scholar
Wei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Lin He
View author publications
You can also search for this author in PubMed Google Scholar
Yongyong Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lin He or Yongyong Shi.

Supplementary information

Supplementary information, Table S1

To prove the random combination is robust, we generated 10 simulated data sets (5 multiallelic loci of 500 individiduals). (DOC 88 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Zhang, Z., He, Z. et al. A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn). Cell Res 19, 519–523 (2009). https://doi.org/10.1038/cr.2009.33

Download citation

Published: 17 March 2009
Issue Date: April 2009
DOI: https://doi.org/10.1038/cr.2009.33

This article is cited by

The role of SOD2 and NOS2 genes in the molecular aspect of bladder cancer pathophysiology
- Radosław Grębowski
- Joanna Saluk
- Paulina Wigner-Jeziorska
Scientific Reports (2023)
Association of − 717 A > G (rs2794521) CRP polymorphism with high cardiovascular risk by C-reactive protein in systemic lupus erythematosus patients
- Karen Pesqueda-Cendejas
- Isela Parra-Rojas
- Ulises De la Cruz-Mosso
Clinical Rheumatology (2023)
PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data
- Roeland E. Voorrips
- Giorgio Tumino
BMC Bioinformatics (2022)
Analysis of protein kinase C (HcPKC) gene expression and single-nucleotide polymorphisms related to inner shell color traits in Hyriopsis cumingii
- Mengying Zhang
- Xiajun Chen
- Zhiyi Bai
BMC Genomic Data (2022)
The dopamine transporter gene SLC6A3: multidisease risks
- Maarten E. A. Reith
- Sandhya Kortagere
- Zhicheng Lin
Molecular Psychiatry (2022)

A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn)

Dear Editor,

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Supplementary information

Supplementary information, Table S1

Rights and permissions

About this article

Cite this article

This article is cited by

The role of SOD2 and NOS2 genes in the molecular aspect of bladder cancer pathophysiology

Association of − 717 A > G (rs2794521) CRP polymorphism with high cardiovascular risk by C-reactive protein in systemic lupus erythematosus patients

PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data

Analysis of protein kinase C (HcPKC) gene expression and single-nucleotide polymorphisms related to inner shell color traits in Hyriopsis cumingii

The dopamine transporter gene SLC6A3: multidisease risks

Search

Quick links

Dear Editor,

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Supplementary information

Supplementary information, Table S1

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

The role of SOD2 and NOS2 genes in the molecular aspect of bladder cancer pathophysiology

Association of − 717 A > G (rs2794521) CRP polymorphism with high cardiovascular risk by C-reactive protein in systemic lupus erythematosus patients

PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data

Analysis of protein kinase C (HcPKC) gene expression and single-nucleotide polymorphisms related to inner shell color traits in Hyriopsis cumingii

The dopamine transporter gene SLC6A3: multidisease risks

Search

Quick links