Potential regulatory SNPs in promoters of human genes: A systematic approach

https://doi.org/10.1016/j.mcp.2006.03.007Get rights and content

Abstract

Single nucleotide polymorphisms (SNPs) can significantly contribute to the cellular level of the mRNA transcripts encoded by human disease related genes. DNA variations between individuals can be an indication of predisposition to disease or affect the response to treatment. An algorithm allowing in silico extraction of SNPs with the high probability of influencing the level of gene expression is highly desirable. We performed a whole-genome analysis of SNP markers in regulatory areas of the human genes. Computational criteria were applied to predict an influence of the nucleotide replacement on the individual gene's expression. We formed a list of 14127 regulatory SNPs corresponding to 8555 regulatory areas suitable for future association studies. A catalogue of 1859 SNP entries, confirmed by analysis in populations, and allocated to 1607 human regulatory areas was created. We also revealed 13 cases of overlapped promoters corresponding to the human genes transcribed from opposite DNA strands and containing the regulatory SNP markers validated in populations. A population-validated set of regulatory SNP markers is organized in a database available in open access as a Supplementary file and by ftp://194.67.85.195/.

Introduction

Variations in the human DNA sequence between individuals can be an indication of predisposition to disease or affect the response to treatment [1]. Variations represented by Single Nucleotide Polymorphisms (SNPs) are becoming increasingly important tools for genetic and biomedical research. Although the current genomic databases contain information on several million SNPs and are growing at a very fast rate, the true value of a given SNP is not always clear. Many of those SNPs correspond to mere sequencing errors, while others are true neutral nucleotide substitutions. Targeting the correct SNP for large-scale association studies represents a major bottleneck, especially when it comes to SNPs which are located outside of open reading frames of genes.

Nevertheless, many successful studies of SNP association with particular disease phenotypes have been performed, even for SNPs located in promoter areas not included in mRNA. For instance, some regulatory SNP associations survived meta analysis studies that reveal an age-related pattern of risk of Alzheimer's disease associated with the IL-1A (−889) polymorphism [2] and the protective role of the myeloperoxidase (MPO) −463G→A polymorphism in lung cancer [3].

SNPs located in regulatory areas of human genes can significantly contribute to the cellular level of mRNA transcripts. For example, polymorphisms (−491A→T, −427C→T and −219G→T) in the promoter region of the apolipoprotein E (apoE) gene alter the level of its expression in an allele-specific manner [4], [5]. Similarly, the −330G allele of the IL2 gene showed two-fold higher levels of expression over the −330T allele and association with multiple sclerosis [6], [7]. The −232G allele of the PCK1 gene encoding phosphoenolpyruvate carboxykinase showed significantly increased basal expression with no down-regulation by insulin and had an association with type 2 diabetes mellitus [8]. Regulatory polymorphisms in the disease-related genes offers an obvious advantage, as the therapeutic manipulation of regulatory mutations should conceivably be easier than repairing or modulating the effects of an abnormal protein [9].

Unfortunately, in a promoter region broadly defined as a 2 kb sequence located upstream of the transcription start site there may be a large number of SNPs, some of which may be non-polymorphic in the study population, while others may fail to demonstrate a regulatory role. The search for relevant polymorphic candidates faces significant obstacles, due in part to both the high number of potentially promising SNPs and the intrinsic difficulties associated with identification of weak gene–disease interactions [10]. At present extensive case-control studies can be applied only to a limited number of gene polymorphisms due to high cost. The gene regulation based choice of SNPs that deserve an exhaustive cohort analysis is of primary importance.

Lack of functional certainty is prominent even in cases of very well studied disease-associated SNPs like the TNF-alpha promoter polymorphism −308G/A. This is a major pitfall for case-control genetics studies [11]. On the other hand, association studies depend on linkage disequilibrium (LD) between a causative mutation and its linked marker loci. So, even non-regulatory but associated SNPs may serve as helpful leads to a SNP that is truly causative. An observation of association of non-regulatory SNPs with disease is a prerequisite for detailed investigation of the variations in the genomic region within the same haplotype. Algorithms similar to ones used for primary searches for potential regulatory SNPs can also be applied to secondary searches aimed at revealing truly regulatory SNPs located in the vicinity of a disease-associated SNP previously described experimentally.

In this paper we describe an algorithm which allows in silico extraction of SNPs with a high probability of influence on the level of gene expression. To perform whole-genome analysis of SNP markers in regulatory areas of human genes we created the software SNP_TRAST (SNP Transcription Regulating Area Search Tool) and applied computational criteria for involvement of a given SNP in gene regulation. This study revealed 14127 first-line candidates for future association studies. A significant subset of these SNP markers confirmed to be polymorphic experimentally was organized in an open access database available by ftp://194.67.85.195/ and in the Supplementary Table.

Section snippets

Methods

We used version 2 of the NCBI assembly 34 available at ftp://ftp.ncbi.nih.gov/genebank/genomes/H_Sapiens (February 2004 data freeze). In this database the human genome was subdivided into contigs from 100 kb to 65 Mb in length. We also used MapView (build#3.2) containing annotations for 473 contigs which include 3.0×106 ESTs, representing 8.7×106 coding exons (ftp://ftp.ncbi.nih.gov/genomes/H_Sapiens/maps/mapview). The location and verification status of SNP markers was retrieved from dbSNP build

Determination of the cut-off level of matrix change by SNP introduction

Determining the relationship between the binding affinity of a particular TF for a single nucleotide substitution in its binding site is an important step in predicting whether a given uncharacterized SNP is responsible for the increase or decrease in expression of a given gene of interest.

To provide a solid basis for the estimation of the cut-off level of the matrix change we performed computational analysis of 18 SNPs experimentally found to influence the level of gene expression resulting in

Conclusions

In conclusion, we describe computational criteria for prediction of SNPs likely to influence gene expression levels. The method has identified 14,127 potentially regulatory SNPs suitable for future association studies. A population-validated set of regulatory SNP markers is organized in a database available by open access at ftp://194.67.85.195/ and in Supplementary Table.

Acknowledgements

Authors are extremely grateful to Prof. Nick K. Yankovsky, the Head of Genome Analysis Lab of Vavilov Insitute, for unwaivering support and to faculty of the MMB department, Dr. Christensen and Dr. Grant, for their help with everything. This work was supported by grants “Cancer Genomics and Development of Diagnostic Tools and Therapies” (Commonwealth Technology Research Fund, Virginia, USA), the FCNTP grant “Whole genome analysis of human polymorphisms” from the Russian Ministry of Science and

References (44)

  • O. Combarros et al.

    Age-dependent association between interleukin-1A (−889) genetic polymorphism and sporadic Alzheimer's disease. A meta-analysis

    J Neurol

    (2003)
  • A. Feyler et al.

    Point: myeloperoxidase −463G→a polymorphism and lung cancer risk

    Cancer Epidemiol Biomarkers Prev

    (2002)
  • M.J. Bullido et al.

    A polymorphism in the regulatory region of APOE associated with risk for Alzheimer's dementia

    Nat Genet

    (1998)
  • J.C. Lambert et al.

    Contribution of APOE promoter polymorphisms to Alzheimer's disease risk

    Neurology

    (2002)
  • H. Cao et al.

    Promoter polymorphism in PCK1 (phosphoenolpyruvate carboxykinase gene) associated with type 2 diabetes mellitus

    J Clin Endocrinol Metab

    (2004)
  • T.J. Hudson

    Wanted, regulatory SNPs

    Nat Genet

    (2003)
  • J.P. Bayley et al.

    Is there a future for TNF promoter polymorphisms?

    Genes Immun

    (2004)
  • K. Quandt et al.

    MatInd and MatInspector, new fast and versatile tools for detection of consensus matches in nucleotide sequence data

    Nucleic Acids Res

    (1995)
  • M. Stepanova et al.

    A comparative analysis of relative occurrence of transcription factor binding sites in vertebrate genomes and gene promoter areas

    Bioinformatics

    (2005)
  • V. Matys et al.

    TRANSFAC, transcriptional regulation, from patterns to profiles”

    Nucleic Acids Res

    (2003)
  • M. Widenius et al.

    MySQL Reference Manual

    (2002)
  • E. Pennisi

    A low number wins the genesweep pool

    Science

    (2003)
  • Cited by (10)

    View all citing articles on Scopus
    View full text