Epistasis Blog

From the Computational Genetics Laboratory at Dartmouth Medical School (www.epistasis.org)

Wednesday, February 03, 2010

The GenEpi Toolbox

This looks useful. Has anyone tried it? The GenEpi Toolbox.

Here is a recent paper discussing this new bioinformatics resource for genetic epidemiology.

Coassin S, Brandstätter A, Kronenberg F. Lost in the space of bioinformatic tools: A constantly updated survival guide for genetic epidemiology. The GenEpi Toolbox. Atherosclerosis. 2009 Oct 29. [Epub ahead of print] [PubMed] PMID:19963217.

Abstract

Genome-wide association studies (GWASs) led to impressive advances in the elucidation of genetic factors underlying complex phenotypes and diseases. However, the ability of GWAS to identify new susceptibility loci in a hypothesis-free approach requires tools to quickly retrieve comprehensive information about a genomic region and analyze the potential effects of coding and non-coding SNPs in a candidate gene region. Furthermore, once a candidate region is chosen for resequencing and fine-mapping studies, the identification of several rare mutations is likely and requires strong bioinformatic support to properly evaluate and prioritize the found mutations for further analysis. Due to the variety of regulatory layers that can be affected by a mutation, a comprehensive in-silico evaluation of candidate SNPs can be a demanding and very time-consuming task. Although many bioinformatic tools that significantly simplify this task were made available in the last years, their utility is often still unknown to researches not intensively involved in bioinformatics. We present a comprehensive guide of 64 tools and databases to bioinformatically analyze gene regions of interest to predict SNP effects. In addition, we discuss tools to perform data mining of large genetic regions, predict the presence of regulatory elements, make in-silico evaluations of SNPs effects and address issues ranging from interactome analysis to graphically annotated proteins sequences. Finally, we exemplify the use of these tools by applying them to hits of a recently performed GWAS. Taken together a combination of the discussed tools are summarized and constantly updated in the web-based "GenEpi Toolbox" (http://genepi_toolbox.i-med.ac.at) and can help to get a glimpse at the potential functional relevance of both large genetic regions and single nucleotide mutations which might help to prioritize the next steps.

Tuesday, February 02, 2010

Genetic Heterogeneity and Cancer

The following paper raises the important issue of genetic heterogeneity. This is a nice paper because it addresses the complexity of genetic architecture. However, it is very poorly cited. Note how few citations there are before the year 2000. This is not a new idea. It would have been nice if they could have provided the reader with a historical perspective on this important phenomenon.

Galvan A, Ioannidis JP, Dragani TA. Beyond genome-wide association studies: genetic heterogeneity and individual predisposition to cancer. Trends Genet. 2010 Jan 25. [Epub ahead of print] [PubMed] PMID: 20106545.

Abstract

Genome-wide association studies (GWAS) using population-based designs have identified many genetic loci associated with risk of a range of complex diseases including cancer; however, each locus exerts a very small effect and most heritability remains unexplained. Family-based pedigree studies have also suggested tentative loci linked to increased cancer risk, often characterized by pedigree-specificity. However, comparison between the results of population- and family-based studies shows little concordance. Explanations for this unidentified genetic 'dark matter' of cancer include phenotype ascertainment issues, limited power, gene-gene and gene-environment interactions, population heterogeneity, parent-of-origin-specific effects, and rare and unexplored variants. Many of these reasons converge towards the concept of genetic heterogeneity that might implicate hundreds of genetic variants in regulating cancer risk. Dissecting the dark matter is a challenging task. Further insights can be gained from both population association and pedigree studies.

Monday, February 01, 2010

An Open Access Database of Genome-wide Association Results

Ran across this paper today. Might be useful for those interested in reanalysis of GWAS data.

Johnson AD, O'Donnell CJ. An open access database of genome-wide association results. BMC Med Genet. 2009 Jan 22;10:6. [PubMed] PMID: 19161620; PubMed Central PMCID: PMC2639349.

BACKGROUND: The number of genome-wide association studies (GWAS) is growing rapidly leading to the discovery and replication of many new disease loci. Combining results from multiple GWAS datasets may potentially strengthen previous conclusions and suggest new disease loci, pathways or pleiotropic genes. However, no database or centralized resource currently exists that contains anywhere near the full scope of GWAS results. METHODS: We collected available results from 118 GWAS articles into a database of 56,411 significant SNP-phenotype associations and accompanying information, making this database freely available here. In doing so, we met and describe here a number of challenges to creating an open access database of GWAS results. Through preliminary analyses and characterization of available GWAS, we demonstrate the potential to gain new insights by querying a database across GWAS. RESULTS: Using a genomic bin-based density analysis to search for highly associated regions of the genome, positive control loci (e.g., MHC loci) were detected with high sensitivity. Likewise, an analysis of highly repeated SNPs across GWAS identified replicated loci (e.g., APOE, LPL). At the same time we identified novel, highly suggestive loci for a variety of traits that did not meet genome-wide significant thresholds in prior analyses, in some cases with strong support from the primary medical genetics literature (SLC16A7, CSMD1, OAS1), suggesting these genes merit further study. Additional adjustment for linkage disequilibrium within most regions with a high density of GWAS associations did not materially alter our findings. Having a centralized database with standardized gene annotation also allowed us to examine the representation of functional gene categories (gene ontologies) containing one or more associations among top GWAS results. Genes relating to cell adhesion functions were highly over-represented among significant associations (p < 4.6 x 10(-14)), a finding which was not perturbed by a sensitivity analysis. CONCLUSION: We provide access to a full gene-annotated GWAS database which could be used for further querying, analyses or integration with other genomic information. We make a number of general observations. Of reported associated SNPs, 40% lie within the boundaries of a RefSeq gene and 68% are within 60 kb of one, indicating a bias toward gene-centricity in the findings. We found considerable heterogeneity in information available from GWAS suggesting the wider community could benefit from standardization and centralization of results reporting.

Saturday, January 30, 2010

Whole Genome Association Study of Brain-Wide Imaging Phenotypes

We did a neat cluster analysis in this paper. Combining GWAS data with brain imaging phenotypes is a challenge.

Shen L, Kim S, Risacher SL, Nho K, Swaminathan S, West JD, Foroud T, Pankratz N, Moore JH, Sloan CD, Huentelman MJ, Craig DW, Dechairo BM, Potkin SG, Jack CR Jr, Weiner MW, Saykin AJ; Alzheimer’s Disease Neuroimaging Initiative. Whole Genome Association Study of Brain-Wide Imaging Phenotypes for Identifying Quantitative Trait Loci in MCI and AD: A Study of the ADNI Cohort. Neuroimage. 2010 Jan 22. [PubMed] PubMed PMID: 20100581.

Abstract

A genome-wide, whole brain approach to investigate genetic effects on neuroimaging phenotypes for identifying quantitative trait loci is described. The Alzheimer's Disease Neuroimaging Initiative 1.5T MRI and genetic dataset was investigated using voxel-based morphometry (VBM) and FreeSurfer parcellation followed by genome wide association studies (GWAS). 142 measures of grey matter (GM) density, volume, and cortical thickness were extracted from baseline scans. GWAS, using PLINK, were performed on each phenotype using quality controlled genotype and scan data including 530,992 of 620,903 single nucleotide polymorphisms (SNPs) and 733 of 818 participants (175 AD, 354 amnestic mild cognitive impairment, MCI, and 204 healthy controls, HC). Hierarchical clustering and heat maps were used to analyze the GWAS results and associations are reported at two significance thresholds (p<10(-7) and p<10(-6)). As expected, SNPs in the APOE and TOMM40 genes were confirmed as markers strongly associated with multiple brain regions. Other top SNPs were proximal to the EPHA4, TP63 and NXPH1 genes. Detailed image analyses of rs6463843 (flanking NXPH1) revealed reduced global and regional GM density across diagnostic groups in TT relative to GG homozygotes. Interaction analysis indicated that AD patients homozygous for the T allele showed differential vulnerability to right hippocampal GM density loss. NXPH1 codes for a protein implicated in promotion of adhesion between dendrites and axons, a key factor in synaptic integrity, the loss of which is a hallmark of AD. A genome wide, whole brain search strategy has the potential to reveal novel candidate genes and loci warranting further investigation and replication.

Tuesday, January 19, 2010

Genetics of diabetes reveals biology but does not improve prediction

I very much enjoyed this blog posting on www.phgfoundation.org. They discuss a new paper published in the British Medical Journal (below) that shows traditional risk factors do a much better job of predicting Type II Diabetes than 20 published SNPs. A quote from the post: "By assessing the area under the receiver operator characteristic curve (a plot of sensitivity versus 1-specificity, where a value of 1.0 represents a perfect test and 0.5 represents a useless test), the traditional models significantly outperformed the genetic model (around 0.75 versus 0.54), and their performance was not substantially improved by the addition of genetic risk factors." This comes as no surpise to me because the genetic studies that led to this test were all based on single-locus analyses that completely ignore the underlying complexity of this common disease. It is my working hypothesis that we will not be able to use genetic to predict disease risk until we ebrace, rather than ignore, the complexity of the genetic architecture of common human diseases. We commented on this in a 2007 letter to Science (also below).

Talmud PJ, Hingorani AD, Cooper JA, Marmot MG, Brunner EJ, Kumari M, Kivimäki M, Humphries SE. Utility of genetic and non-genetic risk factors in prediction of type 2 diabetes: Whitehall II prospective cohort study. BMJ. 2010 Jan 14;340:b4838. doi: 10.1136/bmj.b4838. [PubMed] PMID: 20075150.

Williams SM, Canter JA, Crawford DC, Moore JH, Ritchie MD, Haines JL. Problems with genome-wide association studies. Science. 2007 Jun 29;316(5833):1840-2. [PubMed] PMID: 17605173.

Saturday, January 16, 2010

Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

This is a very nice paper. Hint for students: there might be a research project in there.

Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am J Hum Genet. 2010 Jan 8;86(1):6-22. [PubMed]

Abstract

Genome-wide association studies (GWAS) have rapidly become a standard method for disease gene discovery. A substantial number of recent GWAS indicate that for most disorders, only a few common variants are implicated and the associated SNPs explain only a small fraction of the genetic risk. This review is written from the viewpoint that findings from the GWAS provide preliminary genetic information that is available for additional analysis by statistical procedures that accumulate evidence, and that these secondary analyses are very likely to provide valuable information that will help prioritize the strongest constellations of results. We review and discuss three analytic methods to combine preliminary GWAS statistics to identify genes, alleles, and pathways for deeper investigations. Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable. Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact. Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases. Reviews of published methods with recommendations for their application are provided within the framework for each approach. Copyright © 2010 The American Society of Human Genetics.

Wednesday, January 13, 2010

Epistatic Interactions

I don't agree with everything in this paper but it does provide some useful information.

VanderWeele, Tyler J. (2010) "Epistatic Interactions," Statistical Applications in Genetics and Molecular Biology: Vol. 9 : Iss. 1, Article 1. [PDF]

Abstract

The term "epistasis" is sometimes used to describe some form of statistical interaction between genetic factors and is alternatively sometimes used to describe instances in which the effect of a particular genetic variant is masked by a variant at another locus. In general statistical tests for interaction are of limited use in detecting "epistasis" in the sense of masking. It is, however, shown that there are relations between empirical data patterns and epistasis that have not been previously noted. These relations can sometimes be exploited to empirically test for "epistatic interactions" in the sense of the masking of the effect of a particular genetic variant by a variant at another locus.

Tuesday, January 12, 2010

Multifactor Dimensionality Reduction for Graphics Processing Units Enables Genome-wide Testing of Epistasis in Sporadic ALS

Our new paper on using MDR on GPUs for GWAS analysis of epistasis has been accepted for publication in Bioinformatics. A preprint will be available soon. The GPU-MDR software is available from our website.

Casey S. Greene, Nicholas A. Sinnott-Armstrong, Daniel S. Himmelstein, Paul J. Park, Jason H. Moore, and Brent T. Harris. Multifactor Dimensionality Reduction for Graphics Processing Units Enables Genome-wide Testing of Epistasis in Sporadic ALS. Bioinformatics, in press (2010).

ABSTRACT

Motivation: Epistasis, the presence of gene-gene interactions, has been hypothesized to be at the root of many common human diseases, but current genome-wide association studies largely ignore its role. Multifactor dimensionality reduction (MDR) is a powerful model-free method for detecting epistatic relationships between genes but computational costs have made its application to genomewide data difficult. Graphics processing units (GPUs), the hardware responsible for rendering computer games, are powerful parallel processors. Using GPUs to run MDR on a genome-wide dataset allows for statistically rigorous testing of epistasis. Results: The implementation of MDR for GPUs (MDRGPU) includes core features of the widely used Java software package, MDR. This GPU implementation allows for large scale analysis of epistasis at a dramatically lower cost than the standard CPU based implementations. As a proof-of-concept, we applied this software to a genome-wide study of sporadic amyotrophic lateral sclerosis (ALS). We discovered a statistically significant two-SNP classifier and subsequently replicated the significance of these two SNPs in an independent study of ALS. MDRGPU makes the large scale analysis of epistasis tractable and opens the door to statistically rigorous testing of interactions in genome-wide datasets. Availability: MDRGPU is open source and available free of charge from http://www.sourceforge.net/projects/mdr.

Friday, January 08, 2010

A novel approach to simulate gene-environment interactions in complex diseases

This looks interesting and perhaps useful. Let me know if you try it.

Amato R, Pinelli M, D'Andrea D, Miele G, Nicodemi M, Raiconi G, Cocozza S. A novel approach to simulate gene-environment interactions in complex diseases. BMC Bioinformatics. 2010 Jan 5;11(1):8. [PubMed]

Abstract

BACKGROUND: Complex diseases are multifactorial traits caused by both genetic and environmental factors. They represent the most part of human diseases and include those with largest prevalence and mortality (cancer, heart disease, obesity, etc.). Despite of a large amount of information that have been collected about both genetic and environmental risk factors, there are relatively few examples of studies on their interactions in epidemiological literature. One reason can be the incomplete knowledge of the power of statistical methods designed to search for risk factors and their interactions in this data sets. An improvement in this direction would lead to a better understanding and description of gene-environment interaction. To this aim, a possible strategy is to challenge the different statistical methods against data sets where the underlying phenomenon is completely known and fully controllable, like for example simulated ones. RESULTS: We present a mathematical approach that models gene-environment interactions. By this method it is possible to generate simulated populations having gene-environment interactions of any form, involving any number of genetic and environmental factors and also allowing non-linear interactions as epistasis. In particular, we implemented a simple version of this model in a Gene-Environment iNteraction Simulator (GENS), a tool designed to simulate case-control data sets where a one gene-one environment interaction influences the disease risk. The main effort has been to allow user to describe characteristics of population by using standard epidemiological measures and to implement constraints to make the simulator behavior biologically meaningful. CONCLUSIONS: By the multi-logistic model implemented in GENS it is possible to simulate case-control samples of complex disease where gene-environment interactions influence the disease risk. The user has a full control of the main characteristics of the simulated population and a Monte Carlo process allows random variability. A Knowledge-based approach reduces the complexity of the mathematical model by using reasonable biological constraints and makes the simulation more understandable in biological terms. Simulated data sets can be used for the assessment of novel statistical methods or for the evaluation of the statistical power when designing a study.

Thursday, January 07, 2010

Bioinformatics Strategies for Genome-Wide Association Studies (GWAS)

Our new review on bioinformatics strategies for GWAS analysis has been published in Bioinformatics. We focus in this paper on methods that are designed to embrace, rather than ignore, the complexity of common human diseases.

Moore, J.H., Asselbergs, F.W., Williams, S.M. Bioinformatics strategies for genome-wide association studies. Bioinformatics (2010). [PDF]

ABSTRACT

Motivation: The sequencing of the human genome has made it possible to identify an informative set of more than one million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWAS). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation, and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving healthcare through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype-phenotype relationship that is characterized by significant heterogeneity and gene-gene and gene-environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.