Epistasis Blog

From the Computational Genetics Laboratory at Dartmouth Medical School (www.epistasis.org)

Thursday, November 19, 2009

A Genetics Company Fails, Its Research Too Complex

According the NY Times, deCode Genetics filed for bankruptcy on Tuesday. This in my mind is the last nail in the GWAS coffin. deCode and the many other people pushing technological solutions to complex problems in human genetics have failed because they assumed there was a simple relationship between genotype and phenotype. The optimism expressed by Altschuler at the end of this news article is the last gasp from those going down with the sinking ship.

Does this mean we will return to thinking about the problem and coming up with intelligent solutions? Unfortunately not. Altschuler and his many colleagues that influence research funding and publication are now pushing whole-genome sequencing and rare variants as the next solution. I predict this will be no more successful for diseases such as hypertension and bipolar depression than GWAS was. We need to be very careful as we go down this road. The public and congress may not be able to stomach another grand failure to live up to the hype. One might predict that funding for genetics will dry up once deep sequencing fails to reveal the missing heritability. It might be a good time to think about retooling as a physiologist.

Tuesday, November 03, 2009

Program in Resource Bioinformatics

I have been awarded a subcontract to work with Harvard University, Jackson State University, Montana State University, Morehouse College, Oregon Heath Sciences University, the University of Alaska, University of Hawaii, and the University of Puerto Rico to develop the national infrastructure for sharing biomedical research resources. Locally, we are establishing a Program in Resource Bioinformatics at Dartmouth College to identify and make available information about cell lines, mouse models, software, reagents, etc. that can be shared with the rest of the country. This new program is funded by an NIH/NCRR ARRA U24 grant. The official NCRR press release can be found here. The Dartmouth College press release can be found here. The Dartmouth newspaper article ont he project can be found here. A brief description of our project can be found here.

Friday, October 16, 2009

A computational evolution system for open-ended automated learning of complex genetic relationships

I will be giving a talk on the following topic at IGES on Tuesday morning. I will also be presenting this as a poster at ASHG.

A computational evolution system for open-ended automated learning of complex genetic relationships.

Jason H. Moore, Doug Hill, Casey S. Greene

The failure of genome-wide association studies to reveal the genetic architecture of common diseases suggests that it is time that we embrace, rather than ignore, the complexity of the genotype-to-phenotype mapping relationship that is characterized by epistasis, plastic reaction norms, heterogeneity and other phenomena such as epigenetics. The extreme complexity of the problem suggests that simple linear models and other approaches that assume simplicity are unlikely to capture the full spectrum of genetic effects. To this end, we have developed an open-ended computational evolution system (CES) that makes no assumptions about the underlying genetic model and can learn through evolution by natural selection how to solve a particular genetic modeling problem. This is accomplished by providing the basic mathematical building blocks (e.g. +, -, *, /, LOG, <, >, =, AND, OR, NOT etc.) for models that can take any shape or form and the basic building blocks for algorithmic functions (e.g. ADD, DELETE, COPY, etc.) that can manipulate genetic models in a manner that is dependent on expert statistical and biological knowledge or prior modeling experience. We have previously demonstrated that our CES approach has excellent power to detect epistatic relationships in genome-wide data across a wide-range of heritabilities and sample sizes (Moore et al. 2008, 2009). We have also previously shown that this system can learn to utilize one of many sources of expert knowledge thus providing an important clue as to how the system solves the problem (Greene et al. 2009). Here, we introduce an additional layer to our CES approach that introduces noise into the training data (5%, 10%, 15% and 20%) to drive the discovery process toward models that are more likely to generalize. We show using simulated epistatic relationships in genome-wide data that the CES leads to significantly smaller models (P<0.001) thus reducing false-positives and overfitting while maintaining a power of 97% to 100%. These results are important because they show how introduced noise in the data can yield more parsimonious models and reduce overfitting without the need for computationally expensive cross-validation. This study is an important step towards a paradigm of genetic analysis that makes few assumptions about a genetic architecture that is very complex.

Thursday, October 15, 2009

Epistasis in a quantitative trait captured by a molecular model of transcription factor interactions

I haven't read this yet but the abstract look very interesting!

Gertz J, Gerke JP, Cohen BA. Epistasis in a quantitative trait captured by a molecular model of transcription factor interactions. Theor Popul Biol. 2009 Oct 7. [PubMed]

Abstract

With technological advances in genetic mapping studies more of the genes and polymorphisms that underlie Quantitative Trait Loci (QTL) are now being identified. As the identities of these genes become known there is a growing need for an analysis framework that incorporates the molecular interactions affected by natural polymorphisms. As a step towards such a framework we present a molecular model of genetic variation in sporulation efficiency between natural isolates of the yeast, Saccharomyces cerevisiae. The model is based on the structure of the regulatory pathway that controls sporulation. The model captures the phenotypic variation between strains carrying different combinations of alleles at known QTL. Compared to a standard linear model the molecular model requires fewer free parameters, and has the advantage of generating quantitative hypotheses about the affinity of specific molecular interactions in different genetic backgrounds. Our analyses provide a concrete example of how the thermodynamic properties of protein-=protein and protein-DNA interactions naturally give rise to epistasis, the non-linear relationship between genotype and phenotype. As more causative genes and polymorphisms underlying QTL are identified, thermodynamic analyses of quantitative traits may provide a useful framework for unraveling the complex relationship between genotype and phenotype.

Saturday, October 10, 2009

Transgenerational genetic effects on phenotypic variation and disease risk

Complexity, complexity, complexity. This is a nice paper by Joe Nadeau on a commonly overlooked source of complexity in the mapping between genotype and phenotype.

Nadeau JH. Transgenerational genetic effects on phenotypic variation and disease risk. Hum Mol Genet. 2009 Oct 15;18(R2):R202-10. [PubMed]

Abstract

Traditionally, we understand that individual phenotypes result primarily from inherited genetic variants together with environmental exposures. However, many studies showed that a remarkable variety of factors including environmental agents, parental behaviors, maternal physiology, xenobiotics, nutritional supplements and others lead to epigenetic changes that can be transmitted to subsequent generations without continued exposure. Recent discoveries show transgenerational epistasis and transgenerational genetic effects where genetic factors in one generation affect phenotypes in subsequent generation without inheritance of the genetic variant in the parents. Together these discoveries implicate a key signaling pathway, chromatin remodeling, methylation, RNA editing and microRNA biology. This exceptional mode of inheritance complicates the search for disease genes and represents perhaps an adaptation to transmit useful gene expression profiles from one generation to the next. In this review, I present evidence for these transgenerational genetic effects, identify their common features, propose a heuristic model to guide the search for mechanisms, discuss the implications, and pose questions whose answers will begin to reveal the underlying mechanisms.

Friday, October 09, 2009

Machine Learning Analysis of GWAS Data

Finally, someone thinking beyond the largely unsuccessful one SNP at a time approach.

Zhi Wei, Kai Wang, Hui-Qi Qu, Haitao Zhang, Jonathan Bradfield, Cecilia Kim, Edward Frackleton, Cuiping Hou, Joseph T. Glessner, Rosetta Chiavacci, Charles Stanley, Dimitri Monos, Struan F. A. Grant, Constantin Polychronakos, Hakon Hakonarson. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLoS Genetics, in press (2009). [PLoS]

Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of ~0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

Tuesday, October 06, 2009

The Gene Wiki: community intelligence applied to human gene annotation

This looks like an interesting new paper. Useful?

Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su AI. The Gene Wiki: community intelligence applied to human gene annotation. Nucleic Acids Res. 2009 [PubMed]

Abstract

Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki 'stubs' for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki.

Wednesday, September 30, 2009

Bioinformatics Strategies for Genome-Wide Association Studies - New NIH R01 Funded

My new NIH R01 (LM010098) on "Bioinformatics Strategies for Genome-Wide Association Studies" has been funded for four years by the National Library of Medicine. The abstract for this new grant is below.

Abstract

Genome-wide association studies (GWAS) are commonplace despite the lack of a comprehensive bioinformatics approach to the analysis of the data. The common method of analysis is to employ parametric statistics and then adjust for the large number of tests performed to limit false-positives (i.e. type 1 errors). This agnostic approach is preferred by some because no assumptions are made about which genes or genomic regions might be important. This logic suggests that the data should tell us where the important genetic variants are. The goal of our proposed research program is to specifically compare this agnostic approach with a bioinformatics approach that selects associated SNPs based on expert knowledge about biochemical pathways and gene function. We propose to develop a bioinformatics approach for selecting SNPs from a GWAS using knowledge about the biology of the genes being studied and the molecular pathology of disease (AIM 1). We will modify and extend the Exploratory Visual Analysis (EVA) database and software that was originally designed for microarray studies with pilot funding from the NLM BISTI program. We will then use this bioinformatics approach along with an agnostic statistical approach for detecting SNPs associated with plasma levels of tissue plasminogen activator (t-PA) and plasminogen activator inhibitor one (PAI-1) in a large population-based sample of Caucasians (n=2000) from the PREVEND study in Groningen, The Netherlands (AIM 2). Those SNPs identified by both methods in the PREVEND study will be evaluated first for replication in an independent population-based sample of Caucasians (n=2000) from the Rotterdam Study in the Netherlands and then for validation in a population-based sample of Blacks (n=2000) from the HeART Study in Ghana, Africa (AIM 3). Finally, we will specifically compare how many and which SNPs replicate and validate using the statistical approach and the bioinformatics approach (AIM 4). Our working hypothesis is that we will obtain more validated and hence more real SNPs using the bioinformatics approach.

Tuesday, September 29, 2009

A View of the Parallel Computing Landscape

We rely very heavily on parallel computing to assist with our computational studies of epistasis. This article appeared in the Oct., 2009 issue of Communications of the ACM.

Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, Katherine Yelick. A View of the Parallel Computing Landscape. Communications of the ACM Vol. 52 No. 10, Pages 56-67 [ACM]

Abstract

Industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Failure could jeopardize both the IT industry and the portions of the economy that depend on rapidly improving information technology. Here, we review the issues and, as an example, describe an integrated approach we're developing at the Parallel Computing Laboratory, or Par Lab, to tackle the parallel challenge.

Friday, September 25, 2009

New NIH ARRA Grant Supplements

I have been awarded two NIH ARRA stimulus supplements to my National Library of Medicine R01 grant LM009012. The first will allow me to establish a Bioinformatics Visualization Laboratory at Dartmouth. The second will support six high school and undergraduate students to assist with our bioinformatics research during the summer.