Epistasis Blog

From the Computational Genetics Laboratory at Dartmouth Medical School (www.epistasis.org)

Thursday, April 30, 2009

Dartmouth Medical School Geneticist Tapped for National Academy of Sciences

Dr. Jay Dunlap, Chair of Genetics at Dartmouth College, was elected this week to the National Academy of Sciences. More information can be found here.

Sunday, April 26, 2009

Special Issue of PNAS on Complex Systems

The April 21st issue of PNAS is devoted to complex systems. The table of contents can be found here.

Saturday, April 25, 2009

Maximal extraction of biological information from genetic interaction data

This is an interesting new paper.

Carter GW, Galas DJ, Galitski T. Maximal extraction of biological information from genetic interaction data. PLoS Comput Biol. 2009 Apr;5(4):e1000347. [PubMed]

Abstract

Extraction of all the biological information inherent in large-scale genetic interaction datasets remains a significant challenge for systems biology. The core problem is essentially that of classification of the relationships among phenotypes of mutant strains into biologically informative "rules" of gene interaction. Geneticists have determined such classifications based on insights from biological examples, but it is not clear that there is a systematic, unsupervised way to extract this information. In this paper we describe such a method that depends on maximizing a previously described context-dependent information measure to obtain maximally informative biological networks. We have successfully validated this method on two examples from yeast by demonstrating that more biological information is obtained when analysis is guided by this information measure. The context-dependent information measure is a function only of phenotype data and a set of interaction rules, involving no prior biological knowledge. Analysis of the resulting networks reveals that the most biologically informative networks are those with the greatest context-dependent information scores. We propose that these high-complexity networks reveal genetic architecture at a modular level, in contrast to classical genetic interaction rules that order genes in pathways. We suggest that our analysis represents a powerful, data-driven, and general approach to genetic interaction analysis, with particular potential in the study of mammalian systems in which interactions are complex and gene annotation data are sparse.

Tuesday, April 21, 2009

Exploiting Expert Knowledge in Evolutionary Computing

Exploiting Expert Knowledge For Epistasis Analysis Using Evolutionary Computing - A Brief Summary of Work in this Area from the Computational Genetics Laboratory

We have put a lot of effort over the last several years into adapting evolutionary computing strategies to the detection and modeling of epistasis on a genome-wide scale. Our focus has been on the use of expert knowledge to help guide these stochastic search algorithms. The key to these studies is to know what your building blocks are and then exploit the hell out of them algorithmically. A good building block could be a SNP that is in a biologically important gene or for which there is good statistical evidence of association, for example. We have used expert knowledge to guide population initialization, mutation, recombination, selection, function definition, model tuning and in a multiobjective fitness function, for example. We have also developed a system that learn which source of expert knoweldge to use from multiple different options. For background reading on exploiting building blocks see David Goldberg's book on The Design of Innovation. I chronologically summarize this work and its importance here for those of you interested in this area. Please let me know if you have any questions or would like one or more reprints.

1) White, B.C., Gilbert, J.C., Reif, D.M., Moore, J.H. A statistical comparison of grammatical evolution strategies in the domain of human genetics. Proceedings of the IEEE Congress on Evolutionary Computing, 676-682 (2005).

This paper was important because it provided the baseline experiments showing that various grammatical evolution strategies work no better than random search for attribute selection when the underlying model is epistatic. This motivated our work summarized below showing how expert knowledge can be used to significantly improve evolutionary computing algorithms in this domain. This was peer-reviewed, published and presented as part of the 2005 IEEE Congress on Evolutionary Computing.

2) Moore, J.H., White, B.C. Genome-wide genetic analysis using genetic programming: The critical need for expert knowledge. In: Riolo, R., Soule, T., Worzel, B. (Eds.) Genetic Programming Theory and Practice IV, Springer, pp. 11-28 (2007).

This was our first paper showing that expert knowledge is critical for success in this domain. Here, we developed a multiobjective fitness function that combines the accuracy of a classifier and a source of statistical expert knowledge from ReliefF scores. We show that accuracy alone is not enough to beat random search. This was peer reviewed, published and presented as part of the 2006 Genetic Programming Theory and Practice (GPTP) workshop at the University of Michigan.

3) Moore, J.H., White, B.C. Exploiting expert knowledge in genetic programming for genome-wide genetic analysis. In: Runarsson et al. (eds), Lecture Notes in Computer Science 4193, 969-977 (2006).

This was our second paper showing that expert knowledge is critical for success in this domain. Here, we developed an expert knowledge-guided recombination operator that ensures models with good building blocks (e.g. SNPs) are recombined and reproduced. We showed that using expert knowledge to select and recombine models performs as well as the multiobjective fitness function described above (Moore and White 2007) but requires only a tenth of the population size. This was peer-reviewed, published and presented as part of the 2006 Parallel Programming from Nature (PPSN) conference.

4) Moore, J.H. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Zhu, X., Davidson, I. (Eds.), Knowledge Discovery and Data Mining: Challenges and Realities, IGI Global, pp 17-30 (2007).

This invited book chapter provides a nice review of our expert knowledge-driven evolutionary computing methods from the first three papers from above as applied to our multifactor dimensionality reduction (MDR) method.

5) Moore, J.H., Barney, N., Tsai, C.T., Chiang, F.T., Gui, J., White, B.C. Symbolic modeling of epistasis. Human Heredity 63, 120-133 (2007).

In this paper we demonstrate how knowledge gained from multiple evolutionary computing runs can be used to build a probability distribution function of model pieces that can in turn be used for fine-scale model generation. This is related to cascading and estimation of distribution algorithms. This was an important foundational paper for our computational evolution system that is described later below and is the inspiration for the archive that is built into that system and used as an important source of knowledge.

6) Greene, C.S., White, B.C., Moore, J.H. An expert knowledge-guided mutation operator for genome-wide genetic analysis using genetic programming. Lecture Notes in Bioinformatics 4774, 30-40 (2007).

Here we show how expert knowledge can be used to guide mutation in evolutionary computing. We show that using expert knowledge guided mutation performs as well as expert knowledge guided selection or multiobjective fitness functions.

7) Moore, J.H., Barney, N., White, B.C. Solving complex problems in human genetics using genetic programming: The importance of theorist-practitioner-computer interaction. In: Riolo, R., Soule, T., Worzel, B. (Eds.) Genetic Programming Theory and Practice V, pp 69-85, Springer (2008).

This paper provides an overview of our Symbolic Modeling (SyMod) software that uses genetic programming do build classification and regression models. Provided in SyMod is the ability to load and use expert knowledge for sensible initialization (presented below), multiobjective optimization and selection and recombination. This was peer reviewed, published and presented as part of the 2007 Genetic Programming Theory and Practice (GPTP) workshop at the University of Michigan.

8) Moore, J.H. The critical need for computational intelligence in human genetics. Computing Reviews, Hot Topics Essay 7, March 3 (2008).

This is a short essay that highlights the need for computational intelligence approaches such as evolutionary computing that can utilize expert knowledge.

9) Moore, J.H., Andrews, P.C., Barney, N., White, B.C. Development and evaluation of an open-ended computational evolution system for the genetic analysis of susceptibility to common human diseases. Lecture Notes in Computer Science 4973, 129-140 (2008).

The goal of this study was to develop a computational evolution system (CES) that is capable of discovering its own operators (e.g. recombination, mutation, etc.) of any size and complexity. We showed that this system is capable of disovering complex operators that can exploit expert knowledge for solving epistasis problems. This system was expanded in the next paper below.

10) Moore, J.H., Greene, C.S., Andrews, P., White, B.C. Does complexity matter? Artificial evolution, computational evolution and the genetic analysis of common human diseases. In: Riolo, R., Soule, T., Worzel, B. (Eds.) Genetic Programming Theory and Practice VI, pp. 125-143, Springer (2008).

The goal of this study was to expand our computational evolution system (CES) introduced above that is capable of discovering its own operators of any size of complexity. We again showed that this system is capable of disovering complex operators that can exploit expert knowledge for solving epistasis problems. We also introduced the concept of an archive that provides a memory for what the system has already learned. The archive provides an additional source of expert knowledge.

11) Urbanowicz, R., White, B.C., Barney, N., Moore, J.H. Mask functions for the symbolic modeling of epistasis using genetic programming. Proceedings of the Genetic and Evolutionary Computing Conference. ACM Press, pp. 339-346 (2008).

This study showed how the results of multifactor dimensionality reduction (MDR) modeling could be used for improving the ability of genetic programming to discover epistasis models. This is accomplished by providing new functions in the function set based on prior modeling results. This is an important type of statistical expert knowledge. This was peer reviewed, published and presented as part of the 2008 Genetic and Evolutionary Computing Conference (GECCO).

12) Greene, C.S., White, B.C., Moore, J.H. Ant colony optimization for genome-wide genetic analysis. Lecture Notes in Computer Science 5217, 37-47 (2008).

Ant colony optimization (ACO) is an attractive type of evolutionary computing because it inherently allows for expert knowledge (i.e. the heuristic) in the pheromone updating rule. We show that ACO works quite well in this domain and is able to exploit expert knowledge effectively. This was peer reviewed, published and presented as part of the 2008 International Conference on Ant Colony Optimization and Swarm Intelligence (ANTS).

13) Greene, C.S., Moore, J.H. Solving complex problems in human genetics using GP: Challenges and opportunities. SIGEVOlution 3, 1-7 (2008).

This is a review paper for the evolutionary computing community highlighting the important role of expert knowledge in this domain.

14) Greene, C.S., Moore, J.H. Solving complex problems in human genetics using nature-inspired algorithms requires strategies which exploit domain-specific knowledge. Nature Inspired Informatics, IGI Global, in press (2009).

This is an invited book chapter highlighting the important role of expert knowledge in this domain.

15) Moore, J.H. Mining patterns of epistasis in human genetics. Biological Data Mining, in press (2009).

This is an invited book chapter highlighting the important role of expert knowledge in this domain.

16) Greene, C.S., Gilmore, J., Kiralis, J., Andrews, P.C., Moore, J.H. Optimal use of expert knowledge in ant colony optimization for the analysis of epistasis in human disease. Lecture Notes in Computer Science, in press (2009).

As described above, ant colony optimization (ACO) is an attractive type of evolutionary computing because it inherently allows for expert knowledge (i.e. the heuristic) in the pheromone updating rule. We previously showed that ACO works quite well in this domain and is able to exploit expert knowledge effectively. The goal of this study was to optimize the use of expert knowledge. This was peer reviewed, published and presented as part of the 2009 Evolutionary Computing, Machine Learning and Data Mining in Bioinformatics (EvoBIO) conference.

17) Greene, C.S., Kiralis, J., Moore, J.H. Nature-inspired algorithms for the genetic analysis of epistasis in common human diseases: A theoretical assessment of wrapper vs. filter approaches. Proceedings of the IEEE Congress on Evolutionary Computation, in press (2009).

This is a theoretical assessment of wrapper and filter approaches. Under certain assumption filtering based on expert knowledge might be more powerful. If those assumptions are violated a wrapper approach such genetic programming might be better. This was peer reviewed, published and presented as part of the 2009 IEEE Congress on Evolutionary Computing.

18) Greene, C.S., White, B.C., Moore, J.H. Sensible initialization using expert knowledge for genome-wide analysis of epistasis using genetic programming. Proceedings of the IEEE Congress on Evolutionary Computation, in press (2009).

This paper shows how expert knowledge can be used to initialize a genetic programming population prior to the run. We show that this is an effective way to introduce expert knowledge into an evolutionary computing system. This was peer reviewed, published and presented as part of the 2009 IEEE Congress on Evolutionary Computing.

19) Greene, C.S., Hill, D., Moore, J.H., Environmental sensing of expert knoweldge in a computational evolution system for complex problem solving in human genetics. In: Riolo, R., Soule, T., Worzel, B. (Eds.) Genetic Programming Theory and Practice VII, in press, Springer (2009).

The goal of this study was to expand our computational evolution system (CES) introduced above that is capable of discovering its own operators of any size of complexity. We show here that the CES is also capable of learning to exploit one of several sources of expert knowledge to solve the epistasis problem. This is not only important for the discovery of highly fit genetic models but it also important because the particular source of expert knowledge used by evolved operators may tell us something important about the problem itself. This was peer reviewed, published and presented as part of the 2009 Genetic Programming Theory and Practice (GPTP) workshop at the University of Michigan.

Sunday, April 19, 2009

Machine Learning: An Algorithmic Perspective - Python Examples

This looks like a really interesting and possibly useful new book. I look forward to reading it.

Machine Learning: An Algorithmic Perspective - Python Examples
by Stephen Marsland
Chapman & Hall/CRC; 1 edition (April 1, 2009)
[Amazon]

Written in an easily accessible style, this book provides the ideal blend of theory and practical, applicable knowledge. It covers neural networks, graphical models, reinforcement learning, evolutionary algorithms, dimensionality reduction methods, and the important area of optimization. It treads the fine line between adequate academic rigor and overwhelming students with equations and mathematical concepts. The author includes examples based on widely available datasets and practical and theoretical problems to test understanding and application of the material. The book describes algorithms with code examples backed up by a website that provides working implementations in Python.

Saturday, April 18, 2009

Human Genetics Conferences

The 2009 meetings of the International Genetics Epidemiology Society (IGES) and the American Society of Human Genetics (ASHG) will be held in October in Hawaii. Asbtract deadlines will be coming around this summer (usually June-July).

Monday, April 13, 2009

Encyclopedia of Complexity and Systems Science

This looks interesting! Very expensive though.

Encyclopedia of Complexity and Systems Science [Springer]

Encyclopedia of Complexity and Systems Science provides an authoritative single source for understanding and applying the concepts of complexity theory together with the tools and measures for analyzing complex systems in all fields of science and engineering. The science and tools of complexity and systems science include theories of self-organization, complex systems, synergetics, dynamical systems, turbulence, catastrophes, instabilities, nonlinearity, stochastic processes, chaos, neural networks, cellular automata, adaptive systems, and genetic algorithms. Examples of near-term problems and major unknowns that can be approached through complexity and systems science include: The structure, history and future of the universe; the biological basis of consciousness; the integration of genomics, proteomics and bioinformatics as systems biology; human longevity limits; the limits of computing; sustainability of life on earth; predictability, dynamics and extent of earthquakes, hurricanes, tsunamis, and other natural disasters; the dynamics of turbulent flows; lasers or fluids in physics, microprocessor design; macromolecular assembly in chemistry and biophysics; brain functions in cognitive neuroscience; climate change; ecosystem management; traffic management; and business cycles. All these seemingly quite different kinds of structure formation have a number of important features and underlying structures in common. These deep structural similarities can be exploited to transfer analytical methods and understanding from one field to another. This unique work will extend the influence of complexity and system science to a much wider audience than has been possible to date.

Written for:
Undergraduate and graduate students in all fields, researchers in academia and government laboratories, technical professionals and managers in industries such as pharmaceuticals, computer hardware and software, aerospace, and telecommunications, financial analysts, and infrastructure managers.

Saturday, April 11, 2009

Identification of gene-gene interactions in the presence of missing data using the multifactor dimensionality reduction method.

A new paper in Genetic Epidemiology on data imputation for MDR analysis. It is not too surprising that a mutivariate imputation is more powerful than other simpler approaches.

Namkung J, Elston RC, Yang JM, Park T. Identification of gene-gene interactions in the presence of missing data using the multifactor dimensionality reduction method. Genet Epidemiol. 2009 Feb 24. [Epub ahead of print] [PubMed]

Abstract

Gene-gene interaction is believed to play an important role in understanding complex traits. Multifactor dimensionality reduction (MDR) was proposed by Ritchie et al. [2001. Am J Hum Genet 69:138-147] to identify multiple loci that simultaneously affect disease susceptibility. Although the MDR method has been widely used to detect gene-gene interactions, few studies have been reported on MDR analysis when there are missing data. Currently, there are four approaches available in MDR analysis to handle missing data. The first approach uses only complete observations that have no missing data, which can cause a severe loss of data. The second approach is to treat missing values as an additional genotype category, but interpretation of the results may then be not clear and the conclusions may be misleading. Furthermore, it performs poorly when the missing rates are unbalanced between the case and control groups. The third approach is a simple imputation method that imputes missing genotypes as the most frequent genotype, which may also produce biased results. The fourth approach, Available, uses all data available for the given loci to increase power. In any real data analysis, it is not clear which MDR approach one should use when there are missing data. In this article, we consider a new EM Impute approach to handle missing data more appropriately. Through simulation studies, we compared the performance of the proposed EM Impute approach with the current approaches. Our results showed that Available and EM Impute approaches perform better than the three other current approaches in terms of power and precision.

Thursday, April 09, 2009

Getting Genetics Done

Here is a very nice blog on Getting Genetics Done from Stephen Turner. Stephen is a graduate student with Dr. Marylyn Ritchie at the Vanderbilt University Center for Human Genetics Research. Marylyn was my first graduate student so that makes Stephen my academic grandchild.

Wednesday, April 08, 2009

Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis

A great new paper from Dr. Brett McKinney. Brett is a former postoc in the Computational Genetics Laboratory.

McKinney BA, Crowe JE, Guo J, Tian D. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet. 2009 Mar;5(3):e1000432 [PubMed] [PLoS Genetics]

Abstract

Evidence from human genetic studies of several disorders suggests that interactions between alleles at multiple genes play an important role in influencing phenotypic expression. Analytical methods for identifying Mendelian disease genes are not appropriate when applied to common multigenic diseases, because such methods investigate association with the phenotype only one genetic locus at a time. New strategies are needed that can capture the spectrum of genetic effects, from Mendelian to multifactorial epistasis. Random Forests (RF) and Relief-F are two powerful machine-learning methods that have been studied as filters for genetic case-control data due to their ability to account for the context of alleles at multiple genes when scoring the relevance of individual genetic variants to the phenotype. However, when variants interact strongly, the independence assumption of RF in the tree node-splitting criterion leads to diminished importance scores for relevant variants. Relief-F, on the other hand, was designed to detect strong interactions but is sensitive to large backgrounds of variants that are irrelevant to classification of the phenotype, which is an acute problem in genome-wide association studies. To overcome the weaknesses of these data mining approaches, we develop Evaporative Cooling (EC) feature selection, a flexible machine learning method that can integrate multiple importance scores while removing irrelevant genetic variants. To characterize detailed interactions, we construct a genetic-association interaction network (GAIN), whose edges quantify the synergy between variants with respect to the phenotype. We use simulation analysis to show that EC is able to identify a wide range of interaction effects in genetic association data. We apply the EC filter to a smallpox vaccine cohort study of single nucleotide polymorphisms (SNPs) and infer a GAIN for a collection of SNPs associated with adverse events. Our results suggest an important role for hubs in SNP disease susceptibility networks. The software is available at (http://sites.google.com/site/McKinneyLab/software).

Thursday, April 02, 2009

Decanalization and the origin of complex disease

Great new paper on canalization in Nature Reviews Genetics by Greg Gibson. If you are interested in epistasis you must read this.

Gibson G. Decanalization and the origin of complex disease. Nat Rev Genet. 2009 Feb;10(2):134-40. [PubMed]

Abstract

Complex genetic disease is caused by the interaction between genetic and environmental variables and is the predominant cause of mortality globally. Recognition that susceptibility arises through the combination of multiple genetic pathways that influence liability factors in a nonlinear manner suggests that a process of 'decanalization' contributes to the epidemic nature of common genetic diseases. The rapid evolution of the human genome combined with marked environmental and cultural perturbation in the past two generations might lead to the uncovering of cryptic genetic variation that is a major source of disease susceptibility.