Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Tuesday, February 24, 2015

Great feature selection method for detecting epistasis using random forests

This is a really neat approach that is worth exploring for using machine learning methods such as random forests for the detection and modeling of statistical epistasis in genetic studies of human health.

Holzinger ER, Szymczak S, Dasgupta A, Malley J, Li Q, Bailey-Wilson JE. Variable selection method for the identification of epistatic models. Pac Symp Biocomput. 2015;20:195-206. [PDF]


Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Labels: , ,

Friday, February 20, 2015

Is Big Data a 21st Century Maginot Line?

We have just published this open access editorial BioData Mining on whether 'big data' is a 21st century Maginot line. This is relevant because we as scientists sometimes let the data define the research questions rather than the other way around. As the size and complexity of data grows we may find ourselves asking simpler and simpler questions only some of which are important to advancing our understanding of human health and disease.

Huang X, Jennings SF, Bruce B, Buchan A, Cai L, Chen P, Cramer CL, Guan W, Hilgert UK, Jiang H, Li Z, McClure G, McMullen DF, Nanduri B, Perkins A, Rekepalli B, Salem S, Specker J, Walker K, Wunsch D, Xiong D, Zhang S, Zhang Y, Zhao Z, Moore JH. Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm. BioData Min. 2015 Feb 6;8:7. [PDF]

See also our previous related essay on 'no boundary thinking' in bioinformatics.

Huang X, Bruce B, Buchan A, Congdon CB, Cramer CL, Jennings SF, Jiang H, Li Z, McClure G, McMullen R, Moore JH, Nanduri B, Peckham J, Perkins A, Polson SW, Rekepalli B, Salem S, Specker J, Wunsch D, Xiong D, Zhang S, Zhao Z. No-boundary thinking in bioinformatics research. BioData Min. 2013 Nov 6;6(1):19. [PDF]

Labels: ,

Saturday, January 31, 2015

Epistasis: Methods and Protocols

Our new edited volume on epistasis.

This volume presents a valuable and readily reproducible collection of established and emerging techniques on modern genetic analyses. Chapters focus on statistical or data mining analyses, genetic architecture, the burden of multiple testing, genetic variance, measuring epistasis, multifactor dimensionality reduction, and ReliefF. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and key tips on troubleshooting and avoiding known pitfalls.

Saturday, January 03, 2015

Heuristic identification of biological architectures for simulating complex hierarchical genetic interactions

Moore JH, Amos R, Kiralis J, Andrews PC. Heuristic identification of biological architectures for simulating complex hierarchical genetic interactions. Genet Epidemiol. 2015 Jan;39(1):25-34. [PubMed]


Simulation plays an essential role in the development of new computational and statistical methods for the genetic analysis of complex traits. Most simulations start with a statistical model using methods such as linear or logistic regression that specify the relationship between genotype and phenotype. This is appealing due to its simplicity and because these statistical methods are commonly used in genetic analysis. It is our working hypothesis that simulations need to move beyond simple statistical models to more realistically represent the biological complexity of genetic architecture. The goal of the present study was to develop a prototype genotype-phenotype simulation method and software that are capable of simulating complex genetic effects within the context of a hierarchical biology-based framework. Specifically, our goal is to simulate multilocus epistasis or gene-gene interaction where the genetic variants are organized within the framework of one or more genes, their regulatory regions and other regulatory loci. We introduce here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating data in this manner. This approach combines a biological hierarchy, a flexible mathematical framework, a liability threshold model for defining disease endpoints, and a heuristic search strategy for identifying high-order epistatic models of disease susceptibility. We provide several simulation examples using genetic models exhibiting independent main effects and three-way epistatic effects.

Tuesday, December 16, 2014

SNP characteristics predict replication success in association studies

Gorlov IP, Moore JH, Peng B, Jin JL, Gorlova OY, Amos CI. SNP characteristics predict replication success in association studies. Hum Genet. 2014 Dec;133(12):1477-86. [PubMed]


Successful independent replication is the most direct approach for distinguishing real genotype-disease associations from false discoveries in genome-wide association studies (GWAS). Selecting SNPs for replication has been primarily based on P values from the discovery stage, although additional characteristics of SNPs may be used to improve replication success. We used disease-associated SNPs from more than 2,000 published GWASs to identify predictors of SNP reproducibility. SNP reproducibility was defined as a proportion of successful replications among all replication attempts. The study reporting association for the first time was considered to be discovery and all consequent studies targeting the same phenotype replications. We found that -Log(P), where P is a P value from the discovery study, is the strongest predictor of the SNP reproducibility. Other significant predictors include type of the SNP (e.g., missense vs intronic SNPs) and minor allele frequency. Features of the genes linked to the disease-associated SNP also predict SNP reproducibility. Based on empirically defined rules, we developed a reproducibility score (RS) to predict SNP reproducibility independently of -Log(P). We used data from two lung cancer GWAS studies as well as recently reported disease-associated SNPs to validate RS. Minus Log(P) outperforms RS when the very top SNPs are selected, while RS works better with relaxed selection criteria. In conclusion, we propose an empirical model to predict SNP reproducibility, which can be used to select SNPs for validation and prioritization.

Tuesday, November 04, 2014

The effects of recombination on phenotypic exploration and robustness in evolution

Hu T, Banzhaf W, Moore JH. The effects of recombination on phenotypic exploration and robustness in evolution. Artif Life. 2014 Fall;20(4):457-70. [IEEE]


Recombination is a commonly used genetic operator in artificial and computational evolutionary systems. It has been empirically shown to be essential for evolutionary processes. However, little has been done to analyze the effects of recombination on quantitative genotypic and phenotypic properties. The majority of studies only consider mutation, mainly due to the more serious consequences of recombination in reorganizing entire genomes. Here we adopt methods from evolutionary biology to analyze a simple, yet representative, genetic programming method, linear genetic programming. We demonstrate that recombination has less disruptive effects on phenotype than mutation, that it accelerates novel phenotypic exploration, and that it particularly promotes robust phenotypes and evolves genotypic robustness and synergistic epistasis. Our results corroborate an explanation for the prevalence of recombination in complex living organisms, and helps elucidate a better understanding of the evolutionary mechanisms involved in the design of complex artificial evolutionary systems and intelligent algorithms.

Wednesday, October 01, 2014

Phenotypic robustness and the assortativity signature of human transcription factor networks

Pechenick DA, Payne JL, Moore JH. Phenotypic robustness and the assortativity signature of human transcription factor networks. PLoS Comput Biol. 2014 Aug 14;10(8):e1003780. [PDF]


Many developmental, physiological, and behavioral processes depend on the precise expression of genes in space and time. Such spatiotemporal gene expression phenotypes arise from the binding of sequence-specific transcription factors (TFs) to DNA, and from the regulation of nearby genes that such binding causes. These nearby genes may themselves encode TFs, giving rise to a transcription factor network (TFN), wherein nodes represent TFs and directed edges denote regulatory interactions between TFs. Computational studies have linked several topological properties of TFNs - such as their degree distribution - with the robustness of a TFN's gene expression phenotype to genetic and environmental perturbation. Another important topological property is assortativity, which measures the tendency of nodes with similar numbers of edges to connect. In directed networks, assortativity comprises four distinct components that collectively form an assortativity signature. We know very little about how a TFN's assortativity signature affects the robustness of its gene expression phenotype to perturbation. While recent theoretical results suggest that increasing one specific component of a TFN's assortativity signature leads to increased phenotypic robustness, the biological context of this finding is currently limited because the assortativity signatures of real-world TFNs have not been characterized. It is therefore unclear whether these earlier theoretical findings are biologically relevant. Moreover, it is not known how the other three components of the assortativity signature contribute to the phenotypic robustness of TFNs. Here, we use publicly available DNaseI-seq data to measure the assortativity signatures of genome-wide TFNs in 41 distinct human cell and tissue types. We find that all TFNs share a common assortativity signature and that this signature confers phenotypic robustness to model TFNs. Lastly, we determine the extent to which each of the four components of the assortativity signature contributes to this robustness.

Monday, September 01, 2014

Computational genetics analysis of grey matter density in Alzheimer's disease

Zieselman AL, Fisher JM, Hu T, Andrews PC, Greene CS, Shen L, Saykin AJ, Moore JH. Computational genetics analysis of grey matter density in Alzheimer's disease. BioData Min. 2014 Aug 22;7:17. [PDF]



Alzheimer's disease is the most common form of progressive dementia and there is currently no known cure. The cause of onset is not fully understood but genetic factors are expected to play a significant role. We present here a bioinformatics approach to the genetic analysis of grey matter density as an endophenotype for late onset Alzheimer's disease. Our approach combines machine learning analysis of gene-gene interactions with large-scale functional genomics data for assessing biological relationships.


We found a statistically significant synergistic interaction among two SNPs located in the intergenic region of an olfactory gene cluster. This model did not replicate in an independent dataset. However, genes in this region have high-confidence biological relationships and are consistent with previous findings implicating sensory processes in Alzheimer's disease.


Previous genetic studies of Alzheimer's disease have revealed only a small portion of the overall variability due to DNA sequence differences. Some of this missing heritability is likely due to complex gene-gene and gene-environment interactions. We have introduced here a novel bioinformatics analysis pipeline that embraces the complexity of the genetic architecture of Alzheimer's disease while at the same time harnessing the power of functional genomics. These findings represent novel hypotheses about the genetic basis of this complex disease and provide open-access methods that others can use in their own studies.

Tuesday, August 12, 2014

Diverse convergent evidence in the genetic analysis of complex disease

The genetic analysis of common human diseases should not rely solely on one piece of evidence (e.g. a p-value derived from a univariate test). In this paper, we explore integrating multiple sources of evidence in the search for valid genetic associations.

Ciesielski TH, Pendergrass SA, White MJ, Kodaman N, Sobota RS, Huang M, Bartlett J, Li J, Pan Q, Gui J, Selleck SB, Amos CI, Ritchie MD, Moore JH, Williams SM. Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors. BioData Min. 2014 Jun 30;7:10. [PDF]


In omic research, such as genome wide association studies, researchers seek to repeat their results in other datasets to reduce false positive findings and thus provide evidence for the existence of true associations. Unfortunately this standard validation approach cannot completely eliminate false positive conclusions, and it can also mask many true associations that might otherwise advance our understanding of pathology. These issues beg the question: How can we increase the amount of knowledge gained from high throughput genetic data? To address this challenge, we present an approach that complements standard statistical validation methods by drawing attention to both potential false negative and false positive conclusions, as well as providing broad information for directing future research. The Diverse Convergent Evidence approach (DiCE) we propose integrates information from multiple sources (omics, informatics, and laboratory experiments) to estimate the strength of the available corroborating evidence supporting a given association. This process is designed to yield an evidence metric that has utility when etiologic heterogeneity, variable risk factor frequencies, and a variety of observational data imperfections might lead to false conclusions. We provide proof of principle examples in which DiCE identified strong evidence for associations that have established biological importance, when standard validation methods alone did not provide support. If used as an adjunct to standard validation methods this approach can leverage multiple distinct data types to improve genetic risk factor discovery/validation, promote effective science communication, and guide future research directions.

Thursday, August 07, 2014

A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection

Our latest open-access paper on developing epistasis models that can be used to simulate data for evaluating machine learning methods. 

Urbanowicz RJ, Granizo-Mackenzie AL, Kiralis J, Moore JH. A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection. BioData Min. 2014 Jun 9;7:8. [PDF]


The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model 'architecture' on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.
In this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by "shape". In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.
These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.