Epistasis Blog

From the Computational Genetics Laboratory at Dartmouth Medical School (www.epistasis.org)

Tuesday, August 12, 2014

Diverse convergent evidence in the genetic analysis of complex disease

The genetic analysis of common human diseases should not rely solely on one piece of evidence (e.g. a p-value derived from a univariate test). In this paper, we explore integrating multiple sources of evidence in the search for valid genetic associations.

Ciesielski TH, Pendergrass SA, White MJ, Kodaman N, Sobota RS, Huang M, Bartlett J, Li J, Pan Q, Gui J, Selleck SB, Amos CI, Ritchie MD, Moore JH, Williams SM. Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors. BioData Min. 2014 Jun 30;7:10. [PDF]


In omic research, such as genome wide association studies, researchers seek to repeat their results in other datasets to reduce false positive findings and thus provide evidence for the existence of true associations. Unfortunately this standard validation approach cannot completely eliminate false positive conclusions, and it can also mask many true associations that might otherwise advance our understanding of pathology. These issues beg the question: How can we increase the amount of knowledge gained from high throughput genetic data? To address this challenge, we present an approach that complements standard statistical validation methods by drawing attention to both potential false negative and false positive conclusions, as well as providing broad information for directing future research. The Diverse Convergent Evidence approach (DiCE) we propose integrates information from multiple sources (omics, informatics, and laboratory experiments) to estimate the strength of the available corroborating evidence supporting a given association. This process is designed to yield an evidence metric that has utility when etiologic heterogeneity, variable risk factor frequencies, and a variety of observational data imperfections might lead to false conclusions. We provide proof of principle examples in which DiCE identified strong evidence for associations that have established biological importance, when standard validation methods alone did not provide support. If used as an adjunct to standard validation methods this approach can leverage multiple distinct data types to improve genetic risk factor discovery/validation, promote effective science communication, and guide future research directions.

Thursday, August 07, 2014

A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection

Our latest open-access paper on developing epistasis models that can be used to simulate data for evaluating machine learning methods. 

Urbanowicz RJ, Granizo-Mackenzie AL, Kiralis J, Moore JH. A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection. BioData Min. 2014 Jun 9;7:8. [PDF]


The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model 'architecture' on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.
In this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by "shape". In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.
These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.

Monday, August 04, 2014

Why epistasis is important for tackling complex human disease genetics

My new epistasis review / comment with Trudy Mackay from NC State. Unfortunately, it is not open access. Email me for a pdf.

Mackay TF, Moore JH. Why epistasis is important for tackling complex human disease genetics. Genome Med. 2014 Jun 9;6(6):42. [PubMed]

Thursday, July 31, 2014

First complex, then simple

My new BioData Mining editorial with Dr. James Malley from NIH on approaching machine learning and data science modeling from a complexity point of view.

Malley, JD, Moore JH. First complex, then simple. BioData Mining 2014;7:13 [PDF]

Tuesday, July 29, 2014

Innovation is often unnerving: the door into summer

My latest BioData Mining editorial with Dr. James Malley from NIH. We discuss innovation and interestingness within the context of biological data mining.

Malley JD, Moore JH. Innovation is often unnerving: the door into summer. BioData Min. 2014 Jul 17;7:12. [PDF]

Monday, June 09, 2014

Validated context-dependent associations of coronary heart disease risk with genotype variation

An absolutely fabulous paper from Charlie Sing and Andy Clark on the context-dependent effects of GWAS hits. A must read for anyone that believes, as I do, that the assumption of simplicity of genetic architecture is not realistic for most biological traits in humans. I have included below a paper for the PRIM method they used. Also worth a read.

Lusk CM, Dyson G, Clark AG, Ballantyne CM, Frikke-Schmidt R, Tybj√¶rg-Hansen A, Boerwinkle E, Sing CF. Validated context-dependent associations of coronary heart disease risk with genotype variation in the chromosome 9p21 region: the Atherosclerosis Risk in Communities study. Hum Genet. 2014 Jun 3. [PubMed]

Dyson G, Sing CF. Efficient identification of context dependent subgroups of risk from genome-wide association studies. Stat Appl Genet Mol Biol. 2014 Apr 1;13(2):217-26. [PubMed]

Dyson G, Frikke-Schmidt R, Nordestgaard BG, Tybjaerg-Hansen A, Sing CF. Modifications to the Patient Rule-Induction Method that utilize non-additive combinations of genetic and environmental effects to define partitions that predict ischemic heart disease. Genet Epidemiol. 2009 May;33(4):317-24. [PubMed]

Friday, May 02, 2014

Models in Biology

This is by far one of the best papers I have read in the past year. Very relevant to our efforts to build mathematical, statistical and computation models of the genotype-phenotype relationship in population-based studies of common diseases. Gunawardena BMC Biology 2014, 12:29 [PDF]                  

Sunday, April 27, 2014

Functional genomics annotation of a statistical epistasis network associated with bladder cancer susceptibility

We have previously published a statistical epistasis network associated with bladder cancer in a population based study (Hu et al., BMC Bioinformatics). This network was highly non-random and contained many genes that were part of the aryl hydrocarbon receptor pathway. Below is a short paper reporting functional genomics annotation of the network.

Hu T, Pan Q, Andrew AS, Langer JM, Cole MD, Tomlinson CR, Karagas MR, Moore JH. Functional genomics annotation of a statistical epistasis network associated with bladder cancer susceptibility. BioData Min. 2014 Apr 11;7(1):5. [BioData Mining]


BACKGROUND: Several different genetic and environmental factors have been identified as independent risk factors for bladder cancer in population-based studies. Recent studies have turned to understanding the role of gene-gene and gene-environment interactions in determining risk. We previously developed the bioinformatics framework of statistical epistasis networks (SEN) to characterize the global structure of interacting genetic factors associated with a particular disease or clinical outcome. By applying SEN to a population-based study of bladder cancer among Caucasians in New Hampshire, we were able to identify a set of connected genetic factors with strong and significant interaction effects on bladder cancer susceptibility.

FINDINGS: To support our statistical findings using networks, in the present study, we performed pathway enrichment analyses on the set of genes identified using SEN, and found that they are associated with the carcinogen benzo[a]pyrene, a component of tobacco smoke. We further carried out an mRNA expression microarray experiment to validate statistical genetic interactions, and to determine if the set of genes identified in the SEN were differentially expressed in a normal bladder cell line and a bladder cancer cell line in the presence or absence of benzo[a]pyrene. Significant nonrandom sets of genes from the SEN were found to be differentially expressed in response to benzo[a]pyrene in both the normal bladder cells and the bladder cancer cells. In addition, the patterns of gene expression were significantly different between these two cell types.

CONCLUSIONS: The enrichment analyses and the gene expression microarray results support the idea that SEN analysis of bladder in population-based studies is able to identify biologically meaningful statistical patterns. These results bring us a step closer to a systems genetic approach to understanding cancer susceptibility that integrates population and laboratory-based studies.

Monday, April 14, 2014

To replicate or not to replicate? The case of pharmacogenetic studies

Statistical replication has always been the gold standard in genome-wide association studies (GWAS). However, as we have previously pointed out, there are many good reasons why true genetic associations might not replicate (Greene et al. 2009). This 2013 paper explores the issue with respect to pharmacogenetic studies. The mantra of GWAS is now focused on the identification of new drug targets using genetic association results. If this is true, should biological validation matter more than statistical replication? 

Aslibekyan S, Claas SA, Arnett DK. To replicate or not to replicate: the case of pharmacogenetic studies: Establishing validity of pharmacogenomic findings: from replication to triangulation. Circ Cardiovasc Genet. 2013 Aug;6(4):409-12 [PubMed]

Sunday, April 13, 2014

Why human disease-associated residues appear as the wild-type in other species: genome-scale structural evidence for the compensation hypothesis

Xu J, Zhang J. Why human disease-associated residues appear as the wild-type in other species: genome-scale structural evidence for the compensation hypothesis. Mol Biol Evol. 2014 [PubMed]


Many human-disease associated amino acid residues (DARs) appear as the wild-type in other species. This phenomenon is commonly explained by the presence of compensatory residues in these other species that alleviate the deleterious effects of the DARs. The general validity of this hypothesis, however, is unclear, because few compensatory residues have been identified. Here we test the compensation hypothesis by assembling and analyzing 1077 DARs located in 177 proteins of known crystal structures. Because destabilizing protein structures is a primary reason why DARs are deleterious, we focus on protein stability in this analysis. We discover that, in species where a DAR represents the wild-type, the destabilizing effect of the DAR is generally lessened by the observed amino acid substitutions in the spatial proximity of the DAR. This and other findings provide genome-scale evidence for the compensation hypothesis and have important implications for understanding epistasis in protein evolution and for using animal models of human diseases.