Epistasis Blog

From the Computational Genetics Laboratory at Dartmouth Medical School (www.epistasis.org)

Monday, January 29, 2007

BioSymphony Update

We are making good progress on our BioSymphony project. I have added an example to www.BioSymphony.org so you can see what the output will look like. Stay tuned!

Friday, January 26, 2007

GECCO Paper Deadline is Jan. 31st

The deadline for submitting papers to the Biological Applications track of the 2007 Genetic and Evolutionary Computing Conference (GECCO) is January 31st. I hope you will consider submitting a paper. Here is a description:

Biological Applications:

The scope of this track is any research that applies evolutionary (and related) techniques to solving biological and biomedical problems. All " flavors" of evolutionary techniques consistent with GECCO are included in this scope, including genetic algorithms, genetic programming, estimation of distribution algorithms, evolution strategies, evolutionary programming, ant colony optimization, swarm intelligence, artificial life and hybrid systems with any of these components. Papers that integrate evolutionary methods into bioinformatics, biomedical informatics, biostatistics, computational biology and systems biology, for example, are particularly welcome. Papers that provide experimental validation and/or biological interpretation of computational results are also particularly welcome.

Some specific examples of biological and biomedical issues that papers could address include:

• Data mining in biological or biomedical databases
• Diagnostic or predictive testing in epidemiology and genetics
• Functional diversification through gene duplication and exon shuffling
• Gene expression and regulation, alternative splicing
• Genetic association studies
• Haplotype and linkage disequilibrium analysis
• Image analysis and pattern recognition
• Metabolomics
• Microarray analysis
• Network reconstruction for development, expression, catalysis etc.
• Pharmacokinetic and pharmacodynamic analysis
• Phylogenetic reconstruction and analysis
• Relationships between evolved systems and their environment (e.g. phylogeography)
• Relationships within evolved communities (cooperation, coevolution, symbiosis, etc.)
• Sensitivity of speciation to variations in evolutionary processes
• Sequence alignment and analysis
• Simulation of cells, viruses, organisms and whole ecologies
• Structure prediction for biological molecules (structural biology)
• Systems biology

Thursday, January 25, 2007

New Version of MDR Available Soon

We are working on a new version of MDR that will have a faster analysis algorithm and a much faster ReliefF algorithm. The new MDR 1.1.0 version should be available in the next week or two. Stay tuned!

Tuesday, January 23, 2007

MDR Analysis of Bladder Cancer

A new paper by Huang et al. in CEBP reports results from an MDR analysis of gene-gene interactions in bladder cancer. This paper is a nice example of the use of interaction dendrograms to interpret MDR models.

Huang M, Dinney CP, Lin X, Lin J, Grossman HB, Wu X. High-Order Interactions among Genetic Variants in DNA Base Excision Repair Pathway Genes and Smoking in Bladder Cancer Susceptibility. Cancer Epidemiol Biomarkers Prev. 2007 Jan;16(1):84-91. [PubMed]

Cancer is a common multifactor human disease resulting from complex interactions between many genetic and environmental factors. In this study, we used a multifaceted analytic approach to explore the relationship between eight single nucleotide polymorphisms in base excision repair (BER) pathway genes, smoking, and bladder cancer susceptibility in a hospital-based case-control study. Overall, we did not find an association between any single BER gene single nucleotide polymorphism and bladder cancer risk. However, in stratified analysis, the OGG1 S326C variant genotypes in ever smokers (odds ratio, 0.74; 95% confidence interval, 0.56-0.99) and ADP-ribosyltransferase (ADPRT) V762A variant genotypes in never smokers (odds ratio, 0.58; 95% confidence interval, 0.37-0.91) conferred a significantly reduced risk. Using logistic regression, we observed that there was a two-way interaction between ADPRT V762A and smoking status. We next used classification and regression tree analysis to explore high-order gene-gene and gene-environment interactions. We found that smoking is the most important influential factor for bladder cancer risk. Consistent with the above findings, we found that the ADPRT V762A was only significantly involved in bladder cancer risk in never smokers and the OGG1 S326C was only significantly involved in ever smokers. We also observed gene-gene interactions among OGG1 S326C, XRCC1 R194W, and MUTYH H335Q in ever smokers. Using multifactor dimensionality reduction approach, the four-factor model, including smoking status, OGG1 S326C (rs1052133), APEX1 D148E (rs3136820), and ADPRT762 (rs1136410), had the best ability to predict bladder cancer risk with the highest cross-validation consistency (100%) and the lowest prediction error (37.02%; P < 0.001). These results support the hypothesis that genetic variants in BER genes contribute to bladder cancer risk through gene-gene and gene-environmental interactions. (Cancer Epidemiol Biomarkers Prev 2007;16(1):84-91).

Monday, January 22, 2007

Symbolic Modeler (SyMod)

Our Symbolic Modeler (SyMod) software package is now available for alpha testing. Please email me for a copy to evaluate. This is not a production version so please don't use it for analysis.

Thursday, January 18, 2007

Exploratory Visual Analysis

I have posted a few screenshots of our new Exploratory Visual Analysis (EVA) software package at www.exploratoryvisualanalysis.org. It will be ready for external testing in a few weeks. We are adding a few more bells and whistles before we send it out.

Friday, January 12, 2007

Genetic Architecture of Plasma t-PA and PAI-1

We have several papers recently published or in press that report results from our genetic studies of plasma levels of tissue plasminogen activator (t-PA) [OMIM 173370] and plasminogen activator inhibitor 1 (PAI-1) [OMIM 173360] in Caucasians from the PREVEND study in The Netherlands. The latest paper to be published in Genomics reports epistatic effects. Papers on gene-environment interaction and high-order epistatic effects are planned.

Asselbergs FW, Williams SM, Hebert PR, Coffey CS, Hillege HL, Navis G, Vaughan DE, van Gilst WH, Moore JH. The gender-specific role of polymorphisms from the fibrinolytic, renin-angiotensin, and bradykinin systems in determining plasma t-PA and PAI-1 levels. Thromb Haemost. 2006 Oct;96(4):471-7. [PubMed]

Asselbergs FW, Williams SM, Hebert PR, Coffey CS, Hillege HL, Navis G, Vaughan DE, van Gilst WH, Moore JH. Gender-specific correlations of PAI-1 and t-PA levels with cardiovascular disease-related traits. J Thromb Haemost 2007, in press. [PubMed]

Asselbergs FW, Williams SM, Hebert PR, Coffey CS, Hillege HL, Navis G, Vaughan DE, van Gilst WH, Moore JH. Epistatic effects of polymorphisms in genes from the renin-angiotensin, bradykinin, and fibrinolytic systems on plasma t-PA and PAI-1 levels. Genomics 2007, in press. [PubMed]

This work was supported by NIH grant HL65234 (PI - Moore)

Wednesday, January 10, 2007

MDR Analysis of Autism

A new paper by Coutinho et al. that will be published later this year in Human Genetics uses MDR to identify nonadditive interactions that are predictive of autism susceptibility. There are several nice aspects of this paper that make it worth reading. First, this is one of the first papers to use the interaction dendrogram feature of the MDR software to provide a statistical intepretation of the multilocus model. Second, the authors used the Restricted Partitioning Method (RPM) of Culverhouse et al. (Genetic Epidemiology 2004) to identify similar interaction effects on seratonin levels. Thus, the discrete MDR analysis is showing the same thing as the the quantitative RPM analysis. This is the first time, to my knowledge, that epistasis has been documented at multiple levels of the hierarchy between genotype and phenotype in a human study of a complex disease.

Coutinho AM, Sousa I, Martins M, Correia C, Morgadinho T, Bento C, Marques C, Ataide A, Miguel TS, Moore JH, Oliveira G, Vicente AM. Evidence for epistasis between SLC6A4 and ITGB3 in autism etiology and in the determination of platelet serotonin levels. Human Genetics, in press (2007) [PubMed]


Autism is a neurodevelopmental disorder of unclear etiology. The consistent finding of platelet hyperserotonemia in a proportion of patients and its heritability within affected families suggest that genes involved in the serotonin system play a role in this disorder. The role in autism etiology of seven candidate genes in the serotonin metabolic and neurotransmission pathways and mapping to autism linkage regions (SLC6A4, HTR1A, HTR1D, HTR2A, HTR5A, TPH1 and ITGB3) was analyzed in a sample of 186 nuclear families. The impact of interactions among these genes in autism was assessed using the multifactor-dimensionality reduction (MDR) method in 186 patients and 181 controls. We further evaluated whether the effect of specific gene variants or gene interactions associated with autism etiology might be mediated by their influence on serotonin levels, using the quantitative transmission disequilibrium test (QTDT) and the restricted partition method (RPM), in a sample of 109 autistic children. We report a significant main effect of the HTR5A gene in autism (P = 0.0088), and a significant three-locus model comprising a synergistic interaction between the ITGB3 and SLC6A4 genes with an additive effect of HTR5A (P < 0.0010). In addition to the previously reported contribution of SLC6A4, we found significant associations of ITGB3 haplotypes with serotonin level distribution (P = 0.0163). The most significant models contributing to serotonin distribution were found for interactions between TPH1 rs4537731 and SLC6A4 haplotypes (P = 0.002) and between HTR1D rs6300 and SLC6A4 haplotypes (P = 0.013). In addition to the significant independent effects, evidence for interaction between SLC6A4 and ITGB3 markers was also found. The overall results implicate SLC6A4 and ITGB3 gene interactions in autism etiology and in serotonin level determination, providing evidence for a common underlying genetic mechanism and a molecular explanation for the association of platelet hyperserotonemia with autism.

Tuesday, January 09, 2007

Tuned ReliefF (TuRF)

Our paper on "Tuning ReliefF for Genome-Wide Genetic Analysis" has been accepted for publication in the Lecture Notes in Computer Science (LNCS) series from Springer. This paper will be presented at the Evolutionary Computing, Machine Learning, and Data Mining in Bioinformatics (EvoBIO'07) Conference in Valencia, Spain in April. Email me for a preprint.

Moore JH, White BC. Tuning ReliefF for Genome-Wide Genetic Analysis. Lecture Notes in Computer Science, in press (2007).


An important goal of human genetics is the identification of DNA sequence variations that are predictive of who is at risk for various common diseases. The focus of the present study is on the challenge of detecting and characterizing nonlinear attribute interactions or dependencies in the context of a genome-wide genetic study. The first question we address is whether the ReliefF algorithm is suitable for attribute selection in this domain. The second question we address is whether we can improve ReliefF for selecting important genetic attributes. Using simulated genetic datasets, we show that ReliefF is significantly better than a naive chi-square test of independence for selecting two interacting attributes out of 103 candidates. In addition, we show that ReliefF can be improved in this domain by systematically removing the worst attributes and re-estimating ReliefF weights. Our simulation studies demonstrate that this new Tuned ReliefF (TuRF) algorithm is significantly better than ReliefF. The ability to filter or select DNA sequence variations that are associated with disease class through complex nonlinear interactions will play an important role in the development of genetic models of disease risk.

Monday, January 08, 2007

MDR Analysis in Imbalanced Datasets

Our paper on "A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction" has been accepted for publication in Genetic Epidemiology and will appear later this year. This paper describes a balanced accuracy function for estimating accuracy when the number of cases and controls is not equal. This new function has been implemented in the latest MDR software package.

Velez, D.R., White, B.C., Motsinger, A.A., Bush, W.S., Ritchie, M.D., Williams, S.M., Moore, J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology, in press (2007).


Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well-known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: 1) over-sampling that resamples with replacement the smaller class until the data are balanced; 2) under-sampling that randomly removes subjects from the larger class until the data are balanced; 3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1:1, 1:2, 1:4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

This work was supported by National Institutes of Health grants AI59694, HD047447, LM009012, RR018787, GM62758 and AG20135.

Sunday, January 07, 2007

MDR Applications List Updated

I have updated the growing list of published papers that report results from the application of our Multifactor Dimensionality Reduction (MDR) method to real data. You can find the updated list here.

If you are new to MDR you can download the free software here.

You can find a five-part MDR tutorial in a series of posts on this blog in November and December of 2006. Here are the links:

Part 1 - Missing Data
Part 2 - Filtering
Part 3 - Analysis
Part 4 - Results
Part 5 - Interpretation

For more information about MDR visit www.epistasis.org or the www.multifactordimensionalityreduction.org.

Questions? Comments? Email me.