MDR 101 - Part 2 - Filtering
MDR has traditionally carried out an exhaustive search of all possible 2-way, 3-way, up to n-way combinations of attributes (e.g. SNPs). This works well for candidate gene studies with 100 or fewer SNPs because the combinations can be completely enumerated in a reasonable amount of time on a desktop computer. As a side note, it is important to remember that the MDR software is designed to take advantage of PCs with multiple processors. Thus, if you have a PC with two processors that have threading turned on MDR will run approximately 4x faster that a single CPU with no threading. Make sure the next PC you buy has one or more dual-core or new quad-core chips in it!
When you have more than 100 SNPs an exhaustive search may not be practical unless you are willing to wait days or even weeks for a run to finish. When the number of SNPs exceeds 10,000 an exhaustive search of all 3-way and higher combinations becomes infeasible, even with a parallel computer. As SNP chip data becomes ubiquitous there will be an increasing interest in using MDR to carry out a genome-wide analysis of epistasis. As a side note, we have released a C library (libMDR) to help you write fast MDR software for high-performance computing.
We are actively developing two different strategies for genome-wide analysis using MDR. The first general strategy is to select a subset of 100 or 1000 SNPs, for example, from the list of 100,000 or 500,000 candidates. This is the filter approach. The second approach is to use a stochastic search algorithm such as simulated annealing or genetic programming to search for an optimal combination. This is the wrapper approach. In this essay we will cover the filter approach since it is more of a pre-processing step. Previous blog posts (e.g. March 14, June 12, Sept. 12) have talked about our wrapper approaches. We will discuss this in more detail in a later MDR 101 post.
There are many different filters that can be applied to SNP data to reduce the number of attributes that need to be systematically evaluated using MDR. The first approach is to use statistical and computational algorithms to select SNPs that are most likely to interact. We have been evaluating the Relief family of algorithms for this purpose. Relief was originally developed by Kira and Rendell (1992) and later extended in the ReliefF algorithm by Kononenko (1994). A good review of these algorithms can be found in Robnik-Sikonja and Kononenko (2003). The nice thing about ReliefF is that it can assign a single quality value to each SNPs in a way that captures dependencies or interactions between SNPs. We have developed several extensions of ReliefF that improve the power to filter interacting SNPs. Our first approach is called Tuned ReliefF or TuRF. A paper describing this approach is under review. A second approach is called Evaporative ReliefF. This approach was developed in collaboration Dr. Brett McKinney and is also under review. The key to find a fast algorithm that can assign high quality values to SNP involved in interactions without explicitly modeling the interactions which leads to the combinatorial problem that MDR faces. You don't want to use univariate approaches such as chi-square unless you explicitly want to condition on main effects. Interactions in the absence of statistically significant main effects are likely to be missed.
The MDR software currently allows you to select ReliefF, chi-square, and odds ratio as filters. For ReliefF, it is probably best to use the default settings as these are standard in data mining. Feel free to play with the settings if you want. Note that the odds ratio filter reports the 'best' odds ratio using several different references groups. It should thus be seen as a 'maximized' odds ratio. For the chi-square test, the results can be viewed as either the chi-square values or the p-values. For all the filters, you can select the size of the subset you want by specifiying the number you want, the percentage you want, or the subset that meets some threshold critera (e.g. p-value). As an exercise, compare the list of best SNPs returned by ReliefF and chi-square. Those SNPs that have high ReliefF scores but non-significant ch-square scores are candidates for nonlinear interactions. In the papers on atrial fibrillation by Tsai et al. (2004) and Moore et al. (2006), the ACE I/D and T174M polymorphisms detected by MDR have the highest ReliefF scores of any polymorphism in the dataset but have non-significant chi-square values and thus would have been missed by conditioning on main effects. The M235T polymorphism has a main effect and is detected by both ReliefF and chi-square. Note that we have not yet put TuRF or Tuned ReliefF in the MDR software yet. However, you can carry out a TuRF analysis manually by repeatedly running ReliefF on your data each time saving the best 90% or 99% of the polymorphisms, for example. The idea is to throw away the noisy SNPs to get better estimates of the interesting SNPs. This is a balance though! ReliefF scores will start to decrease as you throw out signal so don't remove too many.
In addition to the algorithmic approach, we think it will be very useful to reduce the number of SNPs examined by using domain-specific knowledge about biochemical pathways, Gene Ontology, chromosomal location, published relationships, etc.. Why not use the years of accumulated knowledge about the disease you are studying and likely genes that might be involved? We think an optimal filter will combine algorithms such as ReliefF with knowledge about biological systems. As a side note, we have previously used pathway information to help organize our MDR analyses. See Williams et al. (2004) for an example.
In summary, statistical and biological filters can help reduce the number of SNPs that need to be exhaustively analyzed by MDR. The advantage of the filter approach is that it is computationally tractible. The disadvantage is that your list of filtered SNPs is only as good as the filter that is being used. Surely some good SNPs will be excluded from the analysis.
For further reading about filtering using TuRF for MDR analysis see our new paper that appear in a new data mining book to be published by IGI in Feb, of 2007. More information about the book can be found at the published here and on Amazon here. I would be happy to send you a preprint. We have a more detailed paper under peer review.
Moore, J.H. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Zhu, Davidson (eds.)Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data, IGI, in press (2007).

