Does Complexity Matter?
Our paper on the development and evaluation of a computational evolution system for solving complex problem in human genetics will be published by Springer as a chapter in a new book that will be out later this year or early in 2009. I just received the page proofs. Please email me if you would like a preprint. The book (details below) is a set of contributed chapters by attendees of the Genetic Programming Theory and Practice (GPTP) Workshop held by the Center for the Study of Complex Systems at the University of Michigan this past spring. This work builds on the work discussed in my Dec. 23, 2007 blog post.
Moore, J.H., Greene, C.S., Andrews, P., White, B.C. Does complexity matter? Artificial evolution, computational evolution and the genetic analysis of common human diseases. In: Genetic Programming Theory and Practice VI, in press, Springer (2008).
Common human diseases are complex and likely the result of nonlinear interactions between multiple different DNA sequence variations. One goal of human genetics is to use data mining and machine learning methods to identify sets of discrete genetic attributes that are predictive of discrete measures of health in human population data. A variety of different computational intelligence methods based on artificial evolution have been developed and applied in this domain. While artificial evolution approaches such as genetic programming show promise, they
are only loosely based on real biological and evolutionary processes. It has recently been suggested that a new paradigm is needed where “artificial evolution” is transformed to “computational evolution” by incorporating more biological and evolutionary complexity into existing algorithms. Computational evolution systems have been proposed as more likely to solve problems of interest to biologists and biomedical researchers. To test this hypothesis, we developed a prototype computational evolution system for the analysis of human genetics
data capable of evolving operators of arbitrary complexity. Preliminary results suggest that more complex operators result in better solutions. Here we introduce modifications including a simpler stack-based solution representation, the ability to maintain and use an archive of solution building blocks, and a simpler set of solution operator building blocks capable of learning to use pre-processed expert knowledge. A parameter sweep suggests that operators that can use expert knowledge or archival information outperform those that cannot. This study supports the idea that complexity matters and thus the consideration of computational evolution for bioinformatics problem-solving in the domain of human genetics.
Figure 1. Visual overview of our computational evolution system for discovering symbolic discriminant functions that differentiate disease subjects fromhealthy subjects using information about single nucleotide polymorphisms (SNPs). The hierarchical structure is shown on the left while some specific examples at each level are shown in the middle. At the lowest level (D) is a grid of solutions. Each solution consists of a list of functions and their arguments (e.g. X1 is an attribute) that are evaluated using a stack (denoted by ST in the solution). The next level up (C) is a grid of solution operators that each consists of some combination of the ADD, DELETE and COPY functions each with their respective set of probabilities that define whether expert knowledge from ReliefF (denoted by E in the probability pie) or the archive (denoted by A in the probability pie) are used instead of a random generator (denoted by R in the probability pie). ReliefF scores are derived by pre-processing the data (E) and are stored for use by the system (F). The attribute archive (G) is derived from the frequency with which each attribute occurs among solutions in the population. The top two levels of the hierarchy (A and B) exist to generate variability in the operators that modify the solutions. This system allows operators of arbitrary complexity to modify solutions. Note that we used 18x18 and 36x36 grids of 324 and 1296 solutions, respectively, in the present study. A 12x12 grid is shown here as an example.