Evolutionary Computation and QSAR Research

Vanessa Aguiar-Pulido, Marcos Gestal, Maykel Cruz-Monteagudo, Juan Rabunal, Julian Dorado, Cristian Munteanu
2013 Current Computer - Aided Drug Design  
The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structureactivity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering
more » ... l libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for highdimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods. widely accepted assumption points toward the human cognitive activity: humans have a higher intellectual capacity. Genetic Algorithms and Genetic Programming Evolutionary Computation (EC) is inspired in Darwin's theory. Thus, EC uses the concepts of evolution and genetics to provide solutions to problems, obtaining excellent results, especially for optimization tasks [35] [36] [37] . However, the work of Arthur Samuel and Alan Turing in the 50's regarding whether machines are capable of thinking and whether computers are capable of learning to solve problems without being explicitly programmed must also be taken into account. Broadly speaking, EC methods can be defined as search and optimization techniques that apply heuristic rules based on natural evolution principles, that is, algorithms that look for solutions based on genetics and evolution properties. Among these properties, the survival of the fittest individuals (which implies that the best solutions to a problem will be maintained once they are found) and heterogeneity (basic heterogeneity so that algorithms have multiple types of information when generating solutions) become particularly relevant. Evolutionary Algorithm Performance Evolutionary algorithms work following a relatively simple outline, as shown in Fig. (1) . The algorithm will iteratively refine solutions, progressively approaching a definitive solution to the problem to be solved. Fig. (1). General outline of performance of an evolutionary algorithm. Before implementing the evolutionary process in itself, some important decisions must be made: one regarding the encoding, that is, how solutions will be represented, and the other one regarding what is known as fitness function, which is what will determine how accurate a solution is. Regarding the first one, there are mainly two options (see Fig.2 ): using a list of values (integer, real, bits, etc.) of fixed or variable length, or using a tree-shaped representation (in which the leaves usually represent the values and the intermediate nodes represent the operators). Depending on the encoding strategy chosen, a different technique will be used: Genetic Algorithm (GA) [39] or Genetic Programming (GP) [40, 41] respectively. Regardless of the technique used, the solutions obtained by the algorithms are known as genetic individuals. Each component (gene for GA or leaf node for PG) represents the variables or parameters involved in the problem.
doi:10.2174/1573409911309020006 pmid:23700999 fatcat:zsipotcovzhlhg7wrgjfkc5wvu