FMCS: a novel algorithm for the multiple MCS problem

Andrew Dalke, Janna Hastings
2013 Journal of Cheminformatics  
Clustering and classification of large-scale chemical data are essential for navigation, analysis and knowledge discovery in a wide variety of chemical application domains. The maximum common structure (MCS) for a group of compounds is an important element of such classification, providing insight into activity patterns and enabling scaffold alignment for a more consistent 2D depiction. Most modern, exact MCS implementations use back-tracking [1] or clique detection [2] , and handle the
more » ... MCS problem by recursive reduction to successive pairwise maximal common substructure searches [3] . We present fmcs, which implements a novel multiple MCS algorithm based on subgraph enumeration and subgraph isomorphism testing [4, 5] and with algorithm improvements and heuristics which make it competitive to the standard methods. MCS performance evaluation is very sensitive to the test set, so we have developed several reference benchmarks based on ChEMBL-13, including randomly selected pairs of structures, and randomly selected structures with their k = 2, k = 10, and k = 100 nearest neighbors. We use these benchmarks to compare fmcs to SMSD [6] and Indigo's scaffold detector [7] . Most differences are due to chemistry perception and timeout errors. The fmcs performance, written in Python using the RDKit C++ toolkit [8], is currently between 0.3x and 1.2x the performance of the Indigo implementation in C++. We also cross-validated the fmcs algorithm with the manually curated ChEBI structure ontology classification [9] and characterized the differences. We identified limitations with fmcs, such as with tautomer perception and structural classes that fmcs cannot handle, and problems with ChEBI, such as misclassifications and classifications that are not, structurally speaking, strictly hierarchical.
doi:10.1186/1758-2946-5-s1-o6 pmcid:PMC3606201 fatcat:xx3p45wc7nf6hfpht5lgvfzdvu