Parallel color-coding

George M. Slota, Kamesh Madduri
2015 Parallel Computing  
We present new parallelization and memory-reducing strategies for the graphtheoretic color-coding approximation technique, with applications to biological network analysis. Color-coding is a technique that gives fixed parameter tractable algorithms for several well-known NP-hard optimization problems. In this work, by efficiently parallelizing steps in color-coding, we create two new biological protein interaction network analysis tools: Fascia for subgraph counting and motif finding and
more » ... h for signaling pathway detection. We demonstrate considerable speedup over prior work, and the optimizations introduced in this paper can also be used for other problems where color-coding is applicable. of this strategy, using new data structures and optimizations to reduce peak memory utilization and inter-processor communication. We also create two new software tools, Fascia and FastPath, that use parallel color-coding to solve bioinformatics problems. The problem of counting the number of occurrences of a template or sub-10 graph within a large graph is commonly termed subgraph counting. This problem is very similar to the classical subgraph isomorphism problem. Related problems, such as subgraph enumeration, tree isomorphism, motif finding, frequent subgraph identification, etc. are all fundamental graph analysis methods to identify latent structure in complex data sets. They have applications in 15 bioinformatics [2, 3, 4], chemoinformatics [5], online social network analysis [6], network traffic analysis, and many other areas. Subgraph counting and enumeration are compute-intensive problems. A naïve algorithm, which exhaustively enumerates all vertices reachable in k hops from a vertex, runs in O(n k ) time, where n is the number of vertices in the 20 network and k is the number of vertices in the subgraph. For large networks, this running time complexity puts a constraint of the size of the subgraph (value of k). If k is larger than 2 or 3, exact counting becomes prohibitively expensive. Thus, there has been a lot of recent work on approximation algorithms. Approaches are generally based on sampling or on exploiting network topology. 25 Sampling-based methods analyze a subset of the network and extrapolate counts based on the observed occurrences and network properties. Some tools based on sampling are MFINDER [7], FANMOD [8], and GRAFT [9]. The other class of methods impose some constraint on the network or transform the network so that the possible search space is restricted. Examples of tools imposing con-30 straints on the network are NEMO [10] and SAHAD [11]. Tools based on the color-coding method belong to the second category, and this forms the basis of our current work. The color-coding method for this problem uses a dynamic programming scheme to generate an approximate count of a given non-induced tree-structured 35 subgraph/template (also referred to as a treelet) in O(m · 2 k ) time, where m is the number of edges in the network. The algorithm can be informally stated as follows: every node in a network is randomly colored with one of at least k 2 possible colors. The number of colorful embeddings of a given input template is then counted, where colorful in this context means that each node in the tem-40 plate embedding has a distinct color. The total embedding count is then scaled by the probability that the template is colorful, in order to generate an approximation for the total number of possible embeddings. This colorful embedding counting scheme avoids the prohibitive O(n k ) bound seen in exhaustive search. Color-coding can also be applied in an entirely different context. Consider 45 the NP-hard optimization problem [12] of finding the minimum-weight simple path of path length k in a weighted graph with positive edge weights. This problem is of considerable interest in bioinformatics, specifically in the analysis of paths in protein interaction networks. With an appropriately-defined edge weight scheme, paths with the minimum weight, or in general close to the 50 minimum weight, often have vertices that belong to biologically-significant subgraphs such as signaling networks and metabolic pathways [13, 14]. As in the case of subgraph counting, color-coding can only offer an approximate solution to this NP-hard problem. With some confidence and error bounds, it is guaranteed to return simple paths with weight close to the minimum path weight. 55 The low-weight paths returned through color-coding are shown to be good candidates for signaling pathways [12] . We present a shared-memory parallelization of the approximate low-weight path enumeration strategy. Color-coding can be in general applied to finding any subgraphs with a bounded tree-width in polynomial time, by executing the color-coding algo-60 rithm with the tree decomposition of the subgraph [1]. However, in this work, we only consider finding treelets, which are subgraphs with a tree-width of 1. Another application of color-coding that is not included in this work is for finding cycles of length k. We also note that all algorithms using color-coding can be derandomized by using families of perfect hash functions. However, we ob-
doi:10.1016/j.parco.2015.02.004 fatcat:hbut56hljnalzmjdfzw5aof7eq