Binary Jumbled Pattern Matching on Trees and Tree-Like Structures

Travis Gagie, Danny Hermelin, Gad M. Landau, Oren Weimann
2014 Algorithmica  
Binary jumbled pattern matching asks to preprocess a binary string S in order to answer queries (i, j) which ask for a substring of S that is of length i and has exactly j 1-bits. This problem naturally generalizes to vertex-labeled trees and graphs by replacing "substring" with "connected subgraph". In this paper, we give an O(n 2 / log 2 n)-time solution for trees, matching the currently best bound for (the simpler problem of) strings. We also give an O(g 2/3 n 4/3 /(log n) 4/3 )-time
more » ... for strings that are compressed by a context-free grammar of size g in Chomsky normal form. This solution improves the known bounds when the string is compressible under many popular compression schemes. Finally, we prove that on graphs the problem is fixed-parameter tractable with respect to the treewidth w of the graph, even for a Algorithmica (2015) 73:571-588 constant number of different vertex-labels, thus improving the previous best n O (w) algorithm. Introduction Jumbled pattern matching is an important variant of classical pattern matching with several applications in computational biology, ranging from alignment [5] and SNP discovery [7] , to the interpretation of mass spectrometry data [10] and metabolic network analysis [26] . In the most basic case of strings, the problem asks to determine whether a given pattern P can be rearranged so that it appears in a given text T . That is, whether T contains a substring of length |P| where each letter of the alphabet occurs the same number of times as in P. Using a straightforward sliding window algorithm, such a jumbled occurrence can be found optimally in O(n) time on a text of length n. While jumbled pattern matching has a simple efficient solution, its indexing problem is much more challenging. In the indexing problem, we preprocess a given text T so that on queries P we can determine quickly whether T has a jumbled occurrence of P. Very little is known about this problem besides the trivial naive solution. Most of the interesting results on indexing for jumbled pattern matching relate to binary strings (where a query pattern (i, j) asks for a substring of T that is of length i and has j 1s). Given a binary string of length n, Cicalese, Fici and Lipták [14] showed how one can build in O(n 2 ) time an O(n)-space index that answers jumbled pattern matching queries in O(1) time. Their key observation was that if one substring of length i contains fewer than j 1s, and another substring of length i contains more than j 1s, then there must be a substring of length i with exactly j 1s. Using this observation, they construct an index that stores the maximum and minimum number of 1s in any i-length substring, for each possible i. Burcsi et al. [10] (see also [11, 12] ) and Moosa and Rahman [27] independently improved the construction time to O(n 2 / log n), then Moosa and Rahman [28] further improved it to O(n 2 / log 2 n) in the word RAM model. Currently, faster algorithms than O(n 2 / log 2 n) exist only when the string compresses well using run-length encoding [4, 23] or when we are willing to settle for approximate indices [16] . Regarding non-binary alphabets, the recent solution of Kociumaka et al. [25] for constant alphabets requires o(n 2 ) space and o(n) query time. For general alphabets, expected sublinear query time was achieved by Burcsi et al. [11] for large query patterns but in the worst case a query takes superlinear time. In fact, a recent result of Amir et al. [3] shows that under a 3SUM hardness assumption, jumbled indexing for alphabets of size ω(1) requires either Ω(n 2−ε ) preprocessing time or Ω(n 1−δ ) query time for any ε, δ > 0. The natural extension of jumbled pattern matching from strings to trees is much harder. In this extension, we are asked to determine whether a vertex-labeled input tree has a connected subgraph where each label occurs the same number of times as specified by the input query. The difficulty here stems from the fact that a tree can have an exponential number of connected subgraphs as opposed to strings. Hence, a sliding
doi:10.1007/s00453-014-9957-6 fatcat:oamaqdg5rzewfgkzs3wzfckexm