Linear approximation of shortest superstrings

Avrim Blum, Tao Jiang, Ming Li, John Tromp, Mihalis Yannakakis
1991 Proceedings of the twenty-third annual ACM symposium on Theory of computing - STOC '91  
We consider the following problem: given a collection of strings s 1 ; .. .;s m , nd the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of distinct strings with maximum overlap until only one string remains. Let n denote the length of the optimal
more » ... g. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n logn) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modi ed version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the superstring problem to be MAX SNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.
doi:10.1145/103418.103455 dblp:conf/stoc/BlumJLTY91 fatcat:rrhuzscoezfe7o6vh42lte2bvy