Constructing Antidictionaries in Output-Sensitive Space

Lorraine A.K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Heliou, Solon P. Pissis
2019 2019 Data Compression Conference (DCC)  
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y 1 , y 2 , . . . , y k over an alphabet Σ, we are asked to compute the set M y 1 #...#y k of minimal absent words of length at most of word y = y 1 #y 2 # . . . #y k , # / ∈ Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This
more » ... ation generally requires Ω(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when || M y 1 #...#y N || = o(n), for all N ∈ [1, k]. For instance, in the human genome, n ≈ 3 × 10 9 but || M 12 y 1 #...#y k || ≈ 10 6 . We consider a constant-sized alphabet for stating our results. We show that all M y 1 , . . . , M y 1 #...#y k can be computed in O(kn + k N =1 || M y 1 #...#y N ||) total time using O(MaxIn + MaxOut) space, where MaxIn is the length of the longest word in {y 1 , . . . , y k } and MaxOut = max{|| M y 1 #...#y N || : N ∈ [1, k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
doi:10.1109/dcc.2019.00062 dblp:conf/dcc/AyadBFHP19 fatcat:dx43qqxnbvbhvcuk5dbr64z5su