Efficient text fingerprinting via Parikh mapping

Amihood Amir, Alberto Apostolico, Gad M. Landau, Giorgio Satta
2003 Journal of Discrete Algorithms  
We consider the problem of fingerprinting text by sets of symbols. Specifically, if S is a string, of length n, over a finite, ordered alphabet Σ, and S is a substring of S, then the fingerprint of S is the subset φ of Σ of precisely the symbols appearing in S . In this paper we show efficient methods of answering various queries on fingerprint statistics. Our preprocessing is done in time O(n|Σ| log n log |Σ|) and enables answering the following queries: (1) Given an integer k, compute the
more » ... er of distinct fingerprints of size k in time O(1). (2) Given a set φ ⊆ Σ, compute the total number of distinct occurrences in S of substrings with fingerprint φ in time O(|Σ| log n).
doi:10.1016/s1570-8667(03)00035-2 fatcat:6lf4wqifw5ernl4eiiw4mymy6e