Practical perfect hashing in nearly optimal space

Fabiano C. Botelho, Rasmus Pagh, Nivio Ziviani
2013 Information Systems  
A hash function is a mapping from a key universe U to a range of integers, i.e., h : U → {0, 1, . . . , m − 1}, where m is the range's size. A perfect hash function for some set S ⊆ U is a hash function that is one-to-one on S , where m ≥ |S |. A minimal perfect hash function for some set S ⊆ U is a perfect hash function with a range of minimum size, i.e., m = |S |. This paper presents a construction for (minimal) perfect hash functions that combines theoretical analysis, practical performance,
more » ... expected linear construction time and nearly optimal space consumption for the data structure. For n keys and m = n the space consumption ranges from 2.62n to 3.3n bits, and for m = 1.23n it ranges from 1.95n to 2.7n bits. This is within a small constant factor from the theoretical lower bounds of 1.44n bits for m = n and 0.89n bits for m = 1.23n. We combine several theoretical results into a practical solution that has turned perfect hashing into a very compact data structure to solve the membership problem when the key set S is static and known in advance. By taking into account the memory hierarchy we can construct (minimal) perfect hash functions for over a billion keys in 46 minutes using a commodity PC. An open source implementation of the algorithms is available at http://cmph.sf.net under the GNU Lesser General Public License (LGPL). (Fabiano C. Botelho), pagh@itu.dk (Rasmus Pagh), nivio@dcc.ufmg.br (Nivio Ziviani) 1 A successful search happens when the queried key is found in the key set and an unsuccessful search happens otherwise. 2 Data Domain develops a deduplicated file system tailored for a backup load. It was acquired by EMC 2 in July 2009.
doi:10.1016/j.is.2012.06.002 fatcat:hn6si7ptnrgylfgvkgxlpsebai