Redesigning the string hash table, burst trie, and BST to exploit cache

Nikolas Askitis, Justin Zobel
2010 ACM Journal of Experimental Algorithmics  
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with move-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use
more » ... cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance. ACM Reference Format: Askitis, N. and Zobel, J. 2011. Redesigning the string hash table, burst trie, and BST to exploit cache. ACM be fetched simultaneously and subsequent fetches have high spatial locality. We illustrate the structural differences between standard, clustered, compact, and array-based linked lists in Figure 2 . While our proposal, which consists of the elementary step of dumping every string in a list into a contiguous array, might be seen as simplistic, it nonetheless is attractive in the context of current architectures. A potential disadvantage of using arrays is that, whenever a string is inserted, the array must be dynamically resized. As a consequence, a dynamic array can be computationally expensive to access. However, it is also cache-efficient, which can make the array method dramatically faster to access in practice, while simultaneously reducing space due to pointer elimination. We experimentally evaluate the effectiveness of our pointer-elimination techniques, using sets of up to 178 million strings. We apply our compact chains to the standard hash table, burst trie, BST, splay tree, and red-black tree, forming compact variants. We then replace the chains of the hash table, burst trie, and BST using dynamic arrays, creating new cache-conscious array representations called the array hash, array burst trie, and array BST, respectively. On practical machines with reasonable choices for parameters, our experiments show that for all array-based data structures where the array size is bounded, as in the BST and the burst trie, the analytical costs are equivalent to their standard and compactchain representations. For all the data structures, the expected cache costs of their array representations are superior to their standard and compact equivalents, even for the array hash, where on update, the asymptotic costs are greater than its chaining equivalents, yet in practice, greater efficiency is observed. Clustering is arguably one of the best methods available at improving the cacheefficiency of pointer-intensive data structures . However, to the best of our knowledge, there has yet to be a practical evaluation of its effectiveness on string data structures. We experimentally compare our compact and array data structures against a clustered hash table, burst trie, and BST. Our experiments measure the time, space, and cache misses incurred by our compact, clustered, and array data structures for the task of inserting and searching large sets of strings. Our baseline consists of a set of current state-of-the-art (standard-chained) data structures: a BST, a TST, a splay tree, a red-black tree, a hash table, a burst trie, the adaptive trie [Acharya et al. 1999] , and the Judy data structure. Judy is a trie-based hybrid data structure, composed from a set of existing data structures [Baskins 2004; Silverstein 2002; Hewlett-Packard 2001] . Our results show that, in an architecture with cache, our array data structures can yield startling improvements over their standard, compact, and clustered chained variants. A search for around 28 million unique strings on a standard hash table that contains these strings, for example, can require over 2,300 seconds-using 2 15 slots-to complete while the table occupies almost 1GB of memory. The equivalent array hash table, however, required less than 80 seconds to search while using less than a third of the space, that is, simultaneously saving around 97% in time and around 70% in space. Although this is an artificial case-we would typically allocate more slots in practiceit highlights that random access to memory is highly inefficient, and that the array hash can scale well in situations where the number of keys is not known in advance. Similar savings were obtained for insertion. Hence, despite the high computational costs of growing arrays, our results demonstrate that cache efficiency more than compensates. The array burst trie demonstrated similar improvements, being up to 74% faster to search and insert strings, while maintaining a consistent and simultaneous reduction in space, of up to 80%. The array BST also displayed similar behaviors, being up to 30% faster to build and search using 28 million unique strings, while requiring less than 50% of the space used by the equivalent standard BST. The splay tree,
doi:10.1145/1671970.1921704 fatcat:eimxbg3zvjcpfefwa7jr7wx76i