Lower bounds for processing data with few random accesses to external memory
Martin Grohe, André Hernich, Nicole Schweikardt
2009
Journal of the ACM
We consider a scenario where we want to query a large dataset that is stored in external memory and does not fit into main memory. The most constrained resources in such a situation are the size of the main memory and the number of random accesses to external memory. We note that sequentially streaming data from external memory through main memory is much less prohibitive. We propose an abstract model of this scenario in which we restrict the size of the main memory and the number of random
more »
... sses to external memory, but admit arbitrary sequential access. A distinguishing feature of our model is that it allows the usage of unlimited external memory for storing intermediate results, such as several hard disks that can be accessed in parallel. In this model, we prove lower bounds for the problem of sorting a sequence of strings (or numbers), the problem of deciding whether two given sets of strings are equal, and two closely related decision problems. Intuitively, our results say that there is no algorithm for the problems that uses internal memory space bounded by N 1−ε and at most o(log N ) random accesses to external memory, but unlimited "streaming access", both for writing to and reading from external memory. (Here N denotes the size of the input and ε is an arbitrary constant greater than 0.) We even permit randomized algorithms with one-sided bounded error. We also consider the problem of evaluating database queries and prove similar lower bounds for evaluating relational algebra queries against relational databases and XQuery and XPath queries against XML-databases. We prove lower bounds for various algorithmic problems, including sorting and query answering, in a streaming model with auxiliary external memory. Our model is a natural extension of a model introduced in [Grohe et al. 2007 ] to the setting with auxiliary external memory devices. Recall that the two most significant cost measures in this setting are the number of random accesses to external memory and the size of the internal memory. The model is based on a standard multi-tape Turing machine. Some of the tapes of the machine, among them the input tape, represent the external memory. They are unrestricted in size, but access to these tapes is restricted by allowing only a certain number r(N ) (where N denotes the input size) of reversals of the head directions. This may be seen as a way of (a) restricting the number of sequential scans and (b) restricting random access to these tapes, because each random access can be simulated by moving the head to the desired
doi:10.1145/1516512.1516514
fatcat:wye6s3ladfcidivumsk7vtnhpq