Frequent items in streaming data: An experimental evaluation of the state-of-the-art

Nishad Manerikar, Themis Palpanas
<span title="">2009</span> <i title="Elsevier BV"> <a target="_blank" rel="noopener" href="" style="color: black;">Data &amp; Knowledge Engineering</a> </i> &nbsp;
The problem of detecting frequent items in streaming data is relevant to many different applications across many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution of the above problem. In this paper, we review these algorithms, and we present the results of the first extensive comparative experimental study of the most prominent algorithms in the literature. The algorithms were comprehensively tested using a common test framework on several
more &raquo; ... l and synthetic datasets. Their performance with respect to the different parameters (i.e., parameters intrinsic to the algorithms, and data related parameters) was studied. We report the results, and insights gained through these experiments. Introduction Over the past few years, there has been a substantial increase in the volume of data generated and the rate at which these data are generated by various applications. These two factors render the traditional store first and process later data analysis approach obsolete for several applications across many domains. Instead, a growing number of applications relies on the new paradigm of streaming data processing [24, 2, 25]. Consequently, the area of data stream mining has received considerable attention in the recent years. An important problem in data stream mining is that of finding frequent items in the stream. This problem finds applications across several domains [14, 16, 13] , such as financial systems, web traffic monitoring, internet advertising, retail and e-business. Furthermore, it serves as the basis for the solution of other relevant problems, like identifying frequent itemsets [22] and recent frequent items [27] . A common requirement in these settings is to identify frequent items in real time with a limited amount of memory, usually orders of 1 magnitude less than the size of the problem. Several novel algorithms have been proposed in the literature to tackle this problem. There are generally two approaches: counter-based methods, and sketch-based methods. Counter-based algorithms maintain counters for a fixed number of elements of the stream, and only this limited number of elements is monitored. If an item arrives in the stream that is monitored, the associated counter is incremented, else the algorithm decides whether to discard the item or reassign an existing counter to this item. The prominent counterbased algorithms include Sticky Sampling and Lossy Counting (LC) [22], Frequent (Freq) [19, 17], and Space-Saving (SS) [23]. The other approach is to maintain a sketch of the data stream, using techniques such as hashing, to map items to a reduced set of counters. Sketch-based techniques maintain approximate frequency counts of all elements in the stream, and they can also support deletions. As such, these algorithms are much more flexible than the counter-based methods. The prominent sketch-based algorithms include CountSketch 1 (CCFC) [6], GroupTest (CGT) [10], Count Min-Sketch (CM) [9], and hCount (hC) [18]. Although similar in some aspects, each algorithm has its own characteristics and peculiarities. As far as we are aware, there has not been a comprehensive comparative study of all these algorithms 2 . In this paper, we independently compare all approaches, using a common test framework and a common set of synthetic and real datasets, the real datasets coming from such diverse domains as retail, web blogs, and space measurements. It is interesting to note that several of the previous studies have not reported results on real datasets. This work represents a comprehensive set of experiments that provide statistically robust indicators of performance under a broad range of operating conditions. Moreover, we make sure that the results of our experiments are completely reproducible. Therefore, we make publicly available the source code for all the algorithms used in our experiments, as well as the datasets upon which we tested them [26]. In summary, in this work we make the following contributions. • We evaluate the performance of the most prominent algorithms proposed in the literature for the problem of identifying frequent items in data streams. We compare the performance of these algorithms along several different dimensions, using a common and fair test framework. • In our experimental framework, we use the most extensive and diverse set of synthetic and real datasets that has been employed in the related literature. 1 We refer to the CountSketch algorithm as CCFC, after the authors' initials, to avoid confusion with the Count Min-Sketch algorithm. 2 In parallel (and independently) to our work, another study explored the performance of frequent items algorithms [8] . In our work, we use a much wider variety of real datasets for conducting the experiments. The overall results of both studies are similar. [26] Source Code, Datasets, and Additional Experimental Results.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1016/j.datak.2008.11.001</a> <a target="_blank" rel="external noopener" href="">fatcat:gm7ux5tmzfe7vo35ter33fneoy</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>