RFIMiner: A regression-based algorithm for recently frequent patterns in multiple time granularity data streams

Lifeng Jia, Zhe Wang, Nan Lu, Xiujuan Xu, Dongbin Zhou, Yan Wang
2007 Applied Mathematics and Computation  
In this paper, we propose an algorithm for computing and maintaining recently frequent patterns which is more stable and smaller than the data stream and dynamically updating them with the incoming transactions. Our study mainly has two contributions. First, a regression-based data stream model is proposed to differentiate new and old transactions. The novel model reflects transactions into many multiple time granularities and can automatically adjust transactional fading rate by defining a
more » ... ng factor. The factor defines a desired life-time of the information of transactions in the data stream. Second, we develop RFIMiner, a single-scan algorithm for mining recently frequent patterns from data streams. Our algorithm employs a special property among suffix-trees, so it is unnecessary to traverse suffix-trees when patterns are discovered. To cater to suffix-trees, we also adopt a new method called Depth-first and Bottom-up Inside Itemset Growth to find more recently frequent patterns from known frequent ones. Moreover, it avoids generating redundant computation and candidate patterns as well. We conduct detailed experiments to evaluate the performance of algorithm in several aspects. Results confirm that the new method has an excellent scalability and the performance meets the condition which requires better quality and efficiency of mining recently frequent itemsets in the data stream. (D. Zhou). Applied Mathematics and Computation 185 (2007) 769-783 www.elsevier.com/locate/amc of P. An itemset P is frequent if jT(P)j P min_sup, where min_sup is the support threshold defined by users. The objective of frequent itemset mining is to discovery the complete set of frequent itemsets in transactional databases. The majority of algorithms, including Apriori [2], FP-growth [9], H-mine [10], and OP [11], mine the complete set of frequent itemsets. Then, extended studies concerning closed frequent itemsets [12], maximal frequent itemsets [15], and mining representative itemsets [22] are proposed. Their tasks are to find a succinct presentation that describes the complete set of frequent itemsets accurately or approximately. Subsequently, some novel algorithms, including A-close [12], CLOSET [13], CHARM [14], MAFIA [15], and RPlocal and RPglobal [22] are proposed. All algorithms mentioned above have good performance in the sparse database with a high min_sup threshold. However, when the min_sup drops low or the size of database increases dynamically, their performances are influenced negatively. Unfortunately, a new kind of dense and large databases, such as network traffic analysis, web click stream mining, power consumption measurement, sensor network data analysis, and dynamic tracing of stock fluctuation, appears in recent emerging applications. They are called streaming data where data takes the form of continuous, potentially infinite data streams, as opposed to finite, statically stored databases. Data stream management systems and continuous stream query processors are under popular investigation and development. Besides querying data streams, another important task is to mine data streams for frequent itemsets. Actually, the problem concerning mining frequent itemsets in large databases was first proposed by Agrawal et al. [1] in 1993. It has been widely studied and applied since the last decade. In the environment of data streams, mining frequent itemsets, however, becomes a challenging problem, because the information in the streaming data is huge and rapidly changing. Consequently, infrequent items and itemsets can become frequent later on and hence can not be ignored. In my opinion, the best description of data stream is as follow. A data stream is a continuous, huge, fast changing, rapid, infinite sequence of data elements. According to this definition, we can draw a conclusion that the nature of streaming data makes the algorithm which only requires scanning the whole dataset once be devised to support aggregation queries on demand. In addition, this kind of algorithms usually owns a data structure far smaller than the size of whole dataset. Based on the discussion so far, the single scan requirement of streaming data model conflicts with the objective of frequent itemset mining which is to discovery the complete set of frequent itemsets. To harmonize this conflict, an estimation mechanism [16] is proposed in the Lossy Counting algorithm. Lossy Counting is a stream-based algorithm for mining frequent itemsets and utilizes the well-known Apriori property [2]: if any length k pattern is not frequent in the database, its length (k + 1) super-patterns can never be frequent, to discovery frequent itemsets in data streams. The estimation mechanism is defined as follow. Given a maximum allowable error threshold e as well as a minimum support threshold h, the information about the previous results up to the latest block operation is maintained in a data structure called lattice. The lattice contains a set of entries of the form, (e, f, $), where e is an itemset, f is the frequency of itemset e, and $ is the maximum possible error count of the itemset e. For each entry in the lattice, if e is one of the itemsets identified by new transactions, its previous count, f, is incremented by its count in new transactions. Subsequently, if its estimated count, f + $, is less than e AE N, such that N is the number of transactions, its entry is pruned from the lattice. On the other hand, when there is no entry in lattice for a new itemset identified by new transactions, a new entry, (e, f, $), is inserted into the lattice. Its maximum possible error count is set to e AE N 0 where N 0 denotes the number of transactions that were processed up to the latest block operation before. Besides the estimation mechanism, many other literatures present some ingenious technologies related with mining frequent itemsets in the data stream. , a FP-tree-based algorithm. FP-stream employs a novel titled-time windows technique and mines frequent itemsets at multiple time granularities. Ruoming and Agrawal developed StreamMining [18], a single-scan algorithm that mines frequent itemsets in data streams. StreamMining utilizes some fixed-size sets that decrease the frequencies of candidate itemsets contained by them whenever they are overflowed. Furthermore, because of particular characteristics of data streams, such as infinite and fast changing, new transactions might contain more valuable information than old transactions. So, the importance and practicability of mining recently frequent itemsets attract people's attention. Teng et al. proposed a regression-based algorithm called FTP-DS [19] to mine recently frequent itemsets by sliding windows for the first time. Chang et al. developed an algorithm called estDec [20]
doi:10.1016/j.amc.2006.06.115 fatcat:45kiaxjat5dsxgm6wne6gzpa4u