Augmented Apriori by Simulating Map-Reduce

R. Akila, K. Mani
2017 International Journal of Mathematical Sciences and Computing  
Association rule mining is a data mining technique which is used to identify decision-making patterns by analyzing datasets. Many association rule mining techniques exist to find various relationships among itemsets. The techniques proposed in the literature are processed using non-distributed platform in which the entire dataset is sustained till all transactions are processed and the transactions are scanned sequentially. They require more space and are time consuming techniques when large
more » ... unts of data are considered. An efficient technique is needed to find association rules from big data set to minimize the space as well as time. Thus, this paper aims to enhance the efficiency of association rule mining of big transaction database both in terms of memory and speed by processing the big transaction database as distributed file system in Map-Reduce framework. The proposed method organizes the transactions into clusters and the clusters are distributed among many parallel processors in a distributed platform. This distribution makes the clusters to be processed simultaneously to find itemsets which enhances the performance both in memory and speed. Then, frequent itemsets are discovered using minimum support threshold. Associations are generated from frequent itemsets and finally interesting rules are found using minimum confidence threshold. The efficiency of the proposed method is enhanced in a noticeably higher level both in terms of memory and speed. 53 and to make important decisions. Association rule mining is one of the data mining tasks to discover decisionmaking patterns. The decision-making patterns are generated in the form of if-then rules which are human understandable. The associations can be used for prediction [16] . Though Apriori is one most popular and easiest association rule mining technique, its efficiency is degraded due to many reasons such as performing too many scans on the database, requiring too much processing time, memory and generating too many candidate sets. These issues are exaggerated much more when big data is dealt. Data repositories are long-term storage of big data and data are collected from multiple sources which is organized so as to facilitate management decision-making. The data are stored under a unified schema and are summarized. Data repository systems provide data analysis capabilities, collectively referred to as On Line Analytical Processing (OLAP). OLAP operations include drill-down, roll-up, and pivot [2] . Data repositories may be big dataset. Big data is huge in its volume which is structured. Most of the structured data in scientific domain are voluminous. Processing of such big data requires state of the art computing machines. Setting up such an infrastructure is expensive.. A distributed platform is employed for tackling such scenarios using Map-Reduce [15] . Distributed platform employs distribution among multiprocessor systems wherein many tasks can be processed simultaneously. It requires tasks to be partitioned and to be distributed among the processors. The interaction among multiple processors can take place using message passing. It provides responses at great speed and utilizes the system efficiently. Similar kind of characteristics can also be accomplished in a single processor system using multithreading. Only one difference is that partitioned tasks are designed as threads, instead of processes. Map-Reduce framework is to handle big data in a distributed platform. It needs clustering of documents. Document clustering is an important part of mining [17] . Map-Reduce are used in Artificial Intelligence as functional programming [6] . It has received the highlight since it is reintroduced by Google to solve the problems by analyzing Big Data. It is defined as multiple bytes of data in distributed computing. It is inspired by Google's Map-Reduce and Google File Systems (GFS) [2]. This approach is especially for big data set and the same can be applied for association rule mining. Apriori algorithm is a powerful and important algorithm of association rule mining to mine frequent itemsets for generating Boolean association rules [10] . It is expensive because of frequent scans on the database [12], computational complexity and costly comparisons for the generation of candidate itemsets [3] . Map-Reduce programming model can be used [11] [12] [13] for the implementation of scalable Apriori algorithm and it is a parallel data processing system. The input and output data have key-value pairs in a specific format. The users express an algorithm using two functions which are map and reduce functions [6] [7] [8] [9] . Map function generates a set of intermediate key-value pairs. Reduce function combines all intermediate values associated with the same key [4] [14] [15]. Cloud computing and Grid computing are distributed environment which may be used for parallelism [5] [8] [9] . Apache Hadoop distribution is one of the cluster frameworks in distributed platform that helps to distribute large amounts of data across a number of nodes in the framework [7] [14] . It is observed that Map-Reduce programming model is used to analyze large amounts of data with the consumption of less memory, less processing time and the efficiencies of existing association rule mining techniques lag behind while voluminous data are analyzed. In addition, 1, 2, 3,...,n-candidate itemsets are generated for the generation of frequent itemsets and all the transactions are scanned for the total number of itemsets of 1, 2, ,3,...,n candidate itemsets. They lead to too many scans, need of more memory space and need more consumption of time. So it is worthwhile to incorporate Map-Reduce with the association rule mining to enhance the efficiency of association rule mining. In the proposed work, the transactional database (TDB) is distributed among multiple processors to deal with big data. The TDB is not distributed entirely. It is partitioned into clusters and the number of clusters is determined by the number of processors. These clusters consisting of the transactions are distributed among parallel processors. After the clusters are assigned to the processors, they are stored and processed by parallel processors. Parallel processes are begun with map process. It scans each transaction to find only single itemsets with their occurrences. It is further continued by finding combinations for 2, 3, ..., n-itemsets with their occurrences from single itemsets. Then, reduce process performs its task by summing up the number of occurrences of the generated itemsets within each processor and the itemsets generated by all clusters are accumulated together to construct itemsets for the database. One
doi:10.5815/ijmsc.2017.04.05 fatcat:bzld3o6wfjfehmp6xzsc6bnwre