Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification [chapter]

Mark Tennant, Frederic Stahl, Giuseppe Di Fatta, João Bártolo Gomes
2014 Research and Development in Intelligent Systems XXXI  
2014) Towards a parallel computationally efficient approach to scaling up data stream classification. Abstract Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in realtime. The creation and real-time adaption of classification models from
more » ... a streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks. ety (possible unstructured format) of 'Big Data'. Traditional data mining methods for classification of static data take several passes through the training data in order to generate the classification model, which is then applied on previously unseen data instances. Streaming models differ from this learning procedure of Train and Test to a system that continuously needs to be evaluated and updated. As the data is often either too fast to process in depth, or too vast to store, data stream classifiers must be naturally incremental, adaptive and responsive to single exposures to data instances. The continuous task of re-learning and adaptation aims to tackle the problem of concept drift [12] (changes of the patterns encoded in the streams over time). An ideal data stream classifier should incorporate certain features [18]: the classifier must limit its size (memory footprint) as streams are theoretically infinitely long; the time taken to process each instance is short and constant so as not to create a bottleneck; each data instance is only seen once by the classifier; the classification model created incrementally should be equivalent to a 'batch' learner given the same training data; and the classifier must be able to handle concept drift. Data streams come in all forms as technologies merge and become more interconnected. Classic applications are: sensor networks; Internet traffic management and web log analysis [13]; TCP/IP packet monitoring [8] ; and intrusion detection [15] . However, capturing, storing and processing these data streams is not feasible, as the data stream is potentially infinite. Systems that could analyse these very fast and unbounded data streams in real-time are of great importance to applications such as the detection of credit card fraud [6, 20] or network intrusion. For many data mining problems parallelisation can be utilised to increase the scalability. It is a way for classifiers to increase the speed of both model creation and usage, notable developments are for example the tree and rule based parallel classifiers [16, 22, 23] . Working with data streams limits the processing time available for classifications (both testing and training), to the small window of time in between the arrival of instances. Parallelisation of data stream mining algorithms offers a potential way to create faster solutions that can process a much larger amount of data instances in this small time window and thus can scale up these algorithms to high velocity data streams. One of the currently fastest streaming decision tree based classifiers VFDT (Very Fast Decision Tree) [11] is simple, incremental and has great performance. Unfortunately they are not inherently scalable and lack the ability to be efficiently parallelised. The problem with distributing complex streaming classifier models (such as decision trees) over a cluster, is that it reduces their ability to adapt to concept drift and creates new problems such as load imbalance and time delays. KNN is typically not suited to data stream mining without adaptation (such as employing KD-Trees,P-Trees, L-Trees, MicroClusters) [26], as they incur a relatively high real-time processing cost, proportional to their training data size. In this paper we propose KNN as a basis for the creation of a parallel data stream classifier. The motivation for using KNN is because KNN is inherently parallelisable, for example [25] developed a parallel KNN using the MapReduce parallel programming paradigm [9] . This has been demonstrated in the past for KNN on static data [17], but not yet on data streams. Versions of KNN for data streams exist, such as [10], but to our knowledge there are no parallel approaches for KNN on data streams.
doi:10.1007/978-3-319-12069-0_4 dblp:conf/sgai/TennantSFG14 fatcat:65lwln5ddvdojc3idwvyhs3mgu