Analyzing Continuous Data Streams Using Improved Stratified Sampling and Ensemble Classification

Gayathiri Kathiresan, Krishna Mohanta, Khanaa Asari
2018 International Journal of Intelligent Engineering and Systems  
The streaming data technologies play a vital role in real-time applications. To analyze the data, Random sampling with replacement has a problem in drawing inferences from the small random sample, while sampling without replacement is not preferable to sub-streams that correspond to different sources. Hence, to effectively mine the data streams from heterogeneous sources, this work proposes Adaptive Reservoir sampling Of stream In a Time window (AdROIT) which partitions the streams in a window
more » ... n time factor and determines the size of historical data on reference window regarding the data changes in the observation window. By measuring the standard deviation of the partitioned window, we can identify whether the changes in statistical properties of a data stream is due to one or multiple sources. The AdROIT allocates the reservoir sampling size to the source, ensures the adaptability, updates the ensemble classifier with dynamically estimated weight, decides accuracy of each member regarding weight. The experimental results show that the AdROIT provides better classification and mining results over heterogeneous data streams. The AdROIT increases the precision by 16%, compared to the Chain sampling under a high degree of heterogeneity. Under the same scenario, the proposed scheme increases the recall by 30 %, more than that in Chain sampling. In high degree of heterogeneity, the Chain sampling utilizes 40kb for storage, more than that of Chain sampling. Finally, the high window size reduces the execution time in AdROIT by 15 seconds and improves the recall by 40%, compared to the Chain sampling.
doi:10.22266/ijies2018.1031.20 fatcat:nuujuah5qfgfpkizl5zxoggseu