An Outlier Mining Algorithm Based on Dissimilarity

Ming-jian Zhou, Xue-jiao Chen
2012 Procedia Environmental Sciences  
Outlier mining is a hot topic of data mining. After studying the commonly used outlier mining methods, this paper presents an outlier mining algorithm OMABD(Outlier Mining Algorithm Base on Dissimilarity) based on dissimilarity. The algorithm first constructs dissimilarity matrix based on object dissimilarity of each object of data set, then makes the dissimilarity degree of each object according to the dissimilarity matrix, and finally outlier will be detected by comparing the dissimilarity
more » ... he dissimilarity degree with dissimilarity threshold. The experiment results show that this algorithm can detect outlier efficiently. 1.Introduction With the development of information technology and the growing popularity of Internet, much more application data are available to people. But there are many noises or incomplete data. Usually, this kind of data which has special behavior or model is called outlier. An outlier, according to Hawkins [1], is "an observation that deviates so much from other observations as to arouse that it was generated by a different mechanism". Outliers are mainly produced by the following three causes: 1) Data caused by their inherent changes. This change occurs naturally due to data sample, and is uncontrollable. 2) Data result from execute error such as manual operation errors, hacker break and equipment failures. 3) Data that fall into wrong classes. In effective data set, outlier is a small part and recognized as the byproduct of clustering. So, outlier is always cancelled or neglected simply. However, researchers gradually realize that certain outlier probably is the real reflection of normal data. So outlier mining becomes an important aspect of data mining. The task of outlier mining is to discover exceptional, interesting, sparse and isolated patterns concealed in massive data set. It often makes people find some real, but unexpected knowledge. Therefore, outlier mining in real life has a wide range of applications, such as credit card malicious overdraw, network intrusion detection, loan proof checking and son on [2] . Nowadays, the classical technologies of outlier mining can be divided into four categories: statistic-based methods[3], distance-based methods [4, 5] , density-based methods [6, 7, 8] and deviation-based methods [9, 10, 11] .
doi:10.1016/j.proenv.2012.01.352 fatcat:5jbg54bnyrau7fgd2dengthcxu