A New Network Traffic Classification Method Based on Classifier Integration

Zhang Luoshi, Xue Yibo, Bao Yuanyuan
2015 International Journal of Grid and Distributed Computing  
With development of scale, diversity and complexity of network traffic, the drawbacks of traditional machine learning methods on traffic classification is gradually exposed, especially the false positive problem in large-scale real network traffic classification is particularly serious. In this paper, aiming at reducing the false positive rate of network traffic classification, an effective network traffic classification method ---CMM method. CMM method contains three steps, including dividing
more » ... he training set into clusters, forming sub-classifiers, and classifier integration in accordance with the principle of minimization and maximization. In this paper, we firstly demonstrate the effectiveness of this method in reducing the false positive rate. Secondly, we conduct experiments in large-scale national backbone network, such as the SSL protocol classification and experimental results verify the effectiveness of this method in large-scale the actual network traffic classification. (such as the number of all packets and the average payload length of packets) as the classification feature and use Naive Bayes, SVM, C4.5 and other machine learning algorithms for network traffic classification. Such methods do not depend on the contents of the payload of packet or the port number. They can solve the current problems of network traffic classification such as port multiplexing and traffic encryption and have become a research focus. However, most existing network traffic classification methods based on machine learning have a premise that the flow number of various protocol in the training set is basically balanced and the classifier trained accordingly has better classification effect on the test data with balanced class in the laboratory environment. But in the real-world network environment, various class of network traffic are of great difference in the proportion, that is, some class of protocol are far fewer than others, with a serious imbalance. The class with fewer samples are called little class or rare class in this paper, while those with a large number of samples are called large class. In the above-mentioned real-world network environment, if the model trained by the data of balanced class was used to classify the network traffic of little class, a lot of large class will be mistakenly identified as little class. That is to say, a serious problem of false positives will occur and greatly affect the accuracy of traffic classification. Therefore, in the context of massive and complex real network traffic, how to design a high-accuracy method to classify the traffic of little class has great academic significance and high practical value. This paper analyzed the causes of low traffic classification accuracy of little class in the large-scale network environment, designed a new network traffic classification method to effectively increase the classification accuracy and experimentally validated it in the actual network traffic classification. The experimental results validated the effectiveness of this method in the classification of actual network traffic of little class. The main contributions of this paper are as follows:  It analyzed the main causes of the decrease in classification accuracy of traffic of little class based on traditional machine learning methods;  It designed a network traffic classification method based on classifier ensemble that could effectively improve the classification accuracy of traffic of little class -CMM (Cluster-Min-Max);  It experimentally validated CMM method in the context of actual network traffic, and the experimental results verified the effectiveness of CMM method in the classification of massive network traffic. This paper was organized as follows: Section 2 described the traditional network traffic classification technology and the research status of that based on machine learning; Section 3 analyzed the reasons for the decrease in accuracy and proposed CMM method, a classification method of large-scale network traffic; Section 4 experimentally validated the accuracy, recall and false positive of CMM method in the context of actual network traffic; Section 5 summarized and discussed the next research work. Related Works Traditional network traffic classification technologies mainly include port -based ones and payload-based ones. The port-based network traffic classification technology is based on the port number assigned by IANA for each network application to classify the traffic, such as port 443 assigned to HTTPS protocol. However, with the widespread use of the technology such as random port, port multiplexing and port hopping, the existing port-based network traffic classification Copyright ⓒ 2015 SERSC 311 technology can only identify less than 30% of network traffic [1] , with the identification accuracy of only 50-70% [2]. Payload-based network traffic classification technology builds signature through the pre-analysis of packet payload characteristics of network application, uses regular expressions or string matching to determine the presence of such signature in the network traffic, and based on this determines the protocol of network flow. However, with the extensive use of encryption protocol, protocol multiplexing and feature obfuscation, the payload characteristics are hidden so that payload-based network traffic classification technology has gradually lost its effect [3] . In order to effectively address the failures of port-based and payload-based network traffic classification technologies, machine learning methods have been introduced in the field. In 2003, Early et al. [4] used the statistical features of network flow such as average packet payload length, average interval time between packets and TCP header flag to distinguish the applications of HTTP, SMTP, FTP, SSH and TELNET. Since then, network traffic classification methods based on machine learning have gradually become a research focus, and a large number of research results have been achieved [5] . To further improve the accuracy of network traffic classification methods based on machine learning, Moor et al. summarized 248 class of features in the network flow that could be used for machine learning methods [6]. Williams et al. made an effective evaluation on the performance and accuracy of five different machine learning algorithms in terms of network traffic classification [7]. Yang Baohua et al. [8] proposed SMILER and used the semi-supervised machine learning algorithm as well as the payload length of the first 5 packets to classify network traffic and achieved high accuracy. Xi Liang et.al analyzed research status of immunity-based intrusion detection system (IIDS) and promoted the conversion of the theoretical fruits to applications and stimulated the deeper developments of artificial immune systems [9]. Based on active learning and SVM algorithms, Wang Yipeng et al. [10] achieved the classification of unknown network protocol traffic only depending on the payload information in untreated network traffic. They reduced the number of learning samples and at the same time ensured the classification accuracy. Dong Hui et al. proposed a new method based on link homophily to classify traffic in application layer without the payloads and properties, and achieved above 80% accuracy [11]. However, the existing classification algorithms mostly consider the balanced data set, that is, the same number of sample data is obtained from each protocol class to constitute the training set. Meanwhile, for the collection of training samples, only the random under-sampling method can be used to obtain partial traffic due to environment and time constraints. Therefore, flow features cannot be all effectively covered. As put into the real-world network environment, they show a sharp decline in their identification effect, have a high misclassification rate and take on significant false positives. To this end, Zhang Hongli et al. [12] compared the performance of C4.5, SVM, NBK and Bagging algorithms for the classification of network traffic with imbalanced class. Although Bagging algorithm and C4.5 decision tree algorithm have relatively better performance in handling little class of network protocol, they still cannot solve the problem of imbalanced class. Nguyen [13] and Dainotti [14] et al. also argued that the existing network traffic classification technology based on machine learning mainly faced serious challenges of imbalanced class and had poor classification effect on little class and almost invalid usability [15] [16] [17] .
doi:10.14257/ijgdc.2015.8.3.29 fatcat:xbe7i36z65htvlmevqt5adnpgm