An Infinite Multivariate Categorical Mixture Model for Self-Diagnosis of Telecommunication Networks
2020 23rd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN)
is as complex as the data processing required to construct expert rules. However, unlike text data, the data gathered from various devices and services in the network is often structured in the form of a table, where each variable takes some range of values. Clustering such data, gathered from telecommunication networks and services presents many challenges. The first and the main challenge is the unknown number of clusters of faults in the data. The second challenge is the types and
... e nature of the data. The data is multi-dimensional and can contain categorical and continuous variables. Therefore, classical clustering algorithms where the number of clusters is to be set a priori require some form of model selection. Furthermore, classical approaches such as KMeans suppose a specific probability distribution for each cluster. These modeling assumptions, including the assumption on the number of clusters, can hurt the performance of the clustering when the data do not comply with such assumptions, which is often the case when dealing with real-world applications. In this paper, we propose an infinite multivariate categorical mixture model to identify patterns of faults in an unsupervised setting, without any prior expert knowledge, and without the requirement to know a priori the number of different fault patterns. The model is based on the Dirichlet Process , which allows for learning the number of clusters from the data. However, the Dirichlet Process supposes an infinite number of clusters which translates to an intractable inference problem on the model. Our contributions are the following: • We provide a theoretical formulation of the infinite multivariate categorical mixture model (section 2). • We show how to perform approximate inference on the model, in order to extract the clusters from the data using Variational Inference  (section 3). • We demonstrate how the model is able to identify root causes of faults in a synthetic dataset generated from a real-world expert Bayesian Network (section 4). • We also demonstrate the clustering performance of the model on real operational data acquired from the Fixed Access Network and the Local Area Network (section 5). Implementation of the model and synthetic data are available in: https://git.io/JejBQ Abstract-The diagnosis of telecommunication networks remains a challenging task, mainly due to the large variety and volume of data from which the root causes have to be inferred. Expert systems, supervised machine learning, or Bayesian networks require expensive and time consuming data labeling or processing by experts. In this paper, we propose an Infinite Multivariate Categorical Mixture Model for clustering patterns of faults from data gathered from telecommunication networks. The model is able to automatically identify the number of clusters necessary to explain the data using the Dirichlet process prior. We show how to use Variational Inference to derive an Expectation-Maximization (EM) like algorithm to perform inference on the model. We apply our model on synthetic data generated from an expert Bayesian network of a Fiber-To-The-Home (FTTH) Gigabit capable Passive Optical Network (GPON). We show that the model discovers the patterns linked to the root causes of the faults with up to 96 % accuracy in an unsupervised manner. We also apply our method on real data gathered from the FTTH network and the local area network and demonstrate how the model is able to identify known faults.