An overview of inference methods in probabilistic classifier chains for multilabel classification

Deiner Mena, Elena Montañés, José Ramón Quevedo, Juan José del Coz
<span title="2016-08-03">2016</span> <i title="Wiley"> <a target="_blank" rel="noopener" href="" style="color: black;">Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery</a> </i> &nbsp;
This paper presents a review of the recent advances in performing inference in probabilistic classifier chains for multi-label classification. The interest of performing such inference arises in an attempt of improving the performance of the approach based on greedy search (the well-known CC method) and simultaneously reducing the computational cost of an exhaustive search (the well-known PCC method). Unlike PCC and as CC, inference techniques do not explore all the possible solutions, but they
more &raquo; ... increase the performance of CC, sometimes reaching the optimal solution in terms of subset 0/1 loss, as PCC does. The −approximate algorithm, the method based on a beam search and Monte Carlo sampling are those techniques. An exhaustive set of experiments over a wide range of datasets are performed to analyze not only in what extent these techniques tend to produce optimal solutions, otherwise also to study their computational cost, both in terms of solutions explored and execution time. Only −approximate algorithm with = .0 theoretically guarantees reaching an optimal solution in terms of subset 0/1 loss. However, the other algorithms provide solutions close to an optimal solution, despite the fact they do not guarantee to reach an optimal solution. The −approximate algorithm is the most promising to balance the performance in terms of subset 0/1 loss against the number of solutions explored and execution time. The value of determines in what extent one prefers to guarantee to reach an optimal solution at the cost of increasing the computational cost. Introduction Multi-label classification 1 (MLC) is a machine learning problem in which models are able to assign a subset of (class) labels to each instance, unlike conventional (single-class) classification that involves predicting only a single class. Multi-label classification problems are ubiquitous and naturally occur, for instance, in assigning keywords to a paper, tags to resources in a social network, objects to images or emotional expressions to human faces. In general, the problem of multi-label learning is coming with two fundamental challenges. The first one bears on the computational complexity of the algorithms. A complex approach might not be applicable in practice when the number of labels is large. Therefore, the scalability of algorithms is a key issue in this field. The second problem is related to the own nature of multi-label data. Not only the number of labels is typically large, otherwise each instance also belongs to a variable-sized subset of labels simultaneously. Moreover, and perhaps even more important, the labels will normally not occur independently of each other; instead, there are statistical dependencies between them. From a learning and prediction point of view, these relationships constitute a promising source of information, in addition to the information coming from the mere description of the instances. Thus, it is hardly surprising that research on MLC has very much focused on the design of new methods that are able to detect-and benefit from-interdependencies among labels. Several approaches have been proposed in the literature to cope with MLC. Firstly, researchers tried to adapt and extend different state-of-the-art binary or multi-class classification algorithms, including methods using decision trees 2 , neural networks 3 , support vector machines 4 , naive Bayes 5 , conditional random fields 6 and boosting 7 . Secondly, they further analyzed in depth the label dependence and attempted to design new approaches that exploit label correlations 8 . In this regard, two kinds of label dependence have been formally distinguished: conditional dependence 6,9-13 and unconditional dependence 3,14,15 . Also, pairwise relations 3, 4, 7, 16, 17 , relations in sets of different sizes 12, 18, 19 , or relations in the whole set of labels 10,14,15 have also been exploited. Regarding conditional label dependence, the approach called Probabilistic Classifier Chains (PCC) has aroused great interest among the multi-label community, since it offers the nice property of being able to estimate the conditional joint distribution of the labels. However, the original PCC algorithm 9 suffers from high computational cost, since it performs an exhaustive search as inference strategy to obtain optimal solutions in terms of a given loss function. Then, several efforts that use different searching and sampling strategies in order to overcome this drawback are being made just now. This includes uniform-cost search 20 , beam search 21,22 and Monte Carlo sampling 20,23,24 . All of these algorithms successfully estimate the optimal solution reached by the original PCC 9 , at the same time that they reduce the computational cost in terms of both the number of candidate solutions explored and execution time. This paper studies in depth the behavior and the properties of all these algorithms, comparing their strategies and establishing their differences and similarities, paying special attention to the meaning of their parameters and the effect of the different values they can take. The methods are experimentally compared over a wide range of multi-label datasets, concluding that even those that do not theoretically guarantee obtaining optimal solutions also reach good performance. However, the −approximate algorithm shows to be a promising election, even for values of that do not guarantee reaching optimal solutions. For this algorithm, it happens that renouncing to reach optimal solutions leads to reduce the computational cost in terms of candidate solutions explored and execution time and viceversa. The rest of the paper is organized as follows. Section 2 formally describes multi-label framework and the principles of PCC. Section 3 discusses the properties and behavior of the different existing approaches for inference in PCC. Exhaustive experiments are shown and discussed in Section 4. Finally, Section 5 exposes some conclusions and includes new directions of future work.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1002/widm.1185</a> <a target="_blank" rel="external noopener" href="">fatcat:wf2hoo5zcrcbvptoktyl2tno54</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>