Building acceptable classification models for financial engineering applications

David Martens
2008 SIGKDD Explorations  
Classification is a popular data mining task, where the value of a discrete (dependent) variable is predicted, based on the values of several independent variables. In this research, we investigate how predictive classification models can be inferred from the available data. The classification models are required to make good predictions, and be comprehensible and intuitive. The aspect of humanly understandable and intuitive models is of crucial importance in any domain where the model needs to
more » ... be validated before it can be implemented, such as in the medical diagnosis and credit scoring domain. A classification model that is accurate, comprehensible and intuitive is defined in this thesis as acceptable for implementation. Building such acceptable models is the goal of this text. We examine how rule based classifiers can be built that satisfy these requirements. In a first approach, we use rule extraction from Support Vector Machines (SVMs) to extract rules that are accurate, comprehensible, and mimic the SVM model as much as possible. Next, the use of artificial ant colonies for classification is studied, attempting to induce acceptable classification models from data. In a final part, we discuss the application of the investigated algorithms for real-life case studies, such as the prediction of defaults, going concern opinions, software faults, and business/ICT alignment. SVM Rule Extraction We examine how rule based classifiers can be built that satisfy the aforementioned prerequisites. An initial exploratory benchmarking study examines the current opportunities for SVM rule extraction [2] . With these lessons learnt, a new methodology is developed for SVM rule extraction: active learning based approach (ALBA) [6] . ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are in the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors, that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules.
doi:10.1145/1540276.1540285 fatcat:j2uuk6oz7jabxnwtyk5fash23q