Knowledge Discovery in Databases

Roland Düsing
2000 Wirtschaftsinformatik  
This is a manuscript of a textbook evolving from research and three years of teaching at the Hong Kong University of Science and Technology. The textbook gives an introduction into the fascinating eld of knowledge discovery in databases, sometimes called data mining. The manuscript is suited for beginners who can leave out the more advanced sections, as well as people who would like to do research in this area. In the manuscript emphasizes our own discovery techniques. Statistical and neural
more » ... approaches are not discussed in depth since many excellent textbooks covering those topics are available. A Case Study We obtained data about 96 French rms. For each rm we got the information shown in gure 1.2. An expert has given two kinds of appraisals : premises and conclusions. The premises consist of 10 weighted criteria (see gure 1.2): dynamic (abbreviated by dyn), professionally competent (pro), audacious (aud), innovative (inn), appreciated (app), reputable (rep), enterprising (ent), powerful (puis), reliable (rel) and intelligently managed (int). The conclusions are the position of the rm on the McKinsey matrix. This matrix measures on one axis the competitiveness of the rm within its sector, and on the other axis whether the sector is nancially interesting or not. The task is therefore to predict the competitiveness from the weighted premises. In fact, we only looked at the competitiveness of the rms, because, as the people who provided us with the data assured, there is no correlation between the premises and the axis denoted sector. We store each premise of each rm as a weighted fact (i.e. weighted background information) and store the weighted competitiveness of the rms as training examples. We thus discover or learn rules with head comp(X) and with bodies consisting of literals built from the premise predicates. The learned rules will predict the competitiveness from the premises. The expert classi ed each premise of a company into one of seven columns (very bad to very good, see gure 1.2). We simply divide the closed interval 0; 1] equally by the seven numbers 0; 0:17; 0:33; 0:5; 0:67; 0:83; 1 and attach these numbers as weights to the premise facts instead. The McKinsey matrix has 3 columns, so that a rm can be seen as noncompetitive, average, or very competitive. The weights of the competitiveness are thus either 0; 0:5 or 1. For the rm shown in gure 1.2, for example, we store the facts dyn(firm 1) : 0:5; pro(firm 1) : 0:83 and the training example example(comp; firm 1; 0:5). The data from these 96 rms were stored in a database and knowledge discovery system. This p J J J ] J J J ] J J J ]
doi:10.1007/bf03250720 fatcat:qapnx6g3zjexne276yw2xiohse