On data mining, compression, and Kolmogorov complexity

Christos Faloutsos, Vasileios Megalooikonomou
2007 Data mining and knowledge discovery  
Will we ever have a theory of data mining analogous to the relational algebra in databases? Why do we have so many clearly different clustering algorithms? Could data mining be automated? We show that the answer to all these questions is negative, because data mining is closely related to compression and Kolmogorov complexity; and the latter is undecidable. Therefore, data mining will always be an art, where our goal will be to find better models (patterns) that fit our datasets as best as
more » ... ble. Example 1 (Outlier) Find the outlier in a cloud of 2-d points, as shown in Fig. 1 . Looking at the linear-linear scatter-plot of Fig. 1 (a), we would argue that the point labeled X is the outlier, since it is distant from all the other points. Point X is at (1024, 1024). Figure 1(b) shows the same dataset in log-log scales. Now, point Y seems like the outlier at (17, 17) ; all the other points are equispaced, because their coordinates are powers of 2: (1, 1), (2, 2), . . ., (2 i , 2 i ), for 1 ≤ i ≤ 10. Once we are told that almost all the data points are at powers of 2, most people would tend to consider point Y at (17, 17) as the outlier. Why? Some people may justify their choice of Y, saying that the log-log scatter-plot of Fig. 1 (b) reveals more structure than its linear-linear counterpart; since point Y violates this structure, that is the one that is the outlier. This answer is still qualitative, but it can bring us one step closer to our final destination. How do humans measure 'structure' and 'order'? This is where compression and Kolmogorov complexity help: The log-log scatter-plot is easier to describe: it consists of equi-spaced points, except for Y. Taking logarithms helped us discover a pattern that was not obvious in the linear-linear plot. A real example is shown in Fig. 2 , in which the area is plotted against the population of 235 countries of the world in 2001. Again, in linear-linear scales, we see no major pattern except for a large cluster near the origin and a few countries as outliers, that is, countries with large population or large area, or both:
doi:10.1007/s10618-006-0057-3 fatcat:czp3ylwbyfdudpr44fh2vjqfna