Data set preprocessing and transformation in a database system

Carlos Ordonez
2011 Intelligent Data Analysis  
In general, there is a significant amount of data mining analysis performed outside a database system, which creates many data management issues. This article presents a summary of our experience and recommendations to compute data set preprocessing and transformation inside a database system (i.e. data cleaning, record selection, summarization, denormalization, variable creation, coding), which is the most time-consuming task in data mining projects. This aspect is largely ignored in the
more » ... ture. We present practical issues, common solutions and lessons learned when preparing and transforming data sets with the SQL language, based on experience from real-life projects. We then provide specific guidelines to translate programs written in a traditional programming language into SQL statements. Based on successful real-life projects, we present time performance comparisons between SQL code running inside the database system and external data mining programs. We highlight which steps in data mining projects become faster when processed by the database system. More importantly, we identify advantages and disadvantages from a practical standpoint based on data mining users feedback.
doi:10.3233/ida-2011-0485 fatcat:7jifmmcbdbbmribvdylnx3eqri