Enhancing ER Diagrams to View Data Transformations Computed with Queries

Carlos Ordonez, Ladjel Bellatreche
2019 International Workshop on Data Warehousing and OLAP  
Transforming relational tables to build a data set takes most of the time in a machine learning (ML) project centered around a relational database. The explanation is simple: a relational database has a collection of tables that are joined and aggregated with complex relational queries, and whose columns are transformed with complex SQL expressions, in order to build the required data set. In general, such data is wide, gathering many ML variables together. Such complicated data pre-processing
more » ... esults in a large set of SQL queries that are independently developed from each other for different ML models. The database grows with important tables and views that are absent in the original ER diagram. More importantly, similar SQL queries tend to be written multiple times, creating problems in database evolution, disk space utilization and software maintenance. In this paper, we go in opposite direction from a physical level (tables) to a logical level (entities) representation, providing a unifying diagram of both levels. Specifically, we propose minimal, but powerful, extensions to an ER diagram in UML notation to represent most common database transformations. Our "transformation" ER diagram helps analytic users understanding complex transformations, consolidating columns representing analytic variables into fewer tables (i.e. eliminating redundant tables), reusing existing SQL queries (i.e. avoid forking new queries) and explaining data provenance (where data originated from).
dblp:conf/dolap/0001B19 fatcat:x7b253w2uvasdfhzgvah75sncm