Extended aggregations for databases with referential integrity issues

Javier García-García, Carlos Ordonez
2010 Data & Knowledge Engineering  
Querying databases with incomplete or inconsistent content remains a broad and difficult problem. In this work, we study how to improve aggregations computed on databases with referential errors in the context of database integration, where each source database has different tables, columns with similar content across multiple databases, but different referential integrity constraints. Thus, a query in an integrated database may involve tables and columns with referential integrity errors. In a
more » ... data warehouse, even though the ETL processes fix referential integrity errors, this is generally done by inserting "dummy" records into the dimension tables corresponding to such invalid foreign keys, thereby artificially enforcing referential integrity. When two tables are joined and aggregations are computed, rows with an invalid or null foreign key value are skipped, effectively eliminating potentially valuable information. With that motivation in mind, we extend SQL aggregate functions computed over tables with referential integrity errors to return complete answer sets in the sense that no row is excluded. We associate to each referenced key in the dimension table, a probability that invalid or null foreign keys refer to it. Our main idea is to compute aggregations over joined tables including rows with invalid or null references by distributing their contribution to aggregation totals, based on probabilities computed over correct foreign keys. Therefore, our extended aggregations can return improved answer sets in databases that violate referential integrity or have referential issues. Experiments with real and synthetic databases evaluate the usefulness, accuracy and performance of our extended aggregations. This is the author's version. Official version published in Elsevier DKE, 69(1): 2010
doi:10.1016/j.datak.2009.08.008 fatcat:jd2oiwy65bbpvd2v3wgaptkc6e