Letter from the Special Issue Editors

Shazia W. Sadiq, Divesh Srivastava
2016 IEEE Data Engineering Bulletin  
The prevalence of large volumes and varieties of accessible data is profoundly changing the way business, government and individuals approach decision making. Organizational big data investment strategies regarding what data to collect, clean, integrate, and analyze are typically driven by some notion of perceived value. However, the value of the data is inescapably tied to the underlying quality of the data. Although for big data, value and quality may be correlated, they are conceptually
more » ... rent. For example, a complete and accurate list of the books read on April 1, 2016 by the special editors of this issue may not have much value to anyone else. Whereas even partially complete and somewhat noisy GPS data from public transport vehicles may have a high perceived value for transport engineers and urban planners. In spite of significant advances in storage and compute capabilities, the time to value in big data projects often remains unacceptable due to the quality of the underlying data. Poor data quality is being termed as the dark side of big data, inhibiting the effective use of data to discover trusted insights and foresights. Finding the nexus of use and quality is a multifaceted problem encompassing organizational and computational challenges. These challenges are often specific to the type of data (e.g. structured/relational, text, spatial, time series, social/graph, multimedia, RDF/web), the dimension of data quality (e.g. completeness, consistency, timeliness), and the preparatory processes (e.g. data acquisition, profiling, curation, integration) that precede the actual use of the data. Designing a practical strategy for tackling quality issues in big data requires data scientists to bring together these multiple aspects of data type, quality dimension and process within the context of their application setting. In this special issue we have endeavoured to present recent research of some of the leading experts in the field of data quality with the aim of informing the design of such practical strategies. Out of the eight papers, four are on relational/structured data while the remaining four are on time series data, spatio-temporal data, micro-blog data and web data. The papers have targeted a number of data quality dimensions through a range of innovative approaches as outlined below. The first two papers tackle data quality dimensions of meta-data compliance and schema quality. Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann present data anamnesis as a means of meta-data discovery with an aim to assess the quality and utility of the underlying relational datasets. In the second paper, Henning Kohler, Sebastian Link and Xiaofang Zhou present a method for discovering meaningful certain keys, in the presence of incomplete and inconsistent data with an aim to tackle redundancy and maintain the integrity constraints of the underlying relational data. The next two papers discuss data cleaning in the context of associated data transformation and curation activities. These works are instrumental in evaluating the effectiveness of data cleaning algorithms. A number of data quality dimensions are covered by these papers including value, format and semantic consistency, and business rule compliance. The paper by Ihab Ilyas proposes a decoupling between detecting data errors and the repairing of these errors within a continuous data cleaning life-cycle with humans in the loop. The paper by
dblp:journals/debu/SadiqS16 fatcat:wjo4xbmncbdz7a5hqw4hpwpcea