Report from the 4th Workshop on Extremely Large Databases

Jacek Becla, Kian-Tat Lim, Daniel Liwei Wang
2010 Data Science Journal  
Academic and industrial users are increasingly facing the challenge of petabytes of data, but managing and analyzing such large data sets still remains a daunting task. The 4 th Extremely Large Databases workshop was organized to examine the needs of communities under-represented at the past workshops facing these issues. Approaches to big data statistical analytics as well as emerging opportunities related to emerging hardware technologies were also debated. Writable extreme scale databases
more » ... scale databases and the science benchmark were discussed. This paper is the final report of the discussions and activities at this workshop. The 4 th XLDB workshop (XLDB4) focused on challenges and solutions in the oil/gas, finance, and medical/bioinformatics communities, as well as several cross-domain big data topics. The three domain-specific panels expressed similar concerns about an explosion of data and limits of the current state of the art, despite having different applications and analyses. All three communities (and others present) were struggling with these challenges: integrating disparate data sets including unstructured or semi-structured data; noise and data cleansing; and building and deploying complex analytical models in rapidly changing environments. The oil/gas exploration and production business analyzed petascale seismic and sensor data using both proprietary rendering algorithms and common scientific techniques like curve fitting, usually with highly summarized data. The refining and chemicals business had terabyte, but growing, datasets. Most processing of historical financial transaction data was offline, highly parallelizable, and used relatively simple summarization algorithms although the results often fed into more complex models. Those models may then be applied, especially by credit card processors, to real-time transactions using extremely low-latency stream processing systems. High-throughput sequencing and other laboratory techniques as well as increasingly electronic medical records (including images) produced the large datasets in the medical/bioinformatics field. Applications here included shape searching, similarity finding, disease modeling, and fault diagnosis in drug production. The medical community was striking for its non-technical issues including strict regulation and minimal data sharing. Progress was made on the science benchmark that was conceived at previous XLDB workshops. This benchmark was created to provide concrete examples of science needs for database providers and to drive solutions for current and emerging needs. Its specifications and details have now been published. The next iteration will go beyond processing of images and time series of images to include use cases from additional science domains. Statistical analysis tools and techniques were reportedly insufficient for big, distributed data sets. First, statistical tools should be developed to scale efficiently to big data sizes. Second, approximating and sampling techniques should be used more often with large data sets since they can reduce the computational cost dramatically. Finally, existing statistical tools should be made easier to use by non-specialists. New hardware developments have made big data computation more accessible though uncertain in some ways. Power is the biggest issue and one that will drive the future of hardware as well as analysis. Regarding performance, more evidence of the potential speedup from GPU computing was shown through examples of
doi:10.2481/dsj.xldb10 fatcat:hpc7afhbwjadvahngm7utw2i4a