BIG DATA ANALYTICS IN STATIC AND STREAMING PROVENANCE

Peng Chen, David Crandall, Ryan Newton, Evans, Yuan Luo, Yu Luo, Milinda Pathirage, Zong Peng, Guangchen Ruan, Isuru Suriarachchi, Gabriel Zhou, Jiaan (+4 others)
2016 unpublished
ii Copyright c 2016 Peng Chen iii To my wife Shuya, my parents, and my grandparents. iv Acknowledgements First, I would like to express my sincere gratitude to my advisor Professor Beth Plale for her continuous guidance and support throughout my PhD study. I feel truly fortunate to be able to work with her, and what I have learned from her is far beyond conducting research. I have been constantly surprised by her insightful thoughts and visionary suggestions. Her dedication, thrust, and energy
more » ... ill continue to inspire me. I would also like to individually thank the rest of my thesis committee. Professor Tom Evans is my minor advisor in Geography, and I was enlightened by his crystal clear illustration of geographic concepts. His consideration and special sense of humor ensured my experience collaborating with him was truly enjoyable. Professor David Crandall has always being encouraging as well. He gave me lots of insightful comments from the very beginning and spent valuable time on proofreading and commenting extensively on my thesis. Professor Ryan Newton has a very sharp mind. He showed me different perspectives to tackle my research question and helped me frame the solution. I also want to thank the administrative staff, Jenny Olmes-Stevens and Jodi Stern for making the lab an enjoyable and productive place to work. Last but not the least, I would like to thank my family: my wife, Shuya Xu, and my parents , Yuliang Chen and Meizhen Fang, who have been supportive and encouraging through the journey. vi Peng Chen BIG DATA ANALYTICS IN STATIC AND STREAMING PROVENANCE With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on overtime behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making. vii Beth Plale, Ph.D. (Chairperson)
fatcat:ycntxdxurzhejl3tsyfbjze5y4