1,549 Hits in 4.5 sec

Efficiently Processing Workflow Provenance Queries on SPARK [article]

Rajmohan C, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, Sameep Mehta
2018 arXiv   pre-print
In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data.  ...  We focus on processing provenance queries at attribute-value level which is the finest granularity available.  ...  In comparison, our focus is on leveraging Spark platform for efficiently processing provenance data obtained from a workflow management system and not on capturing provenance data in a Spark workflow.  ... 
arXiv:1808.08424v2 fatcat:lamtyq2x2ras5oi2jspysgt4fi

Big Provenance Stream Processing for Data Intensive Computations

Isuru Suriarachchi, Sachith Withana, Beth Plale
2018 2018 IEEE 14th International Conference on e-Science (e-Science)  
They use transitive closure tables on each node in a provenance graph to improve the efficiency of graph traversal queries.  ...  The workload is a DIC workflow composed of two DICs: DIC1 runs on Hadoop and DIC2 runs on Spark, see Figure 7.2. Each DIC produces a separate provenance stream and each is processed in isolation.  ...  Research Assistant Indiana University, Bloomington Aug 2012 -Dec 2017 • Designer and lead developer of the Komadu provenance repository. • Implemented a Map-Reduce based solution on Azure Cloud for executing  ... 
doi:10.1109/escience.2018.00039 dblp:conf/eScience/SuriarachchiWP18 fatcat:xrkgn66wkzhyxm2z6duattyqdy

Guest editorial: large-scale data curation and metadata management

Mohamed Eltabakh, Boris Glavic
2018 Distributed and parallel databases  
Web-scale provenance reconstruction of implicit information diffusion on social media The authors present an efficient method for reconstructing the provenance of information diffusion in social media.  ...  ontology matching tool COACT: a query interface language for collaborative databases P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance  ... 
doi:10.1007/s10619-017-7217-x fatcat:4i2cs625yfcgnfrvuqmmy7zm5u

Supporting Data Provenance in Data-Intensive Scalable Computing Systems

Matteo Interlandi, Tyson Condie
2018 IEEE Data Engineering Bulletin  
In this paper we report our experience in building Titian: a data provenance system targeting the Apache Spark framework.  ...  Data provenance support is a key building block in libraries that aim to provide debugging support for data processing pipelines.  ...  The process repeat if other stages follow. The rampMapEnd in the final stage materialize all nested provenance IDs in HDFS. Querying.  ... 
dblp:journals/debu/InterlandiC17 fatcat:4m5p7lii5rb55cfxzxghoahsl4

Your notebook is not crumby enough, REPLace it

Mike Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Mueller, Sonia Castelo, Carlos Bautista, Juliana Freire
2020 Conference on Innovative Data Systems Research  
These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis.  ...  Mimir, in turn, is implemented as a query-rewriting front-end over Apache Spark, which handles query evaluation.  ...  Many systems capture database provenance by annotating data and propagating these annotations during query processing.  ... 
dblp:conf/cidr/BrachmannSKGMCB20 fatcat:vcck6uokrvef3ahgdr47gybaoq

Scientific Data Analysis Using Data-Intensive Scalable Computing: The SciDISC Project

Patrick Valduriez, Marta Mattoso, Reza Akbarinia, Heraldo Borges, Jose J. Camata, Alvaro L. G. A. Coutinho, Daniel Gaspar, Noel Moreno Lemus, Ji Liu, Hermano Lustosa, Florent Masseglia, Fabrício Nogueira da Silva (+8 others)
2018 Very Large Data Bases Conference  
This paper introduces the motivations and objectives of the project, and reports on the first results achieved so far.  ...  SciDISC is an ongoing project between Inria, several research institutions in Rio de Janeiro and NYU.  ...  The experiments in SciDISC are carried out using the Inria Grid'5000 testbed (, NACAD/COPPE supercomputers and LNCC SINAPAD Santos Dumont supercomputer (  ... 
dblp:conf/vldb/ValduriezMABCCG18 fatcat:fk2hp6vhxjebtpgpi7y6p3b24e

Interactive and automated debugging for big data analytics

Muhammad Ali Gulzar
2018 Proceedings of the 40th International Conference on Software Engineering Companion Proceeedings - ICSE '18  
We showcase the data provenance and optimized incremental computation features to effectively and efficiently support interactive debugging, and investigate new research directions on how to automatically  ...  pinpoint and repair the root cause of errors in large-scale distributed data processing.  ...  These systems maintain the provenance metadata in external storage and support data provenance queries through a separate programming interface.  ... 
doi:10.1145/3183440.3190334 dblp:conf/icse/Gulzar18 fatcat:o36lxubmjzfqxmv6p2kfkufkia

ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows [article]

Hui Miao, Amit Chavan, Amol Deshpande
2016 arXiv   pre-print
In this paper, we describe our vision of a unified provenance and metadata management system to support lifecycle management of complex collaborative data science workflows.  ...  We argue that a large amount of information about the analysis processes and data artifacts can, and should be, captured in a semi-passive manner; and we show that querying and analyzing this information  ...  Queries over Version/Workflow Graph and Properties: In a collaborative workflow, provenance queries to identify what revision and which author last modified a line in an artifact are common (e.g., git  ... 
arXiv:1610.04963v1 fatcat:lgyzcdt2knfnfjgea7ufzxpx4y

Efficient Runtime Capture of Multiworkflow Data Using Provenance

Renan Souza, Marta Mattoso, Leonardo Azevedo, Raphael Thiago, Elton Soares, Marcelo Nery, Marco A. S. Netto, Emilio Vital, Renato Cerqueira, Patrick Valduriez
2019 2019 15th International Conference on eScience (eScience)  
We validated ProvLake in a real use case in the O&G industry encompassing four workflows that process 5 TB datasets for a deep learning classifier.  ...  A typical solution in scientific data analysis is to capture and relate the data in a provenance database while the workflows run, thus allowing for data analysis at runtime.  ...  Komadu captures provenance data generated by workflows running on multiple data processing systems.  ... 
doi:10.1109/escience.2019.00047 dblp:conf/eScience/SouzaMATSSNBCV19 fatcat:yab5uixsqzbixdsoaaz6rncoqq

ProvDB: Provenance-enabled Lifecycle Management of Collaborative Data Analysis Workflows

Hui Miao, Amol Deshpande
2018 IEEE Data Engineering Bulletin  
workflows.  ...  novel querying and analysis capabilities for simplifying bookkeeping and debugging tasks for data analysts; and enables a rich new set of capabilities like identifying flaws in the data science process  ...  R4: The query facility should be scalable to large graph and process queries efficiently.  ... 
dblp:journals/debu/0001D18 fatcat:ybx7j6hvnjanbjnmz7wyrmb2te

Adding data provenance support to Apache Spark

Matteo Interlandi, Ari Ekmekji, Kshitij Shah, Muhammad Ali Gulzar, Sai Deep Tetali, Miryung Kim, Todd Millstein, Tyson Condie
2017 The VLDB journal  
To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark.  ...  Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort.  ...  We would also like to thank our industry partners at IBM Research Almaden and Intel for their generous gifts in support of this research.  ... 
doi:10.1007/s00778-017-0474-5 pmid:31007500 pmcid:PMC6474385 fatcat:ppvk5na66zdjrgpilsl7cprsdq

Titian: Data Provenance Support in Spark

Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie
2015 Proceedings of the VLDB Endowment  
To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark.  ...  Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.  ...  Acknowledgements We thank Mohan Yang, Massimo Mazzeo and Alexander Shkapsky for their discussions and suggestions on early stages of this work.  ... 
pmid:26726305 pmcid:PMC4697929 fatcat:hxqpqvg6i5d2zgm3vnvtjocvwi


Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, Miryung Kim
2016 Proceedings of the 38th International Conference on Software Engineering - ICSE '16  
To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform.  ...  Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics.  ...  Participants in this project are in part supported through NSF CCF-1527923, CCF-1460325, IIS-1302698, CNS-1351047, and NIH U54EB020404.  ... 
doi:10.1145/2884781.2884813 pmid:27390389 pmcid:PMC4933307 dblp:conf/icse/GulzarIYTCMK16 fatcat:atfa4b4cczehrkslaojivhkosi

HEP Software Foundation Community White Paper Working Group - Data Analysis and Interpretation [article]

Lothar Bauerdick, Riccardo Maria Bianchi, Brian Bockelman, Nuno Castro, Kyle Cranmer, Peter Elmer, Robert Gardner, Maria Girone, Oliver Gutsche, Benedikt Hegner, José M. Hernández, Bodhitha Jayatilaka (+17 others)
2018 arXiv   pre-print
As part of the HEP Software Foundation Community White Paper process, a working group on Data Analysis and Interpretation was formed to assess the challenges and opportunities in HEP data analysis and  ...  develop a roadmap for activities in this area over the next decade.  ...  Removing the requirements on storing intermediate data in the analysis chain would help to "democratize" data analysis and streamline the overall analysis workflow. Ease of Provenance.  ... 
arXiv:1804.03983v1 fatcat:spygvugilvavnahtikugcln53y

Geoweaver: Advanced Cyberinfrastructure for Managing Hybrid Geoscientific AI Workflows

Ziheng Sun, Liping Di, Annie Burgess, Jason A. Tullis, Andrew B. Magill
2020 ISPRS International Journal of Geo-Information  
However, none of the existing workflow management software provides a satisfying solution on hybrid resources, full file access, data flow, code control, and provenance.  ...  This paper introduces a new system named Geoweaver to improve the efficiency of full-stack AI workflow management.  ...  Thanks to our colleagues in George Mason University and many other institutes who gave kind advice on the project development.  ... 
doi:10.3390/ijgi9020119 fatcat:foijkptf6bfdtlmhjpptsa2ywe
« Previous Showing results 1 — 15 out of 1,549 results