Retrospective Provenance Without a Runtime Provenance Recorder
Workshop on the Theory and Practice of Provenance
The YesWorkflow (YW) toolkit aims to provide users of scripting languages such as Python, Perl, and R with many of the benefits of scientific workflow automation. YW requires neither the use of a workflow engine nor the overhead of adapting or instrumenting code to run in such a system. Instead, YW enables scientists to annotate their scripts with special comments that reveal the main computational blocks and dataflow dependencies otherwise implicit in scripts. YW tools extract and analyze
... comments, represent scripts in terms of entities based on a typical scientific workflow model, and provide graphical workflow views (i.e., prospective provenance) of scripts. In this paper, we present a new extension of YW for inferring retrospective provenance from script executions without relying on a runtime provenance recorder. Instead we exploit the common practice of scientists to embed important pieces of provenance in directory structures and file names. For such "provenance-friendly" data organizations, we offer a new annotation mechanism based on URI templates. YW uses these to link conceptual-level prospective provenance with data files created at runtime, resulting in a powerful, integrated model of prospective and retrospective provenance. We present scientifically meaningful retrospective provenance queries for investigating an execution of a data acquisition workflow implemented as a Python script, and show how these queries can be evaluated using the YW toolkit.