SPADE: Support for Provenance Auditing in Distributed Environments
Lecture Notes in Computer Science
SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of
... ovenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches. SPADEv2 is the second generation of our data provenance collection, management, and analysis software infrastructure. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges, each of which can be labeled with an arbitrary number of annotations (in the form of key-value pairs). These annotations can be used to embed the domain-specific semantics of the provenance. The system has been completely re-architected to decouple the production, storage, and utilization of provenance metadata, as illustrated in Figure 1 . At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. The kernel handles buffering, filtering, and multiplexing incoming metadata from multiple provenance sources. It can be configured to commit the elements to multiple databases, and responds to concurrent queries from local and remote clients. The kernel also supports modules that operate on the stream of provenance graph elements, allowing the aggregation, fusion, and composition of provenance elements to be customized by a series of filters. SPADEv2 supports the Open Provenance Model [42, 47] and includes controlling Agent, executing Process, and data Artifact node types, as well as dependency types that relate which process wasControlledBy which agent, which artifact wasGenerat-edBy which process, which process used which artifact, which process wasTriggeredBy which other process, and which artifact wasDerivedFrom which other artifact. Table 1 illustrates how each of these nodes and dependencies represented.