Integrity constraints for XML

Wenfei Fan, Jérôme Siméon
2003 Journal of computer and system sciences (Print)  
Integrity constraints have proved fundamentally important in database management. The ID/IDREF mechanism provided by XML DTDs relies on a simple form of constraints to describe references. Yet, this mechanism is sufficient neither for specifying references in XML documents, nor for expressing semantic constraints commonly found in databases. In this paper, we extend XML DTDs with several classes of integrity constraints and investigate the complexity of reasoning about these constraints. The
more » ... straints range over keys, foreign keys, inverse constraints as well as ID constraints for capturing the semantics of object identities. They improve semantic specifications and provide a better reference mechanism for native XML applications. They are also useful in information exchange and data integration for preserving the semantics of data originating in relational and object-oriented databases. We establish complexity and axiomatization results for the (finite) implication problems associated with these constraints. In addition, we study implication of more general constraints, such as functional, inclusion and inverse constraints defined in terms of navigation paths. r (J. Sim! eon). 0022-0000/03/$ -see front matter r 2003 Elsevier Science (USA). All rights reserved. PII: S 0 0 2 2 -0 0 0 0 ( 0 2 ) 0 0 0 3 2 -6 As this work is motivated by the need for integrity constraints in practical XML applications, we first illustrate several important application contexts and the limitations of the current ID/ IDREF mechanism. 256 corresponding attribute should uniquely identify an element in the entire document, i.e., it is unique among all ID attributes. An IDREF(S) annotation indicates a reference, i.e., it should contain a (set of) value(s) of some ID attribute(s) present in the document. Observe that the ID/IDREF mechanism is similar to both the object-identity based notion of references from object-oriented databases [3] and to keys/foreign keys from relational databases. On the one hand, like object identifiers, ID attributes uniquely identify elements within the whole document. On the other hand, as XML has a textual format, the reference semantics is achieved with implicit constraints that must hold on attribute values, in the spirit of relational keys and foreign keys. Yet, it captures neither the complete semantics of relational keys and foreign keys nor that of object-style references. For instance, isbn should be a key for entry. Its representation as an ID attribute indeed makes it unique, but among all the ID attributes in the document. This is too strong an assumption, preventing other elements, e.g., books, from using the same isbn number as a key. Worse still, the scope and type of an ID/IDREF attribute are not clear. The to attribute, for instance, could contain a reference to a section or an author element. One has no control over what an IDREF reference points to. Obviously, we would like to constrain such references to entry elements only. We can resolve these problems by changing slightly the constraints on the attributes involved. More specifically, we can (i) treat isbn (resp. sid) attribute as a key for entry (resp. section) elements, which uniquely identifies an element among the elements of entry (resp. section), as opposed to all elements in the entire document; (ii) add an inclusion constraint as part of a foreign key, asserting that is a subset of entry.isbn, where t:l stands for the set of l attribute values of all t elements in a document. That is, for any ref element x and each value v of the to attribute of x; there is an entry element y such that v matches the isbn attribute value of y: Observe that isbn is a key of entry and thus the to attribute is a foreign key of ref that references entry elements. These constraints can be expressed in our language L u : Capturing the semantics of legacy repositories: XML is mainly used for data exchange. As a consequence, a large amount of XML data originates in relational or object-oriented databases, for which keys, foreign keys and inverse relationships [2,15] convey a fundamental part of the information. Consider, for instance, the following object-oriented schema (in ODL syntax [15] ): On top of the structure specified by the schema, we have the following: (1) name and dname are keys for the Person and Dept classes, respectively, and (2) there is an inverse relationship between dept and Dept.has staff. That is, (1) no distinct Person objects can have the same
doi:10.1016/s0022-0000(02)00032-6 fatcat:mztjh3xsrveh7ngfvxtsksj34m