Deterministic regular expressions with back-references

Dominik D. Freydenberger, Markus L. Schmid
<span title="">2019</span> <i title="Elsevier BV"> <a target="_blank" rel="noopener" href="" style="color: black;">Journal of computer and system sciences (Print)</a> </i> &nbsp;
Most modern libraries for regular expression matching allow back-references (i. e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. in actual applications:
more &raquo; ... Originally defined for the ISO standard for SGML (see ), they are a central part of the W3C recommendations on XML DTDs [7] and XML Schema [22] (see Murata et al. [32]). The goal of this paper is finding common ground between these two variants, by introducing deterministic regex and an appropriate automaton model, the deterministic memory automata with trap-state (DTMFA). To elaborate: We first introduce a new automaton model for regex, the memory automata with trap-state (TMFA). While the TMFA is based on the MFA that was proposed by Schmid [35] , its deterministic variant, the DTMFA, is better suited for complementation than the deterministic MFA. We then generalize the notion of deterministic regular expressions to regex, and show that the Glushkov construction can also be generalized. This allows us not only to efficiently decide the membership problem for deterministic regex, but also whether a regex is deterministic. After this, we study the expressive power of these models. Although deterministic regex share many of the limitations of deterministic regular expressions (in particular, the inherent non-determinism of some regular languages persists), their expressive power offers some surprises. Finally, we examine a subclass of deterministic regexes and DTMFA for which polynomial space minimization is possible, and we consider an alternative notion of determinism. From the perspective of deterministic regular expressions, this paper proposes a natural extension that significantly increases the expressive power, while still having a tractable membership problem. From a regex point of view, we restrict regex to their deterministic core, thus obtaining a tractable subclass. Hence, the authors intend this paper as a starting point for further work, as it opens a new direction on research into making regex tractable. For space reasons, detailed proofs are given in a full version of the paper [21] . Main contributions. The main conceptual contribution of this paper are the notion of determinism in regex, and an appropriate deterministic automaton model. The main challenge from this point of view was finding a natural extension of deterministic regular expressions that preserves the following properties: A natural definition of determinism that can be checked efficiently and also has an automata-theoretic characterization, and an efficient Glushkov-style conversion to automata that decide the membership problem efficiently. Regarding technical contributions, the authors would like to emphasize that, in addition to the effort that was needed to accomplish the aforementioned goals, the paper uses subtleties of the back-reference operator in novel ways. By using these, deterministic regex can define non-deterministic regular languages (in particular, all unary regular languages), as well as infinite languages that are not pumpable in the usual sense. Related work. Regex were first examined from a theoretical point of view by Aho [2], but without fully defining the semantics. There were various proposals for semantics, of which we mention the first by Câmpeanu, Salomaa, Yu [10], and the recent one by Schmid [35] , which is the basis for this paper. Apart from defining the semantics, there was work on the expressive power [10, 11, 20] , the static analysis [11, 18, 19] , and the tractability of the membership problem (investigated in terms of a strongly restricted subclass of regex) [16, 17] . They have also been compared to related models in database theory, e. g. graph databases [4] and information extraction [15, 19] . Following the original paper by Brüggemann-Klein and Wood [9], deterministic regular expressions have been studied extensively. Aspects include computing the Glushkov automaton and deciding the membership problem (e. g. [8, 24, 34] ), static analysis (cf. [31]), deciding whether a regular language is deterministic (e. g. [12, 24, 30] ), closure properties and descriptional complexity [28] , and learning (e. g. [5]). One noteworthy extension are counter operators (e. g. [23, 24, 27] ), which we briefly address in Section 7. XX:3 2 Preliminaries We use ε to denote the empty word. The subset and proper subset relation are denoted by ⊆ and ⊂, respectively. Let Σ be a finite terminal alphabet. Unless otherwise noted, we assume |Σ| ≥ 2. Let Ξ be an infinite variable alphabet with Ξ ∩ Σ = ∅. Let w ∈ Σ * , then, for every i, w[i] denotes the symbol at position i of w. We define w 0 : = ε and w i+1 : = w i · w for all i ≥ 0, and, for w = a 1 · · · a n with a i ∈ Σ, let w m+ i n = w m · a 1 · · · a i for all m ≥ 0 and all i with 0 ≤ i ≤ n. A v ∈ Σ * is a factor of w if there exist u 1 , u 2 ∈ Σ * with w = u 1 vu 2 . If u 2 = ε, v is also a prefix of w. We use the notions of deterministic and non-deterministic finite automata (DFA and NFA) like [25] . If an NFA can have ε-transitions, we call it an ε-NFA. Given a class C of language description mechanisms (e. g., a class of automata or regular expressions), we use L(C) to denote the class of all languages L(C) with C ∈ C. The membership problem for C is defined as follows: Given a C ∈ C and a w ∈ Σ * , is w ∈ L(C)? Regex Definition 1 (Syntax of regex). We define RX, the set of regex over Σ and Ξ, recursively: Terminals and ε: a ∈ RX and var(a) = ∅ for every a ∈ (Σ ∪ {ε}). Variable reference: &x ∈ RX and var(&x) = {x} for every x ∈ Ξ. Concatenation: (α · β) ∈ RX and var(α · β) = var(α) ∪ var(β) if α, β ∈ RX. Disjunction: (α ∨ β) ∈ RX and var(α ∨ β) = var(α) ∪ var(β) if α, β ∈ RX. Kleene plus: (α + ) ∈ RX and var(α + ) = var(α) if α ∈ RX. Variable binding: x : α ∈ RX and var( x : α ) = var(α) ∪ {x} if α ∈ RX with x ∈ Ξ \ var(α). In addition, we allow ∅ as a regex (with var(∅) = ∅), but we do not allow ∅ to occur in any other regex. An α ∈ RX with var(α) = ∅ is called a proper regular expression, or just regular expression. We use REG to denote the set of all regular expressions. We add and omit parentheses freely, as long as the meaning remains clear. We use the Kleene star α * as shorthand for ε ∨ α + , and A as shorthand for a∈A a for non-empty A ⊆ Σ. We define the semantics of regex using the ref-words (short for reference words) by Schmid [35] . A ref-word is a word over (
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1016/j.jcss.2019.04.001</a> <a target="_blank" rel="external noopener" href="">fatcat:i2dtuwp6jjc4vkso5juys4nzai</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>