A Unified Framework for Frequent Sequence Mining with Subsequence Constraints

Kaustubh Beedkar, Rainer Gemulla, Wim Martens
2019 ACM Transactions on Database Systems  
Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints-including and beyond those considered in the literature-can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence
more » ... onstraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive "pattern expressions" to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms-although more general-are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods. 11:2 K. Beedkar et al. textual patterns such as "PERSON is married to PERSON " are indicative of typed relations between entities and useful for natural-language processing and information extraction tasks [21, 37] . In FSM, we model the available data as a collection of sequences composed of items such as words (text processing), products (market-basket analysis), or actions and events (session analysis). Often items are arranged in an application-specific hierarchy; e.g., is→be→VERB (for words), Canon 5D→DSLR camera→electronics (for products), or Rakesh Agrawal→scientist→PERSON (for entities). The goal of FSM is to discover subsequences or generalized subsequences that occur in sufficiently many input sequences. Since the total number of such subsequences can potentially be very large and not all frequent subsequences may be of interest to a particular application, most FSM methods make use of subsequence constraints to control the set of subsequences to be mined. A large variety of subsequence constraints has been studied in prior work [9, 10, 23, 33, 39, 40, 43, 50] . Commonly proposed constraints include gap or span constraints, where items in the subsequences need to appear "close" in the input sequence, and length constraints, where the number of items in the subsequences is bounded. In n-gram mining [12], for example, the goal is to mine frequent consecutive subsequences of exactly n words. Hierarchy constraints allow controlled generalization according to the item hierarchy to find patterns that do not directly occur in the input data. Examples include shopping patterns such as "customers frequently buy some DSLR camera, then some tripod, then some flash" or textual patterns such as "PERSON be born in LOCATION." Regular expression (RE) constraints have also been studied in the context of FSM; here, subsequences must match a given RE. A number of specialized algorithms for various combinations of the above subsequence constraints have been proposed in the literature. In this work, we focus on the questions of (1) how to model and express subsequence constraints in a suitable way and (2) how to mine efficiently all frequent sequences that satisfy the given constraints. 1 We show that many subsequence constraintsincluding and beyond the constraints mentioned above-can be unified in a single framework. A unified framework offers advantages to both researchers and practitioners. In particular, it allows researchers to study algorithms and properties of subsequence constraints in general instead of focusing on certain special cases individually. It also helps to improve usability of pattern mining systems for practitioners: They only need to familiarize themselves with one framework and, perhaps more importantly, do not need to develop customized mining algorithms for a particular subsequence constraint of interest. In fact, we propose a number of general-purpose mining algorithms that operate within our framework. Our experimental study (Section 7) suggests that our methods are often competitive (and sometimes exponentially more efficient) to state-of-the-art specialized algorithms for the above-mentioned subsequence constraints. In more detail, we introduce subsequence predicates to model subsequence constraints in a general way, and we propose a simple and intuitive pattern expression language to concisely express subsequence predicates. Our pattern expressions are based on regular expressions, but-in contrast to prior work on RE-constrained FSM [40, 47]-target input sequences and support capture groups and item hierarchies. Capture groups are the key ingredient for expressing most prior subsequence constraints in a unified way; see Table 1 for examples. Direct support for item hierarchies allows us both to express subsequence constraints concisely and to mine generalized subsequences in a controlled way. Some example pattern expressions as well as anecdotal results are given in Table 4 . To mine frequent sequences, we propose to use finite-state transducers (FST) as the underlying computational model. To the best of our knowledge, FSTs have not been studied in the context of FSM before. We propose the DESQ system, 2 which includes two efficient mining algorithms termed DESQ-COUNT and DESQ-DFS. Both algorithms translate a given pattern expression to a succinct 1 A preliminary version of this article appeared in 11:3 FST (sFST), which is simulated in a way suitable for frequent sequence mining. DESQ-COUNT is a match-and-count algorithm that aims at highly selective constraints, whereas DESQ-DFS can handle more demanding pattern expressions and is inspired by PrefixSpan [39] . Both algorithms heavily rely on efficient sFST simulation. We discuss various optimizations for sFST simulation, which often improve mining performance substantially. First, we show how sFSTs can be partially determinized and minimized. Second, we discuss methods that allow us to earlyabort sFST simulation whenever possible and without affecting correctness. Third, we propose a pruning method that enables us to quickly prune irrelevant input sequences, i.e., input sequences that cannot affect the mining results. Finally, we propose a two-pass approach to sFST simulation that additionally avoids unnecessary backtracking and show that the two-pass approach can be exponentially more efficient than the one-pass approach for certain pattern expressions. We conducted an experimental study on multiple real-world datasets to investigate the expressiveness of our pattern expression language, the efficiency of our mining algorithms, and the effectiveness of our proposed optimizations. We found that our pattern expressions are sufficiently powerful to express many subsequence constraints that arise in sequence mining applications. Our algorithms were generally efficient, and when used for pattern expressions that express prior subsequence constraints, competitive to-and sometimes more efficient than-state-of-the-art specialized methods. Our sFST optimizations were effective and significantly improved performance of our mining algorithms. Our results suggests that DESQ is an efficient general-purpose FSM framework for wide range of sequence mining tasks. The remainder of this article is organized as follows. In Section 2, we summarize basic concepts for FSM and establish the notation used throughout this work. In Section 3, we introduce subsequence predicates and formally define the problem of frequent sequence mining with general subsequence constraints. In Section 4, we propose our pattern expression language and finite-state transducers as the underlying computational model. Based on these transducers, we derive algorithms for frequent sequence mining in Section 5. In Section 6, we propose various optimizations for efficiently simulating finite-state transducers. Section 7 reports on our experimental study and its results. Section 8 discusses additional related work, and Section 9 concludes the article. PRELIMINARIES Sequence Databases. A sequence database is a set 3 of sequences, denoted D = { T 1 ,T 2 , . . . ,T |D | }. Each sequence T = t 1 t 2 . . . t |T | is an ordered list of items from a finite set Σ = { w 1 , w 2 , . . . ,w |Σ | } that we call vocabulary. 4 We refer to T as a sequence over Σ. We denote by ε the empty sequence, by |T | the length of sequence T , by Σ * (resp., Σ + ) the set of all (respectively, all non-empty) sequences that can be constructed from items in Σ. Figure 1(a) shows an example sequence database D ex consisting of six sequences over Σ = {A, a 1 , a 2 , B, b 1 , b 2 , b 11 , b 12 , c, d, e}. Item Hierarchy. The items in Σ are arranged in an item hierarchy, which expresses how items can be generalized (or that they cannot be generalized). Figure 1(b) shows an example hierarchy in which, for example, item a 1 generalizes to item A. In general, we say that an item udirectly generalizes to an item v, denoted u ⇒ v, if u is a child of v in the hierarchy. We further denote by ⇒ * the reflexive transitive closure of ⇒. For the example of Figure 1 (b), we have b 11 ⇒ b 1 , b 1 ⇒ B, 3 The restriction to sets is for expository reasons. In practice, sequence databases are more accurately abstracted as multisets, but we chose sets to make our definitions clearer. It is not difficult to generalize our approach from sets to multisets and, in fact, our implementation uses multisets. 4 A more general variant of this setting is often considered in literature, in which sequences are formed of itemsets rather than individual items. In this article, we focus on the special case of sequences composed of individual items (e.g., textual data, user sessions, event logs, protein sequences, etc.) 123 b 1 ← − − q d 23 c ← − q d 23 d ← − q d 23 c ← − q d 23 b 2 ← − − q d 234 .
doi:10.1145/3321486 fatcat:qefxp3kaq5fohcupid36aqczam