Indexing weighted-sequences in large databases

H. Wang, C.-S. Perng, W. Fan, S. Park, P.S. Yu
Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)  
We present an index structure for managing weightedsequences in large databases. A weighted-sequence is defined as a two-dimensional structure where each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence in that each event has a timestamp. Querying a large sequence database by events' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed in
more » ... paper enables us to efficiently retrieve from the database all subsequences, possibly non-contiguous, that match a given query sequence both by events and by weights. The index method also takes into consideration the nonuniform frequency distribution of events in the sequence data. In addition, our method finds a broad range of applications in indexing scientific data consisting of multiple numerical columns for discovery of correlations among these columns. For instance, indexing a DNA micro-array that records expression levels of genes under different conditions enables us to search for genes whose responses to various experimental perturbations follow a given pattern. We demonstrate, using real-world data sets, that our method is effective and efficient. coDCDLinkUp, under the constraint that the interval between the first two events is about ¦ § © ¦ seconds, and the interval between the 1st and 3rd events is about § © seconds. Answering such queries efficiently is important to understanding temporal causal relationships among events, which often provide actionable insights for determining problems in system management. A query can involve any number of events, and each event has an approximate weight, which, in this case, is the elapsed time between the occurrence of the event and Example 2. Query-by-pattern in DNA Micro-arrays Find all genes whose expression level in sample CH1I is about § § © ¢ ¡ units higher than that in CH2B, ¦
doi:10.1109/icde.2003.1260782 dblp:conf/icde/WangPFPY03 fatcat:kso3ntuyszfhngvwi74lcv3otq