77 Hits in 8.9 sec

An Algorithm for the Generalized k-Keyword Proximity Problem and Finding Longest Repetitive Substring in a Set of Strings [chapter]

Inbok Lee, Sung-Ryul Kim
2006 Lecture Notes in Computer Science  
Furthermore, we show that they can be used to find longest repetitive substring with constraints in a set of strings.  ...  Therefore, we need some method to measure the relevance of the documents to the query. In this paper we propose algorithms for computing k-keyword proximity score [3] in more realistic environments.  ...  All the problems in Section 2 can be solved in Repetitive Longest Substring in a Set of Strings Now we consider finding the longest substring in a set of strings.  ... 
doi:10.1007/11758549_43 fatcat:i55xsl6vazectlxdl2uvyjr2f4

Fast error-tolerant search on very large texts

Marjan Celikik, Holger Bast
2009 Proceedings of the 2009 ACM symposium on Applied Computing - SAC '09  
We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both  ...  This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching  ...  : for a large set of strings (words) and a given query word find all similar words from the set.  ... 
doi:10.1145/1529282.1529669 dblp:conf/sac/CelikikB09 fatcat:gksr2hzdmjhfhe5ip56jaol6ji

Applying Pattern Mining to Web Information Extraction [chapter]

Chia-Hui Chang, Shao-Chen Lui, Yen-Chin Wu
2001 Lecture Notes in Computer Science  
In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree.  ...  Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc.  ...  Also, we would like to thank Lee-Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree code for us.  ... 
doi:10.1007/3-540-45357-1_4 fatcat:a6xesyslfrc4fju6h7r56hseje


Chia-Hui Chang, Shao-Chen Lui
2001 Proceedings of the tenth international conference on World Wide Web - WWW '01  
The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semistructured Web documents.  ...  Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule.  ...  The problem is transformed to find the multiple alignment of the k strings S=P 1 , P 2 , ..., P k , so that the generalized pattern can be used to extract all records we need.  ... 
doi:10.1145/371920.372182 dblp:conf/www/ChangL01 fatcat:6zlprr7jmzc5pckbx3qf6s2xsi

Automatic information extraction from semi-structured Web pages by pattern discovery

Chia-Hui Chang, Chun-Nan Hsu, Shao-Cheng Lui
2003 Decision Support Systems  
The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way.  ...  for 10 of the sample Web data sources.  ...  Acknowledgements The research reported here was supported in part by the National Science Council of Taiwan under Grant No.90-2213-E-008-042 and in part by DeepSpot Intelligent Systems, Taiwan under Contract  ... 
doi:10.1016/s0167-9236(02)00100-8 fatcat:x6kfrckenfcbjdyssnabz3xmka

Approximate Matching of Network Expressions with Spacers

1996 Journal of Computational Biology  
The algorithm is threshold-sensitive in that its performance depends on the threshold, k, of the number of differences allowed in an approximate match.  ...  This result generalizes the O(kn) expected-time algorithm of Ukkonen for approximately matching keywords.  ...  More precisely, one can assert that a sweep of the algorithm scans no more than Σ k (set k + mot k + ∆ k ) symbols, where mot k is the length of the longest word matched by M k , and set k = max{ j: j∈  ... 
doi:10.1089/cmb.1996.3.33 pmid:8697238 fatcat:pkxkdfpy2jgghcbwkxo74rrirq

Inverted indexes for phrases and strings

Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran
2011 Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11  
In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching.  ...  In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size.  ...  The main idea of the algorithm is to maintain a list of candidate top-k documents in the set S doc , and refine the candidate set by moving documents to the set Sans from time to time.  ... 
doi:10.1145/2009916.2009992 dblp:conf/sigir/PatilTSHVC11 fatcat:wufdcy5mkvcgzm3p2tmrxhcrim

Reconfigurable Web wrapper agents for biological information integration

Chun-Nan Hsu, Chia-Hui Chang, Chang-Huain Hsieh, Jiann-Jyh Lu, Chien-Chi Chang
2005 Journal of the American Society for Information Science and Technology  
A variety of biological data is transferred and exchanged in overwhelming volumes on the World Wide Web.  ...  We define an XMLbased language called WNDL, which provides a representation of a Web browsing session. A WNDL script describes how to locate the data, extract the data and combine the data.  ...  Acknowledgements We acknowledge the contribution of the members and alumni of AIIA Lab at the Institute of Information Science, Academia Sinica: Hung-Hsuan Huang, Siek Harianto and Elan Hung for implementing  ... 
doi:10.1002/asi.20139 fatcat:x7kfwwfbffhrtn7vtgc5d6d3fi

Polyphonic music retrieval

Shyamala Doraisamy
2005 SIGIR Forum  
For the retrieval of these 'overlaying' musical words, i.e., when more than one word can assume the same within-document position, a new proximity-based operator and a ranking function is proposed.  ...  In exploiting the time-dependent element of polyphonic music data, a method to index adjacent and concurrent musical words using a 'polyphonic musical word indexer' is proposed.  ...  In that case, k 1 and b would be set in the same way as they are set for a document. l Q is usually set to 3 for title queries.  ... 
doi:10.1145/1067268.1067289 fatcat:6kh6aflmnjgvlnr2zwtpn32oxi

Unsupervised named-entity extraction from the Web: An experimental study

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates
2005 Artificial Intelligence  
List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list.  ...  The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable  ...  Acknowledgements This research was supported in part by NSF grants IIS-0312988 and IIS-0307906, DARPA contract NBCHD030010, ONR grants N00014-02-1-0324 and N00014-02-1-  ... 
doi:10.1016/j.artint.2005.03.001 fatcat:qmdekfdvjnf53lbpjmdno2k7uy

A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries

Marco Cornolti, Paolo Ferragina, Massimiliano Ciaramita, Stefan Rüd, Hinrich Schütze
2016 Proceedings of the 25th International Conference on World Wide Web - WWW '16  
The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query.  ...  We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort.  ...  string x. • M inED(a, b) is an asymmetric measure of distance of string a towards string b, defined as follows 5 : let at and bt be the set of terms in strings a and b, then M inED(a, b) = avg ta∈a t  ... 
doi:10.1145/2872427.2883061 dblp:conf/www/CornoltiFCRS16 fatcat:llvnk3ivzrbjpbftuapbrwlzyi

Detecting visually similar Web pages

Teh-Chung Chen, Scott Dick, James Miller
2010 ACM Transactions on Internet Technology  
Once it can not find a longer substring, the index (in the dictionary) for that specific string matched part is sent to output and this new string (including the last character) is added into the dictionary  ...  The only usage of a two-dimensional compressor in the NCD we could find was an image co-registration algorithm that compared JPEG and bzip2 [62] , and this also examined sets of monochrome images (the  ...  Phishing mechanism is an absolute necessity in the trend of the Anti-Phishing tool development.  ... 
doi:10.1145/1754393.1754394 fatcat:3fsxye6xzrgvnd76ria3lozdze

The Book Review Column

William Gasarch
2000 ACM SIGACT News  
In this column we review the following books.  ...  Chapter 6 covers the topic of Union-Find and other structures for the partitions of a set of data elements.  ...  Chapter 7 The pattern matching problem is that of locating all occurrences of a pattern string in a larger string called the text.  ... 
doi:10.1145/348210.1042071 fatcat:3fnw5huw6vguhbdzac7sqzsqxi

Intrusion-Detection Systems [chapter]

Peng Ning, Sushil Jajodia
2012 Handbook of Computer Networks  
Series on ADVANCES IN INFORMATION SECURITY are, one, to establish the state of the art of, and set the course for future research in information security and, two, to serve as a central reference source  ...  and software assurance. of maturity to warrant a comprehensive textbook treatment. ideas for books under this series.  ...  Acknowledgment This work has been partially funded by the Ministero dell'Università e della Ricerca (MiUR) in the framework of the RECIPE Project, and by the EU as part of the IST Programme -within the  ... 
doi:10.1002/9781118256107.ch26 fatcat:aeidzkegvfc27dqqmztiayv3dm

A Survey on Honeypot Software and Data Analysis [article]

Marcin Nawrocki, Matthias Wählisch, Thomas C. Schmidt, Christian Keil, Jochen Schönfelder
2016 arXiv   pre-print
In this survey, we give an extensive overview on honeypots. This includes not only honeypot software but also methodologies to analyse honeypot data.  ...  Using suffix trees, the longest common substring of two strings is straightforward to find in linear time. Suffix trees can be generated for example with the Ukkonen's algorithm [173] .  ...  Those instance are collected centrally and a signature generation algorithm is initiated: First a substrings extraction process takes place which creates the set of all possible substrings and determines  ... 
arXiv:1608.06249v1 fatcat:nlv2qdnmmvhxlmsfkyszl3owxq
« Previous Showing results 1 — 15 out of 77 results