Filters








75,963 Hits in 7.3 sec

Data Fusion for Japanese Term and Character N-gram Search

Michiko Yasukawa, J. Shane Culpepper, Falk Scholer
2015 Proceedings of the 20th Australasian Document Computing Symposium on ZZZ - ADCS '15  
In this study, we explore data fusion techniques to answer the following question: if there are multiple ranked lists of documents from both word and n-gram indexes, can we improve overall effectiveness  ...  The alternative approach to indexing a segmented collection is n-gram search, where every n-length sequence of symbols is indexed.  ...  The minimum and maximum length of index terms is 1 and 20 for both cases. The number of unique terms in the 1gram, 2-gram and 3-gram indexes for NTCIR7 are 5,431, 860,181 and 8,225,898.  ... 
doi:10.1145/2838931.2838939 dblp:conf/adcs/YasukawaCS15 fatcat:a3bdwzns2rh45brzojnzvoiysu

RICOH at NTCIR-2

Yasushi Ogawa, Hiroko Mano
2001 NTCIR Conference on Evaluation of Information Access Technologies  
The system features (1) hybrid retrieval using a combination of n-gram indexing and wordbased document ranking (2) word-based and ngram-based query expansion (3) a modi ed version of the Okapi's probabilistic  ...  Of the eight runs, four runs use the title eld only and the other four use the description eld only.  ...  We tried both word-based expansion and n-gram-based expansion.  ... 
dblp:conf/ntcir/OgawaM01 fatcat:hxk4sdggizhalc5zuhykhv5auu

Character n-Gram Spotting in Document Images

M. Sudha Praveen, K. Pramod Sankar, C. V. Jawahar
2011 2011 International Conference on Document Analysis and Recognition  
In the retrieval phase, the query word is expanded to its constituent n-grams, which are used to query the previously built index.  ...  The character n-grams are represented in a visual-feature space and indexed for quick retrieval.  ...  Thus, character n-gram spotting encompasses both OCR and word-spotting approaches, and augments them by evidence from matching n-grams.  ... 
doi:10.1109/icdar.2011.191 dblp:conf/icdar/PraveenSJ11 fatcat:h546noqntzbwpl3fjncsoierym

Experiments in the Retrieval of Unsegmented Japanese Text at the NTCIR-2 Workshop

Paul McNamee
2001 NTCIR Conference on Evaluation of Information Access Technologies  
Our work with the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system has made use of overlapping character n-grams in the indexing and retrieval of text.  ...  We found that 6-grams performed comparably with English words and that 2-grams and 3-grams perform equally well in Japanese text.  ...  Routinely, overlapping character n-grams and simple words are used as indexing terms.  ... 
dblp:conf/ntcir/McNamee01 fatcat:c6qaboh64ncdfgc7phhvh4v3dm

Effective Translation, Tokenization and Combination for Cross-Lingual Retrieval [chapter]

Jaap Kamps, Sisay Fissaha Adafre, Maarten de Rijke
2005 Lecture Notes in Computer Science  
Second, effective combination methods allow us to combine the best of different strategies.  ...  Currently at Archives and Information Studies, Faculty of Humanities, University of Amsterdam.  ...  For Finnish, Split+stem indicates that compounds are split, where we stem the words and compound parts. n-Grams: both topic and document words are n-grammed, using the settings discussed in Section 2.  ... 
doi:10.1007/11519645_12 fatcat:ufe26nid4fg6fax52qim2fpc6a

The HAIRCUT System at TREC-9

Paul McNamee, James Mayfield, Christine D. Piatko
2000 Text Retrieval Conference  
The result is a stream of blank-separated words. When using n-grams we construct indexing terms from the same sequence of words.  ...  We did not segment the text, and instead elected to index the documents using both 2and 3-grams.  ...  We remain open to the possibility that other techniques may be better stillfor example, using both 2-grams and 3-grams, or 2grams and segmented words.  ... 
dblp:conf/trec/McNameeMP00 fatcat:mkpayhdhknhgjejbq7ljkiv5tq

CoLesIR at CLEF 2007: from English to French via Character N-Grams

Jesús Vilares, Michael P. Oakes, Manuel Vilares Ferro
2007 Conference and Labs of the Evaluation Forum  
As in their original proposal, our work is based on the direct translation of character n-grams, avoiding in this way the need for word normalization during indexing or translation, and also dealing with  ...  Nevertheless, in contrast with the original approach, our proposal is much faster and transparent, making extensive use of freely available resources.  ...  Acknowledgments This research has been partially funded by the European Union (FP6-045389), Ministerio de Educación y Ciencia and FEDER (TIN2004-07246-C03 and HUM2007-66607-C04), and Xunta de Galicia (  ... 
dblp:conf/clef/VilaresOF07 fatcat:4y3hbk6iurbfng3czg5itspxie

Combination Approaches in Korean Information Retrieval: Words vs. n-grams, and Query Translation vs. Document Translation

IN-SU KANG, SEUNG-HOON NA, JONG-HYEOK LEE
2006 International Journal of Computer Processing Of Languages  
In combining words and n-grams, we concentrate on generating several ranked lists showing different retrieval characteristics on word and n-gram indexes by incorporating feedback schemes.  ...  For monolingual information retrieval, we use a combination strategy that integrates words and n-grams at the ranked list level.  ...  For example, both words and n-grams are collected from documents to create a single index.  ... 
doi:10.1142/s0219427906001463 fatcat:5h5ql5e6xbdw5ax7oukc4fy5g4

Ternary Tree Optimalization for n-gram Indexing

Daniel Robenek, Jan Platos, Václav Snásel
2014 Databases, Texts, Specifications, Objects  
N-gram indexing is used in many practical applications. Spam detection, plagiarism detection or comparison of DNA reads.  ...  Efficiency of ternary forest is tested and compared to ternary search tree and two-level indexing ternary search tree.  ...  The stored root index of n-gram tree is used to found node with index 3. Search is done again in the word tree with index 2 and the last node in the n-gram tree is found.  ... 
dblp:conf/dateso/RobenekPS14 fatcat:fqxmdkga3fetlp652bufdqrl3e

Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files

Justin Zobel, Alistair Moffat, Ron Sacks-Davis
1993 Very Large Data Bases Conference  
In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term.  ...  The pattern search method is based on text indexing techniques and is a successful adaptation of inverted files to main memory databases.  ...  Acknowledgmnents We would like to thank Abe Bookstcin, Andrew ~IIIIII~, Alan Kent, and Ihmi Klrin for thcbir advice and hc~lpl'ul discussion.  ... 
dblp:conf/vldb/ZobelMS93 fatcat:efaevb6opfaufmb3drzhhw335q

Scalable Multilingual Information Access [chapter]

Paul McNamee, James Mayfield
2003 Lecture Notes in Computer Science  
In particular, we investigate the use of character n-grams for monolingual retrieval, pre-translation expansion as a technique to mitigate errors due to limited translation resources, and translation of  ...  The third Cross-Language Evaluation Forum workshop (CLEF-2002) provides the unprecedented opportunity to evaluate retrieval in eight different languages using a uniform set of topics and assessment methodology  ...  Methodology For the monolingual tasks we created sixteen indexes, a word and an n-gram (n=6) index for each of the eight languages.  ... 
doi:10.1007/978-3-540-45237-9_17 fatcat:6wihlivr3jf4zkcx4u3puphjey

Comparison of Word and Subword Indexing Techniques for Mandarin Chinese Spoken Document Retrieval [chapter]

Hsin-min Wang, Berlin Chen
2001 Lecture Notes in Computer Science  
In this paper, we investigate the use of words and subwords (including both characters and syllables) in audio indexing for Mandarin Chinese spoken document retrieval.  ...  Two retrieval approaches, including the well-known vector space model approach and the newly proposed HMM/Ngram-based approach, are used in the present work.  ...  Subword-level Indexing Using The HMM/N-gram-based Model The retrieval results obtained when the HMM/N-gram-based retrieval approach was applied are shown in Table 3 .  ... 
doi:10.1007/3-540-45453-5_78 fatcat:kzd5sdmzzbfxdc5s5ifpeztnqu

On the use of words and n-grams for Chinese information retrieval

Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, Ming Zhou
2000 Proceedings of the fifth international workshop on on Information retrieval with Asian languages - IRAL '00  
Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances.  ...  In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams.  ...  Instead of using words, n-grams may also be used as indexes. One may use only bi-grams.  ... 
doi:10.1145/355214.355235 dblp:conf/iral/NieGZZ00 fatcat:ln25xi4ngbdopo4iqsuvpqxf5u

Single n-gram stemming

James Mayfield, Paul McNamee
2003 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03  
We demonstrate that selection of a single n-gram as a pseudo-stem for a word can be an effective and efficient language-neutral approach for some languages.  ...  Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty.  ...  Automatic selection of n-gram length and the use of two or more n-grams for some words are other potentially fruitful directions.  ... 
doi:10.1145/860500.860528 fatcat:nyqybzoftrevpmadj3s23hd53a

Single n-gram stemming

James Mayfield, Paul McNamee
2003 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03  
We demonstrate that selection of a single n-gram as a pseudo-stem for a word can be an effective and efficient language-neutral approach for some languages.  ...  Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty.  ...  Automatic selection of n-gram length and the use of two or more n-grams for some words are other potentially fruitful directions.  ... 
doi:10.1145/860435.860528 dblp:conf/sigir/MayfieldM03 fatcat:kimj6rjgwbajhl5gk4aovy54aq
« Previous Showing results 1 — 15 out of 75,963 results