Filters








1,223 Hits in 7.6 sec

A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages [chapter]

Aki Loponen, Kalervo Järvelin
2010 Lecture Notes in Computer Science  
We present a dictionary-and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any  ...  We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages.  ...  With F 2 score level near 70% for a morphologically complex language and around 90% for simpler ones in lemmatization tasks while also performing very well in retrieval tasks, StaLe is a light, robust  ... 
doi:10.1007/978-3-642-15998-5_3 fatcat:zbvev4uhzrbffm4sejcqlnpcl4

METIS-II: Low-Resource MT for German to English

Michael Carl
2009 Journal for Language Technology and Computational Linguistics  
The idea was to use 'basic' linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus.  ...  The paper outlines the basic ideas of the project, their implementation, the resources used, and the results obtained. It emphazises on the German implementation.  ...  The former method requires a synchronization of the source-and target language resources, while for the latter, in principle, SL and TL resources may be processed and prepared independently. SL vs.  ... 
dblp:journals/ldvf/Carl09 fatcat:ylp7dbqlkfe67anzwtx6brur5u

METIS-II: low resource machine translation

Michael Carl, Maite Melero, Toni Badia, Vincent Vandeghinste, Peter Dirix, Ineke Schuurman, Stella Markantonatou, Sokratis Sofianopoulos, Marina Vassiliou, Olga Yannoutsou
2008 Machine Translation  
The idea was to use 'basic' linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus.  ...  On the basis of the results and experiences obtained, we believe that the approach is promising and offers the potential for development in various directions.  ...  The former method requires a synchronization of the source-and target language resources, while for the latter, in principle, SL and TL resources may be processed and prepared independently. SL vs.  ... 
doi:10.1007/s10590-008-9048-z fatcat:jo63wp23yreqxa25g2zys4hidm

Word normalization and decompounding in mono- and bilingual IR

Eija Airio
2006 Information retrieval (Boston)  
In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized  ...  The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds  ...  Acknowledgements The InQuery search engine was provided by the Center for Intelligent Information Retrieval at the University of Massachusetts. Lingsoft plc. 1983-1992.  ... 
doi:10.1007/s10791-006-0884-2 fatcat:vsjslmor5vbrrixfrbu2obphaq

Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

G De Pauw, G-M De Schryver
2009 Lexikos  
We particularly focus our discussion on its ability to retrieve lemmas for word forms and evaluate it as a tool for corpus-based dictionary compilation.  ...  Computational morphological analysis is an important first step in the automatic treatment of natural language and a useful lexicographic tool.  ...  It has become unimaginable to compile a wide-coverage dictionary for a Bantu language without the use of a large language corpus and a functional corpus query package (CQP).  ... 
doi:10.4314/lex.v18i1.47257 fatcat:45ldlrwrefaujexsf2kcdu3g6a

Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

Guy De Pauw, Gilles-Maurice De Schryver
2011 Lexikos  
We particularly focus our discussion on its ability to retrieve lemmas for word forms and evaluate it as a tool for corpus-based dictionary compilation.  ...  Computational morphological analysis is an important first step in the automatic treatment of natural language and a useful lexicographic tool.  ...  It has become unimaginable to compile a wide-coverage dictionary for a Bantu language without the use of a large language corpus and a functional corpus query package (CQP).  ... 
doi:10.5788/18-0-488 fatcat:zwzjta2pnfbprhgzvdqft3wej4

Design and Development of Unsupervised Stemmer for Sindhi Language

Bharti Nathani, Nisheeth Joshi, G.N. Purohit
2020 Procedia Computer Science  
Results are compared with existing rule-based, stemmer [32] and Lemmatizer [33] , 1000 words are extracted from Sindhi Dictionary for evaluation.  ...  Results are compared with existing rule-based, stemmer [32] and Lemmatizer [33] , 1000 words are extracted from Sindhi Dictionary for evaluation.  ...  Related work in low resource language Saharia, N. et.al. [43] , inthisresearch work focuses on stemming of resource poor Eastern Indian languages such as Bodo Manipuri and Assamese.  ... 
doi:10.1016/j.procs.2020.03.212 fatcat:bs2mggcwh5bz3oeha25lehmu7u

An evaluation of conflation accuracy using finite‐state transducers

Carmen Galvez, Félix de Moya‐Anegón
2006 Journal of Documentation  
The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms.  ...  Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval.  ...  Essentially, three analytical resources are needed -a dictionary of canonical forms, a dictionary of inflectional forms, and a dictionary of frozen expressions and compound lemmas -to recognize and group  ... 
doi:10.1108/00220410610666493 fatcat:rcf2r7vxqbbvlcuyvuscy2wopq

Cross-Lingual Text Categorization [chapter]

Nuria Bel, Cornelis H. A. Koster, Marta Villegas
2003 Lecture Notes in Computer Science  
We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case  ...  Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation  ...  In a practical classification system, the above techniques can be combined, by using terminology translation or profile-based translation to generate examples for poly-lingual training and then bootstrap  ... 
doi:10.1007/978-3-540-45175-4_13 fatcat:pzwxoszblfda5d6abnpe4if6eu

D4.1 Business Pilot Specification

Christian Blaschke, Maria Khvalchik, Artem Revenok, Guilherme Rodrigues, Roser Saurí, Meritxell González, Khalil Ahmed, Eva Theodoridou, Deirdre Lee, Katharine Cooney, Mario Romera, Matthias Orlikowski (+5 others)
2020 Zenodo  
Business Pilot Specification for Prêt-à-LLOD project  ...  Use Cases/User Stories User stories for domain specific lemmatization: • Semantic Web Company uses lemmatization and large corpus for a specific language and provides a base lemmatization model for a certain  ...  language. • User can extend lemmatization model with a domain specific corpus that is added to the base model for a language.  ... 
doi:10.5281/zenodo.5744866 fatcat:65ghyqyuirdtrfyqsdpzb4p4ju

Lexical paraphrasing for document retrieval and node identification

Ingrid Zukerman, Sarah George, Yingying Wen
2003 Proceedings of the second international workshop on Paraphrasing -  
Lexical paraphrases are generated using syntactic, semantic and corpus-based information. Our evaluation shows that lexical paraphrasing improves retrieval performance for both applications.  ...  Node identification -performed in the context of a Bayesian argumentation system -matches users' Natural Language sentences to nodes in a Bayesian network.  ...  Acknowledgments This research was supported in part by grants A49927212 and DP0209565 from the Australian Research Council.  ... 
doi:10.3115/1118984.1118997 dblp:conf/acl-iwp/ZukermanGW03 fatcat:ur5424nb2rcehjkoqyzvjeytoy

Creating a Persian-English Comparable Corpus [chapter]

Homa Baradaran Hashemi, Azadeh Shakery, Heshaam Faili
2010 Lecture Notes in Computer Science  
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs.  ...  In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian.  ...  For example, Hamshahri corpus [1] is a monolingual corpus for evaluating Persian information retrieval systems and Bijankhan corpus [3] is a Persian tagged corpus for natural language processing.  ... 
doi:10.1007/978-3-642-15998-5_5 fatcat:dpbwztz4z5axtimn77bqtktghy

The Application of NLTK Library for Python Natural Language Processing in Corpus Research

Meng Wang, Fanghui Hu
2021 Theory and Practice in Language Studies  
In terms of the main links in corpus research, such as text cleaning, word form restoration, part of speech tagging and text retrieval statistics, this paper takes the US presidential inaugural speech  ...  in the corpus as an example to show how to use this tool to process the language data, and introduces the application of Python NLTK library in corpus research.  ...  of speech tagging 1044 THEORY AND PRACTICE IN LANGUAGE STUDIES Fig 5 . 5 Lemmatization results of the first 30 words Fig 6 . 6 Retrieval results Figure 7 . 7 .  ... 
doi:10.17507/tpls.1109.09 fatcat:3m4de2wio5a77iujlt3tuyiveu

Restricted inflectional form generation in management of morphological keyword variation

Kimmo Kettunen, Eija Airio, Kalervo Järvelin
2007 Information retrieval (Boston)  
Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically  ...  Lemmatization is effective but often requires expensive resources.  ...  In Kettunen and Airio (2006) we first sought for corpus statistics of Finnish nominal word forms. Then we verified these statistics with two independent automatic analyses of larger corpuses.  ... 
doi:10.1007/s10791-007-9030-z fatcat:z7osdhwmabckdixrofhdl6xjpy

Monolingual Document Retrieval for European Languages

Vera Hollink, Jaap Kamps, Christof Monz, Maarten de Rijke
2004 Information retrieval (Boston)  
Recent years have witnessed considerable advances in information retrieval for European languages other than English.  ...  Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.  ...  Acknowledgments We are extremely grateful to three anonymous referees for their extensive and insightful comments. We want to thank Carol Peters for editorial help and patience.  ... 
doi:10.1023/b:inrt.0000009439.19151.4c fatcat:iagstorgsbgqbh2nnlueivckei
« Previous Showing results 1 — 15 out of 1,223 results