A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2018; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages
[chapter]
2010
Lecture Notes in Computer Science
We present a dictionary-and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any ...
We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. ...
With F 2 score level near 70% for a morphologically complex language and around 90% for simpler ones in lemmatization tasks while also performing very well in retrieval tasks, StaLe is a light, robust ...
doi:10.1007/978-3-642-15998-5_3
fatcat:zbvev4uhzrbffm4sejcqlnpcl4
METIS-II: Low-Resource MT for German to English
2009
Journal for Language Technology and Computational Linguistics
The idea was to use 'basic' linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. ...
The paper outlines the basic ideas of the project, their implementation, the resources used, and the results obtained. It emphazises on the German implementation. ...
The former method requires a synchronization of the source-and target language resources, while for the latter, in principle, SL and TL resources may be processed and prepared independently.
SL vs. ...
dblp:journals/ldvf/Carl09
fatcat:ylp7dbqlkfe67anzwtx6brur5u
METIS-II: low resource machine translation
2008
Machine Translation
The idea was to use 'basic' linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. ...
On the basis of the results and experiences obtained, we believe that the approach is promising and offers the potential for development in various directions. ...
The former method requires a synchronization of the source-and target language resources, while for the latter, in principle, SL and TL resources may be processed and prepared independently.
SL vs. ...
doi:10.1007/s10590-008-9048-z
fatcat:jo63wp23yreqxa25g2zys4hidm
Word normalization and decompounding in mono- and bilingual IR
2006
Information retrieval (Boston)
In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized ...
The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds ...
Acknowledgements The InQuery search engine was provided by the Center for Intelligent Information Retrieval at the University of Massachusetts. Lingsoft plc. 1983-1992. ...
doi:10.1007/s10791-006-0884-2
fatcat:vsjslmor5vbrrixfrbu2obphaq
Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes
2009
Lexikos
We particularly focus our discussion on its ability to retrieve lemmas for word forms and evaluate it as a tool for corpus-based dictionary compilation. ...
Computational morphological analysis is an important first step in the automatic treatment of natural language and a useful lexicographic tool. ...
It has become unimaginable to compile a wide-coverage dictionary for a Bantu language without the use of a large language corpus and a functional corpus query package (CQP). ...
doi:10.4314/lex.v18i1.47257
fatcat:45ldlrwrefaujexsf2kcdu3g6a
Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes
2011
Lexikos
We particularly focus our discussion on its ability to retrieve lemmas for word forms and evaluate it as a tool for corpus-based dictionary compilation. ...
Computational morphological analysis is an important first step in the automatic treatment of natural language and a useful lexicographic tool. ...
It has become unimaginable to compile a wide-coverage dictionary for a Bantu language without the use of a large language corpus and a functional corpus query package (CQP). ...
doi:10.5788/18-0-488
fatcat:zwzjta2pnfbprhgzvdqft3wej4
Design and Development of Unsupervised Stemmer for Sindhi Language
2020
Procedia Computer Science
Results are compared with existing rule-based, stemmer [32] and Lemmatizer [33] , 1000 words are extracted from Sindhi Dictionary for evaluation. ...
Results are compared with existing rule-based, stemmer [32] and Lemmatizer [33] , 1000 words are extracted from Sindhi Dictionary for evaluation. ...
Related work in low resource language Saharia, N. et.al. [43] , inthisresearch work focuses on stemming of resource poor Eastern Indian languages such as Bodo Manipuri and Assamese. ...
doi:10.1016/j.procs.2020.03.212
fatcat:bs2mggcwh5bz3oeha25lehmu7u
An evaluation of conflation accuracy using finite‐state transducers
2006
Journal of Documentation
The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. ...
Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. ...
Essentially, three analytical resources are needed -a dictionary of canonical forms, a dictionary of inflectional forms, and a dictionary of frozen expressions and compound lemmas -to recognize and group ...
doi:10.1108/00220410610666493
fatcat:rcf2r7vxqbbvlcuyvuscy2wopq
Cross-Lingual Text Categorization
[chapter]
2003
Lecture Notes in Computer Science
We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case ...
Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation ...
In a practical classification system, the above techniques can be combined, by using terminology translation or profile-based translation to generate examples for poly-lingual training and then bootstrap ...
doi:10.1007/978-3-540-45175-4_13
fatcat:pzwxoszblfda5d6abnpe4if6eu
D4.1 Business Pilot Specification
2020
Zenodo
Business Pilot Specification for Prêt-à-LLOD project ...
Use Cases/User Stories User stories for domain specific lemmatization: • Semantic Web Company uses lemmatization and large corpus for a specific language and provides a base lemmatization model for a certain ...
language. • User can extend lemmatization model with a domain specific corpus that is added to the base model for a language. ...
doi:10.5281/zenodo.5744866
fatcat:65ghyqyuirdtrfyqsdpzb4p4ju
Lexical paraphrasing for document retrieval and node identification
2003
Proceedings of the second international workshop on Paraphrasing -
Lexical paraphrases are generated using syntactic, semantic and corpus-based information. Our evaluation shows that lexical paraphrasing improves retrieval performance for both applications. ...
Node identification -performed in the context of a Bayesian argumentation system -matches users' Natural Language sentences to nodes in a Bayesian network. ...
Acknowledgments This research was supported in part by grants A49927212 and DP0209565 from the Australian Research Council. ...
doi:10.3115/1118984.1118997
dblp:conf/acl-iwp/ZukermanGW03
fatcat:ur5424nb2rcehjkoqyzvjeytoy
Creating a Persian-English Comparable Corpus
[chapter]
2010
Lecture Notes in Computer Science
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. ...
In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. ...
For example, Hamshahri corpus [1] is a monolingual corpus for evaluating Persian information retrieval systems and Bijankhan corpus [3] is a Persian tagged corpus for natural language processing. ...
doi:10.1007/978-3-642-15998-5_5
fatcat:dpbwztz4z5axtimn77bqtktghy
The Application of NLTK Library for Python Natural Language Processing in Corpus Research
2021
Theory and Practice in Language Studies
In terms of the main links in corpus research, such as text cleaning, word form restoration, part of speech tagging and text retrieval statistics, this paper takes the US presidential inaugural speech ...
in the corpus as an example to show how to use this tool to process the language data, and introduces the application of Python NLTK library in corpus research. ...
of speech tagging 1044 THEORY AND PRACTICE IN LANGUAGE STUDIES
Fig 5 . 5 Lemmatization results of the first 30 words
Fig 6 . 6 Retrieval results
Figure 7 . 7 . ...
doi:10.17507/tpls.1109.09
fatcat:3m4de2wio5a77iujlt3tuyiveu
Restricted inflectional form generation in management of morphological keyword variation
2007
Information retrieval (Boston)
Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically ...
Lemmatization is effective but often requires expensive resources. ...
In Kettunen and Airio (2006) we first sought for corpus statistics of Finnish nominal word forms. Then we verified these statistics with two independent automatic analyses of larger corpuses. ...
doi:10.1007/s10791-007-9030-z
fatcat:z7osdhwmabckdixrofhdl6xjpy
Monolingual Document Retrieval for European Languages
2004
Information retrieval (Boston)
Recent years have witnessed considerable advances in information retrieval for European languages other than English. ...
Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques. ...
Acknowledgments We are extremely grateful to three anonymous referees for their extensive and insightful comments. We want to thank Carol Peters for editorial help and patience. ...
doi:10.1023/b:inrt.0000009439.19151.4c
fatcat:iagstorgsbgqbh2nnlueivckei
« Previous
Showing results 1 — 15 out of 1,223 results