886 Hits in 3.4 sec

The Wikipedia XML corpus

Ludovic Denoyer, Patrick Gallinari
2006 SIGIR Forum  
, Spanish, Chinese, Arabian and Japanese.  ...  ) main-english English 659,388 ≈ 4,600 20060130 french French 110,838 ≈ 730 20060123 german German 305,099 ≈ 2,079 20060227 dutch Dutch 125,004 ≈ 607 20060130 spanish Spanish 79,236 ≈ 504 20060303 chinese  ...  Conclusion This technical report describes XML collections based on Wikipedia and developed for Structured Information Retrieval, Structured Machine Learning and Natural Language processing.  ... 
doi:10.1145/1147197.1147210 fatcat:yawgcuzx6rgl5csrav57ldosle

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia [chapter]

Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf R. B. Trieschnigg, Djoerd Hiemstra, Franciska de Jong
2009 Lecture Notes in Computer Science  
This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations.  ...  WikiTranslate is evaluated by searching with topics in Dutch, French and Spanish in an English data collection. The systems achieved a performance of 67% compared to the monolingual baseline.  ...  Acknowledgements This paper is based on research partly funded by IST project MESH ( and by bsik program MultimediaN (  ... 
doi:10.1007/978-3-642-04447-2_6 fatcat:7u6z45uywffdxakv7whsumxlaa

POLYGLOT-NER: Massive Multilingual Named Entity Recognition [article]

Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steven Skiena
2014 arXiv   pre-print
We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase.  ...  Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes.  ...  We trained new word embeddings with extended vocabulary (300K words) using English, Spanish and Dutch Wikipedia.  ... 
arXiv:1410.3791v1 fatcat:kqkxgidkgzf2lp4iuh4twuicnm

Analysing Entity Context in Multilingual Wikipedia to Support Entity-Centric Retrieval Applications [chapter]

Yiwei Zhou, Elena Demidova, Alexandra I. Cristea
2015 Lecture Notes in Computer Science  
A systematic analysis of languagespecific entity contexts can provide a better overview of the existing aspects and support entity-centric retrieval applications over multilingual Web data.  ...  Furthermore, we analyse the similarities and the differences in these contexts in a case study including 80 entities and five Wikipedia language editions.  ...  This work was partially funded by the COST Action IC1302 (KEYSTONE) and the European Research Council under ALEXANDRIA (ERC 339233).  ... 
doi:10.1007/978-3-319-27932-9_17 fatcat:vshzpjqpmvdalmgmscjhgrjsii

Token Level Code-Switching Detection Using Wikipedia as a Lexical Resource [chapter]

Daniel Claeser, Dennis Felske, Samantha Kent
2018 Lecture Notes in Computer Science  
We evaluate the classifier using three different language pairs: Spanish-English, Dutch-English, and German-Turkish.  ...  The main aim is to develop a simple lexical look-up classifier based on frequency information retrieved from Wikipedia.  ...  This means that named entities correctly identified as Spanish are for example 'san antonio', 'gloria trevi' and 'san marcos'. Results Turkish-German was the second best performing language pair.  ... 
doi:10.1007/978-3-319-73706-5_16 fatcat:toom7gz435f2bhkynu4v6plyge

What's New? Analysing Language-Specific Wikipedia Entity Contexts to Support Entity-Centric News Retrieval [chapter]

Yiwei Zhou, Elena Demidova, Alexandra I. Cristea
2017 Lecture Notes in Computer Science  
Second, we analyse the similarities and the differences in these contexts in a case study including 220 entities and five Wikipedia language editions.  ...  Such language-specific information could be applied in entity-centric information retrieval applications, in which users utilise very simple queries, mostly just the entity names, for the relevant documents  ...  Acknowledgments This work was partially funded by the COST Action IC1302 (KEYSTONE), the ERC under ALEXANDRIA (ERC 339233) and H2020-MSCA-ITN-2014 WDAqua (64279).  ... 
doi:10.1007/978-3-319-59268-8_10 fatcat:gkeuf6fhm5ctzglpsh4x4qjnjm

Cross-Lingual Named Entity Recognition via Wikification

Chen-Tse Tsai, Stephen Mayhew, Dan Roth
2016 Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning  
When trained on English, our model outperforms comparable approaches on the standard CoNLL datasets (Spanish, German, and Dutch) and also performs very well on lowresource languages (e.g., Turkish, Tagalog  ...  Named Entity Recognition (NER) models for language L are typically trained using annotated data in that language.  ...  Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S.  ... 
doi:10.18653/v1/k16-1022 dblp:conf/conll/TsaiMR16 fatcat:pk6qbejp2jhzvmrr6mfsacvvl4

Learning multilingual named entity recognition from Wikipedia

Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, James R. Curran
2013 Artificial Intelligence  
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia.  ...  We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora.  ...  Acknowledgements We would like to thank members of Schwa Lab and the anonymous reviewers for their helpful feedback on all of the research described here.  ... 
doi:10.1016/j.artint.2012.03.006 fatcat:7agjkau5wfhqbeyit3sddv2ggy

Overview of the WiQA Task at CLEF 2006 [chapter]

Valentin Jijkoun, Maarten de Rijke
2007 Lecture Notes in Computer Science  
from the language of the source page, that add new and important information to the source page, and that do so without repetition.  ...  We describe WiQA 2006, a pilot task aimed at studying question answering using Wikipedia.  ...  .-000.106, 612.066.302, 612.069.006, 640.001.501, 640.002.501, and and by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104.  ... 
doi:10.1007/978-3-540-74999-8_33 fatcat:ppqh5s2p7ffi5g5vizutr25wyi

Who likes me more?

Yiwei Zhou, Elena Demidova, Alexandra I. Cristea
2016 Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16  
in Dutch Wikipedia.  ...  with a case study of five Wikipedia language editions and a set of target entities from four categories.  ...  Acknowledgments This work was partially funded by the COST Action IC1302 (KEYSTONE) and the European Research Council under ALEXANDRIA (ERC 339233).  ... 
doi:10.1145/2851613.2851858 dblp:conf/sac/ZhouDC16 fatcat:6bersj2le5aj3ggswnmxvjfawq

DAMO-NLP at SemEval-2022 Task 11: A Knowledge-based System for Multilingual Named Entity Recognition [article]

Xinyu Wang, Yongliang Shen, Jiong Cai, Tao Wang, Xiaobin Wang, Pengjun Xie, Fei Huang, Weiming Lu, Yueting Zhuang, Kewei Tu, Wei Lu, Yong Jiang
2022 arXiv   pre-print
The MultiCoNER shared task aims at detecting semantically ambiguous and complex named entities in short and low-context settings for multiple languages.  ...  To alleviate this issue, our team DAMO-NLP proposes a knowledge-based system, where we build a multilingual knowledge base based on Wikipedia to provide related context information to the named entity  ...  ., 2022b) aims at building Named Entity Recognition (NER) systems for 11 languages, including English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla.  ... 
arXiv:2203.00545v2 fatcat:egbupibg6nbibjf7l77iaddmke

Spanish Legislation as Linked Data

Víctor Rodríguez-Doncel, María Navas-Loro, Elena Montiel-Ponsoda, Pompeu Casanovas
2018 Zenodo  
In the published dataset, text is structured in articles; key terms are related to external terminological databases, named entities are identified, and links between internal and external documents have  ...  The work presented here is an independent effort to publish Spanish consolidated legislation strongly linked to other external resources.  ...  Acknowledgements This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 780602.  ... 
doi:10.5281/zenodo.4756452 fatcat:mlf4c4mlojaq7mophnkyfowxju

Using Syntactic Knowledge for QA [chapter]

Gosse Bouma, Ismail Fahmi, Jori Mur, Gertjan van Noord, Lonneke van der Plas, Jörg Tiedemann
2007 Lecture Notes in Computer Science  
to Dutch) QA system, which uses a combination of Systran and Wikipedia (for term recognition and translation) for question translation.  ...  We describe our system for the monolingual Dutch and multilingual English to Dutch QA tasks.  ...  To improve on the treatment of named entities and terms, we extracted from English Wikipedia all pairs of lemma titles and their cross-links to the corresponding link in Dutch Wikipedia.  ... 
doi:10.1007/978-3-540-74999-8_39 fatcat:zluo7aoifrfllhpph7uyub7xpm

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies [article]

Marta R. Costa-jussà, Pau Li Lin, Cristina España-Bonet
2019 arXiv   pre-print
We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies.  ...  While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by  ...  Acknowledgments The authors want to thank Jordi Armengol, Magdalena Biesialska, Casimiro Carrino, Noe Casas, Guillem Cortès, Carlos Escolano, Gerard Gallego and Bardia Rafieian for  ... 
arXiv:1912.04778v1 fatcat:mf27cv7shzalta4yy432tcnfdy

Complex Factoid Question Answering with a Free-Text Knowledge Graph

Chen Zhao, Chenyan Xiong, Xin Qian, Jordan Boyd-Graber
2020 Proceedings of The Web Conference 2020  
DELFT builds a free-text knowledge graph from Wikipedia, with entities as nodes and sentences in which entities co-occur as edges.  ...  For each question, DELFT finds the subgraph linking question entity nodes to candidates using text sentences as edges, creating a dense and high coverage semantic graph.  ...  For example, Delft occurs in the Wikipedia page associated with Question Entity Node Vermeer and thus becomes a Candidate Entity Node.  ... 
doi:10.1145/3366423.3380197 dblp:conf/www/ZhaoXQB20 fatcat:ahvzga5qdjdfph36gpkuikn4oe
« Previous Showing results 1 — 15 out of 886 results