Filters








145 Hits in 5.9 sec

Inferring multilingual domain-specific word embeddings from large document corpora

Luca Cagliero, Moreno La Quatra
2021 IEEE Access  
It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source  ...  However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English.  ...  [40] ) relied on bilingual lexicons.  ... 
doi:10.1109/access.2021.3118093 fatcat:pyxp6lre5naktagtgua4ucyyi4

A Survey of Code-switched Speech and Language Processing [article]

Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W Black
2020 arXiv   pre-print
This survey reviews computational approaches for code-switched Speech and Natural Language Processing.  ...  We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities.  ...  lexicon is used for decoding.  ... 
arXiv:1904.00784v3 fatcat:r5tsg4kdnfbtnndae523c32pta

TwiSE at SemEval-2016 Task 4: Twitter Sentiment Classification

Georgios Balikas, Massih-Reza Amini
2016 Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)  
I am grateful for his support and advice. Our discussions and brainstorming during all these years have deeply influenced me.  ...  Bilingual Topics Models for Comparable Corpora In Chapter 5 we proposed to better adapt bilingual topic models for comparable corpora with explicit alignments.  ...  Apart from that, the motivations for adapting the bilingual topic models to comparable corpora lie on two facts: on one hand, comparable corpora are more common and easy to obtain or to construct than  ... 
doi:10.18653/v1/s16-1010 dblp:conf/semeval/BalikasA16 fatcat:w7o56n5ny5hkjgtnqghp2sdeua

Survey of Low-Resource Machine Translation [article]

Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, Alexandra Birch
2022 arXiv   pre-print
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models.  ...  Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language.  ...  Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data.  ... 
arXiv:2109.00486v3 fatcat:5wof74vjy5gptcl5ornkd5j4ku

Survey of Low-Resource Machine Translation

Haddow, Bawden, Miceli Barone, Helcl, Birch
2022 Zenodo  
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models.  ...  Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language.  ...  Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data.  ... 
doi:10.5281/zenodo.6672725 fatcat:ydiog4mdknglxjayk4rlpot5p4

Survey of Low-Resource Machine Translation

Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, Alexandra Birch
2022 Computational Linguistics  
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models.  ...  Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language.  ...  Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data.  ... 
doi:10.1162/coli_a_00446 fatcat:mvpv6awfl5d3phudp5nd2cz2cy

Designing an Extensible Domain-Specific Web Corpus for "Layfication" [chapter]

Marina Santini, Arne Jönsson, Wiktor Strandqvist, Gustav Cederblad, Mikael Nyström, Marjan Alirezaie, Leili Lind, Eva Blomqvist, Maria Lindén, Annica Kristoffersson
2019 Advances in Systems Analysis, Software Engineering, and High Performance Computing  
The main purpose of the corpus is to be used for building and training language technology applications for the "layfication" of the specialized medical jargon.  ...  In this chapter, the authors describe the design and the development of an extensible domain-specific web corpus to be used in a distributed social application for the care of the elderly at home.  ...  These results are promising if compared with the state of the art of keyword extraction methods, but are moderate if compared with term-extractor based on large corpora.  ... 
doi:10.4018/978-1-5225-7879-6.ch006 fatcat:tgaorpe5fvepnhl7j66mkp2taa

ATLAS: A flexible and extensible architecture for linguistic annotation [article]

Steven Bird, David Day, John Garofolo, John Henderson, Christophe Laprun, Mark Liberman
2000 arXiv   pre-print
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations.  ...  .), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.).  ...  It will facilitate the definition of consistent logical and physical formats for meta data.  ... 
arXiv:cs/0007022v1 fatcat:ap2cfa4iyfg3xco4xjilkgi4nu

A Survey of Embedding Space Alignment Methods for Language and Knowledge Graphs [article]

Alexander Kalinowski, Yuan An
2020 arXiv   pre-print
Given the pervasive nature of these algorithms, the natural question becomes how to exploit the embedding spaces to map, or align, embeddings of different data sources.  ...  To this end, we survey the current research landscape on word, sentence and knowledge graph embedding algorithms.  ...  This dataset contains monolingual embeddings and seed dictionaries for 30 languages, as well as bilingual seed dictionary pairs for 110 languages.  ... 
arXiv:2010.13688v1 fatcat:npkzwukih5gwnkvng2fxy7ls5y

A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave

Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro
2021 IEEE Access  
Second, we present the event extraction task, the related challenges, and the available annotated corpora.  ...  After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research  ...  ACKNOWLEDGMENT The authors thank Giulio Carlassare for his contributions during productive discussions and practical experiments on biomedical corpora.  ... 
doi:10.1109/access.2021.3130956 fatcat:wlr7zeikdva77ojuppqx3vmocy

SYSTRAN's Pure Neural Machine Translation Systems [article]

Josep Crego, Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean Senellart, Egor Akhanov, Patrice Brunelle, Aurelien Coquard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss, Joshua Johanson (+14 others)
2016 arXiv   pre-print
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing  ...  Our ultimate goal is to share our expertise to build competitive production systems for "generic" translation.  ...  Table 11 : Human error analysis done for 50 sentences of the corpus defined in the section 6.2 for English-French on NMT, SMT (Google) and RBMT outputs.  ... 
arXiv:1610.05540v1 fatcat:3ciscrxcbre63obuwyucf3y62i

D3.6 Research Challenge Report v3

Christian Chiarcos, Christian Fäth, Jorge Gracia, Bernardo Stearns, Mohammad Fazleh Elahi, Patricia Martín-Chozas, Maxim Ionov, Julia Bosque-Gil, Fernando Bobillo, Marta Lanau-Coronas, John P McCrae, Mariano Rico (+5 others)
2022 Zenodo  
Note that this is a cumulative report designed to be self-contained, so it builds on and substitutes previous versions of this report (D3.1 and D3.2), and it incorporates core information from the software  ...  However, each word embedding has been trained with monolingual corpora.  ...  : The lens examines the data around the sense pair to be linked and extracts text that can be compared for similarity.  ... 
doi:10.5281/zenodo.6759391 fatcat:qef4okfw4ffxjazow6m25oda34

D3.2 Research Challenge Report v2

Christian Fäth, Christian Chiarcos, Jorge Gracia, Julia Bosque-Gil, Bernardo Stearns, John P. McCrae, Fernando Bobillo, Philipp Cimiano, Thierry Declerck, Mohammad Fazleh Elahi, Basil Ell, Julian Grosse (+4 others)
2020 Zenodo  
data for language services " .  ...  It allows to integrate RDF converters for various input formats and combine them with stream-based graph transformation for building complex transformation pipelines.  ...  However, each word embedding has been trained with monolingual corpora.  ... 
doi:10.5281/zenodo.5744508 fatcat:zukppmtuebcrhevwpzz2u3gdoy

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Edoardo Maria Ponti, Helen O'Horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, Anna Korhonen
2019 Computational Linguistics  
A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources.  ...  We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP.  ...  Cross-lingual training jointly learns embeddings from parallel corpora and enforces cross-lingual constraints.  ... 
doi:10.1162/coli_a_00357 fatcat:cfekqbpmwzegdf6j6atez2rsbe

Message from the general chair

Benjamin C. Lee
2015 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
Compared with the best system from CoNLL-2011, which employs a rule-based method, our system shows competitive performance.  ...  Our system gives a better performance than all the learning-based systems from the CoNLL-2011 shared task on the same dataset.  ...  for lexicon extraction that extracts translation pairs from comparable corpora by using graph-based label propagation.  ... 
doi:10.1109/ispass.2015.7095776 dblp:conf/ispass/Lee15 fatcat:ehbed6nl6barfgs6pzwcvwxria
« Previous Showing results 1 — 15 out of 145 results