A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Inferring multilingual domain-specific word embeddings from large document corpora
2021
IEEE Access
It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source ...
However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. ...
[40] ) relied on bilingual lexicons. ...
doi:10.1109/access.2021.3118093
fatcat:pyxp6lre5naktagtgua4ucyyi4
A Survey of Code-switched Speech and Language Processing
[article]
2020
arXiv
pre-print
This survey reviews computational approaches for code-switched Speech and Natural Language Processing. ...
We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. ...
lexicon is used for decoding. ...
arXiv:1904.00784v3
fatcat:r5tsg4kdnfbtnndae523c32pta
TwiSE at SemEval-2016 Task 4: Twitter Sentiment Classification
2016
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
I am grateful for his support and advice. Our discussions and brainstorming during all these years have deeply influenced me. ...
Bilingual Topics Models for Comparable Corpora In Chapter 5 we proposed to better adapt bilingual topic models for comparable corpora with explicit alignments. ...
Apart from that, the motivations for adapting the bilingual topic models to comparable corpora lie on two facts: on one hand, comparable corpora are more common and easy to obtain or to construct than ...
doi:10.18653/v1/s16-1010
dblp:conf/semeval/BalikasA16
fatcat:w7o56n5ny5hkjgtnqghp2sdeua
Survey of Low-Resource Machine Translation
[article]
2022
arXiv
pre-print
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. ...
Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language. ...
Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data. ...
arXiv:2109.00486v3
fatcat:5wof74vjy5gptcl5ornkd5j4ku
Survey of Low-Resource Machine Translation
2022
Zenodo
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. ...
Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language. ...
Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data. ...
doi:10.5281/zenodo.6672725
fatcat:ydiog4mdknglxjayk4rlpot5p4
Survey of Low-Resource Machine Translation
2022
Computational Linguistics
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. ...
Bilingual lexicons Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language. ...
Identifying, extracting and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from web data. ...
doi:10.1162/coli_a_00446
fatcat:mvpv6awfl5d3phudp5nd2cz2cy
Designing an Extensible Domain-Specific Web Corpus for "Layfication"
[chapter]
2019
Advances in Systems Analysis, Software Engineering, and High Performance Computing
The main purpose of the corpus is to be used for building and training language technology applications for the "layfication" of the specialized medical jargon. ...
In this chapter, the authors describe the design and the development of an extensible domain-specific web corpus to be used in a distributed social application for the care of the elderly at home. ...
These results are promising if compared with the state of the art of keyword extraction methods, but are moderate if compared with term-extractor based on large corpora. ...
doi:10.4018/978-1-5225-7879-6.ch006
fatcat:tgaorpe5fvepnhl7j66mkp2taa
ATLAS: A flexible and extensible architecture for linguistic annotation
[article]
2000
arXiv
pre-print
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. ...
.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). ...
It will facilitate the definition of consistent logical and physical formats for meta data. ...
arXiv:cs/0007022v1
fatcat:ap2cfa4iyfg3xco4xjilkgi4nu
A Survey of Embedding Space Alignment Methods for Language and Knowledge Graphs
[article]
2020
arXiv
pre-print
Given the pervasive nature of these algorithms, the natural question becomes how to exploit the embedding spaces to map, or align, embeddings of different data sources. ...
To this end, we survey the current research landscape on word, sentence and knowledge graph embedding algorithms. ...
This dataset contains monolingual embeddings and seed dictionaries for 30 languages, as well as bilingual seed dictionary pairs for 110 languages. ...
arXiv:2010.13688v1
fatcat:npkzwukih5gwnkvng2fxy7ls5y
A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave
2021
IEEE Access
Second, we present the event extraction task, the related challenges, and the available annotated corpora. ...
After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research ...
ACKNOWLEDGMENT The authors thank Giulio Carlassare for his contributions during productive discussions and practical experiments on biomedical corpora. ...
doi:10.1109/access.2021.3130956
fatcat:wlr7zeikdva77ojuppqx3vmocy
SYSTRAN's Pure Neural Machine Translation Systems
[article]
2016
arXiv
pre-print
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing ...
Our ultimate goal is to share our expertise to build competitive production systems for "generic" translation. ...
Table 11 : Human error analysis done for 50 sentences of the corpus defined in the section 6.2 for English-French on NMT, SMT (Google) and RBMT outputs. ...
arXiv:1610.05540v1
fatcat:3ciscrxcbre63obuwyucf3y62i
D3.6 Research Challenge Report v3
2022
Zenodo
Note that this is a cumulative report designed to be self-contained, so it builds on and substitutes previous versions of this report (D3.1 and D3.2), and it incorporates core information from the software ...
However, each word embedding has been trained with monolingual corpora. ...
: The lens examines the data around the sense pair to be linked and extracts text that can be compared for similarity. ...
doi:10.5281/zenodo.6759391
fatcat:qef4okfw4ffxjazow6m25oda34
D3.2 Research Challenge Report v2
2020
Zenodo
data for language services " . ...
It allows to integrate RDF converters for various input formats and combine them with stream-based graph transformation for building complex transformation pipelines. ...
However, each word embedding has been trained with monolingual corpora. ...
doi:10.5281/zenodo.5744508
fatcat:zukppmtuebcrhevwpzz2u3gdoy
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
2019
Computational Linguistics
A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. ...
We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. ...
Cross-lingual training jointly learns embeddings from parallel corpora and enforces cross-lingual constraints. ...
doi:10.1162/coli_a_00357
fatcat:cfekqbpmwzegdf6j6atez2rsbe
Message from the general chair
2015
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Compared with the best system from CoNLL-2011, which employs a rule-based method, our system shows competitive performance. ...
Our system gives a better performance than all the learning-based systems from the CoNLL-2011 shared task on the same dataset. ...
for lexicon extraction that extracts translation pairs from comparable corpora by using graph-based label propagation. ...
doi:10.1109/ispass.2015.7095776
dblp:conf/ispass/Lee15
fatcat:ehbed6nl6barfgs6pzwcvwxria
« Previous
Showing results 1 — 15 out of 145 results