1,820 Hits in 7.2 sec

Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification

Aditya Mogadala, Achim Rettinger
2016 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  
non-parallel document corpora to support cross-language text classification.  ...  We introduce BRAVE (Bilingual paRAgraph VEctors), a model to learn bilingual distributed representations (i.e. embeddings) of words without word alignments either from sentencealigned parallel or label-aligned  ...  Acknowledgments The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007(FP7/ -2013 under grant agreement no. 611346.  ... 
doi:10.18653/v1/n16-1083 dblp:conf/naacl/MogadalaR16 fatcat:z2qri3ovsvhwrhjkqfkw4plxhi

Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings [article]

Sosuke Nishikawa, Ryokan Ri, Yoshimasa Tsuruoka
2021 arXiv   pre-print
Unsupervised cross-lingual word embedding (CLWE) methods learn a linear transformation matrix that maps two monolingual embedding spaces that are separately trained with monolingual corpora.  ...  language that helps to learn similar embedding spaces between the source and target languages.  ...  We concatenate the pseudo corpora with the original corpora, and learn monolingual word embeddings for each language.  ... 
arXiv:2006.00262v3 fatcat:u56cphwbkvccvmrngl56lhiwoq

When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer [article]

Ameet Deshpande, Partha Talukdar, Karthik Narasimhan
2022 arXiv   pre-print
transfer performance and word embedding alignment between languages (e.g., R=0.94 on the task of NLI).  ...  Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence.  ...  Acknowledgments This work was funded through a grant from the Chadha Center for Global India at Princeton University.  ... 
arXiv:2110.14782v3 fatcat:ctoreddmera27knjjzszrvycq4

Exploring Distributional Representations and Machine Translation for Aspect-based Cross-lingual Sentiment Classification

Jeremy Barnes, Patrik Lambert, Toni Badia
2016 International Conference on Computational Linguistics  
Cross-lingual sentiment classification (CLSC) seeks to use resources from a source language in order to detect sentiment and classify text in a target language.  ...  We compare zero-shot learning, bilingual word embeddings, stacked denoising autoencoder representations and machine translation techniques for aspect-based CLSC.  ...  Bilingual Word Embeddings The next set of experiments required the use of parallel sentences to create bilingual word embeddings (BWEs).  ... 
dblp:conf/coling/BarnesLB16 fatcat:oupstotnjvha7duov2k52nedgy

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher [article]

Giannis Karamanolakis, Daniel Hsu, Luis Gravano
2020 arXiv   pre-print
Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages.  ...  Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches  ...  Acknowledgments We thank the anonymous reviewers for their constructive feedback. This material is based upon work supported by the National Science Foundation under Grant No. IIS-15-63785.  ... 
arXiv:2010.02562v1 fatcat:rggtsno3i5fcnnhzobzl6vevmq

Structural Correspondence Learning for Cross-lingual Sentiment Classification with One-to-many Mappings [article]

Nana Li, Shuangfei Zhai, Zhongfei Zhang, Boying Liu
2016 arXiv   pre-print
Our method does not rely on the parallel corpora and the experimental results show that our approach is more competitive than the state-of-the-art methods in cross-lingual sentiment classification.  ...  Structural correspondence learning (SCL) is an effective method for cross-lingual sentiment classification.  ...  Acknowledgments This work is supported in part by Tianjin National Natural Science Foundation for Young Scholars (13JCQNJC00200).  ... 
arXiv:1611.08737v1 fatcat:gty36v2qgbaypmapm7rfmm5tem

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages [article]

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
2020 arXiv   pre-print
We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings.  ...  We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.  ...  We train bilingual word embeddings from English to Indian languages and vice versa using GeoMM (Jawanpuria et al., 2019) , a state-of-the-art supervised method for learning bilingual embeddings.  ... 
arXiv:2005.00085v1 fatcat:eiyrelngcbhxpmzyelcfj62qua

Robust Cross-lingual Embeddings from Parallel Sentences [article]

Ali Sabet, Prakhar Gupta, Jean-Baptiste Cordonnier, Robert West, Martin Jaggi
2020 arXiv   pre-print
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations.  ...  Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation  ...  Acknowledgments We acknowledge funding from the Innosuisse ADA grant.  ... 
arXiv:1912.12481v2 fatcat:onah22qti5gmrghnyi7o6h4pua

Evaluating Word Embeddings for Indonesian-English Code-Mixed Text Based on Synthetic Data

Arra'di Nur Rizal, Sara Stymne
2020 Workshop on Computational Approaches to Code Switching  
In this paper, we explore and evaluate different types of word embeddings for Indonesian-English code-mixed text.  ...  Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey.  ...  This study uses parallel text data aligned at the phrase-level, as a basis for inserting phrases from the embedded language into the matrix language.  ... 
dblp:conf/acl-codeswitch/RizalS20 fatcat:sxqriddovnhktetgzqguhkdlqq

A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles

Maha Jarallah Althobaiti
2021 IEEE Access  
The word semantic similarity models trained on our parallel corpus outperformed other models trained on other corpora in the task of English non-similar word identification.  ...  In this paper, we present a novel method to automatically create parallel sentences from comparable corpora.  ...  Parallel corpora are crucial resources for many Natural Language Processing (NLP) applications, such as machine translation, and cross-lingual information retrieval [2] - [4] .  ... 
doi:10.1109/access.2021.3137830 fatcat:m2r345y5xnbujof2jegybkiz4e

Cross-Lingual Word Embeddings

Eneko Agirre
2020 Computational Linguistics  
Cross-lingual word embeddings (CLWE for short) extend the idea, and represent translation-equivalent words from two (or more) languages close to each other in a common, cross-lingual space.  ...  For instance, given training data for a text-classification task in English, a model using CLWE can classify foreign language documents.  ...  The models that require parallel data in the form of bilingual dictionaries (or word alignments induced from parallel corpora) are further classified into those that learn separate monolingual spaces for  ... 
doi:10.1162/coli_r_00372 fatcat:ei2demw3efbkrftadp6l6kl6qe

Simple task-specific bilingual word embeddings

Stephan Gouws, Anders Søgaard
2015 Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  
We show how our method outperforms off-the-shelf bilingual embeddings on the task of unsupervised cross-language partof-speech (POS) tagging, as well as on the task of semi-supervised cross-language super  ...  We introduce a simple wrapper method that uses off-the-shelf word embedding algorithms to learn task-specific bilingual word embeddings.  ...  Our bilingual embedding model, which we call Bilingual Adaptive Reshuffling with Individual Stochastic Alternatives (BARISTA), takes two (non-parallel) corpora and a small dictionary as input.  ... 
doi:10.3115/v1/n15-1157 dblp:conf/naacl/GouwsS15 fatcat:htznsziwwbfmzmcavgbqiozneq

Word Embeddings for Code-Mixed Language Processing

Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
Thus, this study demonstrates that existing bilingual embedding techniques are not ideal for code-mixed text processing and there is a need for learning multilingual word embedding from the code-mixed  ...  We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks  ...  Acknowledgements We would like to thank Kalika Bali, Gayatri Bhat and Sandipan Dandapat for their valuable suggestions and help in the creation of the synthetic CM corpus.  ... 
doi:10.18653/v1/d18-1344 dblp:conf/emnlp/PratapaCS18 fatcat:ofcgc5y65vcrdahyf7zpdepkoy

Evaluating the Impact of Bilingual Lexical Resources on Cross-lingual Sentiment Projection in the Pharmaceutical Domain

Matthias Hartung, Matthias Orlikowski, Susana Veríssimo
2020 Zenodo  
the target language, capitalizing on monolingual embeddings and a bilingual translation dictionary only (Barnes et al., 2018).  ...  For the language pair English/Spanish, our findings corroborate the strength of cross-lingual projection approaches such as BLSE in technical scenarios, given the availability of bilingual resources that  ...  Acknowledgements For this work we have received funding from the H2020 project Prêt-à-LLOD under Grant Agreement number 825182. Bibliographical References  ... 
doi:10.5281/zenodo.3707940 fatcat:uttn5gyzjvhxnjruinlb6hm7vm

Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings

Pranaydeep Singh, Els Lefever
2020 Workshop on Computational Approaches to Code Switching  
This paper investigates the use of unsupervised cross-lingual embeddings for solving the problem of code-mixed social media text understanding.  ...  model in one of the source languages and evaluating on the other language projected in the same space.  ...  Most past work building cross-lingual sentiment models does so using translation systems (Zhou et al., 2016) or cross-lingual signals in another form, such as parallel corpora or bilingual dictionaries  ... 
dblp:conf/acl-codeswitch/SinghL20 fatcat:3efm5wr3yrdutbzimuo74atvbm
« Previous Showing results 1 — 15 out of 1,820 results