187 Hits in 2.4 sec

Extracting Multilingual Topics from Unaligned Comparable Corpora [chapter]

Jagadeesh Jagarlamudi, Hal Daumé
2010 Lecture Notes in Computer Science  
In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus.  ...  Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments.  ...  Extracting Multilingual Topics from Unaligned Comparable Corpora  ... 
doi:10.1007/978-3-642-12275-0_39 fatcat:rzqvbd4kjva3bnzp6slasgsx6q

Qualitative Comparison of Native and Machine-Translated Parliamentary Debates

Ajda Pretnar Zagar
2022 Digital Humanities in the Nordic Countries Conference  
It qualitatively compares three steps in topic interpretation: topic description, topic significance in subcorpora, and marginal topic distribution.  ...  They can potentially lift the barriers to applying NLP tools and methods to previously unsupported languages and boost comparative cross-lingual research in digital humanities.  ...  PTM can be extended to unaligned documents, but not all corpora contain comparable documents.  ... 
dblp:conf/dhn/Zagar22 fatcat:lav36rgjmnhgxcvyzhums5yfay

Pseudo-Aligned Multilingual Corpora

Fernando Diaz, Donald Metzler
2007 International Joint Conference on Artificial Intelligence  
We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topicbased graph for each language.  ...  Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound.  ...  This would allow one to leverage topic information from different languages when defining the lower-dimensional topic space. Second, we adopted parallel corpora for evaluation reasons.  ... 
dblp:conf/ijcai/DiazM07 fatcat:3fdipgp2wveqhpfddgohhznecu

Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Jordan L. Boyd-Graber, Philip Resnik
2010 Conference on Empirical Methods in Natural Language Processing  
MLSLDA provides a method for extracting topical and sentimentcorrelated word lists from multilingual corpora.  ...  Figure 4 shows extracted topics from German-English and German-Chinese corpora. MLSLDA is able to distinguish sentiment-bearing topics from content bearing topics.  ... 
dblp:conf/emnlp/Boyd-GraberR10 fatcat:rh3xsvsf6fch3d36uy3b4hlgyu

Cross-Lingual Latent Topic Extraction

Duo Zhang, Qiaozhu Mei, ChengXiang Zhai
2010 Annual Meeting of the Association for Computational Linguistics  
Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.  ...  Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way.  ...  Besides all the multilingual topic modeling work discussed above, comparable corpora have also been studied extensively (e.g.  ... 
dblp:conf/acl/ZhangMZ10 fatcat:icpzu6wsrnd2riv2caamvqpvzy

Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora

Sree Harsha Ramesh, Krishna Prasad Sankaranarayanan
2018 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop  
Resources for the non-English languages are scarce and this paper addresses this problem in the context of machine translation, by automatically extracting parallel sentence pairs from the multilingual  ...  In this paper, we have used an end-to-end Siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual articles in Wikipedia.  ...  Comparable corpora such as Wikipedia, are collections of topic-aligned but non-sentence-aligned multilingual documents which are rich resources for extracting parallel sentences from.  ... 
doi:10.18653/v1/n18-4016 dblp:conf/naacl/RameshS18 fatcat:emmblyhcwbahvczghp4tm4cy3a

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics [chapter]

Krzysztof Wołk, Emilia Rejmund, Krzysztof Marasek
2015 Lecture Notes in Computer Science  
This research explores our new methodologies for mining such data from previously obtained comparable corpora.  ...  Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page.  ...  We seek to obtain parallel corpora from unaligned data. Solution proposed by our team is based on sequential analogy detection.  ... 
doi:10.1007/978-3-319-25252-0_46 fatcat:l6y2i4om3jddhilmfoeozdjn6e

Adversarial Alignment of Multilingual Models for Extracting Temporal Expressions from Text [article]

Lukas Lange, Anastasiia Iurshina, Heike Adel, Jannik Strötgen
2020 arXiv   pre-print
In this paper, we explore multilingual methods for the extraction of temporal expressions from text and investigate adversarial training for aligning embedding spaces to one common space.  ...  With this, we create a single multilingual model that can also be transferred to unseen languages and set the new state of the art in those cross-lingual transfer experiments.  ...  Introduction The extraction of temporal expressions from text is an important processing step for many applications, such as topic detection and questions answering (Strötgen and Gertz, 2016) .  ... 
arXiv:2005.09392v1 fatcat:rs5brn7oevegth3gxbsqfusa4e

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Krzysztof Wołk, Krzysztof Marasek
2014 Procedia Technology - Elsevier  
We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs.  ...  This research explores our methodology for mining such data from previously obtained comparable corpora.  ...  Acknowledgements This work was supported by the European Community from the European Social Fund within the Interkadra project UDA-POKL-04.01.01-00-014/10-00 and Eu-Bridge 7th FR EU project (Grant Agreement  ... 
doi:10.1016/j.protcy.2014.11.024 fatcat:fijsnjenrnarhnqt5kgw6wewta

The Role of Sketch Engine in Multiple Types of Corpora

This paper sheds light on the significant role Sketch Engine plays in relation to different types of corpora.  ...  The software's features that support the creation of multilingual dictionaries and lexigraphy are also discussed.  ...  MULTIPLE TYPES OF CORPORA This paper uses different Sketch Engine terms which are explained as follows: A. Comparable Corpora This type of corpora are unaligned to each other.  ... 
doi:10.35940/ijitee.k1307.0981119 fatcat:s7e6pl6nprfivkaj5uazmqssty

Controlling Target Features in Neural Machine Translation via Prefix Constraints

Shunsuke Takeno, Masaaki Nagata, Kazuhide Yamamoto
2017 Workshop on Asian Translation  
Prefix constraints can be predicted from source sentence jointly with target sentence, while side constraints must be provided by the user or predicted by some other methods.  ...  prefix constraints are more flexible than side constraints and can be used to control the behavior of neural machine translation, in terms of output length, bidirectional decoding, domain adaptation, and unaligned  ...  Tatoeba is a collection of multilingual translated example sentences from Tatoeba website.  ... 
dblp:conf/aclwat/TakenoNY17 fatcat:eap7tu5gszfgncszwml3s4zdiq

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

Hetong Ma, Feihong Yang, Jiansong Ren, Ni Li, Min Dai, Xuwen Wang, An Fang, Jiao Li, Qing Qian, Jie He
2020 BMC Medical Informatics and Decision Making  
However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios.  ...  application as a preparatory data foundation e.g. cancer-related machine translation, cancer system development towards medical education, and disease-oriented knowledge extraction.  ...  MulTed is a multilingual parallel corpus collected from TED talks containing general topics [8] etc.  ... 
doi:10.1186/s12911-020-1116-1 pmid:32646415 fatcat:vg7hmi2levewxk4dpcjl5m4pt4

LIMSI: Translations as Source of Indirect Supervision for Multilingual All-Words Sense Disambiguation and Entity Linking

Marianna Apidianaki, Li Gong
2015 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)  
We present the LIMSI submission to the Multilingual Word Sense Disambiguation and Entity Linking task of SemEval-2015.  ...  The system exploits the parallelism of the multilingual test data and uses translations as source of indirect supervision for sense selection.  ...  Task Description The SemEval-2015 Multilingual WSD and EL task (Moro and Navigli, 2015) aims to promote joint research in these two closely-related topics.  ... 
doi:10.18653/v1/s15-2050 dblp:conf/semeval/ApidianakiG15 fatcat:azui7vg2cbfxnkv5sebyhoagim

Multilingual and code-switching ASR challenges for low resource Indian languages [article]

Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh (+10 others)
2021 arXiv   pre-print
of labeled corpora in multiple languages.  ...  With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well.  ...  Dataset Description The Hindi-English and Bengali-English datasets are extracted from spoken tutorials.  ... 
arXiv:2104.00235v1 fatcat:eevwpnji2fdtdk7ltatn7hkkua

Coarse-grained Cross-lingual Alignment of Comparable Texts with Topic Models and Encyclopedic Knowledge [article]

Vivi Nastase, Angela Fahrni
2014 arXiv   pre-print
induced multilingual topics.  ...  We present a method for coarse-grained cross-lingual alignment of comparable texts: segments consisting of contiguous paragraphs that discuss the same theme (e.g. history, economy) are aligned based on  ...  Multilingual topic modeling Jagarlamudi and Daumé III (2010) use a bilingual dictionary to obtain multilingual topics from an unaligned multilingual corpus.  ... 
arXiv:1411.7820v1 fatcat:il5zvspwajfdjpf5yuwiv4fdim
« Previous Showing results 1 — 15 out of 187 results