Filters








2,488 Hits in 3.9 sec

Multilingual Language Processing From Bytes [article]

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya
2016 arXiv   pre-print
Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion  ...  Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model.  ...  Acknowledgments Many thanks to Fernando Pereira and Dan Ramage for their insights about this project from the outset. Thanks also to Cree Howard for creating Figure 1 .  ... 
arXiv:1512.00103v2 fatcat:7ui6b6ptszbqfdnwoehhmx2x6i

Multilingual Language Processing From Bytes

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya
2016 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies  
Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion  ...  Because we operate directly on unicode bytes rather than languagespecific words or characters, we can analyze text in many languages with a single model.  ...  Acknowledgments Many thanks to Fernando Pereira and Dan Ramage for their insights about this project from the outset. Thanks also to Cree Howard for creating Figure 1 .  ... 
doi:10.18653/v1/n16-1155 dblp:conf/naacl/GillickBVS16 fatcat:4eeavkmadngmdn72na4wlzge5i

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes [article]

Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, William Chan
2018 arXiv   pre-print
These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual processing.  ...  Additionally, our multilingual byte model outperform each respective single language baseline on average by 4.4% relatively.  ...  Benefiting from the language independence representation of Unicode bytes, we find it is possible to progressively add support for new languages when building a multilingual A2B model.  ... 
arXiv:1811.09021v1 fatcat:axsm5xwqrva3bgiasy3ttnn5rq

Multilingual Grapheme-To-Phoneme Conversion with Byte Representation

Mingzhi Yu, Hieu Duy Nguyen, Alex Sokolov, Jack Lepird, Kanthashree Mysore Sathyendra, Samridhi Choudhary, Athanasios Mouchtaris, Siegfried Kunzmann
2020 ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
However, most multilingual G2P studies focus on sets of languages that share similar graphemes, such as European languages. Multilingual G2P for languages from different writing systems, e.g.  ...  In addition, byte-level models are 15.0%-20.1% smaller in size. Our results show that byte is an efficient representation for multilingual G2P with languages having large grapheme vocabularies.  ...  Multilingual Models We trained multilingual G2P models using data from European and East Asian languages. Table 3 presents the results under different setups.  ... 
doi:10.1109/icassp40776.2020.9054696 dblp:conf/icassp/YuNSLSCMK20 fatcat:b4gdodfl3nbgvn3z6xwckmtyl4

ByT5 model for massively multilingual grapheme-to-phoneme conversion [article]

Jian Zhu, Cong Zhang, David Jurgens
2022 arXiv   pre-print
We have curated a G2P dataset from various sources that covers around 100 languages and trained large-scale multilingual G2P models based on ByT5.  ...  Pairwise comparison with monolingual models in these languages suggests that multilingual ByT5 models generally lower the phone error rate by jointly learning from a variety of languages.  ...  In contrast, token-free models [32] operating on raw bytes could process any type of strings, making multilingual processing easy and reduces the complicated text processing pipeline.  ... 
arXiv:2204.03067v2 fatcat:idsd6d64vfeulifmuqswmzqxfa

Byte-based Neural Machine Translation

Marta R. Costa-jussà, Carlos Escolano, José A. R. Fonollosa
2017 Proceedings of the First Workshop on Subword and Character Level Models in NLP  
The main motivation of the byte-based neural machine translation system is to build multilingual neural machine translation systems that can share the same vocabulary.  ...  byte-based neural machine translation.  ...  Related work can be found int the area of natural language processing.  ... 
doi:10.18653/v1/w17-4123 dblp:conf/emnlp/Costa-JussaEF17 fatcat:ibujxv2mnzagbh2v7zv6jenl3y

Language Set Identification in Noisy Synthetic Multilingual Documents [chapter]

Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen
2015 Lecture Notes in Computer Science  
We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study.  ...  for monolingual text to detect the language set of a multilingual document.  ...  Acknowledgments This work was supported by Kone Foundation from its language programme 4 . We also thank Timothy Baldwin and Marco Lui for their help with the WikipediaMulti dataset.  ... 
doi:10.1007/978-3-319-18111-0_48 fatcat:lvf32fj2sjbfvaytxxai7ontay

Training Multilingual Pre-trained Language Model with Byte-level Subwords [article]

Junqiu Wei, Qun Liu, Yinpeng Guo, Xin Jiang
2021 arXiv   pre-print
In the technical report, we present our practices on training multilingual pre-trained language models with BBPE: Byte-Level BPE (i.e., Byte Pair Encoding).  ...  We release the source code of our byte-level vocabulary building tools and the multilingual pre-trained language models.  ...  text, we call them BBPE and BUnigram, respectively.It is worth mentioning that BUnigram is space-consuming and could not affort to process all multilingual corpora.Thus, for each of the 11 languages we  ... 
arXiv:2101.09469v2 fatcat:gmngisv2hvefno7mbcbty5tzau

Empowering OLAC Extension using Anusaaraka and Effective text processing using Double Byte coding [article]

B Prabhulla Chandran Pillai
2009 arXiv   pre-print
In this context, the Chinese system of text processing and the anusaaraka system are scrutinised.  ...  The paper reviews the hurdles while trying to implement the OLAC extension for Dravidian / Indian languages. The paper further explores the possibilities which could minimise or solve these problems.  ...  Also, in order to get best results, we can categorize Indian languages in the following way:  ... 
arXiv:0909.1147v1 fatcat:hl5ftbnxn5hh3j32g6ebhmnrk4

Automatic Detection and Language Identification of Multilingual Documents

Marco Lui, Jey Han Lau, Timothy Baldwin
2014 Transactions of the Association for Computational Linguistics  
In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents).  ...  We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.  ...  This work was substantially improved as a result of the insightful feedback received from the reviewers.  ... 
doi:10.1162/tacl_a_00163 fatcat:tww6ikg7lzatra4du3bsrkbqaq

Bilingual End-to-End ASR with Byte-Level Subwords [article]

Liuhui Deng, Roger Hsiao, Arnab Ghoshal
2022 arXiv   pre-print
We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses.  ...  We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances  ...  We use this post-processing approach to recover characters from byte sequences as much as possible.  ... 
arXiv:2205.00485v1 fatcat:ngwrynxj6rhy5ab4wfelgib7gi

Local Byte Fusion for Neural Machine Translation [article]

Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, Junjie Hu
2022 arXiv   pre-print
It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance.  ...  Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques  ...  Introduction Multilingual NMT has proven effective to transfer knowledge learned from a high-resource language to a low-resource language.  ... 
arXiv:2205.11490v1 fatcat:e3rhxzyq3fdabealrye47zzc3y

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [article]

Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
2022 arXiv   pre-print
State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings.  ...  Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and  ...  Introduction Neural networks have achieved tremendous success in natural language processing (NLP) by replacing feature-engineered models with stacks of functions that are learned end-to-end from vast  ... 
arXiv:2106.12672v3 fatcat:vv75qaicyrglpj444mfxtcv44q

Sources of Transfer in Multilingual Named Entity Recognition [article]

David Mueller and Nicholas Andrews and Mark Dredze
2020 arXiv   pre-print
Named-entities are inherently multilingual, and annotations in any given language may be limited.  ...  This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language.  ...  Introduction Multilingual learning-using data from multiple languages to train a single model-can take many forms, such as adapting a model from a highresource to low-resource language (Xie et al., 2018  ... 
arXiv:2005.00847v1 fatcat:wdzp2dwp6jghrhlzrioullteee

Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis [article]

Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
2021 arXiv   pre-print
To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts.  ...  Furthermore, we propose a novel method to extract language-specific sub-networks in a multilingual model for a better understanding of its mechanism.  ...  Hence multilingual byte models may produce natural speech on rich-resource source languages.  ... 
arXiv:2103.03541v2 fatcat:z7xtjh723rey3gyrzhrrepj3ea
« Previous Showing results 1 — 15 out of 2,488 results