Filters








22,200 Hits in 8.3 sec

Evaluation Of Word Embeddings From Large-Scale French Web Content [article]

Hadi Abdine
2022 arXiv   pre-print
We also evaluate the quality of our proposed word vectors and the existing French word vectors on the French word analogy task.  ...  Finally, we created a demo web application to test and visualize the obtained word embeddings.  ...  This work presents French word embeddings trained on a large corpus collected/crawled from the French web with more than 1M domains.  ... 
arXiv:2105.01990v2 fatcat:4dpi7ra74bcvvixtrfxgyq5rgy

NLP Research and Resources at DaSciM, Ecole Polytechnique [article]

Hadi Abdine, Yanzhu Guo, Moussa Kamal Eddine, Giannis Nikolentzos, Stamatis Outsios, Guokan Shang, Christos Xypolopoulos, Michalis Vazirgiannis
2021 arXiv   pre-print
DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, established in 2013 and since then producing research results in the area of large scale data analysis via methods of machine and deep  ...  Here follow our different contributions of interest to the AFIA community.  ...  We also evaluate the quality of our proposed word vectors and the existing French word vectors on the French word analogy task.  ... 
arXiv:2112.00566v1 fatcat:dcmwpwdwc5emti6jcflqq5ib4a

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases [article]

Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot
2021 arXiv   pre-print
We further present a method to mine such paraphrase data in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data.  ...  We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data  ...  Acknowledgements This work was partly supported by Benot Sagot's chair in the PRAIRIE institute, funded by the French national agency ANR as part of the "Investissements davenir" programme under the reference  ... 
arXiv:2005.00352v2 fatcat:m2dyquni35d7fi37q3rygoyzza

ProSOUL: A Framework to Identify Propaganda from Online Urdu Content

Soufia Kausar, Bilal Tahir, Muhammad Amir Mehmood
2020 IEEE Access  
Moreover, we develop and classify large scale Urdu content repositories to identify web sources spreading propaganda.  ...  We evaluate the performance of different classifiers by varying n-gram, News Landscape (NELA), Word2Vec, and Bidirectional Encoder Representations from Transformers (BERT) features.  ...  ACKNOWLEDGEMENT This research work was funded by Higher Education Commission (HEC) Pakistan and Ministry of Planning Development and Reforms under National Center in Big Data and Cloud Computing.  ... 
doi:10.1109/access.2020.3028131 fatcat:g3wk5dqkc5cfjmqrrg5wd3ed5y

Exploiting Sentence Order in Document Alignment [article]

Brian Thompson, Philipp Koehn
2020 arXiv   pre-print
Our method improves downstream MT performance on web-scraped Sinhala--English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release.  ...  It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.  ...  Introduction Document alignment is the task of finding parallel document pairs (i.e., documents that are translations of each other) in a large collection of documents, often crawled from the web.  ... 
arXiv:2004.14523v2 fatcat:c6i7b4oqmrdudngplw4p7jez4u

Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification [article]

Zhenpeng Chen and Sheng Shen and Ziniu Hu and Xuan Lu and Qiaozhu Mei and Xuanzhe Liu
2019 arXiv   pre-print
Sentiment classification typically relies on a large amount of labeled data.  ...  Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages.  ...  EVALUATION In this section, we evaluate the effectiveness and efficiency of ELSA using standard benchmark datasets for cross-lingual sentiment classification as well as a large-scale corpus of Tweets.  ... 
arXiv:1806.02557v2 fatcat:52k23nlb4vdgxksjupmhyxajd4

CamemBERT: a Tasty French Language Model [article]

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot
2020 arXiv   pre-print
In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech  ...  We show that the use of web crawled data is preferable to the use of Wikipedia data.  ...  -15-CE38-0011) and BASNUM (ANR-18-CE38-0003), as well as by the last author's chair in the PRAIRIE institute funded by the French national agency ANR as part of the "Investissements d'avenir" programme  ... 
arXiv:1911.03894v3 fatcat:emj6pfw4gbcvxbckp6xnqrmmhy

Detecting Cyber Threats in Non-English Hacker Forums: An Adversarial Cross-Lingual Knowledge Transfer Approach

Mohammadreza Ebrahimi, Sagar Samtani, Yidong Chai, Hsinchun Chen
2020 2020 IEEE Security and Privacy Workshops (SPW)  
Despite its potential, the Dark Web contains hundreds of thousands of non-English posts.  ...  Three experiments demonstrate how A-CLKT outperforms state-of-the-art machine learning, deep learning, and CLKT algorithms in identifying cyber-threats in French and Russian forums.  ...  Finally, we unify all tokens to UTF-8 across training and evaluation datasets before constructing trainable word embeddings for each token. B.  ... 
doi:10.1109/spw50608.2020.00021 fatcat:xuxm4epnwnd5zlwcyfmy6kkyn4

PAGnol: An Extra-Large French Generative Model [article]

Julien Launay, E.L. Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli, Djamé Seddah
2021 arXiv   pre-print
We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models.  ...  We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization  ...  Djamé Seddah was partly funded by the French Research National Agency via the ANR project ParSiTi (ANR-16-CE33-0021).  ... 
arXiv:2110.08554v1 fatcat:gokxh64ae5av7mm2v6twsuf6pa

Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM tree [article]

Chen Dang
2021 arXiv   pre-print
Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image using the content of the surrounding webpage.  ...  A common preprocessing step before performing WICE is to render the content of the webpage.  ...  On a large scale, visual rendering and content extraction from a webpage is not tractable. We investigate how the HTML data structure may help in extracting images' contexts.  ... 
arXiv:2108.11629v1 fatcat:a2toogjg4ffkllmgitlxbttmv4

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance [article]

Ahmed El-Kishky, Francisco Guzmán
2020 arXiv   pre-print
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.  ...  Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation.  ...  metric for large-scale alignment efforts.  ... 
arXiv:2002.00761v2 fatcat:pmk7vuokdrebtmquxcwbil35wu

Language Resources for Historical Newspapers: the Impresso Collection

Maud Ehrmann, Matteo Romanello, Simon Clematide, Philipp Ströbel, Raphaël Barman
2020 Zenodo  
, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this 'Big Data of the Past'.  ...  If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets  ...  Authors also gratefully acknowledge the financial support of the Swiss National Science Foundation (SNSF) for the project impresso -Media Monitoring of the Past under grant number CR-SII5 173719.  ... 
doi:10.5281/zenodo.4641901 fatcat:glusmzr2nfg3zbc7lxe2mkvxzq

Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models [article]

Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish Thapliyal, Radu Soricut
2021 arXiv   pre-print
Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data.  ...  Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets.  ...  Our approach denoises learning from such large and diverse web-scaled data with alt-text annotations by sub-selecting content in a dual staged model.  ... 
arXiv:2009.05175v2 fatcat:pfbns3crszdi3lp4ikf74rb6qe

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB [article]

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin
2020 arXiv   pre-print
To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets.  ...  Chinese, as well as German/French.  ...  The English Web content is abundant and we used only one snapshot.  ... 
arXiv:1911.04944v2 fatcat:pebxb3fh5namncmpdnzbeufeka

YODA System for WMT16 Shared Task: Bilingual Document Alignment

Aswarth Abhilash Dara, Yiu-Chang Lin
2016 Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers  
In this paper, we address the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016. 1 Given the large amounts  ...  We also outline an IR-based approach that uses both content and the meta data of each web page url, thereby obtaining a recall of 56.31%.  ...  For large scale document level alignment, Uszkoreit et al., (2010) proposed a distributed system that reliably mines parallel text from large corpora.  ... 
doi:10.18653/v1/w16-2366 dblp:conf/wmt/DaraL16 fatcat:yafjcsud2nbb7oa4g4f67f3djq
« Previous Showing results 1 — 15 out of 22,200 results