66 Hits in 0.83 sec

Unsupervised and Distributional Detection of Machine-Generated Text [article]

Matthias Gallé, Jos Rozen, Germán Kruszewski, Hady Elsahar
2021 arXiv   pre-print
The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributional way: we assume that we have access to a large collection of unannotated documents, a big
more » ... on of which is machine-generated. We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which we show over-appear in machine-generated text as compared to human ones. That weak signal is the starting point of a self-training setting where pseudo-labelled documents are used to train an ensemble of classifiers. Our experiments show that leveraging that signal allows us to rank suspicious documents accurately. Precision at 5000 is over 90% for top-k sampling strategies, and over 80% for nucleus sampling for the largest model we used (GPT2-large). The drop with increased size of model is small, which could indicate that the results hold for other current and future large language models.
arXiv:2111.02878v1 fatcat:swqvuzl5gjexlbjjanuqf6qjbq

To Annotate or Not? Predicting Performance Drop under Domain Shift

Matthias Gallé, Hady Elsahar
2019 Zenodo  
Performance drop due to domain-shift is an endemic problem for NLP models in production. This problem creates an urge to continuously annotate evaluation datasets to measure the expected drop in the model performance which can be prohibitively expensive and slow. In this paper we study the problem of predicting the performance drop of modern NLP models under domain-shift, in the absence of any target domain labels. We investigate three families of methods (H-divergence, reverse classification
more » ... curacy and confidence measures), show how they can be used to predict the performance drop and study their robustness to adversarial domain-shifts. Our results on sentiment classification and sequence labeling show that our method is able to predict performance drops with an error rate as low as 2.15% and 0.89% for sentiment analysis and POS tagging respectively
doi:10.5281/zenodo.3446733 fatcat:pbnn4mjoijhpxekdso6etcqkwe

The Case for a GDPR-Specific Annotated Dataset of Privacy Policies

Matthias Gallé, Athena Christof, Hady Elsahar
2019 Zenodo  
In this position paper we analyse the pros and contras for the need of a dataset of privacy policies, annotated with GDPR-specific elements. We revise existing related data-sets and provide an analysis of how they could be augmented in order to facilitate machinelearning techniques to assess privacy policies with respect to their compliance or not to GDPR.
doi:10.5281/zenodo.3446665 fatcat:b2qglziqunbw3alpajbr7533e4

Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs [article]

Bryan Eikema, Germán Kruszewski, Hady Elsahar, Marc Dymetman
2021 arXiv   pre-print
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often difficult or impossible to apply due to the need to find a proposal distribution that upper-bounds
more » ... target distribution everywhere. Approximate Markov chain Monte Carlo sampling techniques like Metropolis-Hastings are usually easier to design, exploiting a local proposal distribution that performs local edits on an evolving sample. However, these techniques can be inefficient due to the local nature of the proposal distribution and do not provide an estimate of the quality of their samples. In this work, we propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics. QRS capitalizes on the availability of high-quality global proposal distributions obtained from deep learning models. We demonstrate the effectiveness of QRS sampling for discrete EBMs over text for the tasks of controlled text generation with distributional constraints and paraphrase generation. We show that we can sample from such EBMs with arbitrary precision at the cost of sampling efficiency.
arXiv:2112.05702v1 fatcat:ug7g3njbx5eenkodsvgareixky

A Distributional Approach to Controlled Text Generation [article]

Muhammad Khalifa, Hady Elsahar, Marc Dymetman
2021 arXiv   pre-print
We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From
more » ... hat optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence. (Code available at
arXiv:2012.11635v2 fatcat:4rzxdngi4fadpm3z5am2hde3de

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples [article]

Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christoph Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl
2017 arXiv   pre-print
In our second scenario, following [Elsahar et al. 2017], we align Wikipedia summaries with the communitycurated triples of Wikidata. Inspired by Lebret et al. , we chose a corpus about biographies.  ... 
arXiv:1711.00155v1 fatcat:5ilyiixpq5fw3eyec3ewby3g44

Zero-Shot Question Generation from Knowledge Graphs for Unseen Predicates and Entity Types [article]

Hady Elsahar, Christophe Gravier, Frederique Laforest
2018 arXiv   pre-print
We present a neural model for question generation from knowledge base triples in a "Zero-Shot" setup, that is generating questions for triples containing predicates, subject types or object types that were not seen at training time. Our model leverages triples occurrences in the natural language corpus in an encoder-decoder architecture, paired with an original part-of-speech copy action mechanism to generate questions. Benchmark and human evaluation show that our model sets a new state-of-the-art for zero-shot QG.
arXiv:1802.06842v1 fatcat:fnb7e33pcjcc5p4nnw7qc6blzi

Self-Supervised and Controlled Multi-Document Opinion Summarization [article]

Hady Elsahar, Maximin Coavoux, Matthias Gallé, Jos Rozen
2020 arXiv   pre-print
Correspondence to: Hady Elsahar <>. (Lewis et al., 2019) and parsing (Drozdov et al., 2019) .  ...  ElSahar, H. and El-Beltagy, S. R. Building large arabic multi-domain resources for sentiment analysis.  ... 
arXiv:2004.14754v2 fatcat:a7dh2rdzpnbrxkjf3ghv3z7hi4

Controlling Conditional Language Models without Catastrophic Forgetting [article]

Tomasz Korbak and Hady Elsahar and German Kruszewski and Marc Dymetman
2022 arXiv   pre-print
Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g., hallucinations in abstractive summarization or style violations in code generation). This raises the important question of how to adapt pre-trained generative models
more » ... meet all requirements without destroying their general capabilities ("catastrophic forgetting"). Recent work has proposed to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Despite its effectiveness, this approach is however limited to unconditional distributions. In this paper, we extend DPG to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and -- in contrast with baseline approaches -- does not result in catastrophic forgetting.
arXiv:2112.00791v2 fatcat:g4vyl3dcrjerzgqg637ohora2y

Unsupervised Open Relation Extraction [chapter]

Hady Elsahar, Elena Demidova, Simon Gottschalk, Christophe Gravier, Frederique Laforest
2017 Lecture Notes in Computer Science  
We explore methods to extract relations between named entities from free text in an unsupervised setting. In addition to standard feature extraction, we develop a novel method to re-weight word embeddings. We alleviate the problem of features sparsity using an individual feature reduction. Our approach exhibits a significant improvement by 5.8% over the state-of-the-art relation clustering scoring a F1-score of 0.416 on the NYT-FB dataset.
doi:10.1007/978-3-319-70407-4_3 fatcat:2pcimacxgbbhtdmw4bjaht2me4

Building Large Arabic Multi-domain Resources for Sentiment Analysis [chapter]

Hady ElSahar, Samhaa R. El-Beltagy
2015 Lecture Notes in Computer Science  
While there has been a recent progress in the area of Arabic Sentiment Analysis, most of the resources in this area are either of limited size, domain specific or not publicly available. In this paper, we address this problem by generating large multi-domain datasets for Sentiment Analysis in Arabic. The datasets were scrapped from different reviewing websites and consist of a total of 33K annotated reviews for movies, hotels, restaurants and products. Moreover we build multi-domain lexicons
more » ... m the generated datasets. Different experiments have been carried out to validate the usefulness of the datasets and the generated lexicons for the task of sentiment classification. From the experimental results, we highlight some useful insights addressing: the best performing classifiers and feature representation methods, the effect of introducing lexicon based features and factors affecting the accuracy of sentiment classification in general. All the datasets, experiments code and results have been made publicly available for scientific purposes.
doi:10.1007/978-3-319-18117-2_2 fatcat:w4onmfhdvvd4zlpn3wfhsiyobm

High Recall Open IE for Relation Discovery

Hady ElSahar, Christophe Gravier, Frédérique Laforest
2017 International Joint Conference on Natural Language Processing  
Relation Discovery discovers predicates (relation types) from a text corpus relying on the co-occurrence of two named entities in the same sentence. This is a very narrowing constraint: it represents only a small fraction of all relation mentions in practice. In this paper we propose a high recall approach for predicate extraction which enables covering up to 16 times more sentences in a large corpus. Comparison against OpenIE systems shows that our proposed approach achieves 28% improvement
more » ... r the highest recall OpenIE system and 6% improvement in precision over the same system.
dblp:conf/ijcnlp/ElSaharGL17 fatcat:mospccqjhbfo3p23t6ka4slr6y

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting [article]

Tomasz Korbak and Hady Elsahar and Germán Kruszewski and Marc Dymetman
2022 arXiv   pre-print
Tomasz Korbak, Hady Elsahar, Marc Dymetman, and Germán Kruszewski. Energy-based models for code generation under compilability constraints. CoRR, abs/2106.04985, 2021.  ...  Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation. In International Conference on Learning Representations, 2021.  ... 
arXiv:2206.00761v1 fatcat:dfmf6jgpzzasxnpmyorus2xsrm

Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata [article]

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl
2018 arXiv   pre-print
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidata triples. We demonstrate the effectiveness of the proposed approach by evaluating it against a set
more » ... of baselines on two languages of different natures: Arabic, a morphological rich language with a larger vocabulary than English, and Esperanto, a constructed language known for its easy acquisition.
arXiv:1803.07116v2 fatcat:3cczq3atibbb3pkecj6jgyhtj4

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl
2018 Journal of Web Semantics  
Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem of generating natural language summaries for Semantic Web data using neural networks. Our end-to-end trainable architecture encodes the information from a set of triples into a vector of fixed dimensionality and generates a textual summary by conditioning the output on the encoded vector. We explore a set of different approaches that enable our models to verbalise
more » ... entities from the input set of triples in the generated text. Our systems are trained and evaluated on two corpora of loosely aligned Wikipedia snippets with triples from DBpedia and Wikidata, with promising results. (Frédérique Laforest), (Jonathon Hare), (Elena Simperl) 1 2 45 sponding triples. In contrast with less flexible, rule-based strategies for NLG, our approach does not constrain the number of potential relations between the triples' predicates and the generated text. Consequently, a learnt predicate embedding, given its position in the semantic space, 50
doi:10.1016/j.websem.2018.07.002 fatcat:a6zfzboqqvbehboshrfosuu6ie
« Previous Showing results 1 — 15 out of 66 results