3,208 Hits in 6.8 sec

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Ahmad Rashid, Vasileios Lioutas, Abbas Ghaddar, Mehdi Rezagholizadeh
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
Knowledge distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions.  ...  We present, to the best of our knowledge, the first work on Zero-shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data.  ...  Acknowledgments We thank Mindspore 3 which is a new deep learning computing framework for the partial support of this work.  ... 
doi:10.18653/v1/2021.emnlp-main.526 fatcat:545b3cqrmjahxaibulcs7xxtxa

Sentence Embeddings by Ensemble Distillation [article]

Fredrik Carlsson Magnus Sahlgren
2021 arXiv   pre-print
We compare and combine a number of recently proposed sentence embedding methods for STS, and propose a novel and simple ensemble knowledge distillation scheme that improves on previous approaches.  ...  This paper contributes a new State Of The Art (SOTA) for Semantic Textual Similarity (STS).  ...  Knowledge Distillation Knowledge distillation is a method where instead of training a model directly towards a specific task, it is trained towards the information of either a strong teacher model, or  ... 
arXiv:2104.06719v1 fatcat:qut2ewkxhjdbdfrhfsw6yjohby

Probing Multilingual Language Models for Discourse [article]

Murathan Kurfalı, Robert Östling
2021 arXiv   pre-print
Pre-trained multilingual language models have become an important building block in multilingual natural language processing.  ...  We find that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting.  ...  These observation provide several starting points for future work: investigating why knowledge distillation seems to hurt zero-shot performance to a much greater extent than same-language sentence encoding  ... 
arXiv:2106.04832v1 fatcat:xmmmig243rguxmhfwi6t33aphy

Unsupervised Neural Machine Translation with Generative Language Models Only [article]

Jesse Michael Han, Igor Babuschkin, Harrison Edwards, Arvind Neelakantan, Tao Xu, Stanislas Polu, Alex Ray, Pranav Shyam, Aditya Ramesh, Alec Radford, Ilya Sutskever
2021 arXiv   pre-print
We first use the zero-shot translation ability of large pre-trained language models to generate translations for a small set of unlabeled sentences.  ...  We then amplify these zero-shot translations by using them as few-shot demonstrations for sampling a larger synthetic dataset.  ...  We do this in a two-stage process. We first sample a small number of zero-shot translations from GPT-3.  ... 
arXiv:2110.05448v1 fatcat:gtssrmksmvf6fhmvwad6qe3v2q

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [article]

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen (+17 others)
2021 arXiv   pre-print
Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks.  ...  A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters.  ...  ERNIE 3.0 can handle both natural language understanding tasks and natural language generation tasks through zero-shot learning, few-shot learning, or fine-tuning.  ... 
arXiv:2112.12731v1 fatcat:hact2hlojrdydhxcnzozmb7kee

Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [article]

Prasetya Ajie Utama, Nafise Sadat Moosavi, Victor Sanh, Iryna Gurevych
2021 arXiv   pre-print
Recent prompt-based approaches allow pretrained language models to achieve strong performances on few-shot finetuning by reformulating downstream tasks as a language modeling problem.  ...  knowledge learned during the pretraining.  ...  Acknowledgement We thank Michael Bugert, Tim Baumgärtner, Jan Buchman, and the anonymous reviewers for their constructive feedback.  ... 
arXiv:2109.04144v1 fatcat:jsozuwbm5vf5vdc2mfytvh5npe

Learning Compact Metrics for MT [article]

Amy Pu, Hyung Won Chung, Ankur P. Parikh, Sebastian Gehrmann, Thibault Sellam
2021 arXiv   pre-print
data generation and transferring knowledge from one teacher to multiple students trained on related languages.  ...  We present a series of experiments which show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck, by leveraging synthetic  ...  Acknowledgments We thank Vitaly Nikolaev, who provided guidance on language families and created groups for the multiple-students setup.  ... 
arXiv:2110.06341v1 fatcat:soizlnjjtjccvaynejghxwh3ky

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation [article]

Humair Raj Khan, Deepak Gupta, Asif Ekbal
2021 arXiv   pre-print
intermediate layers (language and vision encoders) with appropriately designed distillation objectives for incremental knowledge extraction.  ...  Unlike the existing knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model learns and imitates the teacher from multiple  ...  Acknowledgement Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY  ... 
arXiv:2109.04653v1 fatcat:6oac4nogpbbctpoavh64j6iqhy

Black-Box Ripper: Copying black-box models using generative evolutionary algorithms [article]

Antonio Barbalau, Adrian Cosma, Radu Tudor Ionescu, Marius Popescu
2020 arXiv   pre-print
In this context, we present a teacher-student framework that can distill the black-box (teacher) model into a student model with minimal accuracy loss.  ...  We study the task of replicating the functionality of black-box neural models, for which we only know the output class probabilities provided for a set of input images.  ...  Zero-shot knowledge distillation.  ... 
arXiv:2010.11158v1 fatcat:pvpx4bi2nffjzfbjxi2kgq4lbi

Colors of Artificial Intelligence

Hsiao-Ying Lin, Hsiao-Ying Lin
2021 Computer  
Conversely, according to a 2019 estimation, in the field of artificial intelligence (AI), the carbon footprint of training a state-of-the-art language model equates to that of five U.S. cars' entire  ...  Knowledge distillation transfers information from an original model to a target one through a training process applied to the latter.  ...  Typical techniques include knowledge distillation, quantization, and pruning.  ... 
doi:10.1109/mc.2021.3102359 fatcat:4iwxm5ukpvgc7kgzkn6ast7xni

Efficient Large Scale Language Modeling with Mixtures of Experts [article]

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li (+12 others)
2021 arXiv   pre-print
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero  ...  - and few-shot priming, and full fine-tuning.  ...  natural language tasks.  ... 
arXiv:2112.10684v1 fatcat:xb2swrhivnec7nso7q4gfx3wha

Probing and Fine-tuning Reading Comprehension Models for Few-shot Event Extraction [article]

Rui Feng, Jie Yuan, Chao Zhang
2020 arXiv   pre-print
Our experiment results show that our method performs strongly for zero-shot and few-shot event extraction, and it achieves state-of-the-art performance on the ACE 2005 benchmark when trained with full  ...  By constructing proper query templates, our approach can effectively distill rich knowledge about tasks and label semantics from pretrained reading comprehension models.  ...  First, the model can distill knowledge from pretrained language models and be appealing for few-shot or even zero-shot settings.  ... 
arXiv:2010.11325v1 fatcat:4v564b3y5rhwjkekppt5gsyxxm

Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-document Abstractive Summarization

Travis R Goodwin, Max E Savery, Dina Demner-Fushman
2020 International Conference on Computational Linguistics (COLING). Proceedings  
Recent work has shown that pre-trained Transformers obtain remarkable performance on many natural language processing tasks, including automatic summarization.  ...  We report the performance on four challenging summarization datasets: three from the general domain and one from consumer health in both zero-shot and few-shot learning settings.  ...  Although zero-shot learning (ZSL) has received considerable attention in the image processing community, there has been comparatively little work on zero-shot learning specifically for summarization: Duan  ... 
pmid:33293900 pmcid:PMC7720861 fatcat:7xvhedbe5zglhe5xlxedeamwsq

Self-supervised Knowledge Distillation for Few-shot Learning [article]

Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Mubarak Shah
2020 arXiv   pre-print
In this paper, we propose a simple approach to improve the representation capacity of deep neural networks for few-shot learning tasks.  ...  Our experiments show that, even in the first stage, self-supervision can outperform current state-of-the-art methods, with further gains achieved by our second stage distillation process.  ...  output-space Generation Zero Generation One Figure 1 : Self-supervised Knowledge Distillation operates in two phases.  ... 
arXiv:2006.09785v2 fatcat:72yevn3afvcnto4lompc6j6ugu

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models [article]

Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi
2021 arXiv   pre-print
Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity  ...  Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models.  ...  -19-2-4031), and the Allen Institute for AI.  ... 
arXiv:2110.07178v1 fatcat:5vubwnf6ybh3dbp7oefflcq7t4
« Previous Showing results 1 — 15 out of 3,208 results