14 Hits in 3.1 sec

The Pile: An 800GB Dataset of Diverse Text for Language Modeling [article]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy
2020 arXiv   pre-print
With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.  ...  Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models.  ...  Benchmarking Language Models with the Pile While the Pile was conceived as a training dataset for large-scale language models, its coverage of multiple disparate domains makes it also suitable as an evaluation  ... 
arXiv:2101.00027v1 fatcat:74dgmcl55rdupks3kzygosjlca

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs [article]

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki
2021 arXiv   pre-print
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g.  ...  Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch.  ...  Gao et. al. recently released The Pile, an openly-available 800GB text dataset [10] , in an attempt to loosely mimic the dataset used for GPT-3.  ... 
arXiv:2111.02114v1 fatcat:2nc3daeajrdexi675kjmbhjr44

Considerations for Multilingual Wikipedia Research [article]

Isaac Johnson, Emily Lescak
2022 arXiv   pre-print
language editions of Wikipedia in datasets and models.  ...  The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more  ...  We also would like to thank the Wikimedia Foundation Research Team for their input and discussions as well as the many researchers and Wikimedians whose work and discussions we are building on.  ... 
arXiv:2204.02483v1 fatcat:tjylbmqvanc4xha5zpz2akxlom

Efficient Large Scale Language Modeling with Mixtures of Experts [article]

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li (+12 others)
2021 arXiv   pre-print
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero  ...  Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.  ...  The pile: An 2020. BART: Denoising sequence-to-sequence pre- 800gb dataset of diverse text for language modeling.  ... 
arXiv:2112.10684v1 fatcat:xb2swrhivnec7nso7q4gfx3wha

Intersectional Bias in Causal Language Models [article]

Liam Magee, Lida Ghahremanlou, Karen Soldatic, Shanthi Robertson
2021 arXiv   pre-print
Our results confirm earlier tests conducted with auto-regressive causal models, including the GPT family of models.  ...  We conduct an experiment combining up to three social categories - gender, religion and disability - into unconditional or zero-shot prompts used to generate sentences that are then analysed for sentiment  ...  It also has received support, in the form of researcher time and cloud computing credit, from Microsoft Corporation.  ... 
arXiv:2107.07691v1 fatcat:yt3ijvb6ena4hft5ffr7xkwapq

Text Data Augmentation for Deep Learning

Connor Shorten, Taghi M. Khoshgoftaar, Borko Furht
2021 Journal of Big Data  
AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning.  ...  We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data.  ...  Additionally, we acknowledge partial support by the NSF (IIS-2027890). Opinions, findings, conclusions, or recommendations in this paper are the authors' and do not reflect the views of the NSF.  ... 
doi:10.1186/s40537-021-00492-0 fatcat:bcbaqkpicnd6dcwc34pdijosby

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [article]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin (+40 others)
2021 arXiv   pre-print
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of  ...  We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4).  ...  Acknowledgements We would like to thank the AfricaNLP and Google reviewers who have helped us shape this paper.  ... 
arXiv:2103.12028v3 fatcat:gdkre73knnf6xbleosbzqwff6m

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound [article]

Rowan Zellers and Jiasen Lu and Ximing Lu and Youngjae Yu and Yanpeng Zhao and Mohammadreza Salehi and Aditya Kusupati and Jack Hessel and Ali Farhadi and Yejin Choi
2022 arXiv   pre-print
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.  ...  We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research.  ...  Thanks also to Zak Stone and the Google Cloud TPU team for providing access to the TPU machines used for conducting experiments.  ... 
arXiv:2201.02639v4 fatcat:deywuxyj45eqvacjwwns7kmbh4

Teaching Autoregressive Language Models Complex Tasks By Demonstration [article]

Gabriel Recchia
2021 arXiv   pre-print
This is achieved by constructing an appropriate dataset for fine-tuning, with no changes to the learning algorithm.  ...  These results suggest that fine-tuning autoregressive language models on small sets of well-crafted demonstrations may be a useful paradigm for enabling individuals without training in machine learning  ...  Nabeshima and others, "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," arXiv preprint arXiv:2101.00027, 2020. [2] S. Black, L. Gao, P. Wang, C. Leahy and S.  ... 
arXiv:2109.02102v3 fatcat:qghrdzghxrc5famarflsa46cn4

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus [article]

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot
2022 arXiv   pre-print
The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing.  ...  And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling.  ...  The Pile: An 800GB Dataset of Diverse Caswell, I., Kreutzer, J., Wang, L., Wahab, A., van Text for Language Modeling. arXiv e-prints, page Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani,  ... 
arXiv:2201.06642v1 fatcat:n7xdk22ibngztnrgnque2625re

Evaluating Large Language Models Trained on Code [article]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri (+46 others)
2021 arXiv   pre-print
On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves  ...  Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts.  ...  Finally, we thank GitHub for partnering to build GitHub Copilot and Microsoft Azure for supporting model training with infrastructure management.  ... 
arXiv:2107.03374v2 fatcat:tnan6rhwq5fsfek2jydeesgmmy

Controlling Conditional Language Models with Distributional Policy Gradients [article]

Tomasz Korbak and Hady Elsahar and German Kruszewski and Marc Dymetman
2021 arXiv   pre-print
However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g. hallucination in abstractive summarization or wrong format in automatic code  ...  This raises an important question on how to adapt pre-trained generative models to a new task without destroying its capabilities.  ...  , Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al.The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv  ... 
arXiv:2112.00791v1 fatcat:dcjheonc2vecjn5qunu5nmoli4

Causal Inference Principles for Reasoning about Commonsense Causality [article]

Jiayao Zhang, Hongming Zhang, Dan Roth, Weijie J. Su
2022 arXiv   pre-print
Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly  ...  framework, which is the first such attempt for commonsense tasks.  ...  Acknowledgements This work was supported in part by ONR Contract N00014-19-1-2620, NSF through CCF-1934876, an Alfred Sloan Research Fellowship, and the Wharton Dean's Research Fund.  ... 
arXiv:2202.00436v1 fatcat:oavft5weard2jndwxt5vo6aal4

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
The pile: An 2006. The second pascal recognising textual entail- 800gb dataset of diverse text for language modeling. ment challenge.  ...  Adapt language models to domains and tasks. In 2020. Scaling laws for neural language models. Proceedings of the 58th Annual Meeting of the arXiv:2001.08361.  ... 
doi:10.18653/v1/2021.emnlp-main.98 fatcat:okmbgm5f3nhbrajymb5x6uqn2e