A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
[article]
2020
arXiv
pre-print
With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models. ...
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. ...
Benchmarking Language Models with the Pile While the Pile was conceived as a training dataset for large-scale language models, its coverage of multiple disparate domains makes it also suitable as an evaluation ...
arXiv:2101.00027v1
fatcat:74dgmcl55rdupks3kzygosjlca
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
[article]
2021
arXiv
pre-print
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. ...
Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. ...
Gao et. al. recently released The Pile, an openly-available 800GB text dataset [10] , in an attempt to loosely mimic the dataset used for GPT-3. ...
arXiv:2111.02114v1
fatcat:2nc3daeajrdexi675kjmbhjr44
Considerations for Multilingual Wikipedia Research
[article]
2022
arXiv
pre-print
language editions of Wikipedia in datasets and models. ...
The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more ...
We also would like to thank the Wikimedia Foundation Research Team for their input and discussions as well as the many researchers and Wikimedians whose work and discussions we are building on. ...
arXiv:2204.02483v1
fatcat:tjylbmqvanc4xha5zpz2akxlom
Efficient Large Scale Language Modeling with Mixtures of Experts
[article]
2021
arXiv
pre-print
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero ...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. ...
The pile: An 2020. BART: Denoising sequence-to-sequence pre-
800gb dataset of diverse text for language modeling. ...
arXiv:2112.10684v1
fatcat:xb2swrhivnec7nso7q4gfx3wha
Intersectional Bias in Causal Language Models
[article]
2021
arXiv
pre-print
Our results confirm earlier tests conducted with auto-regressive causal models, including the GPT family of models. ...
We conduct an experiment combining up to three social categories - gender, religion and disability - into unconditional or zero-shot prompts used to generate sentences that are then analysed for sentiment ...
It also has received support, in the form of researcher time and cloud computing credit, from Microsoft Corporation. ...
arXiv:2107.07691v1
fatcat:yt3ijvb6ena4hft5ffr7xkwapq
Text Data Augmentation for Deep Learning
2021
Journal of Big Data
AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning. ...
We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. ...
Additionally, we acknowledge partial support by the NSF (IIS-2027890). Opinions, findings, conclusions, or recommendations in this paper are the authors' and do not reflect the views of the NSF. ...
doi:10.1186/s40537-021-00492-0
fatcat:bcbaqkpicnd6dcwc34pdijosby
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
[article]
2021
arXiv
pre-print
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of ...
We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). ...
Acknowledgements We would like to thank the AfricaNLP and Google reviewers who have helped us shape this paper. ...
arXiv:2103.12028v3
fatcat:gdkre73knnf6xbleosbzqwff6m
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
[article]
2022
arXiv
pre-print
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. ...
We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. ...
Thanks also to Zak Stone and the Google Cloud TPU team for providing access to the TPU machines used for conducting experiments. ...
arXiv:2201.02639v4
fatcat:deywuxyj45eqvacjwwns7kmbh4
Teaching Autoregressive Language Models Complex Tasks By Demonstration
[article]
2021
arXiv
pre-print
This is achieved by constructing an appropriate dataset for fine-tuning, with no changes to the learning algorithm. ...
These results suggest that fine-tuning autoregressive language models on small sets of well-crafted demonstrations may be a useful paradigm for enabling individuals without training in machine learning ...
Nabeshima and others, "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," arXiv
preprint arXiv:2101.00027, 2020.
[2] S. Black, L. Gao, P. Wang, C. Leahy and S. ...
arXiv:2109.02102v3
fatcat:qghrdzghxrc5famarflsa46cn4
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
[article]
2022
arXiv
pre-print
The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. ...
And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. ...
The Pile: An 800GB Dataset of Diverse
Caswell, I., Kreutzer, J., Wang, L., Wahab, A., van Text for Language Modeling. arXiv e-prints, page
Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, ...
arXiv:2201.06642v1
fatcat:n7xdk22ibngztnrgnque2625re
Evaluating Large Language Models Trained on Code
[article]
2021
arXiv
pre-print
On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves ...
Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. ...
Finally, we thank GitHub for partnering to build GitHub Copilot and Microsoft Azure for supporting model training with infrastructure management. ...
arXiv:2107.03374v2
fatcat:tnan6rhwq5fsfek2jydeesgmmy
Controlling Conditional Language Models with Distributional Policy Gradients
[article]
2021
arXiv
pre-print
However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g. hallucination in abstractive summarization or wrong format in automatic code ...
This raises an important question on how to adapt pre-trained generative models to a new task without destroying its capabilities. ...
, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al.The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv ...
arXiv:2112.00791v1
fatcat:dcjheonc2vecjn5qunu5nmoli4
Causal Inference Principles for Reasoning about Commonsense Causality
[article]
2022
arXiv
pre-print
Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly ...
framework, which is the first such attempt for commonsense tasks. ...
Acknowledgements This work was supported in part by ONR Contract N00014-19-1-2620, NSF through CCF-1934876, an Alfred Sloan Research Fellowship, and the Wharton Dean's Research Fund. ...
arXiv:2202.00436v1
fatcat:oavft5weard2jndwxt5vo6aal4
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
2021
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
unpublished
The pile: An 2006. The second pascal recognising textual entail-
800gb dataset of diverse text for language modeling. ment challenge. ...
Adapt language models to domains and tasks. In 2020. Scaling laws for neural language models.
Proceedings of the 58th Annual Meeting of the arXiv:2001.08361. ...
doi:10.18653/v1/2021.emnlp-main.98
fatcat:okmbgm5f3nhbrajymb5x6uqn2e