Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
doi:10.18653/v1/2021.emnlp-main.98 fatcat:okmbgm5f3nhbrajymb5x6uqn2e