Filters








241 Hits in 5.7 sec

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus [article]

Alexandra Sasha Luccioni, Joseph D. Viviano
2021 arXiv   pre-print
In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models.  ...  We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.  ...  In the current article, we present an initial analysis of the Common Crawl, highlighting the presence of several types of explicit and abusive content even after filtering.  ... 
arXiv:2105.02732v3 fatcat:mkioygh2ujc23awckot3uku4vi

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Alexandra Luccioni, Joseph Viviano
2021 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)   unpublished
In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models.  ...  We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.  ...  In the current article, we present an initial analysis of the Common Crawl, highlighting the presence of several types of explicit and abusive content even after filtering.  ... 
doi:10.18653/v1/2021.acl-short.24 fatcat:pcnwtctyn5enrefddruufdyeu4

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus [article]

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot
2022 arXiv   pre-print
In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and  ...  And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling.  ...  What’s in the Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., box? an analysis of undesirable content in the Com- and Sutskever, I. (2019).  ... 
arXiv:2201.06642v1 fatcat:n7xdk22ibngztnrgnque2625re

Finding Viable Seed URLs for Web Corpora: A Scouting Approach and Comparative Study of Available Sources

Adrien Barbaresi
2014 Proceedings of the 9th Web as Corpus Workshop (WaC-9)  
The conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines.  ...  To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia.  ...  Acknowledgments This work has been partially supported by an internal grant of the FU Berlin as well as machine power provided by the COW (COrpora from the Web) project at the German Grammar Department  ... 
doi:10.3115/v1/w14-0401 dblp:conf/aclwac/Barbaresi14 fatcat:enm7oaiahbcaxfwomklemzv6fu

The Pile: An 800GB Dataset of Diverse Text for Language Modeling [article]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy
2020 arXiv   pre-print
Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.  ...  With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.  ...  E Investigating data E.1 13-Gram Analysis As part of our exploratory analysis, we calculated the counts of all 13-grams across Common Crawl.  ... 
arXiv:2101.00027v1 fatcat:74dgmcl55rdupks3kzygosjlca

Cloak and Swagger: Understanding Data Sensitivity through the Lens of User Anonymity

Sai Teja Peddinti, Aleksandra Korolova, Elie Bursztein, Geetanjali Sampemane
2014 2014 IEEE Symposium on Security and Privacy  
Our findings validate the viability of the proposed approach towards an automatic assessment of data sensitivity, show that data sensitivity is a nuanced measure that should be viewed on a continuum rather  ...  Most of what we understand about data sensitivity is through user self-report (e.g., surveys); this paper is the first to use behavioral data to determine content sensitivity, via the clues that users  ...  We thank Pern Hui Chia, Dorothy Chou, and Jessica Staddon for useful feedback on the paper drafts, andÚlfar Erlingsson for the title suggestion.  ... 
doi:10.1109/sp.2014.38 dblp:conf/sp/PeddintiKBS14 fatcat:cdyrrkrnjnfx7p3eewbqzedfye

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [article]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith
2020 arXiv   pre-print
of offensive, factually unreliable, and otherwise toxic content.  ...  We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used  ...  ) , as well as an in-depth analysis of its open-source replica, OPENWEBTEXT CORPUS (OWTC; Gokaslan and Cohen, 2019, §6).  ... 
arXiv:2009.11462v2 fatcat:sdzqn6oumjgwvheetr2jrgggqq

SkillExplorer: Understanding the Behavior of Skills in Large Scale

Zhixiu Guo, Zijin Lin, Pan Li, Kai Chen
2020 USENIX Security Symposium  
are in the form of natural languages.  ...  However, to the best of our knowledge, there is no prior research that systematically explores the interaction behaviors of skills, mainly due to the challenges in handling skills' inputs/outputs which  ...  Acknowledgments The authors would like to thank anonymous reviewers for their insightful comments that have helped improve this paper substantially.  ... 
dblp:conf/uss/GuoLL020 fatcat:ikvbilvdobhixkdgpp4ytx5wgu

Building Machine Translation Systems for the Next Thousand Languages [article]

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod (+12 others)
2022 arXiv   pre-print
additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent  ...  of massively multilingual models in data-sparse settings.  ...  First, we collected a list of the 8000 most common tokens in a large, web-crawled, monolingual English corpus.  ... 
arXiv:2205.03983v3 fatcat:65fva7qvpbaapemrrmovgkpac4

Pretrained Transformers for Text Ranking: BERT and Beyond

Andrew Yates, Rodrigo Nogueira, Jimmy Lin
2021 Proceedings of the 14th ACM International Conference on Web Search and Data Mining  
The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task.  ...  Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications.  ...  Acknowledgements 129 Acknowledgements This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada.  ... 
doi:10.1145/3437963.3441667 fatcat:6teqmlndtrgfvk5mneq5l7ecvq

Deep Latent-Variable Models for Text Generation [article]

Xiaoyu Shen
2022 arXiv   pre-print
As a result, it is difficult to trust the output from them in real-life applications.  ...  Deep latent-variable models, by specifying the probabilistic distribution over an intermediate latent process, provide a potential way of addressing these problems while maintaining the expressive power  ...  . 7.a error analysis We analyze common errors below.  ... 
arXiv:2203.02055v1 fatcat:sq3upxl7xvfnhigoc7apszomwu

MERLOT: Multimodal Neural Script Knowledge Models [article]

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
2021 arXiv   pre-print
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner.  ...  Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives  ...  An exploration of the data in our corpus (Section B) c. Qualitative analysis of model representations (Section C) d.  ... 
arXiv:2106.02636v3 fatcat:mrj2t3yuanbdzhsujshtky4enq

Sentiment Analysis and Opinion Mining [chapter]

Lei Zhang, Bing Liu
2017 Encyclopedia of Machine Learning and Data Mining  
the fear of undesirable consequences.  ...  CHAPTER 2 The Problem of Sentiment Analysis In this chapter, we define an abstraction of the sentiment analysis or opinion mining problem.  ... 
doi:10.1007/978-1-4899-7687-1_907 fatcat:iy5ty44cyzbrtodxfo7osy3iu4

A scholarly divide: Social media, Big Data, and unattainable scholarship

Asta Zelenkauskaite, Erik P. Bucy
2016 First Monday  
to the early stages of online research when it was common to examine the small data of Web logs and surface content that could be manually scraped from Web pages.  ...  In some cases, third party providers grant access to a portion of the data, while companies maintain control over the complete corpus.  ... 
doi:10.5210/fm.v21i5.6358 fatcat:2ovxmdcoa5chxlkl4gzfvk2tqu

Creating an Intellectual Commons through Open Access [chapter]

2006 Understanding Knowledge as a Commons  
commons that have the flavor of a tragedy of the commons.  ...  In this paper I discuss the peculiarities of royalty-free literature, the conditions that lead authors to consent to OA (including authors of royalty-producing literature), and some obstacles to an OA  ...  OA commons Anything as large and complicated as the OA commons will inspire analysis from different points of view.  ... 
doi:10.7551/mitpress/6980.003.0011 fatcat:ilqj3u3znfhsbpjz3ia3xdpu54
« Previous Showing results 1 — 15 out of 241 results