Filters








1,016 Hits in 5.9 sec

The Problem of Zombie Datasets:A Framework For Deprecating Datasets [article]

Frances Corry, Hamsini Sridharan, Alexandra Sasha Luccioni, Mike Ananny, Jason Schultz, Kate Crawford
2021 arXiv   pre-print
in order to inform a framework for more consistent, ethical, and accountable dataset deprecation.  ...  What happens when a machine learning dataset is deprecated for legal, ethical, or technical reasons, but continues to be widely used?  ...  ACKNOWLEDGMENTS The authors wish to thank the industry practitioners and academic researchers who gave feedback on the framework outlined here, with particular gratitude to Adam Harvey for his detailed  ... 
arXiv:2111.04424v1 fatcat:ndmydwfnqff6fikaakldpe5w2y

A Software Assurance Reference Dataset: Thousands of Programs With Known Bugs

Paul E. Black
2018 Journal of Research of the National Institute of Standards and Technology  
The Software Assurance Reference Dataset (SARD) is a growing collection of over 170 000 programs with precisely located bugs.  ...  The programs are in C, C++, Java, PHP, and C# and cover more than 150 classes of weaknesses, such as SQL injection, cross-site scripting (XSS), buffer overflow, and use of a broken cryptographic algorithm  ...  Acknowledgments The author thanks David Flater and Gabriel Sarmanho for help with the chart in Fig. 1 and thanks John Henry Scott for valuable suggestions. References  ... 
doi:10.6028/jres.123.005 pmid:34877127 pmcid:PMC7339570 fatcat:i6hqylb7rzbrbnozmcspgmbqn4

Cell-level metadata are indispensable for documenting single-cell sequencing datasets

Sidhant Puntambekar, Jay R Hesselberth, Kent A Riemondy, Rui Fu
2021 PLoS Biology  
We encourage investigators, reviewers, journals, and data repositories to improve their standards and ensure proper documentation of these valuable datasets.  ...  cell types and related findings of the published dataset.  ...  to the development of best practices for documenting single-cell datasets in the wider community.  ... 
doi:10.1371/journal.pbio.3001077 pmid:33945522 pmcid:PMC8121533 fatcat:y4ndrqnb7zf2xjlls5ix2v2xvi

Enabling reusability of plant phenomic datasets with MIAPPE 1.1

Evangelia A. Papoutsoglou, Daniel Faria, Daniel Arend, Elizabeth Arnaud, Ioannis N. Athanasiadis, Inês Chaves, Frederik Coppens, Guillaume Cornut, Bruno V. Costa, Hanna Ćwiek‐Kupczyńska, Bert Droesbeke, Richard Finkers (+25 others)
2020 New Phytologist  
Community feedback has been critical to this development, and will be a key part of ensuring adoption of the standard.  ...  Enabling data reuse and knowledge discovery is increasingly critical in modern science, and requires an effort towards standardizing data publication practices.  ...  Acknowledgements This work was based on extensive reviews from and interactions with the broader MIAPPE community. We are grateful for all  ... 
doi:10.1111/nph.16544 pmid:32171029 fatcat:a2a5chl6xzcazllkhhtcoqjpku

The Darwin Core extension for genebanks opens up new opportunities for sharing genebank datasets

Dag Terje Filip Endresen, Helmut Knüpffer
2012 Biodiversity Informatics  
The Darwin Core extension for genebanks is a key component that provides access for the genebanks and the plant genetic resources community to the GBIF informatics infrastructure including the new toolkits  ...  The new Darwin Core extension for genebanks declares the additional terms required for describing genebank data sets, and is based on established standards from the plant genetic resources community.  ...  The GBIF data publishing framework task group recommends the publication of biodiversity data sets as citable "data papers" and that each dataset is identified by a PID for consistent data citation (Moritz  ... 
doi:10.17161/bi.v8i1.4095 fatcat:e7m3gktvqzfgdo45ift5ujphna

Collecting Vulnerable Source Code from Open-Source Repositories for Dataset Generation

Razvan Raducu, Gonzalo Esteban, Francisco J. Rodríguez Lera, Camino Fernández
2020 Applied Sciences  
The tool provides a set of tagged files suitable for extracting features and creating training datasets for Machine Learning algorithms.  ...  This study presents a descriptive analysis of these files and overviews current status of C vulnerabilities, specifically buffer overflow, in the reviewed public repositories.  ...  and unknown patterns", by the Consejería de Educación de la Junta de Castilla y León through the Project LE028P17 on the "Development of reusable software components based on machine learning for the  ... 
doi:10.3390/app10041270 fatcat:67r7mhtbkvdhfksukdi4mazmhq

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [article]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin (+40 others)
2021 arXiv   pre-print
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of  ...  We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses.  ...  Furthermore, we are grateful for Ahmed El-Kishky's support and help with CCAligned and WikiMatrix size statistics.  ... 
arXiv:2103.12028v3 fatcat:gdkre73knnf6xbleosbzqwff6m

Online Extremism Detection: A Systematic Literature Review with Emphasis on Datasets, Classification Techniques, Validation Methods and Tools

Mayur Gaikwad, Swati Ahirrao, Shraddha Phansalkar, Ketan Kotecha
2021 IEEE Access  
A comprehensive and comparative survey of datasets, classification techniques, validation methods with online extremism detection tool is essential.  ...  The review concludes lack of publicly available, class-balanced, and unbiased datasets for better detection and classification of social-media extremism.  ...  It can be concluded that there is a need for publicly available and verified standard datasets in online extremism research.  ... 
doi:10.1109/access.2021.3068313 fatcat:56xuyhtuxvdsxf7s7s5i3jvvbe

Ethical issues in research using datasets of illicit origin

Daniel R. Thomas, Sergio Pastrana, Alice Hutchings, Richard Clayton, Alastair R. Beresford
2017 Proceedings of the 2017 Internet Measurement Conference on - IMC '17  
We extract ethical principles from existing advice and guidance and analyse how they have been applied within more than recent peer reviewed papers that deal with illicitly obtained datasets.  ...  We evaluate the use of data obtained by illicit means against a broad set of ethical and legal issues.  ...  Keegan and Matias developed a multi-party risk benefit framework for use in analysing ethical considerations for online community research [ ].  ... 
doi:10.1145/3131365.3131389 dblp:conf/imc/ThomasPHCB17 fatcat:4imvace5p5dwjfqrqli2cnu4qa

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin (+40 others)
2022 Transactions of the Association for Computational Linguistics  
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of  ...  We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses.  ...  Furthermore, we are grateful for Ahmed El-Kishky's support and help with CCAligned and WikiMatrix size statistics.  ... 
doi:10.1162/tacl_a_00447 fatcat:dfprrgaj7vgn3obd6sxk4hrjty

RDF dataset profiling – a survey of features, methods, vocabularies and applications

Mohamed Ben Ellefi, Zohra Bellahsene, John G. Breslin, Elena Demidova, Stefan Dietze, Julian Szymański, Konstantin Todorov, Lora Aroyo
2018 Semantic Web Journal  
Ultimately, this work is intended to facilitate the reader to identify the relevant features for building a dataset profile for intended applications together with the methods and tools capable of extracting  ...  Even though there exists a wealth of works contributing to the task of dataset profiling in general, these works are spread across a wide range of communities.  ...  However, the work appears to be deprecated and not maintained.  ... 
doi:10.3233/sw-180294 fatcat:6ihya5zgpfgp5f6xcjpapy6ocy

Automated dataset generation for image recognition using the example of taxonomy [article]

Jaro Milan Zink
2018 arXiv   pre-print
In order to automate the dataset creation, a prototype was conceptualized and implemented after working out knowledge basics and analyzing requirements for it.  ...  The results were more than satisfactory and showed that automatically generating a dataset for image recognition is not only possible, but also might be a decent alternative to spending time and money  ...  For more detailed information, please see Docker's documentation 62,63 . which containers can communicate with each other and many other things 65 .  ... 
arXiv:1802.02207v1 fatcat:wdmdqlv5erdjvoseldhubd47tu

An in‐depth study of the effects of methods on the dataset selection of public development projects

Can Cheng, Bing Li, Zengyang Li, Peng Liang, Xu Yang
2021 IET Software  
To address this problem, a standard dataset was labelled and the base line methods (i.e. selecting projects according to a single feature like star number) under 60 configurations and the machine learning  ...  However, it is hard for researchers to effectively select PDPs and DPDPs due to the lack of specific project selection methods for these two types of projects.  ...  F I G U R E 2 3 . 1 | 231 The framework of calculating features in the Standard Dataset F I G U R E 3 The framework of our experiments 4.Experiment 1 for testing the base line methods F I G U R E 6 6  ... 
doi:10.1049/sfw2.12050 fatcat:yu2uw6rrerdmpemqtxnypjzgw4

Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?

Felicitas Löffler, Valentin Wesp, Birgitta König-Ries, Friederike Klan, Hussein Suleman
2021 PLoS ONE  
In particular, we focus on scholarly search interests and metadata, the primary source of data in a dataset retrieval system.  ...  In this study, we explore what hampers dataset retrieval in biodiversity research, a field that produces a large amount of heterogeneous data.  ...  Acknowledgments The authors would like to thank the annotators for their time and valuable comments. Author Contributions Conceptualization: Felicitas Löffler, Friederike Klan.  ... 
doi:10.1371/journal.pone.0246099 pmid:33760822 fatcat:75vgcuhzibgbxeqrabb76uruje

Is the LOD cloud at risk of becoming a museum for datasets? Looking ahead towards a fully collaborative and sustainable LOD cloud

Jeremy Debattista, Judie Attard, Rob Brennan, Declan O'Sullivan
2019 Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19  
Based on our findings, we therefore propose a strategy and architecture for a potential collaborative and sustainable LOD cloud.  ...  Throughout the years, this prominent depiction served as the epitome for Linked Data and acted as a starting point for many.  ...  Challenge(s) to be tackled: C2 In order to have a uniform view of the datasets in the LOD cloud, the critical element is the identification of a metadata standard, and a glossary or taxonomy for non-descriptive  ... 
doi:10.1145/3308560.3317075 dblp:conf/www/DebattistaABO19 fatcat:6j47xteykzamhhfzirmpo5vogi
« Previous Showing results 1 — 15 out of 1,016 results