1 Hit in 3.0 sec

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources [article]

Angelina McMillan-Major and Zaid Alyafeai and Stella Biderman and Kimbo Chen and Francesco De Toni and Gérard Dupont and Hady Elsahar and Chris Emezue and Alham Fikri Aji and Suzana Ilić and Nurulaqilla Khamis and Colin Leong and Maraim Masoud and Aitor Soroa and Pedro Ortiz Suarez and Zeerak Talat and Daniel van Strien and Yacine Jernite
2022 arXiv   pre-print
Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative.  ...  We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese  ...  The Catalogue The main goal of the catalogue is to support the creation of the BigScience dataset while adhering to the values laid out by the various data working groups: collecting diverse resources  ... 
arXiv:2201.10066v1 fatcat:tcel3byw5be7fablkrx4sx3icq