Discovering data reuse with the Throughput annotation database
The open data movement has the potential to democratize science, promote discovery, enhance reproducibility, and encourage collaboration. However, researchers, especially those uninitiated into a discipline's practices, often struggle to work with open data and understand the potentially complex workflows involved in the analysis even when data is available. We are working to make data and code resources easier to find via the Throughput database, an NSF EarthCube-funded graph database designed
... to improve discoverability of open data by linking data resources through user-created annotations to other associated scholarly products (publications, grants, code repositories, outreach activities). Throughput records links between scientific objects accessible through the web, including research grants, research databases and datasets, and public code repositories. Code repositories in Throughput were populated using open APIs to discover code repositories that reference the names or URLs of data archives listed on re3data.org. The xDeepDive (geoDeepDive.org) platform was used to discover code repositories linked to publications in the 14 million articles indexed within xDeepDive, and the Throughput Code Cookbook has provided an opportunity for individuals to manually link code repositories. In all there are 74,000 code repositories indexed by Throughput, linked to 1,400 data resources and 19,000 journal articles. Here, we describe our development of use-based annotations for code resources in Throughput, which tags code according to the way it uses earth science data. This makes it easier to find code for analysis vs tutorials, and additionally provides data archive managers with important metrics on how their repositories are used. In the future, we hope to leverage these classification results to predict code repository type or identify useful code repositories for machine learning.