Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets

Bill McKinney, Peter A. Meyer, Mercè Crosas, Piotr Sliz
2016 Annals of the New York Academy of Sciences  
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published
more » ... gh the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of filesystem structure within Dataverse-which is essential for both in-place computation and supporting non-http data transfers.
doi:10.1111/nyas.13272 pmid:27862010 pmcid:PMC5546227 fatcat:zbcxvm465zecrbyei5p7musare