824 Hits in 2.9 sec

Sparse Data Management in HDF5

John Mainzer, Neil Fortner, Gerd Heber, Elena Pourmal, Quincey Koziol, Suren Byna, Marc Paterno
2019 2019 IEEE/ACM 1st Annual Workshop on Large-scale Experiment-in-the-Loop Computing (XLOOP)  
In this document, we explore different design options to support sparse data in HDF5, one of the most popular highperformance I/O libraries and file formats used for scientific data.  ...  The importance of sparse data management is growing with data produced by large-scale experimental and observational facilities that contain small amounts of non-zero values.  ...  In this paper, we explore the options for and feasibility of sparse data management in HDF5 without changes to the existing API. We begin with the description of a model problem ( §II).  ... 
doi:10.1109/xloop49562.2019.00009 dblp:conf/sc/MainzerFHPKBP19 fatcat:zlh7rc4kajgq7ilpnptco6tf6u

The TileDB array data storage manager

Stavros Papadopoulos, Kushal Datta, Samuel Madden, Timothy Mattson
2016 Proceedings of the VLDB Endowment  
We present a novel storage manager for multi-dimensional arrays that arise in scientific applications, which is part of a larger scientific data management system called TileDB.  ...  Each fragment is dense or sparse, and groups contiguous array elements into data tiles of fixed capacity.  ...  columnar database on sparse arrays, while offering a programmer-friendly APIbased interface similar to HDF5.  ... 
doi:10.14778/3025111.3025117 fatcat:4pmtzwjbsrexliyrsx65xbdozi

Scalable, Sparse IO with larcv

Corey Adams
2022 Zenodo  
In this lightning talk we present larcv, an open source tool built on HDF5 that enables parallel, scalable IO for irregular datasets with simple python access.  ...  At the intersection of high energy physics, deep learning, and high performance computing there is a challenge: how to efficiently handle data I/O of sparse and irregular datasets from high energy physics  ...  -Serialization/Deserialization looks very similar to SQL: tabular data turns very sparse and irregular data into manageable sequential HDF5 datatype reads. • (You don't have to know that!  ... 
doi:10.5281/zenodo.7140051 fatcat:eqdbsjn5pbdp5fjjdaqujcwhqy

The evolution of an open source file format: a version control story

Benjamin Savitzky, Steven Zeltmann, Luis Rangel DaCosta, Peter Ercius, Mary Scott, Andrew Minor, Colin Ophus
2021 Microscopy and Microanalysis  
The HDF5 format is used widely in scientific computing, enables cross-platform and high performance data access, and allows flexible definition of directory hierarchies containing both data and metadata  ...  It uses an HDF5 based format initially derived from the EMD 0.1 format to store N-D arrays, which also contains several additional datastructures that are useful in 4D STEM data analysis.  ...  The HDF5 format is used widely in scientific computing, enables cross-platform and high performance data access, and allows flexible definition of directory hierarchies containing both data and metadata  ... 
doi:10.1017/s1431927621004116 fatcat:5chijpa4ybcazjm6sabnlpyrei

Improving I/O Throughput of Scientific Applications Using Transparent Parallel Compression

Tekin Bicer, Jian Yin, Gagan Agrawal
2014 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  
a data analysis application to read compressed data.  ...  However, this also implies that increasing larger-sized datasets need to be output, stored, managed, and then visualized and/or analyzed using a variety of methods.  ...  PnetCDF [13] , [4] and Parallel HDF5 [3] are both widely used scientific data management libraries.  ... 
doi:10.1109/ccgrid.2014.112 dblp:conf/ccgrid/BicerYA14 fatcat:k4ev2weytrfvdglet2lha47mji

beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types

Aaron T. L. Lun, Hervé Pagès, Mike L. Smith, Mihaela Pertea
2018 PLoS Computational Biology  
For sparse matrices, times were also recorded with respect to the density of non-zero entries. In all cases, standard errors were negligible and not plotted.  ...  In practice, the magnitude of differences in access speed between representations depends on many factors such as the amount of microprocessor cache memory, speed of reading data from files on different  ...  Compression of data in the HDF5 file also ensures that the disk space requirements are manageable, even if multiple large matrices need to be generated to hold intermediate results throughout an analysis  ... 
doi:10.1371/journal.pcbi.1006135 pmid:29723188 pmcid:PMC5953501 fatcat:v4rk3wtiwnaennosvarlaugvae

scDIOR: single cell RNA-seq data IO software

Huijian Feng, Lihui Lin, Jiekai Chen
2022 BMC Bioinformatics  
Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5).  ...  Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably  ...  Here we used Hierarchical Data Format (HDF5), a high-performance data management and storage suite (https:// www. hdfgr oup. org/ solut ions/ hdf5) to store this information.  ... 
doi:10.1186/s12859-021-04528-3 pmid:34991457 pmcid:PMC8734364 fatcat:urwm3va3xfcedpagwc3jh34uxi

Cooler: scalable storage for Hi-C data and other genomically-labeled arrays [article]

Nezar Abdennur, Leonid Mirny
2019 bioRxiv   pre-print
Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while  ...  Storage and computational costs mount sharply with data resolution when such maps are stored in dense form.  ...  Cooler package We provide a Python-based convenience library to manage cooler data collections.  ... 
doi:10.1101/557660 fatcat:vqjwc7or3bgy5b546vje3rqsfu

Parallel data analysis directly on scientific file formats

Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, Arie Shoshani
2014 Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14  
In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format.  ...  Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows.  ...  In addition, we thank Avrilia Floratou and Yushu Yao for assisting with the Hive setup, and Peter Nugent for answering questions about the PTF workload.  ... 
doi:10.1145/2588555.2612185 dblp:conf/sigmod/BlanasWBDS14 fatcat:tfpgk6x25vefdh6ayqrtkyblly

The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5 and XML Multimodal and Hyperspectral Data Sets [chapter]

Kelly Bennett, James Robertso
2011 MATLAB - A Ubiquitous Tool for the Practical Engineer  
HDF5 is a data model, library, and file format for storing and managing data. HDF5 is portable and extensible, allowing applications to evolve in their use of HDF5 (HDF Group).  ...  The Sparse test converts a The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5… 151 matrix  ...  The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5 and XML Multimodal and Hyperspectral Data Sets, MATLAB -A  ... 
doi:10.5772/19410 fatcat:eqkomn5coff75iqy4ufkzi63ae

Blocks and Fuel: Frameworks for deep learning [article]

Bart van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, Yoshua Bengio
2015 arXiv   pre-print
Automated data management Fuel offers built-in scripts that automate the task of downloading datasets, (similar to e.g. skdata 1 ) and converting them to Fuel's HDF5 specification.  ...  All of the data is stored in a single HDF5 file, with the following metadata attached: • What are the data sources available (e.g. features, targets, etc.)?  ... 
arXiv:1506.00619v1 fatcat:xp6wxgav4rcg5pgzityagso63i

beachmat: a Bioconductor C++ API for accessing single-cell genomics data from a variety of R matrix types [article]

Aaron T. L. Lun, Hervé Pagès, Mike L. Smith
2017 bioRxiv   pre-print
This allows package developers to write efficient C++ code that is interoperable with simple, sparse and HDF5-backed matrices, amongst others.  ...  In particular, large matrices holding expression values for each gene in each cell require sparse or file-backed representations for manipulation with the popular R programming language.  ...  Compression of data in the HDF5 file also ensures 195 that the on-disk footprint remains manageable throughout the course of the analysis. 196 The beachmat API supports row-and column-level access from  ... 
doi:10.1101/167445 fatcat:xiagqu7zkvgizhvtb3b6ssvune

A Storage Scheme for Multi-dimensional Databases Using Extendible Array Files

Ekow J. Otoo, Doron Rotem
2006 International Workshop on Spatio-Temporal Database Management  
In recent years, organizations have adopted the use of on-line analytical processing (OLAP), methods and statistical analyses to make strategic business decisions using enterprise data that are modeled  ...  In both of these domains, the datasets have the propensity to gradually grow, reaching orders of terabytes.  ...  In section 3 we describe how an array file is implemented. We describe how sparse multidimensional array files are managed in section 4.  ... 
dblp:conf/stdbm/OtooR06 fatcat:hcaefytulnd6rbg435ixbnucky

ArrayBridge: Interweaving declarative array processing with high-performance computing [article]

Haoyuan Xing, Suren Byna The Ohio State University
2017 arXiv   pre-print
This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats.  ...  In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it.  ...  TileDB proposes using flexible tiling to address the problem of accessing sparse data for efficient writes [27] .  ... 
arXiv:1702.08327v1 fatcat:mj3taabp5vcgdlcficikjno6li

Distributed Caching for Complex Querying of Raw Arrays [article]

Weijie Zhao, Florin Rusu, Bin Dong, Kesheng Wu, Anna Y. Q. Ho, and Peter Nugent
2018 arXiv   pre-print
In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined.  ...  In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets.  ...  Since HDF5 does not support sparse arrays natively, we store the PTF objects as an HDF5 table, i.e., relation. In FITS, data are stored as a binary table.  ... 
arXiv:1803.06089v1 fatcat:fnzuiwxywzahdlynqyxa4lto7i
« Previous Showing results 1 — 15 out of 824 results