Aird: A computation-oriented mass spectrometry data format enables higher compression ratio and less decoding time [article]

MiaoShan Lu, Shaowei An, Ruimin Wang, Jinyin Wang, Changbin Yu
2020 bioRxiv   pre-print
With the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats
more » ... ocus as much on decoding speed and disk read strategy as compression rate. Here we describe "Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.
doi:10.1101/2020.10.14.338921 fatcat:xekkgdjuljeczavm23hamqiddu