FASTAFS: file system virtualisation of random access compressed FASTA files [article]

Youri Hoogstrate, Guido Jenster, Harmen JG van de Werken
2020 bioRxiv   pre-print
The FASTA file format used to store polymeric sequence data has become a bioinformatics file standard used for decades. The relatively large files require additional files beyond the scope of the original format, to identify sequences and provide random access. Currently, multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to
more » ... TA files, resulting in limited software integration. Results: We designed linux based a toolkit using Filesystem in Userspace (FUSE) that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstandard (zstd). The toolkit, FASTAFS, can track all system wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. Conclusions: FASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers the possibility to design format conversion without the need to rewrite code into different languages while preserving compatibility. Code Availability: .
doi:10.1101/2020.11.11.377689 fatcat:4es44ocbanhhnko3nt4ll4o4ya