Sketching and Sublinear Data Structures in Genomics

Guillaume Marçais, Brad Solomon, Rob Patro, Carl Kingsford
2019 Annual Review of Biomedical Data Science  
Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high
more » ... level and give several representative applications of each. Expected final online publication date for the Annual Review of Biomedical Data Science Volume 2 is July 22, 2019. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
doi:10.1146/annurev-biodatasci-072018-021156 fatcat:zlqdv6ke4vdmvgaaqwvvd53iae