1,494 Hits in 4.0 sec

Compression and fast retrieval of SNP data

F. Sambo, B. Di Camillo, G. Toffolo, C. Cobelli
2014 Bioinformatics  
This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data.  ...  Results: We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies.  ...  Conflict of interest: none declared.  ... 
doi:10.1093/bioinformatics/btu495 pmid:25064564 pmcid:PMC4609015 fatcat:x2rzzvaf3ze25hidwqml5oqn7a

A database for efficient storage and management of multi panel SNP data

E. Groeneveld, C. V. C. Truong
2013 Archives Animal Breeding  
Due to its vector based database storage, data imports and exports are much faster than those of other SNP databases.  ...  A new strategy using SNP and individual selection vectors allows us to view SNP data as matrices or sets.  ...  Secondly, for each individual the compressed genotype vector is retrieved by one SQL select and shrunk on the basis of the snp_sel_vec which can be implemented as fast shifts.  ... 
doi:10.7482/0003-9438-56-103 fatcat:ivj56rlnlrfcfccczksxdqkp2e

Indexing k-mers in Linear-space for Quality Value Compression

Yoshihiro Shibuya, Matteo Comin
2019 Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies  
Most of the entropy of sequencing data lies in the quality scores, and thus they are difficult to compress.  ...  We show how a dictionary of significant k-mers, obtained from SNPs databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value.  ...  The next step usually involves some sort of sorting and indexing for allowing fast retrieval of some particular k-mer (or its neighbors) and link this information to the position in the original sequence  ... 
doi:10.5220/0007369100210029 dblp:conf/biostec/ShibuyaC19 fatcat:lcqqjj3ffvf2hdeywhrcjm5lmu

Fast randomized approximate string matching with succinct hash data structures

Alberto Policriti, Nicola Prezza
2015 BMC Bioinformatics  
We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible.  ...  In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes  ...  Tests on both simulated and real data, using the most popular short reads aligners, allowed us to validate also in practice the efficiency of our algorithm, which proved to be extremely accurate and fast  ... 
doi:10.1186/1471-2105-16-s9-s4 pmid:26051265 pmcid:PMC4464037 fatcat:4yccrahs2jd5vd6condhlvsceu

Tabix: fast retrieval of sequence features from generic TAB-delimited files

H. Li
2011 Bioinformatics  
Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access.  ...  Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions.  ...  of direct FTP/HTTP access and Jim Kent, James Bonfield and Richard Durbin for their helpful discussions on general indexing techniques.  ... 
doi:10.1093/bioinformatics/btq671 pmid:21208982 pmcid:PMC3042176 fatcat:5pshpfozwnb75piffwhkpd7agq

The variant call format and VCFtools

P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin
2011 Bioinformatics  
VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome.  ...  The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.  ...  Conflict of Interest: none declared.  ... 
doi:10.1093/bioinformatics/btr330 pmid:21653522 pmcid:PMC3137218 fatcat:bu6imoalw5hypbfua45gzlsnpy

TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data

Eildert Groeneveld, Helmut Lichtenberg, Tesfaye B Mersha
2016 PLoS ONE  
TheSNPpit has implemented three ideas to also accomodate such large scale experiments: highly compressed vector storage in a relational database, set based data manipulation, and a very fast export written  ...  The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues.  ...  Discussion and Conclusions TheSNPpit is a fast database system for storage and management of large volumes of SNP data. It can handle panels of any size, even those derived from whole genome scans.  ... 
doi:10.1371/journal.pone.0164043 pmid:27780248 pmcid:PMC5079601 fatcat:johokbs5vnddphcxw3qptptz3y

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval [chapter]

Shanika Kuruppu, Simon J. Puglisi, Justin Zobel
2010 Lecture Notes in Computer Science  
Self-indexes -data structures that simultaneously provide fast search of and access to compressed text -are promising for genomic data but in their usual form are not able to exploit the high level of  ...  Our 'RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base.  ...  However, large resources are required to be shared among users of the compressed data (4.2 GB of reference and SNPs in Chirstley et al.'s software).  ... 
doi:10.1007/978-3-642-16321-0_20 fatcat:uo2k4572obdlfat5cx6qo7sasm

De Novo NGS Data Compression [chapter]

Gaetan Benoit, Claire Lemaitre, Guillaume Rizk, Erwan Drezen, Dominique Lavenier
2017 Algorithms for Next-Generation Sequencing Data  
Compression and decompression is performed as a standalone process independently of external knowledge.  ...  The chapter explains the main NGS compression techniques, including lossless and lossy compression.  ...  Introduction During the last decade, the fast evolution of the sequencing technologies has led to an explosion of DNA data. Every field of life science is now concerned.  ... 
doi:10.1007/978-3-319-59826-0_4 fatcat:pjctkfpul5cqvejdr5mtymfjm4

PanTools: representation, storage and exploration of pan-genomic data

Siavash Sheikhizadeh, M. Eric Schranz, Mehmet Akdel, Dick de Ridder, Sandra Smit
2016 Bioinformatics  
We define the pan-genome as a comprehensive representation of multiple annotated genomes, facilitating analyses on the similarity and divergence of the constituent genomes at the nucleotide, gene and genome  ...  We demonstrate the performance of the tool using datasets of 62 E. coli genomes, 93 yeast genomes and 19 Arabidopsis thaliana genomes.  ...  Acknowledgements We thank Maria-Anna Misiakou and Salvador Casani Galdon for valuable input.  ... 
doi:10.1093/bioinformatics/btw455 pmid:27587666 fatcat:sa6pinj46rgitb7ofrlum4p2jq

A SUPER Powerful Method for Genome Wide Association Study

Qishan Wang, Feng Tian, Yuchun Pan, Edward S. Buckler, Zhiwu Zhang, Yun Li
2014 PLoS ONE  
This restriction potentially leads to less statistical power when compared to using all SNPs. We developed a method to extract a small subset of SNPs and use them in FaST-LMM.  ...  This method not only retains the computational advantage of FaST-LMM, but also remarkably increases statistical power even when compared to using the entire set of SNPs.  ...  Miller and Linda R. Klein for editing the manuscript. New Powerful Method for GWAS Conceived and designed the experiments: ZZ YP ESB. Performed the experiments: QW FT. Analyzed the data: QW ZZ.  ... 
doi:10.1371/journal.pone.0107684 pmid:25247812 pmcid:PMC4172578 fatcat:eddmwp3f6raw5kqhhzm6m3zkde

SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access

Jorge Amigo, Antonio Salas, Christopher Phillips, Ángel Carracedo
2008 BMC Bioinformatics  
A fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference  ...  Results: We have developed a novel tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics: SPSmart (SNPs  ...  Thanks to Albert Vernon Smith, Lalitha Krishnan and Marcela K Tello-Ruiz of HapMap for their long-standing interest and support, and to Juan Villasuso and Natalia Costas of Centro de Supercomputación de  ... 
doi:10.1186/1471-2105-9-428 pmid:18847484 pmcid:PMC2576268 fatcat:nd3ijfze4bbjvkczoq3tkpd24i

Better quality score compression through sequence-based quality smoothing

Yoshihiro Shibuya, Matteo Comin
2019 BMC Bioinformatics  
We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines  ...  As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression.  ...  of YALFF's inner structure widely used for benchmarking in other papers, because the list of known SNPs is available and it can retrieved from .P.  ... 
doi:10.1186/s12859-019-2883-5 pmid:31757199 pmcid:PMC6873394 fatcat:ec6a5zsokrfbpdjcsfynxd4ysy

SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock

Ezequiel Nicolazzi, Matteo Picciolini, Francesco Strozzi, Robert Schnabel, Cindy Lawley, Ali Pirani, Fiona Brew, Alessandra Stella
2014 BMC Genomics  
In addition, SNPchiMp can retrieve this information on subsets of SNPs, accessing such data either via physical position on a supported assembly, or by a list of SNP IDs, rs or ss identifiers.  ...  Most researchers and breed associations manage SNP data in real-time and thus require tools to standardise data in a user-friendly manner.  ...  Williams for his important feedback on the early version of this tool.  ... 
doi:10.1186/1471-2164-15-123 pmid:24517501 pmcid:PMC3923093 fatcat:pokwyijpyjebrepjr27wr3scfy

EnsMart: A Generic System for Fast and Flexible Access to Biological Data

A. Kasprzyk
2003 Genome Research  
The EnsMart system ( provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools.  ...  Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats.  ...  We thank the following for providing data sets: South African National Bioinformatics Institute (SANBI) and Electric Genetics, Genomics Institute of the Novartis Research Foundation (GNF), Affymetrix,  ... 
doi:10.1101/gr.1645104 pmid:14707178 pmcid:PMC314293 fatcat:u5twj5oxfncwjn2mkpicfx4hga
« Previous Showing results 1 — 15 out of 1,494 results