Filters








29 Hits in 1.6 sec

Halvade: scalable sequence analysis with MapReduce

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
2015 Bioinformatics  
Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling.  ...  Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.  ...  The scalability of Halvade was assessed by running the analysis pipeline with an increasing number of 1-15 nodes.  ... 
doi:10.1093/bioinformatics/btv179 pmid:25819078 pmcid:PMC4514927 fatcat:2nmppry6nnfqrjhdljpukm5r3a

Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline

Hamid Mushtaq, Zaid Al-Ars
2015 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)  
Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline.  ...  However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis.  ...  However, since sequencing produces a large amount of data, post-sequencing DNA analysis requires effective and scalable solutions to ensure high computational performance.  ... 
doi:10.1109/bibm.2015.7359893 dblp:conf/bibm/MushtaqA15 fatcat:g3jlt7nz6fe3ffy7b3wd3wg62q

SparkGA

Hamid Mushtaq, Frank Liu, Carlos Costa, Gang Liu, Peter Hofstee, Zaid Al-Ars
2017 Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17  
In order to reduce the analysis cost, SparkGA can run on nodes with as little memory as 16GB.  ...  For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%.  ...  Our previous work [Mushtaq15] addressed the problems with MapReduce based solutions like Halvade by having a Spark based implementation of the DNA analysis pipeline.  ... 
doi:10.1145/3107411.3107438 dblp:conf/bcb/MushtaqLCLHA17 fatcat:6pvahsud7jckfgz7ihbszhkaxi

Performance Analysis of a Parallel, Multi-node Pipeline for DNA Sequencing [chapter]

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
2016 Lecture Notes in Computer Science  
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling and is very time-consuming, even on a multi-core machine.  ...  Recently, we proposed Halvade, a parallel, multi-node implementation of a DNA sequencing pipeline according to the GATK Best Practices recommendations.  ...  Special thanks goes to Stijn De Weirdt for his assistance with the Java wrappers to improve NUMA locality. Benchmarks on Lustre were run at the Intel Big Data Lab, Swindon, UK.  ... 
doi:10.1007/978-3-319-32152-3_22 fatcat:sq2tl5dlwrg35mcxy5fjzcir24

Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud

Andrian Yang, Michael Troup, Peijie Lin, Joshua W. K. Ho
2016 Bioinformatics  
Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNAseq data due to their limited scalability.  ...  The result shows Falco performs at least 2.6x faster against a highly optimized single node analysis and a reduction in runtime with increasing number of computing nodes.  ...  Summary Falco is a cloud-based framework designed for multi-sample analysis of transriptomic data in an efficient and scalable manner.  ... 
doi:10.1093/bioinformatics/btw732 pmid:28025200 fatcat:sdazrnj3dnc33hqyosil44hsza

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

José M. Abuín, Juan C. Pichel, Tomás F. Pena, Jorge Amigo, Ruslan Kalendar
2016 PLoS ONE  
In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow.  ...  To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation.  ...  Mapping these data onto a reference genome is often the first step in the sequence analysis workflow.  ... 
doi:10.1371/journal.pone.0155461 pmid:27182962 pmcid:PMC4868289 fatcat:i3ziuweua5hfvix3dh2wda2frm

Parallel computing for genome sequence processing

You Zou, Yuejie Zhu, Yaohang Li, Fang-Xiang Wu, Jianxin Wang
2021 Briefings in Bioinformatics  
Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing,  ...  Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features.  ...  Zou et al. review the MapReduce-based software and projects in next-generation sequencing (NGS) data processing in the aspects including sequence alignment, mapping, assembly, gene expression analysis,  ... 
doi:10.1093/bib/bbab070 pmid:33822883 fatcat:a4hj2fhybrc6zlsq6xyiu6snmy

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Roberto R. Expósito, Jorge González-Domínguez, Juan Touriño, Ruslan Kalendar
2018 PLoS ONE  
With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis  ...  Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression.  ...  Finally, Halvade-RNA is an extension of Halvade that provides a whole analysis pipeline for RNA-seq data using STAR as the underlying aligner. Therefore, the same limitations arise as for Halvade.  ... 
doi:10.1371/journal.pone.0201483 pmid:30063721 pmcid:PMC6067734 fatcat:njconu4bajaulkl2zmephovjsm

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Zaid Al-Ars, Saiyi Wang, Hamid Mushtaq
2020 Genes  
Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.  ...  On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node.  ...  Acknowledgments: The experiments in this paper have been performed on the Dutch national e-infrastructure with support from the SURF cooperative.  ... 
doi:10.3390/genes11010053 pmid:31947774 pmcid:PMC7016739 fatcat:qfuonctfnvdkxf5ckadqq26qoy

StreamAligner: a streaming based sequence aligner on Apache Spark

Sanjay Rathee, Arti Kashyap
2018 Journal of Big Data  
A lot of MapReduce-based sequence alignment tools like CloudBurst, CloudAligner, Halvade, and SparkBWA are proposed by various researchers in recent few years.  ...  We tested the effectiveness, efficiency, and scalability of our aligner for various standard and real-life datasets.  ...  After sequencing, mapping these read sequences onto a reference genome is the most important task in a sequence analysis work-flow.  ... 
doi:10.1186/s40537-018-0114-y fatcat:ooj6fe62dza4ncpsxt5kgbcnu4

Scalability and Validation of Big Data Bioinformatics Software

Andrian Yang, Michael Troup, Joshua W.K. Ho
2017 Computational and Structural Biotechnology Journal  
(scalability) and multiple executions (validation).  ...  We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment.  ...  In particular, we will discuss how modern cloud computing technology and big data analysis frameworks, such as MapReduce and Spark, can be effectively used to deal with the scalability problem in the big  ... 
doi:10.1016/j.csbj.2017.07.002 pmid:28794828 pmcid:PMC5537105 fatcat:nnkrlwg35fd3hkpbg2jtosdicq

Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs

Xiaolong Feng, Jing Gao
2019 International Journal of Computers Communications & Control  
This model not only simplifies the process flow of gene sequence alignment, but also improves the performance compared with other methods.  ...  This paper describes in detail the method of manipulating gene sequences with multiple inputs and outputs modes on Hadoop platform and the design of a computing model based on this method, and proves the  ...  Halvade is a Hadoop-based gene sequence alignment framework developed with Java.  ... 
doi:10.15837/ijccc.2019.2.3539 fatcat:vsxvnxp7yvd5helxejiwltpowe

Gene Sequence Input Formatting and MapReduce Computing

Xiaolong Feng, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China, Jing Gao, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
2019 International Journal Bioautomation  
On this basis, a MapReduce computing model was designed for distributed parallel computing of gene sequence alignment tasks.  ...  Specifically, the HDFS is a distributed file system that stores files with data blocks in distributed cluster, and ensures the data validity with a good fault-tolerance mechanism.  ...  of gene sequence analysis.  ... 
doi:10.7546/ijba.2019.23.2.000675 fatcat:b6ldrj2s4bb35nnspelcmuqxqm

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Tahir, Sardaraz
2020 Genes  
Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade.  ...  In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences.  ...  Halvade uses Hadoop MapReduce based approach for genome analysis, where the variant calling carried out via chromosome divisions.  ... 
doi:10.3390/genes11020166 pmid:32033366 pmcid:PMC7074349 fatcat:nuh3dedypvfgfn52czvwg6dss4

Falco: A quick and flexible single-cell RNA-seq processing framework on the cloud [article]

Andrian Yang, Michael Troup, Peijie Lin, Joshua W. K. Ho
2016 bioRxiv   pre-print
Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability.  ...  Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.  ...  Summary Falco is a cloud-based framework that enables massively parallelised sequence alignment, quality control, and feature quantification of single-cell transcriptomic data in AWS cloud-computing environment  ... 
doi:10.1101/064006 fatcat:qwnl5j6klnf7ji5likbx3r55pe
« Previous Showing results 1 — 15 out of 29 results