Filters








12,617 Hits in 4.5 sec

A parallel computational framework for ultra-large-scale sequence clustering analysis

Wei Zheng, Qi Mao, Robert J Genco, Jean Wactawski-Wende, Michael Buck, Yunpeng Cai, Yijun Sun, Inanc Birol
2018 Bioinformatics  
Apache Spark is a fast and general engine for large-scale data processing, which provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.  ...  Most existing parallel de novo OTU picking methods utilized message passing interface (MPI) for speed-up in a distributed computing environment [1, 5, 8] .  ...  Spark MLlib Another advantage of using Apache Spark is that it is equipped with a bunch of built-in libraries, which can significantly simplify the construction of large-scale computational pipelines.  ... 
doi:10.1093/bioinformatics/bty617 pmid:30010718 fatcat:xtc22y4jrreavjvzwovu244nmy

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing [article]

Shixiang Wan, Quan Zou
2017 arXiv   pre-print
Distributed and parallel computing represents a crucial technique for accelerating ultra-large sequence analyses.  ...  Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.  ...  Ultra-large biological sequence analysis can be efficiently addressed by assembling distributed and parallel computing systems with numerous cheap devices [14] [15] [16] .  ... 
arXiv:1704.00878v1 fatcat:ojszon3mzfetjiauzafvrb52qy

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing

Shixiang Wan, Quan Zou
2017 Algorithms for Molecular Biology  
Methods: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses.  ...  Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.  ...  These improvements facilitate running of sequence analysis on clusters comprising cheap large-scale and low-end machines.  ... 
doi:10.1186/s13015-017-0116-x pmid:29026435 pmcid:PMC5622559 fatcat:bbmuyxddnfemxjgsilb3uap54u

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Ronald C Taylor
2010 BMC Bioinformatics  
Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years.  ...  Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant  ...  Acknowledgements RCT thanks the Bioinformatics Open Source Conference (BOSC) for the opportunity to present a talk on this subject at the July 2010 BOSC meeting, of which this article is an expansion.  ... 
doi:10.1186/1471-2105-11-s12-s1 pmid:21210976 pmcid:PMC3040523 fatcat:r74rokyv6fc45dfid2fd6qgk64

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

Emad A Mohammed, Behrouz H Far, Christopher Naugler
2014 BioData Mining  
MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters.  ...  data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework  ...  /[68] 2013 MapReduce algorithms Enhanced algorithm Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing/[69] 2013 Cloud Whole-genome sequencing Study  ... 
doi:10.1186/1756-0381-7-22 pmid:25383096 pmcid:PMC4224309 fatcat:zpis7kklerh2vna5le2gtxc5vi

Reconstructing evolutionary trees in parallel for massive sequences

Quan Zou, Shixiang Wan, Xiangxiang Zeng, Zhanshan Sam Ma
2017 BMC Systems Biology  
Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard.  ...  Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building.  ...  Moreover, evolutionary networks are superior to trees for large-scale and complex evolutionary analysis. Our parallel strategy also suits network reconstruction.  ... 
doi:10.1186/s12918-017-0476-3 pmid:29297337 pmcid:PMC5751538 fatcat:czh7v2xwdnbsxotug5hgrxgpfq

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

Bincy P Andrews
2013 IOSR Journal of Computer Engineering  
Researchers are now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years.  ...  Rocks cluster is a viable solution in such scenarios. Rocks Cluster Distribution originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters.  ...  Acknowledgements We are greatly indebted to the college management and the faculty members for providing necessary facilities and hardware along with timely guidance and suggestions for implementing this  ... 
doi:10.9790/0661-1468993 fatcat:jimeeynycbdg5fs4iruvfkmpsm

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data

Ilkay Altintas, Jianwu Wang, Daniel Crawl, Weizhong Li
2012 Proceedings of the 2012 Joint EDBT/ICDT Workshops on - EDBT-ICDT '12  
Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data.  ...  Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor  ...  To date, there have been a number of studies for dataintensive analysis of large-scale bioinformatics datasets on Cloud computing platforms.  ... 
doi:10.1145/2320765.2320791 dblp:conf/edbt/AltintasWCL12 fatcat:lot2dlhp4fh45izbyqdiw3ta2y

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

Hiromasa Horiguchi, Hideo Yasunaga, Hideki Hashimoto, Kazuhiko Ohe
2012 BMC Medical Informatics and Decision Making  
MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format  ...  Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand.  ...  For such purposes, we need a more user-friendly framework that allows iterative, easy, and quick transformation of ultra-large scale administrative data into an analytic dataset.  ... 
doi:10.1186/1472-6947-12-151 pmid:23259862 pmcid:PMC3545829 fatcat:qu6zz4vrwjdsjcm5mw7sxj3dqq

SDM center technologies for accelerating scientific discoveries

Arie Shoshani, Ilkay Altintas, Alok Choudhary, Terence Critchlow, Chandrika Kamath, Bertram Ludäscher, Jarek Nieplocha, Steve Parker, Rob Ross, Nagiza Samatova, Mladen Vouk
2007 Journal of Physics, Conference Series  
Our future focus is on improving the SDM framework to address the needs of ultra-scale science during SciDAC-2.  ...  With the increasing volume and complexity of data produced by ultra-scale simulations and highthroughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end  ...  Two major levels of parallelism are supported: data parallelism (k-means clustering, Principal Component Analysis, Hierarchical Clustering, Distance matrix, Histogram) and task parallelism (Likelihood  ... 
doi:10.1088/1742-6596/78/1/012068 fatcat:yjckghdvgzb5toc7v2zllnj7xe

Boa: A language and infrastructure for analyzing ultra-large-scale software repositories

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, Tien N. Nguyen
2013 2013 35th International Conference on Software Engineering (ICSE)  
In today's software-centric world, ultra-large-scale software repositories, e.g.  ...  However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts!  ...  INTRODUCTION Ultra-large-scale software repositories, e.g.  ... 
doi:10.1109/icse.2013.6606588 dblp:conf/icse/0001NRN13 fatcat:jhx5nyqlxbekbfbz4ooiwmdhni

Research on Intrusion Detection Algorithm of User Data based on Cloud Computing

Hongdong Zhang, Yuli Song
2015 International Journal of Security and Its Applications  
Cloud computing is a new computing model, it will be large-scale computing resource interconnection were effectively integrated, and the computing resources available to users in the form of services.  ...  for parallel implementation of map reduce, to solve the clustering problem of the magnanimity data.  ...  But cloud computing can greatly compensate for these shortcomings, it provides ultra lar ge scale computing capacity and large storage capacity, can be in the behavioral event collection, correlation analysis  ... 
doi:10.14257/ijsia.2015.9.9.24 fatcat:bostcsvxyba27pioeenp5twvta

Cloud Computing for Next-Generation Sequencing Data Analysis [chapter]

Shanrong Zhao, Kirk Watrous, Chi Zhang, Baohong Zhang
2017 Cloud Computing - Architecture and Applications  
share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis.  ...  Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses.  ...  The use of large datasets, the demanding analysis algorithms, and the urgent need for computational resources, make large-scale sequencing projects an attractive test-case for cloud computing.  ... 
doi:10.5772/66732 fatcat:2ewdbtp2bjhx7j7tj4e3auwqke

SDAFT: A novel scalable data access framework for parallel BLAST

Jiangling Yin, Junyao Zhang, Jun Wang, Wu-chun Feng
2014 Parallel Computing  
In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by "read" operation for data analysis.  ...  SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches.  ...  Acknowledgments This material is based upon work supported by the National Science Foundation under the following NSF program: Parallel Reconfigurable Observational Environment for Data Intensive Super-Computing  ... 
doi:10.1016/j.parco.2014.08.001 fatcat:kwyeqyahbbfifcg4twh7aovtcq

SDAFT

Jiangling Yin, Junyao Zhang, Jun Wang, Wu-chun Feng
2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems - DISCS-2013  
In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by "read" operation for data analysis.  ...  SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches.  ...  Acknowledgments This material is based upon work supported by the National Science Foundation under the following NSF program: Parallel Reconfigurable Observational Environment for Data Intensive Super-Computing  ... 
doi:10.1145/2534645.2534647 dblp:conf/sc/YinZWF13 fatcat:nih6b4v2nrfpnhyuw6egisdeq4
« Previous Showing results 1 — 15 out of 12,617 results