A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
A parallel computational framework for ultra-large-scale sequence clustering analysis
2018
Bioinformatics
Apache Spark is a fast and general engine for large-scale data processing, which provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. ...
Most existing parallel de novo OTU picking methods utilized message passing interface (MPI) for speed-up in a distributed computing environment [1, 5, 8] . ...
Spark MLlib Another advantage of using Apache Spark is that it is equipped with a bunch of built-in libraries, which can significantly simplify the construction of large-scale computational pipelines. ...
doi:10.1093/bioinformatics/bty617
pmid:30010718
fatcat:xtc22y4jrreavjvzwovu244nmy
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing
[article]
2017
arXiv
pre-print
Distributed and parallel computing represents a crucial technique for accelerating ultra-large sequence analyses. ...
Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. ...
Ultra-large biological sequence analysis can be efficiently addressed by assembling distributed and parallel computing systems with numerous cheap devices [14] [15] [16] . ...
arXiv:1704.00878v1
fatcat:ojszon3mzfetjiauzafvrb52qy
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing
2017
Algorithms for Molecular Biology
Methods: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. ...
Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. ...
These improvements facilitate running of sequence analysis on clusters comprising cheap large-scale and low-end machines. ...
doi:10.1186/s13015-017-0116-x
pmid:29026435
pmcid:PMC5622559
fatcat:bbmuyxddnfemxjgsilb3uap54u
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
2010
BMC Bioinformatics
Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. ...
Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant ...
Acknowledgements RCT thanks the Bioinformatics Open Source Conference (BOSC) for the opportunity to present a talk on this subject at the July 2010 BOSC meeting, of which this article is an expansion. ...
doi:10.1186/1471-2105-11-s12-s1
pmid:21210976
pmcid:PMC3040523
fatcat:r74rokyv6fc45dfid2fd6qgk64
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends
2014
BioData Mining
MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. ...
data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework ...
/[68]
2013
MapReduce algorithms
Enhanced algorithm
Rainbow: a tool for large-scale whole-genome sequencing data analysis
using cloud computing/[69]
2013
Cloud
Whole-genome sequencing
Study ...
doi:10.1186/1756-0381-7-22
pmid:25383096
pmcid:PMC4224309
fatcat:zpis7kklerh2vna5le2gtxc5vi
Reconstructing evolutionary trees in parallel for massive sequences
2017
BMC Systems Biology
Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. ...
Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. ...
Moreover, evolutionary networks are superior to trees for large-scale and complex evolutionary analysis. Our parallel strategy also suits network reconstruction. ...
doi:10.1186/s12918-017-0476-3
pmid:29297337
pmcid:PMC5751538
fatcat:czh7v2xwdnbsxotug5hgrxgpfq
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
2013
IOSR Journal of Computer Engineering
Researchers are now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. ...
Rocks cluster is a viable solution in such scenarios. Rocks Cluster Distribution originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters. ...
Acknowledgements We are greatly indebted to the college management and the faculty members for providing necessary facilities and hardware along with timely guidance and suggestions for implementing this ...
doi:10.9790/0661-1468993
fatcat:jimeeynycbdg5fs4iruvfkmpsm
Challenges and approaches for distributed workflow-driven analysis of large-scale biological data
2012
Proceedings of the 2012 Joint EDBT/ICDT Workshops on - EDBT-ICDT '12
Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. ...
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor ...
To date, there have been a number of studies for dataintensive analysis of large-scale bioinformatics datasets on Cloud computing platforms. ...
doi:10.1145/2320765.2320791
dblp:conf/edbt/AltintasWCL12
fatcat:lot2dlhp4fh45izbyqdiw3ta2y
A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
2012
BMC Medical Informatics and Decision Making
MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format ...
Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. ...
For such purposes, we need a more user-friendly framework that allows iterative, easy, and quick transformation of ultra-large scale administrative data into an analytic dataset. ...
doi:10.1186/1472-6947-12-151
pmid:23259862
pmcid:PMC3545829
fatcat:qu6zz4vrwjdsjcm5mw7sxj3dqq
SDM center technologies for accelerating scientific discoveries
2007
Journal of Physics, Conference Series
Our future focus is on improving the SDM framework to address the needs of ultra-scale science during SciDAC-2. ...
With the increasing volume and complexity of data produced by ultra-scale simulations and highthroughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end ...
Two major levels of parallelism are supported: data parallelism (k-means clustering, Principal Component Analysis, Hierarchical Clustering, Distance matrix, Histogram) and task parallelism (Likelihood ...
doi:10.1088/1742-6596/78/1/012068
fatcat:yjckghdvgzb5toc7v2zllnj7xe
Boa: A language and infrastructure for analyzing ultra-large-scale software repositories
2013
2013 35th International Conference on Software Engineering (ICSE)
In today's software-centric world, ultra-large-scale software repositories, e.g. ...
However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! ...
INTRODUCTION Ultra-large-scale software repositories, e.g. ...
doi:10.1109/icse.2013.6606588
dblp:conf/icse/0001NRN13
fatcat:jhx5nyqlxbekbfbz4ooiwmdhni
Research on Intrusion Detection Algorithm of User Data based on Cloud Computing
2015
International Journal of Security and Its Applications
Cloud computing is a new computing model, it will be large-scale computing resource interconnection were effectively integrated, and the computing resources available to users in the form of services. ...
for parallel implementation of map reduce, to solve the clustering problem of the magnanimity data. ...
But cloud computing can greatly compensate for these shortcomings, it provides ultra lar ge scale computing capacity and large storage capacity, can be in the behavioral event collection, correlation analysis ...
doi:10.14257/ijsia.2015.9.9.24
fatcat:bostcsvxyba27pioeenp5twvta
Cloud Computing for Next-Generation Sequencing Data Analysis
[chapter]
2017
Cloud Computing - Architecture and Applications
share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis. ...
Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses. ...
The use of large datasets, the demanding analysis algorithms, and the urgent need for computational resources, make large-scale sequencing projects an attractive test-case for cloud computing. ...
doi:10.5772/66732
fatcat:2ewdbtp2bjhx7j7tj4e3auwqke
SDAFT: A novel scalable data access framework for parallel BLAST
2014
Parallel Computing
In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by "read" operation for data analysis. ...
SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. ...
Acknowledgments This material is based upon work supported by the National Science Foundation under the following NSF program: Parallel Reconfigurable Observational Environment for Data Intensive Super-Computing ...
doi:10.1016/j.parco.2014.08.001
fatcat:kwyeqyahbbfifcg4twh7aovtcq
In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by "read" operation for data analysis. ...
SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. ...
Acknowledgments This material is based upon work supported by the National Science Foundation under the following NSF program: Parallel Reconfigurable Observational Environment for Data Intensive Super-Computing ...
doi:10.1145/2534645.2534647
dblp:conf/sc/YinZWF13
fatcat:nih6b4v2nrfpnhyuw6egisdeq4
« Previous
Showing results 1 — 15 out of 12,617 results