A parallel computational framework for ultra-large-scale sequence clustering analysis

Wei Zheng, Qi Mao, Robert J Genco, Jean Wactawski-Wende, Michael Buck, Yunpeng Cai, Yijun Sun, Inanc Birol
2018 Bioinformatics  
We implemented the proposed method on Apache Spark V2.0.2 by using the Scala programming language V2.11.8. Apache Spark is a fast and general engine for large-scale data processing, which provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It can run on Hadoop, Mesos, standalone, or in the cloud, and can access diverse data sources including HDFS, Cassandra, HBase and S3. Most existing parallel de novo OTU picking methods utilized message
more » ... sing interface (MPI) for speed-up in a distributed computing environment [1, 5, 8] . While MPI enables the message communication between computational nodes via network, it lacks job scheduling and fault recovery. Since our method can be easily fit into the MapReduce model, the low-level flexibility offered by MPI becomes less appealing. By using high-level and portable Apache Spark, our method is scalable, fault-tolerant, and compatible with different file systems. In addition, Apache Spark supports several programming languages, including Python, R and Scala. We chose Scala since Apache Spark focuses on data transformation and mapping concepts, which are flawlessly supported by functional programming languages including Scala. Moreover, Scala is a JVM native language and thus is much more efficient than Python and R in Spark. Apache Spark also provides users with a programming interface centered on a data structure called resilient distributed dataset (RDD), a read-only multi-set of data items distributed over a cluster of machines and maintained in a fault-tolerant way. It addresses the limitation of the MapReduce cluster computing paradigm, which always forces a program to read input data from disk. In our method, landmarks are selected in an iterative fashion. The frequent access to the data stored in memory rather than disk can save a huge amount of computational time by avoiding unnecessary I/O operations.
doi:10.1093/bioinformatics/bty617 pmid:30010718 fatcat:xtc22y4jrreavjvzwovu244nmy