403 Hits in 5.0 sec

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Roberto R. Exposito, Roi Galego-Torreiro, Jorge Gonzalez-Dominguez
2020 IEEE Access  
., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface  ...  Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing  ...  In terms of functionality, FastQC does not have trimming and filtering features, whereas Trimmomatic is focused on just one operation type (trimming), and PEAT provides very few filter options to the users  ... 
doi:10.1109/access.2020.3015016 fatcat:rjk3db3fxvgztf3njwy73pb4ea

Parallel bi-objective evolutionary algorithms for scalable feature subset selection via migration strategy under Spark [article]

Yelleti Vivek, Vadlamani Ravi, P. Radha Krishna
2022 arXiv   pre-print
In the first-of-its-kind study, we propose and develop an iterative MapReduce-based framework for bi-objective evolutionary algorithms (EAs) based wrappers under Apache spark with the migration strategy  ...  Feature subset selection (FSS) for classification is inherently a bi-objective optimization problem, where the task is to obtain a feature subset which yields the maximum possible area under the receiver  ...  ., (i) filter, (ii) wrapper, and (iii) embedded approaches. Filter approaches selects the features based on the performance of the statistical measure neverthless of the employed model.  ... 
arXiv:2205.09465v1 fatcat:qvdqfzyikjhxdgyfhy6tvsmfsi

Parallelized Classification of Cancer Sub-types from Gene Expression Profiles Using Recursive Gene Selection

Lokeswari VENKATARAMANA, Shomona Gracia JACOB, Rajavel RAMADOSS
2019 Studies in Informatics and Control  
A comparison was drawn between the non-parallelized classification model on Weka and the parallelized classification model on Spark.  ...  The Recursive Feature Selection (RFS) method is proposed as it repeatedly performs the gene selection process until the best gene subset is found.  ...  The algorithm could read data in a distributed form and performed parallel feature selection in both symmetric multiprocessing modes via multithreading and massively parallel processing.  ... 
doi:10.24846/v27i2y201809 fatcat:pgn5dov43vbnxptc36ejqu2v7e

Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

Thomas R. Devine, Katerina Goseva-Popstojanova, Di Pang
2018 Proceedings of the 47th International Conference on Parallel Processing - ICPP 2018  
Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54%  ...  Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation  ...  We also thank the reviewers of this paper for their helpful comments.  ... 
doi:10.1145/3225058.3225101 dblp:conf/icpp/DevineGP18 fatcat:md4eymo4rngs3oqea5grovmsxa

Scalable Feature Subset Selection for Big Data using Parallel Hybrid Evolutionary Algorithm based Wrapper in Apache Spark [article]

Yelleti Vivek, Vadlamani Ravi, Pisipati Radhakrishna
2022 arXiv   pre-print
This limitation motivated us to propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms (EAs) under the Apache Spark environment.  ...  Owing to the emergence of large datasets, applying current sequential wrapper-based feature subset selection (FSS) algorithms increases the complexity.  ...  (iv) To achieve scalability and algorithm parallelization, we proposed a novel MapReduce-multithread based framework.  ... 
arXiv:2106.14007v3 fatcat:lbx5gmkvq5cihftoit5237zpem

AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing [article]

Vinu E. Venugopal, Martin Theobald, Samira Chaychi, Amal Tawakuli
2020 arXiv   pre-print
Our experiments over a variety of benchmark settings confirm that AIR outperforms Spark and Flink in terms of latency and throughput by a factor of up to 15; moreover, we demonstrate that AIR scales out  ...  In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on  ...  We also thank the HPC team of the University of Luxembourg for their timely help and support.  ... 
arXiv:2001.00164v2 fatcat:yyi4l2qx4jgcdla5wphpbxzmb4

Comparative Analysis of Collaborative Filtering on GraphLab, MLlib and Mahout

Abdul Samad, Dr. Syed Saif-ur-Rahman
2015 Journal of Independent Studies and Research - Computing  
In this study, the data loading, model generation, recommendation implementation and accuracy of same algorithm on some major tools and libraries (GraphLab, Mahout-Hadoop, Mahout-Spark and MLLib) has been  ...  Recommendation systems are used in various online shops (E-Commerce application) and decision making systems. Recommendation is a particular form of information filtering.  ...  , Multithreading and disk IO usage.  ... 
doi:10.31645/jisrc/(2015).13.1.0001 fatcat:bj76wgqerna7favzorey4hbasm

Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU-Accelerated Spark Platform

Bing Tang, Linyao Kang, Li Zhang, Feiyan Guo, Haiwu He, Shah Nazir
2021 Scientific Programming  
Furthermore, a GPU-accelerated NMF-based parallel collaborative filtering (CF) algorithm is also proposed, utilizing the advantages of data dimensionality reduction and feature extraction of NMF, as well  ...  Using real MovieLens data sets, experimental results have shown that the parallelization of NMF-based collaborative filtering on Spark platform effectively outperforms traditional user-based and item-based  ...  Parallel and Distributed Collaborative Filtering.  ... 
doi:10.1155/2021/8841133 fatcat:6tf7qm7zwzce3ebcas7r6dvum4

Novel functional and distributed approaches to data analysis available in ROOT

G. Amadio, J. Blomer, P. Canal, G. Ganis, E. Guiraud, P. Mato Vila, L. Moneta, D. Piparo, E. Tejedor, X. Valls Pla
2018 Journal of Physics, Conference Series  
and size of the datasets.  ...  The design choices behind this new interface are described also comparing with other widely adopted tools such as Pandas and Apache Spark.  ...  Examples of transformations are the application of a filter to select entries, the creation of a new column, also based on the content of other existing columns, the caching of the dataset in memory or  ... 
doi:10.1088/1742-6596/1085/4/042008 fatcat:bmiaqp46l5bqvbshopdloaknsm

Parallel computing for genome sequence processing

You Zou, Yuejie Zhu, Yaohang Li, Fang-Xiang Wu, Jianxin Wang
2021 Briefings in Bioinformatics  
Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features.  ...  Finally, we discuss the limitations and future trends of parallel computing technologies.  ...  Funding This work is supported in part by the National Natural Science Foundation of China under grants (Nos U1909208, 61732009, 61772557), Hunan Provincial Science and Technology Program (No. 2018WK4001  ... 
doi:10.1093/bib/bbab070 pmid:33822883 fatcat:a4hj2fhybrc6zlsq6xyiu6snmy

Weld: Rethinking the Interface Between Data-Intensive Applications [article]

Shoumik Palkar, James Thomas, Deepak Narayanan, Anil Shanbhag, Rahul Palamuttam, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Samuel Madden,, Matei Zaharia
2017 arXiv   pre-print
Weld can be integrated into existing frameworks such as Spark, TensorFlow, Pandas and NumPy without changing their user-facing APIs.  ...  Even when each function is optimized in isolation, the performance of the combined application can be an order of magnitude below hardware limits due to extensive data movement across these functions.  ...  This research was supported in part by affiliate members and other supporters of the Stanford DAWN  ... 
arXiv:1709.06416v2 fatcat:nda4d32uafctpcql743lx6qofy

Evaluating end-to-end optimization for data analytics applications in weld

Shoumik Palkar, Saman Amarasinghe, Samuel Madden, Matei Zaharia, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk
2018 Proceedings of the VLDB Endowment  
Modern analytics applications use a diverse mix of libraries and functions.  ...  Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at runtime in  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.  ... 
doi:10.14778/3213880.3213890 fatcat:oesslpgfy5awlb32xnylmjlnoa

The ALICE Analysis Framework for LHC Run 3

Dario Berzano, Roel Deckers, Costin Grigoras¸, Michele Floris, Peter Hristov, Mikolaj Krzewicki, Markus Zimmermann, A. Forti, L. Betev, M. Litmaath, O. Smirnova, P. Hristov
2019 EPJ Web of Conferences  
Analysis Facilities and the developmentofthe Analysis Framework.  ...  Wepresent the prototypeofanew Analysis Object Data format basedontimeframes and optimized for continuous readout. Such formatisdesigned tobeextensible and transported efficiently over the network.  ...  The current framework is considered successful from the performance and usability perspectives because it factors out critical parts and hides their complexity and optimization to the user: the Run 3 analysis  ... 
doi:10.1051/epjconf/201921405045 fatcat:gaxaya3xafhabnevxgko7fxiz4

An Experimental Analysis on Scalable Implementations of the Alternating Least Squares Algorithm

Dânia Meira, José Viterbo, Flavia Bernardini
2018 Proceedings of the 2018 Federated Conference on Computer Science and Information Systems  
The use of the latent factor models technique overcomes two major problems of most collaborative filtering approaches: scalability and sparseness of the user's profile matrix.  ...  In this work we propose a methodology for comparing the performance of two parallel implementations of the ALS algorithm, one executed with MapReduce in Apache Hadoop framework and another executed in  ...  ALS Spark implementation predicted rating between user and item is a dot product of the user's feature vector and the item's feature vector. C.  ... 
doi:10.15439/2018f166 dblp:conf/fedcsis/MeiraVB18 fatcat:n3s2rh64sfg6niyz77q7ns56ty

High-performance Overlay Analysis of Massive Geographic Polygons That Considers Shape Complexity in a Cloud Environment

Zhao, Jin, Fan, Song, Zhou, Jiang
2019 ISPRS International Journal of Geo-Information  
Considering the influence of the shape complexity of polygons on the performance of overlay analysis, we design and implement a parallel processing algorithm based on the Spark paradigm in this paper.  ...  Based on the analysis of the shape complexity of polygons, the overlay analysis speed is improved via reasonable data partition, distributed spatial index, a minimum boundary rectangular filter and other  ...  Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/ijgi8070290 fatcat:oeaupgmv3fe4rghmiuolkxw6yy
« Previous Showing results 1 — 15 out of 403 results