A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets
2020
IEEE Access
., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface ...
Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing ...
In terms of functionality, FastQC does not have trimming and filtering features, whereas Trimmomatic is focused on just one operation type (trimming), and PEAT provides very few filter options to the users ...
doi:10.1109/access.2020.3015016
fatcat:rjk3db3fxvgztf3njwy73pb4ea
Parallel bi-objective evolutionary algorithms for scalable feature subset selection via migration strategy under Spark
[article]
2022
arXiv
pre-print
In the first-of-its-kind study, we propose and develop an iterative MapReduce-based framework for bi-objective evolutionary algorithms (EAs) based wrappers under Apache spark with the migration strategy ...
Feature subset selection (FSS) for classification is inherently a bi-objective optimization problem, where the task is to obtain a feature subset which yields the maximum possible area under the receiver ...
., (i) filter, (ii) wrapper, and (iii) embedded approaches. Filter approaches selects the features based on the performance of the statistical measure neverthless of the employed model. ...
arXiv:2205.09465v1
fatcat:qvdqfzyikjhxdgyfhy6tvsmfsi
Parallelized Classification of Cancer Sub-types from Gene Expression Profiles Using Recursive Gene Selection
2019
Studies in Informatics and Control
A comparison was drawn between the non-parallelized classification model on Weka and the parallelized classification model on Spark. ...
The Recursive Feature Selection (RFS) method is proposed as it repeatedly performs the gene selection process until the best gene subset is found. ...
The algorithm could read data in a distributed form and performed parallel feature selection in both symmetric multiprocessing modes via multithreading and massively parallel processing. ...
doi:10.24846/v27i2y201809
fatcat:pgn5dov43vbnxptc36ejqu2v7e
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
2018
Proceedings of the 47th International Conference on Parallel Processing - ICPP 2018
Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% ...
Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation ...
We also thank the reviewers of this paper for their helpful comments. ...
doi:10.1145/3225058.3225101
dblp:conf/icpp/DevineGP18
fatcat:md4eymo4rngs3oqea5grovmsxa
Scalable Feature Subset Selection for Big Data using Parallel Hybrid Evolutionary Algorithm based Wrapper in Apache Spark
[article]
2022
arXiv
pre-print
This limitation motivated us to propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms (EAs) under the Apache Spark environment. ...
Owing to the emergence of large datasets, applying current sequential wrapper-based feature subset selection (FSS) algorithms increases the complexity. ...
(iv) To achieve scalability and algorithm parallelization, we proposed a novel MapReduce-multithread based framework. ...
arXiv:2106.14007v3
fatcat:lbx5gmkvq5cihftoit5237zpem
AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing
[article]
2020
arXiv
pre-print
Our experiments over a variety of benchmark settings confirm that AIR outperforms Spark and Flink in terms of latency and throughput by a factor of up to 15; moreover, we demonstrate that AIR scales out ...
In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on ...
We also thank the HPC team of the University of Luxembourg for their timely help and support. ...
arXiv:2001.00164v2
fatcat:yyi4l2qx4jgcdla5wphpbxzmb4
Comparative Analysis of Collaborative Filtering on GraphLab, MLlib and Mahout
2015
Journal of Independent Studies and Research - Computing
In this study, the data loading, model generation, recommendation implementation and accuracy of same algorithm on some major tools and libraries (GraphLab, Mahout-Hadoop, Mahout-Spark and MLLib) has been ...
Recommendation systems are used in various online shops (E-Commerce application) and decision making systems. Recommendation is a particular form of information filtering. ...
, Multithreading and disk IO usage. ...
doi:10.31645/jisrc/(2015).13.1.0001
fatcat:bj76wgqerna7favzorey4hbasm
Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU-Accelerated Spark Platform
2021
Scientific Programming
Furthermore, a GPU-accelerated NMF-based parallel collaborative filtering (CF) algorithm is also proposed, utilizing the advantages of data dimensionality reduction and feature extraction of NMF, as well ...
Using real MovieLens data sets, experimental results have shown that the parallelization of NMF-based collaborative filtering on Spark platform effectively outperforms traditional user-based and item-based ...
Parallel and Distributed Collaborative Filtering. ...
doi:10.1155/2021/8841133
fatcat:6tf7qm7zwzce3ebcas7r6dvum4
Novel functional and distributed approaches to data analysis available in ROOT
2018
Journal of Physics, Conference Series
and size of the datasets. ...
The design choices behind this new interface are described also comparing with other widely adopted tools such as Pandas and Apache Spark. ...
Examples of transformations are the application of a filter to select entries, the creation of a new column, also based on the content of other existing columns, the caching of the dataset in memory or ...
doi:10.1088/1742-6596/1085/4/042008
fatcat:bmiaqp46l5bqvbshopdloaknsm
Parallel computing for genome sequence processing
2021
Briefings in Bioinformatics
Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. ...
Finally, we discuss the limitations and future trends of parallel computing technologies. ...
Funding This work is supported in part by the National Natural Science Foundation of China under grants (Nos U1909208, 61732009, 61772557), Hunan Provincial Science and Technology Program (No. 2018WK4001 ...
doi:10.1093/bib/bbab070
pmid:33822883
fatcat:a4hj2fhybrc6zlsq6xyiu6snmy
Weld: Rethinking the Interface Between Data-Intensive Applications
[article]
2017
arXiv
pre-print
Weld can be integrated into existing frameworks such as Spark, TensorFlow, Pandas and NumPy without changing their user-facing APIs. ...
Even when each function is optimized in isolation, the performance of the combined application can be an order of magnitude below hardware limits due to extensive data movement across these functions. ...
This research was supported in part by affiliate members and other supporters of the Stanford DAWN ...
arXiv:1709.06416v2
fatcat:nda4d32uafctpcql743lx6qofy
Evaluating end-to-end optimization for data analytics applications in weld
2018
Proceedings of the VLDB Endowment
Modern analytics applications use a diverse mix of libraries and functions. ...
Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at runtime in ...
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. ...
doi:10.14778/3213880.3213890
fatcat:oesslpgfy5awlb32xnylmjlnoa
The ALICE Analysis Framework for LHC Run 3
2019
EPJ Web of Conferences
Analysis Facilities and the developmentofthe Analysis Framework. ...
Wepresent the prototypeofanew Analysis Object Data format basedontimeframes and optimized for continuous readout. Such formatisdesigned tobeextensible and transported efficiently over the network. ...
The current framework is considered successful from the performance and usability perspectives because it factors out critical parts and hides their complexity and optimization to the user: the Run 3 analysis ...
doi:10.1051/epjconf/201921405045
fatcat:gaxaya3xafhabnevxgko7fxiz4
An Experimental Analysis on Scalable Implementations of the Alternating Least Squares Algorithm
2018
Proceedings of the 2018 Federated Conference on Computer Science and Information Systems
The use of the latent factor models technique overcomes two major problems of most collaborative filtering approaches: scalability and sparseness of the user's profile matrix. ...
In this work we propose a methodology for comparing the performance of two parallel implementations of the ALS algorithm, one executed with MapReduce in Apache Hadoop framework and another executed in ...
ALS Spark implementation predicted rating between user and item is a dot product of the user's feature vector and the item's feature vector.
C. ...
doi:10.15439/2018f166
dblp:conf/fedcsis/MeiraVB18
fatcat:n3s2rh64sfg6niyz77q7ns56ty
High-performance Overlay Analysis of Massive Geographic Polygons That Considers Shape Complexity in a Cloud Environment
2019
ISPRS International Journal of Geo-Information
Considering the influence of the shape complexity of polygons on the performance of overlay analysis, we design and implement a parallel processing algorithm based on the Spark paradigm in this paper. ...
Based on the analysis of the shape complexity of polygons, the overlay analysis speed is improved via reasonable data partition, distributed spatial index, a minimum boundary rectangular filter and other ...
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/ijgi8070290
fatcat:oeaupgmv3fe4rghmiuolkxw6yy
« Previous
Showing results 1 — 15 out of 403 results