Filters








6,209 Hits in 4.5 sec

Mutation Operators for Large Scale Data Processing Programs in Spark [chapter]

João Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin Alejandro Musicante
2020 Lecture Notes in Computer Science  
We propose a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations.  ...  This paper proposes a mutation testing approach for big data processing programs that follow a data flow model, such as those implemented on top of Apache Spark.  ...  Testing Apache Spark Programs Apache Spark is a general-purpose analytics engine for large-scale data processing on cluster systems [28] .  ... 
doi:10.1007/978-3-030-49435-3_30 fatcat:o3sokdt6kncytou4jdclm555pe

TRANSMUT-SPARK: Transformation Mutation for Apache Spark [article]

Joao Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin A. Musicante
2021 arXiv   pre-print
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing.  ...  The paper introduces the TRANSMUT-Spark solution for testing Spark programs. TRANSMUT-Spark automates the most laborious steps of the process and fully executes the mutation testing process.  ...  CONCLUSIONS AND FUTURE WORK This paper introduced TRANSMUT-SPARK, a transformation mutation tool for Spark large-scale data processing programs.  ... 
arXiv:2108.02589v1 fatcat:2d37srroy5d2flctin4hqiafym

Scalable Analysis of Multi-Modal Biomedical Data [article]

Jaclyn M Smith, Yao Shi, Michael Benedikt, Milos Nikolic
2020 bioRxiv   pre-print
Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes.  ...  The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis.  ...  Acknowledgements The authors would like to thank Omics Data Automation, Inc. for supplying hardware, compute time, and contributing to use case discussions.  ... 
doi:10.1101/2020.12.14.422781 fatcat:wscxoume7zeutbhbpxlwo5npgm

Scalable analysis of multi-modal biomedical data

Jaclyn Smith, Yao Shi, Michael Benedikt, Milos Nikolic
2021 GigaScience  
Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes.  ...  The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis.  ...  Acknowledgements The authors thank Omics Data Automation, Inc., for supplying hardware and compute time and contributing to use case discussions.  ... 
doi:10.1093/gigascience/giab058 pmid:34508579 pmcid:PMC8434767 fatcat:fsrmqqfgcvag5dgmiqbu2t6fsy

Cloud-Based Distributed Mutation Analysis [article]

Robert Merkel, James Georgeson
2016 arXiv   pre-print
In this paper, we describe an architecture, and a prototype implementation, of such a cloud-based distributed mutation testing system.  ...  Mutation Testing is a fault-based software testing technique which is too computationally expensive for industrial use.  ...  Note that while Spark avoids resending shared data to each individual core on the node, for scheduling purposes Spark treats each processor core allocated to it as an independent processing node.  ... 
arXiv:1601.07157v2 fatcat:4phqf6swgvgcddvpnfa2o5y35i

Implementing Parallel Differential Evolution on Spark [chapter]

Diego Teijeiro, Xoán C. Pardo, Patricia González, Julio R. Banga, Ramón Doallo
2016 Lecture Notes in Computer Science  
However, with the emergence of Cloud Computing, new programming models, like Spark, have appeared to suit with large-scale data processing on clouds.  ...  In this paper we investigate the applicability of Spark to develop parallel DE schemes to be executed in a distributed environment.  ...  New programming models are being proposed to deal with large scale computations on commodity clusters and Cloud resources.  ... 
doi:10.1007/978-3-319-31153-1_6 fatcat:h2v7lg7sg5hv3jzbesovn74eae

BigDL: A Distributed Deep Learning Framework for Big Data [article]

Jason Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Zhang, Yan Wan, Zhichao Li, Jiao Wang, Shengsheng Huang, Zhongyuan Wu, Yang Wang (+6 others)
2018 arXiv   pre-print
It is implemented on top of Apache Spark, and allows users to write their deep learning applications as standard Spark programs (running directly on large-scale big data clusters in a distributed fashion  ...  In this paper, we present BigDL, a distributed deep learning framework for Big Data platforms and workflows.  ...  To automatically parallelize the large-scale data processing across the cluster in a fault-tolerant fashion, Spark provides a functional compute model where immutable RDDs are transformed through coarse-grained  ... 
arXiv:1804.05839v3 fatcat:u5afdn37l5c7lalqxqmlj5se6e

Research on financial network big data processing technology based on fireworks algorithm

Tao Luo
2019 EURASIP Journal on Wireless Communications and Networking  
Aiming at the concept and characteristics of large data, this paper proposes the research of internal control system of venture capital information system based on large data processing technology.  ...  This paper divides the risk critical control process one by one for the hardware, software, personnel, information and operation rules of the venture capital object; probes into the main risks of different  ...  assessment of large-scale data.  ... 
doi:10.1186/s13638-019-1443-z fatcat:qhossuab6nh6jmvzvff32mble4

Spark-Based Parallel Genetic Algorithm for Simulating a Solution of Optimal Deployment of an Underwater Sensor Network

Peng Liu, Shuai Ye, Can Wang, Zongwei Zhu
2019 Sensors  
dealing with large-scale data.  ...  parallel crossover, mutation, and other operations on each computing node.  ...  The funders had no role in the design of the study.  ... 
doi:10.3390/s19122717 fatcat:ydbljzdlxvf7fmxkui4flma5jq

A Hybrid Mechanism of Particle Swarm Optimization and Differential Evolution Algorithms based on Spark

2019 KSII Transactions on Internet and Information Systems  
With the onset of the big data age, data is growing exponentially, and the issue of how to optimize large-scale data processing is especially significant.  ...  Large-scale global optimization (LSGO) is a research topic with great interest in academia and industry.  ...  RDD greatly speeds up program processing, allowing Spark to be used in a variety of large-scale processing scenarios.  ... 
doi:10.3837/tiis.2019.12.010 fatcat:r7s2xplblraqrj6jdnboeg33c4

An Abstract View of Big Data Processing Programs [article]

Joao Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin A. Musicante
2021 arXiv   pre-print
We extend the model for data processing programs proposed in [1], to enable the use of iterative programs.  ...  This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks.  ...  Large-scale data processing frameworks have 5 implemented these programming models to provide execution infrastructures giving transparent access to large scale computing and memory resources.  ... 
arXiv:2108.02582v1 fatcat:fkqmm5h3gnbp7bg65oagsxcujq

A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems

Chengyu Hu, Guo Ren, Chao Liu, Ming Li, Wei Jie
2017 Cluster Computing  
Existing studies have mainly focused on sensor placement in water distribution systems (WDSs). However, the problem is still not adequately addressed, especially for large scale WSNs.  ...  In this paper, we investigate the sensor placement problem in large scale WDSs with the objective of minimizing the impact of contamination events.  ...  MapReduce and Spark As two very popular open source cluster and cloud computing frameworks for large scale data processing, MapReduce and Spark expose a simple programming API to users.  ... 
doi:10.1007/s10586-017-0838-z fatcat:643trepexbckrn65cy2djnekdu

Scalable Querying of Nested Data [article]

Jaclyn Smith, Michael Benedikt, Milos Nikolic, Amir Shaikhha
2020 arXiv   pre-print
While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections.  ...  We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.  ...  INTRODUCTION Large-scale, distributed data processing platforms such as Spark [59] , Flink [9] , and Hadoop [21] have become indispensable tools for modern data analysis.  ... 
arXiv:2011.06381v1 fatcat:7fulntcavrdgrl2zmmmyiqc52q

Evaluation of Parallel Differential Evolution Implementations on MapReduce and Spark [chapter]

Diego Teijeiro, Xoán C. Pardo, David R. Penas, Patricia González, Julio R. Banga, Ramón Doallo
2017 Lecture Notes in Computer Science  
Recently, new programming models are being proposed to deal with large scale computations on commodity clusters and Cloud resources.  ...  The results obtained can be particularly useful for those interested in the potential of new Cloud programming models for parallel metaheuristic methods in general and Differential Evolution in particular  ...  Background and Related Work Since its appearance, MapReduce [3] (MR from now on) has been the distributed programming model for processing large scale computations that has attracted more attention.  ... 
doi:10.1007/978-3-319-58943-5_32 fatcat:sglkpkjwdfbmplx2kuvrkmwjji

[Demo] Low-latency Spark Queries on Updatable Data

Alexandru Uta, Bogdan Ghit, Ankur Dave, Peter Boncz
2019 Proceedings of the 2019 International Conference on Management of Data - SIGMOD '19  
As data science gets deployed more and more into operational applications, it becomes important for data science frameworks to be able to perform computations in interactive, sub-second time.  ...  Indexing and caching are two key techniques that can make interactive query processing on large datasets possible.  ...  Processing dynamically changing graph structures and filtering large data volumes are very attractive applications for large audiences and organizations, with many practical implications.  ... 
doi:10.1145/3299869.3320227 dblp:conf/sigmod/UtaGDB19 fatcat:fti376bzczcolibfj543jbptye
« Previous Showing results 1 — 15 out of 6,209 results