Filters








962 Hits in 3.4 sec

Towards machine learning-based self-tuning of Hadoop-Spark system

Md. Armanur Rahman, Abid Hossen, J. Hossen, Venkataseshaiah C, Thangavel Bhuvaneswari, Aziza Sultana
2019 Indonesian Journal of Electrical Engineering and Computer Science  
Apache Spark is an open source distributed platform which uses the concept of distributed memory for processing big data. Spark has more than 180 predominant configuration parameter.  ...  Configuration settings directly control the efficiency of Apache spark while processing big data, to get the best outcome yet a challenging task as it has many configuration parameters.  ...  Towards machine learning-based self-tuning of hadoop-spark system (Md. Armanur Rahman) Flowchart of Model Making Figure 3.  ... 
doi:10.11591/ijeecs.v15.i2.pp1076-1085 fatcat:f3svmltkcrfzfhjucmk5mjiz5e

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Herodotos Herodotou, Yuxing Chen, Jiaheng Lu
2020 ACM Computing Surveys  
., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression.  ...  We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning. 43:2 H. Herodotou et al. resource management and load balancing [88] .  ...  Parameter selection: Even though there exist over 200 configuration parameters in Apache Spark [6] , not all of them significantly influence the performance of Spark jobs.  ... 
doi:10.1145/3381027 fatcat:7aglimtuwze25boptuano4ufdy

Effectively Testing System Configurations of Critical IoT Analytics Pipelines [article]

Morgan Geldenhuys, Lauritz Thamsen, Kain Kordian Gontarska, Felix Lorenz, Odej Kao
2021 arXiv   pre-print
We demonstrate the usefulness of our approach by investigating different configurations of an exemplary geographically-based traffic monitoring application implemented in Apache Flink.  ...  However, optimizing these systems towards specific Quality of Service targets is a difficult and time-consuming task, due to the large-scale distributed systems involved, the existence of so many configuration  ...  Secondly, we want to research flexible methods for automatic parameter tuning and selection of optimal performing configurations.  ... 
arXiv:2102.06094v2 fatcat:cknw4pyxkjcdlbbjqknzvpx6km

Apache Spark usage and deployment models for scientific computing

Diogo Castro, Prasanth Kothuri, Piotr Mrowczynski, Danilo Piparo, Enric Tejedor, A. Forti, L. Betev, M. Litmaath, O. Smirnova, P. Hristov
2019 EPJ Web of Conferences  
The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based  ...  Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve  ...  In Future, there are plans to develop functionality for the creation and usage of disposable Spark on Kubernetes clusters from the SWAN platform.  ... 
doi:10.1051/epjconf/201921407020 fatcat:nd3s4cqnjzc3babezy5xjsghum

Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters

Aris-Kyriakos Koliopoulos, Paraskevas Yiapanis, Firat Tekiner, Goran Nenadic, John Keane
2016 2016 IEEE International Congress on Big Data (BigData Congress)  
Apache Spark is an innovative distributed computing framework that supports in-memory computations.  ...  Spark offers various choices for memory tuning but this requires in-depth systems-level knowledge and the choices will be different across various workloads and cluster settings.  ...  The authors wish to thank Dr Mark Hall at the University of Waikato for his advice and encouragement.  ... 
doi:10.1109/bigdatacongress.2016.56 dblp:conf/bigdata/KoliopoulosYTNK16 fatcat:ovavputhrnatfbcikwmq6hipki

Cloud-agnostic architectures for machine learning based on Apache Spark

Enikő Nagy, Róbert Lovas, István Pintye, Ákos Hajnal, Péter Kacsuk
2021 Advances in Engineering Software  
These pre-configured reference architectures can be automatically deployed even by the data scientist on-demand, using a multi-cloud approach for a wide range of cloud systems like Amazon AWS, Microsoft  ...  The paper focuses particularly on the widespread Apache Spark Big Data platform as the baseline and the Occopus cloud-agnostic orchestrator tool.  ...  Lovas was also supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.  ... 
doi:10.1016/j.advengsoft.2021.103029 fatcat:vxl4axbvn5apvah5mgpt3kkqpa

Intelligent system for auto-tuning of big data analytics deployment properties

Profeta Davide, Gaglio Salvatore, Finazzo Rosolino
2018 Zenodo  
The proposed deployment optimizer then aims to solve these problems automatically, transparently, and with faster times than the manual tuning acted by domain experts and data engineers.  ...  These frameworks share the same problem: the tuning of deployment properties to optimize application performance.  ...  of parameters through trials and errors, and focusing the research towards automatic solutions.  ... 
doi:10.5281/zenodo.2019443 fatcat:wo5dq36hinbwlgpt5thscx2hhi

Towards automatic parameter tuning of stream processing systems

Muhammad Bilal, Marco Canini
2017 Proceedings of the 2017 Symposium on Cloud Computing - SoCC '17  
We demonstrate the multiple benefits of automated parameter tuning in optimizing three benchmark applications in Apache Storm.  ...  Optimizing the performance of big-data streaming applications has become a daunting and time-consuming task: parameters may be tuned from a space of hundreds or even thousands of possible configurations  ...  Popular stream-processing systems such as Apache Storm [3] , Heron [32] , Apache Flink [1] and Spark Streaming [2] have dozens of available configuration parameters.  ... 
doi:10.1145/3127479.3127492 dblp:conf/cloud/BilalC17 fatcat:lhpdfc7xfzbibljltgdgf4spjm

S2CE: A Hybrid Cloud and Edge Orchestrator for Mining Exascale Distributed Streams [article]

Nicolas Kourtellis and Herodotos Herodotou and Maciej Grzenda and Piotr Wawrzyniak and Albert Bifet
2020 arXiv   pre-print
To address this need, this paper proposes Stream to Cloud & Edge (S2CE), a first of its kind, optimized, multi-cloud and edge orchestrator, easily configurable, scalable, and extensible.  ...  The explosive increase in volume, velocity, variety, and veracity of data generated by distributed and heterogeneous nodes such as IoT and other devices, continuously challenge the state of art in big  ...  Innovation Exchange with Apache Ecosystem's Open Source The big data interest was followed by a rapid development of DSPEs, such as Apache Storm, Samza, and more recently Apache Spark and Apache Flink,  ... 
arXiv:2007.01260v1 fatcat:hfavtqtpmnd2xo5uh7tzcomm4u

Distributed Training of Deep Neural Networks with Spark: The MareNostrum Experience

Leonel Cruz, Ruben Tous, Beatriz Otero
2019 Pattern Recognition Letters  
The components of a layered architecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated.  ...  Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and  ...  Acknowledgements This work is partially supported by the Spanish Ministry of Economy and Competitivity under contract TIN2015-65316-P and by the SGR programme (2014-SGR-1051) of the Catalan Government.  ... 
doi:10.1016/j.patrec.2019.01.020 fatcat:47iumueflfdvbdgnml6wux3qyi

Introduction to Big Data Technology [chapter]

Bilal Abu-Salih, Pornpit Wongthongtham, Dengya Zhu, Kit Yan Chan, Amit Rudra
2021 Social Big Data Analytics  
This chapter will first have historical review of big data; followed by discussion of characteristics of big data, i.e. from the 3V's to up 10V's of big data.  ...  Big data is no more "all just hype" but widely applied in nearly all aspects of our business, governments, and organizations with the technology stack of AI.  ...  Also, an array of online resources is provided to help researchers and professionals to obtain further and deep insights into Big data technology.  ... 
doi:10.1007/978-981-33-6652-7_2 fatcat:dog5ym666famdedniwyewwswiq

Tuneful: An Online Significance-Aware Configuration Tuner for Big Data Analytics [article]

Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice, Andy Hopper
2020 arXiv   pre-print
We propose Tuneful, an approach that efficiently tunes the configuration of in-memory cluster computing systems.  ...  This means that the amortization of the tuning cost happens significantly faster, enabling practical tuning for new classes of workloads.  ...  We illustrated how Tuneful is designed to be integrated into Spark to provide an efficient configuration tuning with negligible overhead.  ... 
arXiv:2001.08002v1 fatcat:ah7wrhrbxvbs7lzzbogzs477ri

On the Scalability of Big Data Cyber Security Analytics Systems [article]

Faheem Ullah, Muhammad Ali Babar
2021 arXiv   pre-print
., Apache Spark) to collect, store, and analyze a large volume of security event data for detecting cyber-attacks.  ...  We have found that (i) a BDCA system with default Spark configuration parameters deviates from ideal scalability by 59.5% (ii) 9 out of 11 studied Spark configuration parameters significantly impact scalability  ...  Our proposed adaptation approach is the first step towards facilitating practitioners to automatically tune Spark parameters for achieving optimal scalability.  ... 
arXiv:2112.00853v1 fatcat:4twnzni64fdpnh5cdl4yslz64e

Interactive Big Data Analytics Platform for Healthcare and Clinical Services

Dillon Chrimes
2018 Global Journal of Engineering Sciences  
The next step of the testing of the BDA platform will be to distribute and index the data to ten billion patient data rows across the database nodes, and then test the performance using the established  ...  The design of the implemented BDA platform (utilizing WestGrid's supercomputing clusters) is available to researchers and sponsored members.  ...  The xml configuration file HBase-site.xml and the HBase-env.sh were adjusted to configure and fine tune HBase.  ... 
doi:10.33552/gjes.2018.01.000502 fatcat:biiz2qnx4vckrcrmv4gcn5ydru

Self-adaptive Executors for Big Data Processing

Sobhan Omranian Khorasani, Jan S. Rellermeyer, Dick Epema
2019 Proceedings of the 20th International Middleware Conference on - Middleware '19  
Unfortunately, in practice this leads to a substantial manual tuning effort. In this work, we focus on one of the most impactful tuning decisions in big data systems: the number of executor threads.  ...  In response, systems offer pre-determined behaviors based on heuristics and then expose a large number of configuration parameters for operators to adjust them to their particular infrastructure.  ...  One example in the big-data domain is MRonline [13] for automatic performance tuning of Hadoop.  ... 
doi:10.1145/3361525.3361545 dblp:conf/middleware/KhorasaniRE19 fatcat:udde2hnpp5bx3cluwi2mehyhui
« Previous Showing results 1 — 15 out of 962 results