Filters








441 Hits in 6.9 sec

Building Graphs at a Large Scale: Union Find Shuffle [article]

Saigopal Thota, Mridul Jain, Nishad Kamat, Saikiran Malikireddy, Pruthvi Raj Eranti, Albin Kuruvilla
2021 arXiv   pre-print
In this work, we present a highly scalable and configurable distributed algorithm for building connected components, called Union Find Shuffle (UFS) with Path Compression.  ...  The scale and complexity of the algorithm are a function of the number of partitions into which the data is initially partitioned, and the size of the connected components.  ...  UNION FIND SHUFFLE WITH PATH COMPRESSION Union Find Shuffle (UFS) with Path Compression algorithm is a distributed algorithm that generates connected components in three phases.  ... 
arXiv:2012.05430v2 fatcat:bsrbxvhzg5akjjhzvt26uo4jle

The Berlin Big Data Center (BBDC)

Christoph Boden, Tilmann Rabl, Volker Markl
2018 it - Information Technology  
In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have  ...  However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from  ...  Spark provides a special reduceByKey() operator for these parallel aggregates in order to avoid a global shuffle of all tuples and rather to combine outputs with a common key on each partition before shuffling  ... 
doi:10.1515/itit-2018-0016 fatcat:mnxe772elba5rekvxds5j4xgpm

Deca

Xuanhua Shi, Zhixiang Ke, Yongluan Zhou, Hai Jin, Lu Lu, Xiong Zhang, Ligang He, Zhenyu Hu, Fei Wang
2019 ACM Transactions on Computer Systems  
When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency.  ...  In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in big data processing systems  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable comments on earlier versions of this paper, and we thank Alibaba Computing Platform team members for their support and collaboration.  ... 
doi:10.1145/3310361 fatcat:d5z767ar4rd6xdp4z4sxnpkefi

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, Kostas Katrinis, Yoonho Park
2017 IEEE Transactions on Parallel and Distributed Systems  
We implemented this novel strategy in Spark, a popular in-memory data analytics framework.  ...  Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.  ...  This elastic in-flight limit replaces the hard in-flight limit in all decisions. We summarize this process in Algorithm 2.  ... 
doi:10.1109/tpds.2016.2627558 fatcat:ksejutgfbvet3g23mr7ru7hmxq

Shark

Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica
2013 Proceedings of the 2013 international conference on Management of data - SIGMOD '13  
Shark is a new data analysis system that marries query processing with complex analytics on large clusters.  ...  It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently  ...  This is enabled by the design decision to choose Spark as the execution engine and RDD as the main data structure for operators.  ... 
doi:10.1145/2463676.2465288 dblp:conf/sigmod/XinRZFSS13 fatcat:qs4bvu7habd77g42mtm3m5sgoy

Shark: SQL and Rich Analytics at Scale [article]

Reynold Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica
2012 arXiv   pre-print
Shark is a new data analysis system that marries query processing with complex analytics on large clusters.  ...  It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently  ...  This is enabled by the design decision to choose Spark as the execution engine and RDD as the main data structure for operators.  ... 
arXiv:1211.6176v1 fatcat:cdpyu3sp3bd7rcdzaaci4juayi

Distributed query-aware quantization for high-dimensional similarity searches

Gheorghi Guzun, Guadalupe Canahuate
2018 International Conference on Extending Database Technology  
We propose a distributed indexing and query algorithm to efficiently compute QED.  ...  In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches.  ...  Supervised methods use the class label and a training dataset to make an informed decision about the optimal split points. Unsupervised methods rely solely on the statistics collected about the data.  ... 
doi:10.5441/002/edbt.2018.33 pmid:29756125 pmcid:PMC5946695 dblp:conf/edbt/GuzunC18 fatcat:u46wcr4bzjdnnef3k7gwibph6e

ATCS: Auto-Tuning Configurations of Big Data Frameworks Based on Generative Adversarial Nets

Mingyu Li, Zhiqiang Liu, Xuanhua Shi, Hai Jin
2020 IEEE Access  
Big data processing frameworks (e.g., Spark, Storm) have been extensively used for massive data processing in the industry.  ...  INDEX TERMS Big data, generative adversarial nets, spark, genetic algorithm, automatic tune parameters.  ...  For example, PageRank is a memory-and CPU-intensive operation, with many shuffle operations while running on Spark.  ... 
doi:10.1109/access.2020.2979812 fatcat:aownx2kmxvcjlp5gx5otahigz4

A Survey on Spark Ecosystem for Big Data Processing [article]

Shanjiang Tang, Bingsheng He, Ce Yu, Yusen Li, Kun Li
2018 arXiv   pre-print
Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.  ...  Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark.  ...  Decision CEP engine [3] is a Complex Event Processing platform built on Spark Streaming.  ... 
arXiv:1811.08834v1 fatcat:6fxvg6me7rayzm4suoabyg7fii

DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics [chapter]

Yanchen Liu, Fang Cao, Masood Mortazavi, Mengmeng Chen, Ning Yan, Chi Ku, Aniket Adnaik, Stephen Morgan, Guangyu Shi, Yuhu Wang, Fan Fang
2015 Lecture Notes in Computer Science  
To achieve distributed query processing capability for a column database, we have added additional data structures and optimization algorithms in many core components of the execution engine of MonetDB  ...  Introduction With data collection methods continuously evolving, the demand for analytic results from the data we collect also increases.  ... 
doi:10.1007/978-3-319-24315-3_30 fatcat:sh55tv45ivdazjm2q3p3yz3knq

An Efficient Task-based All-Reduce for Machine Learning Applications

Zhenyu Li, James Davis, Stephen Jarvis
2017 Proceedings of the Machine Learning on HPC Environments - MLHPC'17  
ACKNOWLEDGMENT This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1).  ...  in object-serialization and computation by 80-90%; • A novel application of the butterfly all-reduce algorithm for the Apache Spark framework that is efficient for very large vector reduction, exhibiting  ...  Understanding the design decisions behind the usage of each is key to incorporating new algorithmic approaches to existing processes.  ... 
doi:10.1145/3146347.3146350 dblp:conf/sc/LiDJ17 fatcat:jpkx7xqlbffg7llcngt3odvfoy

A database-based distributed computation architecture with Accumulo and D4M: An application of eigensolver for large sparse matrix

Yin Huang, Yelena Yesha, Shujia Zhou
2015 2015 IEEE International Conference on Big Data (Big Data)  
The Hadoop based approach does not natively support iterative algorithms due to data shuffling during each iteration.  ...  This paper presents a novel database-based distributed computation architecture bridging the gap between Hadoop and HPC.  ...  ACKNOWLEDGMENT The authors would like to thank IBM/CAS Toronto for supporting Yin Huang with a CAS fellowship.  ... 
doi:10.1109/bigdata.2015.7364045 dblp:conf/bigdataconf/HuangYZ15 fatcat:qiabkjtshrdohiwzl6mnztu7se

Distributed Tree-based Machine Learning for Short-Term Load Forecasting with Apache Spark

Ameema Zainab, Ali Ghrayeb, Haitham Abu-Rub, Shady S. Refaat, Othmane Bouhali
2021 IEEE Access  
Compression of RDD in shuffle operations has a great advantage due to it random read/write and multiple times read/write. Compression of spark RDD is achieved with the help of codec.  ...  Algorithms available in spark ml are used for performance comparison, which includes the spark decision trees (Spark DT) and tree ensembles i.e., spark parallelized random forests (Spark RF), and spark  ... 
doi:10.1109/access.2021.3072609 fatcat:napatzqw2zchdpo7uhhvqjmqiq

HRDBMS: Combining the Best of Modern and Traditional Relational Databases [article]

Jason Arnold and Boris Glavic and Ioan Raicu
2019 arXiv   pre-print
The system uses an execution framework that is tailored for relational processing, thus addressing some of the performance challenges of running SQL on top of platforms such as MapReduce and Spark.  ...  HRDBMS is a novel distributed relational database that uses a hybrid model combining the best of traditional distributed relational databases and Big Data analytics platforms such as Hive.  ...  Spark will still perform a sort during the shuffle if the shuffle will be followed by an aggregation operation.  ... 
arXiv:1901.08666v1 fatcat:i42ylznyp5bbrozubhmihgfm54

Self-adaptive Executors for Big Data Processing

Sobhan Omranian Khorasani, Jan S. Rellermeyer, Dick Epema
2019 Proceedings of the 20th International Middleware Conference on - Middleware '19  
Unfortunately, in practice this leads to a substantial manual tuning effort. In this work, we focus on one of the most impactful tuning decisions in big data systems: the number of executor threads.  ...  We first show the impact of I/O contention on the runtime of workloads and a simple static solution to reduce the number of threads for I/O-bound phases.  ...  Despite shown for Spark, we envision this approach to be highly applicable to a broad range of different big data processing frameworks and even consider it a blueprint for the design of novel frameworks  ... 
doi:10.1145/3361525.3361545 dblp:conf/middleware/KhorasaniRE19 fatcat:udde2hnpp5bx3cluwi2mehyhui
« Previous Showing results 1 — 15 out of 441 results