Filters








44 Hits in 2.5 sec

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing [article]

Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, Johannes Gehrke
2018 arXiv   pre-print
We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins.  ...  Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer.  ...  We prototype Cuttlefish in Apache Spark [71] and tune operators for image convolution, regular expression matching, and relational joins.  ... 
arXiv:1802.09180v1 fatcat:mw2l75gknfejxl6wh2v4wceijq

Accelerating raw data analysis with the ACCORDA software and hardware architecture

Yuanwei Fang, Chen Zou, Andrew A. Chien
2019 Proceedings of the VLDB Endowment  
Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture.  ...  In doing so, ACCORDA robustly matches or outperforms prior systems that depend on caching loaded data, while computing on raw, unloaded data.  ...  Apache SparkSQL [15] is a widely-used analytics system for processing raw data built on top of Apache Spark [59] , a popular in-memory map-reduce framework.  ... 
doi:10.14778/3342263.3342634 fatcat:ab5unl24tbeaxiic2jdu52oyee

SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink

Oscar Ceballos, Carlos Alberto Ramírez Restrepo, María Constanza Pabón, Andres M. Castillo, Oscar Corcho
2021 Applied Sciences  
., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance.  ...  In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API.  ...  For example, the applications derived from the Internet of Things (IoT) that need to store, process, and analyze data in real or near real-time.  ... 
doi:10.3390/app11157033 fatcat:kqtyvqp645bctbpriwhwb5qgxu

Going big: a large-scale study on what big data developers ask

Mehdi Bagherzadeh, Raffi Khatchadourian
2019 Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019  
popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with  ...  spark-dataframe spark-streaming spark-structured-streaming yarn 14 (0.2, 0.005) amazon-emr apache-spark apache-spark-2.0 apache-spark-dataset apache-spark-ml apache-spark-mllib apache-spark-sql apache-zeppelin  ...  , hue, impala, mahout, hortonworks-data-platform, mapreduce, oozie, sqoop, yarn} T Spark = {amazon-emr, apache-spark, apache-spark-2.0, apache-spark-dataset, apache-spark-ml, apache-spark-mllib, apache-spark-sql  ... 
doi:10.1145/3338906.3338939 dblp:conf/sigsoft/BagherzadehK19 fatcat:fjo23bl5tncczhrfbylr5rhi4m

Survey on Online Log Parsers

Tejaswini S, Azra Nasreen
2021 International Journal of Engineering and Advanced Technology  
As a result, software applications are required to be up and running at all times without fail.  ...  This paper focuses on surveying and creating a comparative study on online log parses by analysing the type of technique used, efficiency and accuracy of the parser on a given dataset, time complexity,  ...  , BGL, HDFS, Windows and Spark.  ... 
doi:10.35940/ijeat.e2816.0610521 fatcat:gecfe3ile5aovlioyigt6khw6a

Tools and Benchmarks for Automated Log Parsing [article]

Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu
2019 arXiv   pre-print
Even more, some log parsers can parse the HDFS and Apache datasets with 100% accuracy. This is because HDFS and Apache error logs have relatively simple event templates and are easy to identify.  ...  In particular, we extend Drain with Spark and naturally exploit the above log data partitioning for quick parallelization.  ... 
arXiv:1811.03509v2 fatcat:q6ffnv7nsrhc3exnwc3piisdpi

A comprehensive social media data processing and analytics architecture by using big data platforms: a case study of twitter flood-risk messages

Michal Podhoranyi
2021 Earth Science Informatics  
The main objective of the article is to propose an advanced architecture and workflow based on Apache Hadoop and Apache Spark big data platforms.  ...  Results confirmed the advantages of many well-known features of Spark and Hadoop in social media data processing.  ...  The second contribution of the paper is in the Spark application that enables to process data in memory with Spark engine and at the same time, it takes advantage of Apache Hadoop Yarn cluster.  ... 
doi:10.1007/s12145-021-00601-w pmid:33727982 pmcid:PMC7951942 fatcat:bxsqn2dbb5ee3crzdsmhifuuyy

MapReduce program synthesis

Calvin Smith, Aws Albarghouthi
2016 SIGPLAN notices  
In this paper, we ask whether we can raise the level of abstraction even higher than what state-of-the-art platforms provide, but this time with the goal of unleashing the power of cloud computing for  ...  We evaluate our tool on a range of real-world big-data analysis tasks and general computations.  ...  Synthesized programs are converted into Apache Spark code and are ready to be executed on an appropriate platform.  ... 
doi:10.1145/2980983.2908102 fatcat:v32qb2hemrdvhbicqcsjgq67ne

MapReduce program synthesis

Calvin Smith, Aws Albarghouthi
2016 Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2016  
In this paper, we ask whether we can raise the level of abstraction even higher than what state-of-the-art platforms provide, but this time with the goal of unleashing the power of cloud computing for  ...  We evaluate our tool on a range of real-world big-data analysis tasks and general computations.  ...  Synthesized programs are converted into Apache Spark code and are ready to be executed on an appropriate platform.  ... 
doi:10.1145/2908080.2908102 dblp:conf/pldi/SmithA16 fatcat:cxb2uah3xzhpnalb5wqyi5negm

A System Architecture for the Detection of Insider Attacks in Big Data Systems [article]

Santosh Aditham, Nagarajan Ranganathan
2016 arXiv   pre-print
Initial experiments on real-world hadoop and spark tests show that the proposed system needs to consider only 20% of the code to analyze a program and incurs 3.28% time overhead.  ...  The second step involves the matching of these instruction sequences among the replica nodes.  ...  A model of the proposed system is tested in real-time on Amazon's EC2 clusters using a different sets of Hadoop and Spark programs.  ... 
arXiv:1612.01587v1 fatcat:32vdnfossbfwplywpcq5odkduu

Self-Supervised Log Parsing [article]

Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, Odej Kao
2020 arXiv   pre-print
We evaluate the parsing performance of NuLog on 10 real-world log datasets and compare the results with 12 parsing techniques.  ...  This allows the coupling of the MLM as pre-training with a downstream anomaly detection task.  ...  SHISO [9] is creating a structured tree using the nodes generated from log messages which enables a real-time update of new log messages if a match with previously existing log templates fails.  ... 
arXiv:2003.07905v1 fatcat:ihng5d57ubhetoraseyxstl3s4

Split-Correctness in Information Extraction

Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, Frank Neven
2019 Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems - PODS '19  
We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with "split constraints".  ...  An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating  ...  Another motivation comes from programming over distribution frameworks such as Apache Hadoop [14] and Apache Spark [35] .  ... 
doi:10.1145/3294052.3319684 dblp:conf/pods/DoleschalKMNN19 fatcat:rogzrk5jzvfexk4y64hsnvz3re

Split-Correctness in Information Extraction [article]

Johannes Doleschal and Benny Kimelfeld and Wim Martens and Frank Neven and Matthias Niewerth
2021 arXiv   pre-print
We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with split constraints.  ...  An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating  ...  One open problem is the exact complexity of Spli ability, as we do not have matching upperand lower-bounds in the general case.  ... 
arXiv:1810.03367v2 fatcat:64whlmyx6jam5ikxtqa567thz4

Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval [chapter]

Philippe Tamla, Florian Freund, Matthias Hemmje
2021 The Role of Gamification in Software Development Lifecycle [Working Title]  
We present real-world use case scenarios and derive features for training and managing NER models with the Stanford NLP machine learning API.  ...  Then, the integration of our developed NER system with an expert rule-based system is presented, which allows an automatic classification of text documents into different taxonomy categories available  ...  Spark NLP 9 is one of the most recent NLP tools that was released in 2017. It is a library build on top of Apache Spark and TensorFlow.  ... 
doi:10.5772/intechopen.95076 fatcat:57vpkmg53zfpthadta2fua6d2i

Fusing effectful comprehensions

Olli Saarikivi, Margus Veanes, Todd Mytkowicz, Madan Musuvathi
2017 Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2017  
regular expressions, processing XML with XPath, and running queries over encoded data.  ...  Using background theory reasoning with an SMT solver our fusion and subsequent reachability based branch elimination algorithms can significantly reduce the complexity of the fused pipelines.  ...  These pipelines exhibit common real-world scenarios of extracting data with regexes, querying XML files with XPath, and working with (Base64) encoded data.  ... 
doi:10.1145/3062341.3062362 dblp:conf/pldi/SaarikiviVMM17 fatcat:567q5l43zrcwjpwwaftuvhqsti
« Previous Showing results 1 — 15 out of 44 results