Filters








1,620 Hits in 2.2 sec

Building a replicated logging system with Apache Kafka

Guozhang Wang, Joel Koshy, Sriram Subramanian, Kartik Paramasivam, Mammad Zadeh, Neha Narkhede, Jun Rao, Jay Kreps, Joe Stein
2015 Proceedings of the VLDB Endowment  
In this abstract, we will talk about our design and engineering experience to replicate Kafka logs for various distributed data-driven systems at LinkedIn, including source-of-truth data storage and stream  ...  processing.  ...  For instance, Espresso is a scalable document store built at LinkedIn to serve as its online data storage platform [8] . It depends on the underlying storage engine for its data replication (e.g.  ... 
doi:10.14778/2824032.2824063 fatcat:e3ul3mxtcjeyjjikhix4mvo4bq

Data Infrastructure at LinkedIn

Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave De Maagd, Alex Feinberg, Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Joel Koshy, Kevin Krawez (+24 others)
2012 2012 IEEE 28th International Conference on Data Engineering  
In this paper, we describe a few selected data infrastructure projects at LinkedIn that have helped us accommodate this increasing scale.  ...  LinkedIn is among the largest social networking sites in the world. As the company has grown, our core data sets and request processing requirements have grown as well.  ...  Espresso Deployment at LinkedIn Espresso was first deployed at LinkedIn in September 2011 to serve read traffic for company profiles, products and reviews.  ... 
doi:10.1109/icde.2012.147 dblp:conf/icde/AuradkarBDMFGGGGHKKKLNNPPQQRSSSSSSSSTTVWWZZ12 fatcat:4paiys2xcvf3dpgp34gp3skygy

Living in the present

David Eyers, Tobias Freudenreich, Alessandro Margara, Sebastian Frischbier, Peter Pietzuch, Patrick Eugster
2012 Proceedings of the 2nd International Workshop on Cloud Computing Platforms - CloudCP '12  
Today's social web platforms, such as Facebook, Twitter, Google+, and LinkedIn, increasingly have to process large volumes of user-generated data on the fly.  ...  processing systems.  ...  is based on Kafka [13] , a distributed messaging system aimed at providing a scalable, low-latency solution for log aggregation and data stream processing.  ... 
doi:10.1145/2168697.2168703 fatcat:6tvn5njmybd7xfsderihf5mxai

Identifying Requirements for Big Data Analytics and Mapping to Hadoop Tools

2019 International journal of recent technology and engineering  
Big data is being generating in a wide variety of formats at an exponential rate.  ...  Big data analytics deals with processing and analyzing voluminous data to provide useful insight for guided decision making.  ...  So, batch processing is treated as a subset of stream data processing. It can emulate batch processing, however at its core it is a native streaming processing engine.  ... 
doi:10.35940/ijrte.c5524.098319 fatcat:zgw5y6nucve3jo36sqeio3wukq

Samza

Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, Roy H. Campbell
2017 Proceedings of the VLDB Endowment  
Samza is currently in use at LinkedIn by hundreds of production applications with more than 10, 000 containers.  ...  Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly.  ...  Figure 2 : 2 Stream processing pipeline at LinkedIn. Figure 3 : 3 Example Samza job to find trending tags. Figure 4 : 4 The internal architecture of a job.  ... 
doi:10.14778/3137765.3137770 fatcat:ameij3w5m5a6bcq62a2k5mwolm

The big data ecosystem at LinkedIn

Roshan Sumbaly, Jay Kreps, Sam Shah
2013 Proceedings of the 2013 international conference on Management of data - SIGMOD '13  
This includes easy ingress from and egress to online systems, and managing workflows as production processes.  ...  Acknowledgements The authors are indebted to the numerous engineers from the LinkedIn data team that have contributed to the work presented in this paper, our grid operations team for their exemplary management  ...  Among Hadoop's advantages are its horizontal scalability, fault tolerance, and multitenancy: the ability to reliably process petabytes of data on thousands of commodity machines.  ... 
doi:10.1145/2463676.2463707 dblp:conf/sigmod/SumbalyKS13 fatcat:a4v36gvravhnnkwgtyeoeeat6u

Gobblin

Lin Qiao, Shirshanka Das, Chavdar Botev, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker
2015 Proceedings of the VLDB Endowment  
At LinkedIn we need to ingest data from various sources such as relational stores, NoSQL stores, streaming systems, REST endpoints, filesystems, etc. into our Hadoop clusters.  ...  ACKNOWLEDGEMENT We'd like to extend our appreciation to our partner teams for their strong support and valuable help during the development and deployment of Gobblin at LinkedIn.  ...  -Processing paradigm: Gobblin supports both standalone and scalable platforms, including Hadoop and Yarn.  ... 
doi:10.14778/2824032.2824073 fatcat:iqfcjsm5wbhf3cmkfhsk6n6lbq

On brewing fresh espresso

Lin Qiao, Aditya Auradar, Chris Beaver, Gregory Brandt, Mihir Gandhi, Kishore Gopalakrishna, Wai Ip, Swaroop Jgadish, Shi Lu, Alexander Pachev, Aditya Ramesh, Kapil Surlaker (+15 others)
2013 Proceedings of the 2013 international conference on Management of data - SIGMOD '13  
Espresso is a document-oriented distributed data serving platform that has been built to address LinkedIn's requirements for a scalable, performant, source-of-truth primary store.  ...  a hierarchical document model, transactional support for modifications to related documents, realtime secondary indexing, on-the-fly schema evolution and provides a timeline consistent change capture stream  ...  Acknowledgement Many other members of the Linkedin Data Infrastructure team helped significantly in the development and deployment of Espresso.  ... 
doi:10.1145/2463676.2465298 dblp:conf/sigmod/QiaoSDQSGCSZABBGGIJLPRSSSSTTWZ13 fatcat:ljireze66zc7rllciq4nznq6pa

Efficiency of Stream Processing Engines for Processing BIGDATA Streams

B. V. S. Srikanth, V. Krishna Reddy
2016 Indian Journal of Science and Technology  
Flink can process the stream data in Batch processing & Resilient Distributed datasets at same state.  ...  the data stream. • At large volume datasets maintain the scalability for stream processing data due to Ad-hoc queries. • Live analytics for streaming the conference media at real-time data discovery,  ... 
doi:10.17485/ijst/2016/v9i14/84797 fatcat:dji434lx6nb4jdfyvwt63kn5qm

State of Big Data Analysis in the Cloud

Sanjay P. Ahuja, Bryan Moore
2013 Network and Communication Technologies  
With the emergence of cloud computing services, big data processing has become a less costly task.  ...  LinkedIn At LinkedIn, Hadoop is used to support features such as People You May Know and Endorsements using predictive analytics and querying.  ...  Billions of LinkedIn relationships are processed each day to compute People You May Know.  ... 
doi:10.5539/nct.v2n1p62 fatcat:iaykl7wr4nejra5lvq4bfl27jq

A Comparative Analysis of Big Data Frameworks: An Adoption Perspective

Madiha Khalid, Muhammad Murtaza Yousaf
2021 Applied Sciences  
These limitations have led to the development of new technologies to process and store very large datasets. As a result, several execution frameworks emerged for big data processing.  ...  The rapid growth of digital data generated from diverse sources makes it inapt to use traditional storage, processing, and analysis methods.  ...  continuous flow streaming continuous flow streaming, batched, micro-batched Stream Primitives Dstream Tuple message datastream State Management stateful stateless stateful operators stateful operators  ... 
doi:10.3390/app112211033 fatcat:mfh3thwe5ngdnolkhc264rkbdi

Real-time stream processing for Big Data

Wolfram Wingerath, Felix Gessert, Steffen Friedrich, Norbert Ritter
2016 it - Information Technology  
importance of timeliness and velocity in Big Data analytics.In this article, we give an overview over the state of the art of stream processors for low-latency Big Data analytics and conduct a qualitative  ...  comparison of the most popular contenders, namely Storm and its abstraction layer Trident, Samza and Spark Streaming.  ...  It was initially created at LinkedIn, submitted to the Apache Incubator in July 2013 and was granted toplevel status in 2015.  ... 
doi:10.1515/itit-2016-0002 fatcat:zhgbaeb4afdybpejseqimwbjim

A Scalable and Robust Framework for Data Stream Ingestion [article]

Haruna Isah, Farhana Zulkernine
2018 arXiv   pre-print
This paper investigates the fundamental requirements and the state of the art of existing data stream ingestion systems, propose a scalable and fault-tolerant data stream ingestion and integration framework  ...  The ever-increasing volume and highly irregular nature of data rates pose new challenges to data stream processing systems.  ...  Qiao et al [19] developed Gobblin, a generic data ingestion framework at LinkedIn. Gobblin was mainly driven by the fact that LinkedIn's data sources have become increasingly heterogeneous.  ... 
arXiv:1812.04197v1 fatcat:freh5fgeu5ezhbfi5lutmc6smy

Elastic and Scalable Processing of Linked Stream Data in the Cloud [chapter]

Danh Le-Phuoc, Hoan Nguyen Mau Quoc, Chan Le Van, Manfred Hauswirth
2013 Lecture Notes in Computer Science  
Several Linked Stream Data processing engines exist but their scalability still needs to be in improved in terms of (static and dynamic) data sizes, number of concurrent queries, stream update frequencies  ...  It enables the integration and joint processing of heterogeneous stream data with quasi-static data from the Linked Data Cloud in near-real-time.  ...  For instance, Kafka and Scribe are used to programmatically create reliable and scalable processing pipelines for stream logs in LinkedIn and Facebook, respectively.  ... 
doi:10.1007/978-3-642-41335-3_18 fatcat:w4ogqbov6ffctm3lp6hrsdfwja

Scalable and Fault-tolerant Stateful Stream Processing

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, Peter Pietzuch, Marc Herbstritt
2013 Imperial College Computing Student Workshop  
As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines.  ...  At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples.  ...  Therefore stream processing systems (SPSs) have evolved from cluster-based systems, deployed on a few dozen machines [1] , to extremely scalable architectures for big data processing, spanning hundreds  ... 
doi:10.4230/oasics.iccsw.2013.11 dblp:conf/iccsw/FernandezMKP13 fatcat:fdrqftllybeixasr3ff5manw3a
« Previous Showing results 1 — 15 out of 1,620 results