Ad-hoc data processing in the cloud

Dionysios Logothetis, Kenneth Yocum
2008 Proceedings of the VLDB Endowment  
Ad-hoc data processing has proven to be a critical paradigm for Internet companies processing large volumes of unstructured data. However, the emergence of cloud-based computing, where storage and CPU are outsourced to multiple third-parties across the globe, implies large collections of highly distributed and continuously evolving data. Our demonstration combines the power and simplicity of the MapReduce abstraction with a wide-scale distributed stream processor, Mortar. While our incremental
more » ... apReduce operators avoid data re-processing, the stream processor manages the placement and physical data flow of the operators across the wide area. We demonstrate a distributed web indexing engine against which users can submit and deploy continuous MapReduce jobs. A visualization component illustrates both the incremental indexing and index searches in real time. 1472 Permission to make digital or hard copies of portions of this work for personal or classroom use is gr anted without fee provided that copies are not made or distr ibuted for pr ofit or co mmercial a dvantage and that copies bear t his notice and the full citation on the first page. Copyright for components of this wor k owned by others than VL DB Endowment must be honored. Abstracting with c redit is per mitted. To copy otherwise, to republish, to post on ser vers or to r edistribute to lists r equires prior specific permission and/or a fee. Request permission to r epublish from: Publications Dept. , AC M, I nc. Fax +1 ( 212) 869-0481 or permissions@acm.org.
doi:10.14778/1454159.1454204 fatcat:73nyswrxp5e7pd2u76xbnfxekm