Continuous Matrix Approximation on Distributed Data [article]

Mina Ghashami, Jeff M. Phillips, Feifei Li
2014 arXiv   pre-print
Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of "tracking approximations to a matrix" in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel
more » ... ith a coordinator, and the goal is to track an eps-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: | ||A x||^2 - ||B x||^2 | <= eps ||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/eps) log(beta N)) communication, where N is the size of stream (at the time of the query) and beta is an upper-bound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.
arXiv:1404.7571v1 fatcat:u3uqbbbnxzh2xaniwdznm2qmhi