Distinct Random Sampling from a Distributed Stream

Srikanta Tirthapura
2015 2015 IEEE International Parallel and Distributed Processing Symposium  
We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling from a distributed stream. We also present a lower bound
more » ... n the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and experimental results showing the performance of our algorithm on real-world data sets.
doi:10.1109/ipdps.2015.97 dblp:conf/ipps/Tirthapura15 fatcat:bws5fvwzovfurpes2pcs62czjq