Novel document detection for massive data streams using distributed dictionary learning

S. P. Kasiviswanathan, G. Cong, P. Melville, R. D. Lawrence
2013 IBM Journal of Research and Development  
Given the high volume of content being generated online, it becomes necessary to employ automated techniques to separate out the documents belonging to novel topics from the background discussion, in a robust and scalable manner (with respect to the size of the document set). We present a solution to this challenge based on sparse coding, in which a stream of documents (where each document is modeled as an m-dimensional vector y) can be used to learn a dictionary matrix A of dimension m k, such
more » ... that the documents can be approximately represented by a linear combination of a few columns of A. If a new document cannot be represented with low error as a sparse linear combination of these columns, then this is a strong indicator of novelty of the document. We scale up this approach to handle millions of documents by parallelizing sparse coding and dictionary learning, and by using the alternating-directions method to solve the resulting optimization problems. We conduct our experiments on high-performance computing clusters with differing architectures and evaluate our approach on news streams and streaming data from Twitter A . Based on the analysis, we share our insights on the distributed optimization and machine architecture that can help the design of exascale systems supporting data analytics.
doi:10.1147/jrd.2013.2247232 fatcat:gb533tvb6nfnxab4v3vr65frry