Masumi Shirakawa, Takahiro Hara, Shojiro Nishio
2014 Proceedings of the VLDB Endowment  
In this demonstration, we introduce MLJ (MultiLingual Journalism,, a first Web-based system that enables users to search any topic of latest tweets posted by media outlets and journalists beyond languages. Handling multilingual tweets in real time involves many technical challenges: language barrier, sparsity of words, and realtime data stream. To overcome the language barrier and the sparsity of words, MLJ harnesses CL-ESA, a Wikipediabased language-independent method
more » ... o generate a vector of Wikipedia pages (entities) from an input text. To continuously deal with tweet stream, we propose one-pass DPmeans, an online clustering method based on DP-means. Given a new tweet as an input, MLJ generates a vector using CL-ESA and classifies it into one of clusters using one-pass DP-means. By interpreting a search query as a vector, users can instantly search clusters containing latest related tweets from the query without being aware of language differences. MLJ as of March 2014 supports nine languages including English, Japanese, Korean, Spanish, Portuguese, German, French, Italian, and Arabic covering 24 countries.
doi:10.14778/2733004.2733041 fatcat:u2y2ylo7xngfrioapjedjzzxby