Text mining and rating prediction with topical user models

Yanir Seroussi
Recent years have seen an abundance of user-generated texts published online. Mining these texts for useful information is a growing research area with many aspects that are yet to be fully explored. Two such aspects, which are investigated in this thesis, are the extraction of implicit information about users to create user models, and the application of these models to tasks that require user information. Our main approach to extracting user information is via topical user models, which
more » ... ent each author and document with low-dimensional distributions over topics (a topic is a distribution over words). We develop methods that utilise these topical user models to address the following tasks: (1) authorship attribution: identifying which user wrote a given anonymous text; (2) polarity inference: detecting the level of sentiment expressed in a given text; and (3) rating prediction: determining a given user's expected sentiment towards a given item. The first task we consider is authorship attribution, where the goal is to identify the authors of anonymous texts. Authorship attribution is one of the most commonly attempted tasks in the authorship analysis field, which -- in addition to authorship attribution -- also deals with profiling authors by inferring demographic information and personality traits from their texts. Traditionally, research in this field has focused on formal texts, such as essays and novels, but recently more attention has been given to online user-generated texts, such as emails and blogs. Authorship attribution of online user-generated texts is a more challenging task than traditional authorship attribution, because such texts tend to be short and informal, and the number of candidate authors is often larger than in traditional settings. We address this challenge by employing topical user models. In addition to exploring novel ways of applying two popular topic models to this task, we develop a new model that projects users and documents to two disjoint topic spaces. Employing our m [...]
doi:10.4225/03/58a675be72e78 fatcat:mbkymd2qzbeaxeabyz4ydaztq4