The Impacts of Structural Difference and Temporality of Tweets on Retrieval Effectiveness

Lifeng Jia, Clement Yu, Weiyi Meng
2013 ACM Transactions on Information Systems  
To explore the information seeking behaviors in microblogosphere, the microblog track at TREC 2011 introduced a real-time ad-hoc retrieval task that aims at ranking relevant tweets in reverse-chronological order. We study this problem via a two-phase approach: 1) retrieving tweets in an ad-hoc way; 2) utilizing the temporal information of tweets to enhance the retrieval effectiveness of tweets. Tweets can be categorized into two types. One type consists of short messages not containing any URL
more » ... f a Web page. The other type has at least one URL of a Web page in addition to a short message. These two types of tweets have different structures. In the first phase, to address the structural difference of tweets, we propose a method to rank tweets using the divide-and-conquer strategy. Specifically, we first rank the two types of tweets separately. This produces two rankings, one for each type. Then we merge these two rankings of tweets into one ranking. In the second phase, we first categorize queries into several types by exploring the temporal distributions of their top-retrieved tweets from the first phase; then we calculate the time-related relevance scores of tweets according to the classified types of queries; finally we combine the time scores with the IR scores from the first phase to produce a ranking of tweets. Experimental results achieved by using the TREC 2011 and TREC 2012 queries over the TREC Tweets2011 collection show that: (i) our way of ranking the two types of tweets separately and then merging them together yields better retrieval effectiveness than ranking them simultaneously; (ii) our way of incorporating temporal information into the retrieval process yields further improvements, and (iii) our method compares favorably with state-of-the-art methods in retrieval effectiveness. relevant information to a query within Twitter . To respond to a query with a timestamp t, the retrieved tweets should satisfy the following three conditions: (1) relevant to the query, (2) published on or before time t, and (3) ranked in reverse-chronological order of their publishing times. Some studies have been done in information retrieval of tweets. These studies can be categorized into two major classes. The techniques in the first class Duan et al. 2010; Han et al. 2012; Metzler and Cai 2011; Zhang et al. 2012 ] rank tweets by measuring the lexical similarities between tweets and queries. The methods in the second class [Amati et al. 2012; Dong et al. 2010b; Efron and Golovchinsky 2011] rank tweets by exploring temporal information (the publishing times of tweets and the timestamps of queries). Some studies [Efron et al. 2012; Liang et al. 2012; Massoudi et al. 2011 ] employ both lexical similarity and temporality in ranking tweets. However, there are two important issues that are not well addressed by these existing works. The first issue is the impact of the structural difference of tweets on retrieval effectiveness. Specifically, there are two types of tweets that have different structures. The first type (to be defined as T-tweet in Section 3.2) is just a short text message with no more than 140 characters. The second type (to be defined as TU-tweet in Section 3.2) contains at least one URL of a Web page in addition to a short text message. All existing studies simultaneously rank both types of tweets. However, we believe it is important to utilize the structural difference of tweets in retrieval. Let us illustrate the motivation by the following example. Example 1. Consider a query q = "phone hacking British politicians", a tweet d 1 = "@jamesrae andy Gray is suing the NOTW... just got fired from Sky for footage that should never have been seen. I smell Murdoch!", a second tweet d 2 = "Tensions simmer as 'frustrated' Rupert Murdoch flies in to face phone-hacking affair http:/ /t.co/b3kOppY via @guardian" and a third tweet d 3 = "Windows Phone 7 gets USB Tethering Hack http:/ /tinyurl.com/4lafss6". d 1 is a T-tweet that only has a short message. d 1 is relevant to q but has no query terms. d 2 and d 3 are two TU-tweets. Each of them has not only a message but also a URL. d 2 is relevant to q. It contains two query terms "phone" and "hacking" in its message and all four query terms in the web page of the URL in d 2 . d 3 is irrelevant to q. It contains two query terms "Phone" and "hack" in its message. The Web page of the URL in d 3 has no query terms. The content of a TU-tweet is the union of its short message and the contents of the Web pages of the URLs in it. It is intuitive that for a TU-tweet, the higher the percentage of query terms appearing in it is, the more likely the tweet is relevant. The relevant d 2 has more query terms than the irrelevant d 3 . However, such an intuition does not apply for a T-tweet. d 1 has no query terms but it is relevant to q. This is because T-tweets are so short that some relevant T-tweets may not have any query terms. In addition, we find out that (see Section 6.1.2) the sets of the most important features for learning to rank the two types of tweets are very different. Motivated by such an observation, we propose to use the divide-and-conquer strategy to address the structural difference of tweets. Specifically, we learn two rankers that are dedicated to ranking T-tweets and TU-tweets separately. This produces two tweet type-specific rankers. We then learn a classifier that determines a preference between any T-tweet and any TU-tweet with respect to a given query. The details about these two tweet type-specific rankers and the classifier are discussed in Sections 3.2 and 3.3, respectively. Given a query q, we first obtain a ranking of T-tweets, R 1 , and a ranking of TU-tweets, R 2 , by using the two type-specific rankers, respectively. Then we apply the classifier to determine the preference between each T-tweet from R 1 and each TU-tweet from R 2 . Finally, we merge the tweets from R 1 and R 2 into a single
doi:10.1145/2500751 fatcat:nvevttmmwreytk4hg25yjoyhui