An analysis of 14 Million tweets on hashtag-oriented spamming*

Surendra Sedhai, Aixin Sun
2017 Journal of the Association for Information Science and Technology  
Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., non-spam) labels, to understand spamming activities on Twitter. The primary focus of
more » ... his paper is to analyze various aspects of spam on Twitter based on hashtags, tweet contents, and user profiles, which are useful for both tweet-level and user-level spam detection. First, we compare the usage of hashtags in spam and ham tweets based on frequency, position, orthography, and co-occurrence. Second, for content-based analysis, we analyze the variations in word usage, metadata, and near-duplicate tweets. Third, for user-based analysis, we investigate user profile information. In our study, we validate that spammers use popular hashtags to promote their tweets. We also observe differences in the usage of words in spam and ham tweets. Spam tweets are more likely to be emphasized using exclamation points and capitalized words. Further, we observe that spammers use multiple accounts to post near-duplicate tweets to promote their services and products. Unlike spammers, legitimate users are likely to provide more information such as their locations and personal descriptions in their profiles. In summary, this study presents a comprehensive analysis of hashtags, tweet contents, and user profiles in Twitter spamming. * This paper is an extended version of a SIGIR'15 paper (Sedhai & Sun, 2015) [37], which describes the construction of the HSpam14 dataset.
doi:10.1002/asi.23836 fatcat:ihcrmf4onjhwvpnimfg2vsglr4