Large-Scale Analysis of Domain Blacklists

Tran Thao, Tokunbo Makanju, Jumpei Urakawa, Akira Yamada, Kosuke Murakami, Ayumu Kubota
unpublished
Malicious content has grown along with the explosion of the Internet. Therefore, many organizations construct and maintain blacklists to help web users protect their computers. There are many kinds of blacklists in which domain blacklists are the most popular one. Existing empirical analyses on domain blacklists have several limitations such as using only outdated blacklists, omitting important blacklists, or focusing only on simple aspects of blacklists. In this paper, we analyze the top 14
more » ... cklists including popular and updated blacklists like Safe Browsing from Google and urlblacklist.com. We are the first to filter out the old entries in the blacklists using an enormous dataset of user browsing history. Besides the analysis on the intersections and the registered information from Whois (such as top-level domain, domain age and country), we also build two classification models for web content categories (i.e., education, business, etc.) and malicious categories (i.e., landing and distribution) using machine learning. Our work found some important results. First, the blacklists Safe Browsing version 3 and 4 are being separately deployed and have independent databases with diverse entries although they belong to the same organization. Second, the blacklist dsi.ut capitole.fr is almost a subset of the blacklist urlblacklist.com with 98% entries. Third, largest portion of entries in the blacklists are created in 2000 with 6.08%, and from United States with 24.28%. Fourth, Safe Browsing version 4 can detect younger domains compared with the others. Fifth, Tech & Computing is the dominant web content category in all the blacklists, and the blacklists in each group (i.e., small public blacklists, large public blacklists, private blacklists) have higher correlation in web content as opposed to blacklists in other groups. Finally, the number of landing domains are larger than that of distribution domains at least 75% in large public blacklists and at least 60% in other blacklists.
fatcat:grgb664qarhlviatlbdp7unogy