PhishStorm: Detecting Phishing With Streaming Analytics

Samuel Marchal, Jerome Francois, Radu State, Thomas Engel
2014 IEEE Transactions on Network and Service Management  
Despite the growth of prevention techniques, phishing remains an important threat since the principal countermeasures in use are still based on reactive URL blacklisting. This technique is inefficient due to the short lifetime of phishing Web sites, making recent approaches relying on real-time or proactive phishing URLs detection techniques more appropriate. In this paper we introduce PhishStorm, an automated phishing detection system that can analyse in real-time any URL in order to identify
more » ... otential phishing sites. PhishStorm can interface with any email server or HTTP proxy. We argue that phishing URLs usually have few relationships between the part of the URL that must be registered (low level domain) and the remaining part of the URL (upper level domain, path, query). We show in this paper that experimental evidence supports this observation and can be used to detect phishing sites. For this purpose, we define the new concept of intra-URL relatedness and evaluate it using features extracted from words that compose a URL based on query data from Google and Yahoo search engines. These features are then used in machine learning based classification to detect phishing URLs from a real dataset. Our technique is assessed on 96,018 phishing and legitimate URLs that results in a correct classification rate of 94.91% with only 1.44% false positives. An extension for a URL phishingness rating system exhibiting high confidence rate (> 99%) is proposed. We discuss in the paper efficient implementation patterns that allow real time analytics using Big Data architectures like STORM and advanced data structures based on Bloom filter. Features Description 1 J RR = |REL rd (url)∩RELrem(url)| |REL rd (url)∪RELrem(url)| Jaccard index b/w REL rd (url) and REL rem (url) 2 J RA = |REL rd (url)∩ASrem(url)| |REL rd (url)∪ASrem(url)| Jaccard index b/w REL rd (url) and AS rem (url) 3 J AA = |AS rd (url)∩ASrem(url)| |AS rd (url)∪ASrem(url)| Jaccard index b/w AS rd (url) and AS rem (url) 4 J AR = |AS rd (url)∩RELrem(url)| |AS rd (url)∪RELrem(url)| Jaccard index b/w AS rd (url) and REL rem (url) 5 J ARrd = |AS rd (url)∩REL rd (url)| |AS rd (url)∪REL rd (url)| Jaccard index b/w AS rd (url) and REL rd (url) 6 J ARrem = |ASrem(url)∩RELrem(url)| |ASrem(url)∪RELrem(url)|
doi:10.1109/tnsm.2014.2377295 fatcat:wer2f6njkzbbpgef64ricbcmiy