SPWalk: Similar Property Oriented Feature Learning for Phishing Detection

Xiuwen Liu, Jianming Fu
2020 IEEE Access  
Detecting phishing webpages is an essential task that protects legitimate websites and their users from various malicious activities. To classify the suspect webpage as phishing or legitimate, robust and effective features used for classification are in demand. However, recent phishing attacks usually make phishing webpages resemble the legitimate webpages in visual and functional aspects. This poses a greater difficulty for feature extraction. We herein propose SPWalk, an unsupervised feature
more » ... earning algorithm for phishing detection. In SPWalk, similar property nodes refer to a collection of phishing webpages or legitimate webpages. We first construct a weblink network with nodes representing webpages. The edges between nodes represent the reference relationships that connect webpages through hyperlinks or similar textual content. Then, SPWalk applies the network embedding technique to mapping nodes into a low-dimensional vector space. A biased random walk procedure efficiently integrates both structural information between nodes and URL information of each node. The effectiveness and robustness of SPWalk come from three points. (1). Phishing attackers do not have full control over reference relationships. (2). The structural regularities generated by diverse reference relationships can be exploited to discriminate between phishing and legitimate webpages. (3). Node URL information makes the learned node representations more suited for phishing detection. Using node as numeric features, we conduct experiments to classify webpages as legitimate or phishing. We demonstrate the superiority of SPWalk over state-of-the-art techniques on phishing detection, especially in terms of precision (over 95%). Even in the case that phishing webpages are well camouflaged by attackers for evading detection, SPwalk exhibits better classification efficacy consistently. INDEX TERMS Feature learning, network embedding, phishing detection, similar property.
doi:10.1109/access.2020.2992381 fatcat:2ktsawu7rbheteco7epbeenhsa