Graph visualization of dark web hyperlinks and their feature analysis
Taichi Aoki, Atsuhiro Goto
International Journal of Networking and Computing
Content regarding various illegal activities, such as weapon and drug trafficking, is shared on the dark web. Most of the illegal content is distributed on anonymous networks that cannot be directly accessed from the World Wide Web. A number of studies have been conducted on the network structure of the World Wide Web since its advent. Similar to the World Wide Web, the dark web is connected by hypertext transfer protocol (http); this makes it possible to use the methods developed for the web
... the dark web. Many studies have investigated the dark web and its network structure. However, few studies have focused on the visualization of the dark web network structure, and there have been no studies investigating the temporal changes in the network structure. In this study, to understand the hypertext markup language (html) network structure of the dark web, we created and visualized a graph of the html hyperlink relations of the Tor network, which is popular on the dark web. We then compared the insights gained from graph centrality metrics with those gained from visualizations. The analyzed dataset comprised 25,270,157 pages of html text files crawled from the Tor network by breadth-first search from June 1, 2018, to January 30, 2021. Subsequently, we acquired half-yearly snapshots from the collected data and investigated the change in the dark web network over time using a time-series graph. Then, we derived the centrality metrics from the created graph data and confirmed the differences between the centrality metrics and visualizations. The results obtained in this study provided new insights into the dark web. First, we found that the dark web fluctuated significantly; the structure of the dark web network was more strongly interconnected. Second, most of the nodes that had increased in the past two years may have disappeared rapidly after May 2020. Third, analysis of each snapshot revealed that the proportion of highly volatile domains increased from 40% to 75% during the observation period. Fourth, after calculating the network centrality metrics from each snapshot and comparing the transition of hub nodes in chronological order, we observed that the importance of link-collection sites as the main information retrieval method used in the dark web decreased. Finally, we estimated the size of the dark web based on our observed dark web measurements using the mark-recapture method. To the best of our 354 International Journal of Networking and Computing knowledge, this is the first study to use the mark-recapture method to estimate the size of the dark web network.