SpiderTrap - an Innovative Approach to Analyze Activity of Internet Bots on a Website
The main idea behind creating SpiderTrap was to build a website that can track how Internet bots crawl it. To track bots, honeypot dynamically generates different types of the hyperlinks on the web pages leading from one article to another and logs information passed by web clients in HTTP requests when visiting these links. By analyzing the sequences of visited links and passed HTTP requests it is possible to: detect bots, reveal bots' crawling or scanning algorithms, and other characteristic
... eatures of the traffic they generate. In our research we focused on identifying and describing whole bots' operations rather than just classifying single HTTP requests. This novel approach has given us insight into what different types of Internet bots are looking for and how they work. This information can be used to optimize the websites for search engines' bots for a better place on a search's results page or prepare a set of rules for tools that filter traffic to the web pages to minimize the impact of bad and unwanted bots on the websites' availability and security. We present the results of the five months of SpiderTrap's activity when honeypot was accessible by two domains (.pl and.eu), as well as by an IP address. The results show examples of activity of well-known Internet bots, such as Googlebot or Bingbot, unknown crawlers, and scanners trying to exploit vulnerabilities in the most popular web frameworks or looking for active webshells (i.e. access points to control a web server left by other attackers). INDEX TERMS Cyber threat intelligence, honeypot, HTTP, search engines, situational awareness, web crawlers, web search, web spiders.