CRAWLING AJAX-BASED WEB APPLICATIONS: EVOLUTION AND STATE-OF-THE-ART

Shah Khalid, Shah Khusro, Irfan Ullah
2018 Malaysian Journal of Computer Science  
The innovation of AJAX resulted in more responsive, interactive and faster web applications due to the clever amalgamation of JavaScript, HTML, and Cascading Style Sheets (CSS). However, from the user's perspective, this achievement places many challenges before web search engines. One major challenge is due to the complexities in crawling such web applications because multiple states are associated with one uniform resource locator (URL) that cause a mismatch with search model of web search
more » ... ines, where a web document is uniquely identified by a single unique URL with a single state. Crawling AJAX-based web applications means giving strength and capability to web search engines so that information produced in these highly-interactive web applications is downloaded and indexed. The need here is to investigate the technicalities of AJAX that shatter the metaphor of a web page which the current web search engine utilize during crawling in order to improve the capabilities of web search engines. Although some academic tools have been developed, they produce some false positives which greatly affect the performance of web search engine. We aim to investigate AJAX and AJAX-based web applications as well as the state-of-the-art in crawling these applications along with some prominent issues, challenges and recommendations. The World Wide Web is a giant source of information in which a continuous change occurs in the way of information storage, retrieval and display. [1] . Simple HTML pages are now being replaced with AJAXembedded web pages making information retrieval (IR) challenging because of complexities in executing JavaScript, constructing the navigation model and analysis of Document Object Model (DOM) [2] . AJAX, which is short for Asynchronous JavaScript and XML (Extensible Markup Language), is one of the prominent new techniques that are being used to develop rich and more interactive web applications such as Facebook, YouTube and Google Maps. Unlike a new programming, scripting language or technology, AJAX is a new way to think, design and develop web applications [3, 4] . As content is dynamically and asynchronously produced in AJAX-based web applications, web crawlers are unable to detect AJAX event and execute calls just like humans do using a web browser. Furthermore, a lot of applications on the Web are AJAX-based and are least searchable. A methodology is needed to present content produced by these applications to crawlers for indexing purposes just like in traditional Web IR. A lot of research has already explored the technical aspects of AJAX, challenges and the benefits it provides to web application developers. However, these articles are limited in evaluating and comparing the relative performance of AJAX web crawlers. In this review article, we critically and analytically review the available literature in order to report the state-of-the-art in crawling AJAX-based web applications along with some prominent issues and challenges. We also briefly discuss the nature of AJAX-based web applications and their differences from traditional web applications. We also cover the similarities and differences between traditional web crawlers and existing AJAX crawlers as well as highlight the limitation in AJAX crawlers. For this purpose, we searched the major Computer Science digital libraries and other related databases to collect relevant articles that best describe this new technology. We carefully reviewed and analyzed all the selected articles and reported the state-of-the-art accordingly. We hope that this research paper will open new research avenues for researchers interested in this domain. Rest of the paper is organized as follows: Section 2 presents the journey
doi:10.22452/mjcs.vol31no1.3 fatcat:k7nnsuki3vhlrlu7wbuthps6ra