Mining Query Logs: Turning Search Usage Data into Knowledge

F. Silvestri
2010 Foundations and Trends in Information Retrieval  
Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular
more » ... applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field. Even if this quote dates back to 2005, it is very likely that those survey results are still valid (if not still more positives for search engines). On the other side of the coin, search engines' users are satisfied by their search experience [189] . In a paper overviewing the challenges in modern web search engines' design, Baeza-Yates et al. [14] state: The main challenge is hence to design large-scale distributed systems that satisfy the user expectations, in which queries use resources efficiently, thereby reducing the cost per query. 1.1 Web Search Engines 5 Fig. 1.1 A fragment of the AOL query log [160]. How query logs interact with search engines has been studied in many papers. For a general overview, [12, 20] are good starting point references. In this paper, we review some of the most recent techniques dealing with query logs and how they can be used to enhance web search engine operations. We are going to summarize the basic results concerning query logs: analyses, techniques used to extract knowledge, most remarkable results, most useful applications, and open issues and possibilities that remain to be studied. The purpose is, thus, to present ideas and results in the most comprehensive way. We review fundamental, and state-of-the-art techniques. In each section, even if not directly specified, we review and analyze the algorithms used, not only their results. This paper is intended for an audience of people with basic knowledge of computer science. We also expect readers to have a basic knowledge of Information Retrieval. Everything not at a basic level is analyzed and detailed. Before going on, it is important to make clear that all the analyses and results reported were not reproduced by the author. We only report Fun Facts about Queries 11 Fig. 1.5 A cloud of the 250 most frequent queried terms in the AOL query log [160]. Picture has been generated using and independently from a partition of the whole collection. The second phase collects global statistics computed over the whole inverted index. One of the most valuable advantages of document partitioning is the possibility of easily performing updates. In fact, new documents may simply be inserted into a new partition to independently index separately from the others [169] . Since the advent of web search engines, a large number of papers have been published describing different architectures for search engines, and search engine components [10, 25, 47, 33, 96, 97, 147, 150, 153, 204] . Many other papers [13, 14, 100 , 101] enumerate the major challenges search engine developers must address in order to improve their ability to help users in finding information they need. Interested readers shall find in the above referenced papers many interesting insights. Needless to say, you shall not find any particular details, in this survey, about the real structure of a search engine. Usually, this kind of information is highly confidential and it is very unlikely that search companies will ever disclose them.
doi:10.1561/1500000013 fatcat:t4rvgk4igbe6nni5l3khdywmpi