SearchGen

Huajing Li, Wang-Chien Lee, Anand Sivasubramaniam, Lee Giles
2007 Proceedings of the 2007 conference on Digital libraries - JCDL '07  
Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to
more » ... orkloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.
doi:10.1145/1255175.1255203 dblp:conf/jcdl/LiLSG07 fatcat:mhkejxtkjbazrdxfaregdq5ssq