Counting YouTube videos via random prefix sampling

Jia Zhou, Yanhua Li, Vijay Kumar Adhikari, Zhi-Li Zhang
2011 Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference - IMC '11  
Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate
more » ... rrors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (< 1000) ; we also shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.
doi:10.1145/2068816.2068851 dblp:conf/imc/ZhouLAZ11 fatcat:rkgdk3nrkfbwngglv2uetxalae