Balancing clusters to reduce response time variability in large scale image search [article]

Romain Tavenard , Hervé Jégou
2010 arXiv   pre-print
Many algorithms for approximate nearest neighbor search in high-dimensional spaces partition the data into clusters. At query time, in order to avoid exhaustive search, an index selects the few (or a single) clusters nearest to the query point. Clusters are often produced by the well-known k-means approach since it has several desirable properties. On the downside, it tends to produce clusters having quite different cardinalities. Imbalanced clusters negatively impact both the variance and the
more » ... xpectation of query response times. This paper proposes to modify k-means centroids to produce clusters with more comparable sizes without sacrificing the desirable properties. Experiments with a large scale collection of image descriptors show that our algorithm significantly reduces the variance of response times without seriously impacting the search quality.
arXiv:1009.4739v1 fatcat:hdod6pwlgbbwnpwhfvhiel45vm