Clustering web images using association rules, interestingness measures, and hypergraph partitions

Hassan H. Malik, John R. Kender
2006 Proceedings of the 6th international conference on Web engineering - ICWE '06  
This paper presents a new approach to cluster web images. Images are first processed to extract signal features such as color in HSV format and quantized orientation. Web pages referring to these images are processed to extract textual features (keywords) and feature reduction techniques such as stemming, stop word elimination, and Zipf's law are applied. All visual and textual features are used to generate association rules. Hypergraphs are generated from these rules, with features used as
more » ... ices and discovered associations as hyperedges. Twenty-two objective "interestingness" measures are evaluated on their ability to prune non-interesting rules and to assign weights to hyperedges. Then a hypergraph partitioning algorithm is used to generate clusters of features, and a simple scoring function is used to assign images to clusters. A tree-distance-based evaluation measure is used to evaluate the quality of image clustering with respect to manually generated ground truth. Our experiments indicate that combining textual and contentbased features results in better clustering as compared to signalonly or text-only approaches. Online steps are done in real-time, which makes this approach practical for web images. Furthermore, we demonstrate that statistical interestingness measures such as Correlation Coefficient, Laplace, Kappa and J-Measure result in better clustering compared to traditional association rule interestingness measures such as Support and Confidence.
doi:10.1145/1145581.1145591 dblp:conf/icwe/MalikK06 fatcat:7j66wracvrepfefcprmnona7zq