XML Document Probabilistic Clustering Based on Structure and Content

Hassan Naderi, Mojtaba Rashidi
2016 International Journal of Information Technology Control and Automation  
Large volume of information is stored in XML format in the Web, and clustering is a management method for this documents. Most of current methods for clustering XML documents consider only one of these two aspects. In this paper, we propose SCEM (Expectation Maximization Structure and Content) for XML documents which is used to effectively cluster XML documents by combining content and structural features. The other contribution of this paper is that we used probabilistic distributions in such
more » ... ay that have probability parameters corresponding to one cluster. In this way, we obtained better effectiveness compared to other clustering methods due to generality. Experimental results on real datasets show effectiveness of proposed method, particularly when it is applied on large XML documents without schema. Also it can be used to improve accuracy and effectiveness of XML information retrieval.
doi:10.5121/ijitca.2016.6101 fatcat:jzj37tafrbfmnjtatlhny57rwq