Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov Models [chapter]

Alexander Ypma, Tom Heskes
2003 Lecture Notes in Computer Science  
We propose mixtures of hidden Markov models for modelling clickstreams of web surfers. Hence, the page categorization is learned from the data without the need for a (possibly cumbersome) manual categorization. We provide an EM algorithm for training a mixture of HMMs and show that additional static user data can be incorporated easily to possibly enhance the labelling of users. Furthermore, we use prior knowledge to enhance generalization and avoid numerical problems. We use parameter tying to
more » ... decrease the danger of overfitting and to reduce computational overhead. We put a flat prior on the parameters to deal with the problem that certain transitions between page categories occur very seldom or not at all, in order to ensure that a nonzero transition probability between these categories nonetheless remains. In applications to artificial data and real-world web logs we demonstrate the usefulness of our approach. We train a mixture of HMMs on artificial navigation patterns, and show that the correct model is being learned. Moreover, we show that the use of static 'satellite data' may enhance the labeling of shorter navigation patterns. When applying a mixture of HMMs to realworld web logs from a large Dutch commercial web site, we demonstrate that sensible page categorizations are being learned.
doi:10.1007/978-3-540-39663-5_3 fatcat:omiyl2senjeydjhxp5j2s5dhyi