Instance Selection Techniques for Memory-Based Collaborative Filtering [chapter]

Kai Yu, Xiaowei Xu, Jianhua Tao, Martin Ester, Hans-Peter Kriegel
2002 Proceedings of the 2002 SIAM International Conference on Data Mining  
Collaborative filtering (CF) has become an important data mining technique to make personalized recommendations for books, web pages or movies, etc. One popular algorithm is the memory-based collaborative filtering, which predicts a user's preference based on his or her similarity to other users (instances) in the database. However, the tremendous growth of users and the large number of products, memory-based CF algorithms results in the problem of deciding the right instances to use during
more » ... iction, in order to reduce executive cost and excessive storage, and possibly to improve the generalization accuracy by avoiding noise and overfitting. In this paper, we focus our work on a typical user preference database that contains many missing values, and propose four novel instance reduction techniques called TURF1-TURF4 as a preprocessing step to improve the efficiency and accuracy of the memory-based CF algorithm. The key idea is to generate prediction from a carefully selected set of relevant instances. We evaluate the techniques on the well-known EachMovie data set. Our experiments showed that the proposed algorithms not just dramatically speed up the prediction, but also improved the accuracy. to predict the preference of a particular user (the active user) for the target items such as music CDs, books, web pages, or movies. The intuition behind the algorithm is that the active user will prefer those items that the like-minded people prefer. So far, two general classes of CF algorithms have been widely investigated [Breese et al., 1998 ]. Memorybased CF, which is the most prevalent approach, operates over the entire user preference database to make predictions. In contrast model-based algorithms use the preference database to infer a model, which is then applied for predictions. CF algorithms have been very successful in both research and practice. However, there still remain important research questions in overcoming two fundamental challenges for CF [Sarwar et al., 2000] . The first challenge is to improve the scalability and efficiency of CF algorithms. Existing CF algorithms can deal with thousands of consumers within a reasonable time, but the demand of modern E-Commerce systems is required to scale millions of users. Efficiency is another intimately related issue. Prediction time of a request must be less than 1 second and prediction engines must often support throughput of several hundred requests per second [Herlocker et al., 1999] . The second challenge is to improve the quality of the recommendations for the users. Users need recommendations they can trust to help them find products they will like. If a user trusts a recommender system, purchases a product, but finds out he does not like the product, the user will be unlikely to use the recommender systems again. Much work has been conducted on this issue. In this paper, we focus our work on memory-based CF algorithms and address the problem of deciding which instances to use during prediction, in order to reduce time complexity, and to improve the accuracy by avoiding noise and overfitting. Two cases of instance removal are considered, the first is to remove redundant instances whose preference pattern has been already carried by other instances, and the second is to remove irrelevant instances whose preference profile is hard to generalize for prediction. Four novel instance selection techniques called TURF1-TURF4 are proposed as a preprocess step for memory-based CF. The key idea is to speed up predictions by generating predictions over relevant and informative instances, instead of operating over the entire database. Our first algorithm works in an incremental manner, which starts with a few training instances, and adds those instances with novel profile into training set. The second algorithm works in a filtering manner, those instances with a strongly rational profile are included into the training set. In the third algorithm, we combine the two former algorithms to pick up those instances with a novel and rational profile. In the fourth algorithm, we try to explore the potential of storage reduction. Our experiment on a real-world preference database confirms the efficiency and accuracy of our approaches. The rest of this paper is organized as follows. Section 2 introduces related work, covering CF algorithms and instance selection methods in memory-based learning. In section 3, we review the scenario of memory-based CF algorithms over preference database and then we describe the four proposed instance selection algorithms one by one. Section 4 gives empirical results that indicate how well our proposed algorithm works in practice. Finally in section 5, the paper is finished with a conclusion and some interesting future work. Related Work The task in CF is to predict the preference of a particular user (active user) based on a database of users' preferences. There are two general classes of CF algorithms: memorybased methods and model-based methods [Breese et al., 1998 ]. Memory-based algorithm [Breese et al., 1998, Resnick et al., 1994, Shardanand and
doi:10.1137/1.9781611972726.4 dblp:conf/sdm/YuXTEK02 fatcat:zhy5gl53prbvjhobfqwvjvdmce