Web-collaborative filtering: recommending music by crawling the Web
We show that it is possible to collect data that is useful for collaborative filtering (CF) using an autonomous Web spider. In CF, entities are recommended to a new user based on the stated preferences of other, similar users. We describe a CF spider that collects from the Web lists of semantically related entities. These lists can then be used by existing CF algorithms by encoding them as "pseudo-users". Importantly, the spider can collect useful data without pre-programmed knowledge about the
... format of particular pages or particular sites. Instead, the CF spider uses commercial Web-search engines to find pages likely to contain lists in the domain of interest, and then applies previously-proposed heuristics [Cohen, 1999] to extract lists from these pages. We show that data collected by this spider is nearly as effective for CF as data collected from real users, and more effective than data collected by two plausible hand-programmed spiders. In some cases, autonomously spidered data can also be combined with actual user data to improve performance. The baseline datasets are fairly large: there are 5,095 downloads from 353 IP addresses in the test set, 23,438 downloads from 1,028 IP addresses in the training set, and a total of 981 different artists associated with these downloads. It should be noted that in this dataset, almost all ratings (nearly 98%) are negative; thus, one would expect that more information about a user's preferences would be conveyed by a positive rating than by a negative rating. It should be noted that the baseline training and test datasets are only an approximate reflection of real user preferences. One problem is the assumption that each IP address corresponds to a distinct user. In fact, while many IP addresses are static and correspond to a single-user workstation, some of the IP addresses are dynamic, and hence correspond to a session by some user, or worse, to a set of distinct sessions by different, unrelated users. Further, some users also access music from several fixed IP addresses (e.g., a home PC and a work PC), and conversely some fixed IP addresses might be used by several distinct users (e.g., a home PC that is used by several family members). Another issue is that many users only download a few files; in this case, it is certainly wrong to the assume that all artists not downloaded are disliked. T associated with the simulated interaction as LEN(U). Note that LEN(U) is bounded by the number of artists rated positively by U. For most of the users, LEN(U) is relatively small: the median value of LEN(U) is only 10. This is shown in the figure below, which plots t against the total number of users U in the test set such that LEN(U)>=t. We will typically truncate all of our result graphs at t=50, as beyond this point there are only a handful of distinct users.