Mining Music from Large-Scale, Peer-to-Peer Networks

Y Shavitt, E Weinsberg, U Weinsberg
2011 IEEE Multimedia  
M illions of users worldwide use peer-to-peer (P2P) networks for sharing content, with a significantly high percentage of this content being multimedia, such as songs and movies. 1 As such, P2P networks are an invaluable resource for various multimedia information retrieval (MIR) tasks because of the large data set, the ability to capture data without the collaboration of P2P network operators, and a large and diverse population. However, P2P networks are quite complex, exhibit high user churn,
more » ... and contain high amounts of noise in the user-generated content. This makes collecting a complete snapshot of the network content complex. Additionally, there are often slightly different duplicates of the same files available in the network, which might have different file hashes, file names, and metadata tags. Duplication in metadata tags is typically caused by spelling mistakes, missing data, and different variations of the correct values. Finally, P2P user-to-item mappings are extremely sparse due to the vast amount of content, most of which is quite scarce, making user preferences hard to deduce. These complexities result in difficulties when attempting to mine meaningful data from P2P networks. For example, even though improved search schemes 2 and recommender systems 3 have been proposed to help users find content, current P2P networks mostly employ simple string-matching algorithms against file name and metadata, either distributed or centralized, usually using a Web-based search engine. In the Gnutella network, this method results in only 7 to 10 percent of queries successfully returning useful content. 4 While improving these approaches is obviously needed, recommender systems require meaningful data sets. Current recommender systems mostly rely on the willingness of users to rank their preferences to provide better recommendation. However, the nonexistence of explicit ranking in P2P networks and the previously mentioned complexities make it difficult to create efficient recommender systems, thus contributing to the increasing frustration of users. The main objective of the work described in this article is to overcome these difficulties and improve the ability to perform efficient mining of music content in data sets originating from P2P networks. For this project, we studied the musical content shared by users in Gnutella, 5 then built a song-similarity graph, where the similarity between two songs is based on the number of users that share the two songs. We accounted for missing metadata by clustering the similarity graph and finding groups of similar songs. This article describes how the resulting clusters hold songs of varying popularity with high prevalence of dominant genres and artists, properties that are especially useful for recommender systems. Song-similarity graph Collecting the shared songs from a P2P network requires crawling the network, which involves traversing the network in a way similar to how Web crawlers behave. Using the Gnutella protocol, we can discover, for each crawled user, the set of peered users and the files they are sharing. We collected the data set used in this article in a 24 hour Gnutella crawl on 25 November, 2007. At this time, Gnutella was the most popular file-sharing network. 6 The crawler reached more than 1.2 million Gnutella users and recorded more than 373 million files. Because this article focuses on MIR data, we identified files that are music-related by indexing only the files with a music-related suffix (MP3, WMA, FLAC, M4P, and M4A). We found that music-related content accounted for more than 75 percent of the files on the network, that is, more than 281 million files. These figures strengthen the notion that P2P
doi:10.1109/mmul.2011.13 fatcat:pmzvyu3sanfb7h2wt5txpaux3a