Scalable mining and link analysis across multiple database relations

Xiaoxin Yin
2008 SIGKDD Explorations  
Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entityrelationship links. Unfortunately, most existing data mining approaches can only handle data stored in single tables, and cannot be applied to relational databases. Therefore, it is an urgent task to design data mining approaches that can discover knowledge from multi-relational
more » ... . In this thesis we study three most important data mining tasks in multi-relational environments: classification, clustering, and duplicate detection. Since information is widely spread across multiple relations, the most crucial and common challenge in multi-relational data mining is how to utilize the relational information linked with each object. We rely on two types of information, -neighbor tuples and linkages between objects, to analyze the properties of objects and relationships among them. Because of the complexity of multi-relational data, efficiency and scalability are two major concerns in multi-relational data mining. In this thesis we propose scalable and accurate approaches for each data mining task studied. In order to achieve high efficiency and scalability, the approaches utilize novel techniques for virtually joining different relations, single-scan algorithms, and multi-resolutional data structures to dramatically reduce computational costs. Our experiments show that our approaches are highly efficient and scalable, and also achieve high accuracies in multi-relational data mining. iii To my dear wife Wen, for her love, encouragement, and support. iv Acknowledgments It is amazing to look back and see how things have changed in the past several years. A new graduate student with very little research experiences, as I was five years ago, has become a Ph.D. candidate capable of performing research independently on newly emerging data mining issues. I wish to express my deepest gratitude to my advisor Dr. Jiawei Han, who made these changes possible. As an advisor, he has given me endless encouragement and support by sharing his knowledge and experiences. More importantly, he taught me to choose worthwhile topics and think deep, from which I will benefit in the rest of my career. I am very thankful to Dr. Philip S. Yu, for the numerous insightful advices he gave me in our many discussions. He has provided great help to improve the quality of my research and this thesis.
doi:10.1145/1540276.1540283 fatcat:xr3vnpi33vfybiecexzjhz66ua