Local Higher-Order Graph Clustering

Hao Yin, Austin R. Benson, Jure Leskovec, David F. Gleich
2017 Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17  
Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle
more » ... rected networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology. targeted graph clustering-is a specific case of this problem that takes an additional input in Request permissions from permissions@acm.org. the form of a seed set of vertices. The idea is to identify a single cluster nearby the seed set without ever exploring the entire graph, which makes the local clustering methods much faster than their global counterparts. Because of its speed and scalability, this approach is frequently used in applications including ranking and community detection on the Web [13, 16] , social networks [22] , and bioinformatics [24] . Furthermore, the seed-based targeting is also critical to many applications. For example, in the analysis of protein-protein interaction networks, local clustering aids in determining additional members of a protein complex [45] . The theory and algorithms for local approaches are most well developed when using conductance as the cluster quality measure [2, 50] . Conductance, however, is only defined for simple undirected networks. Using principled local clustering methods for networks involving signed edges, multiple edge types, and directed interactions has remained an open challenge. Moreover, current cluster quality measures simply count individual edges and do not consider how these edges connect to form small network substructures, called network motifs. Such higher-order connectivity structures are crucial to the organization of complex networks [5, 35, 48] , and it remains an open question how network motifs can be incorporated into local clustering frameworks. Designing new algorithms for local higherorder graph clustering that incorporate higher-order connectivity patterns has the potential to lead to improved clustering and knowledge discovery in networks. There are two main advantages to local higher-order clustering. First, it provides new types of heretofore unexplored local information based on higher-order structures. Second, it provides new avenues for higher-order structures to guide seeded graph clustering. In our recent work, we established a framework that generalizes global conductance-based clustering algorithms to cluster networks based on higher-order structures [5] . However, there are multiple issues that arise when this framework is applied to local graph clustering methodologies that we address here. Present work: Local higher-order clustering-In this paper we develop local algorithms for finding clusters of nodes based on higher-order network structures (also called network motifs, Figure 1 ). Our local methods search for a cluster (a set of nodes) S with minimal motif conductance, a cluster quality score designed to incorporate the higherorder structure and handle edge directions [5] . More precisely, given a graph G and a motif M, the algorithm aims to find a set of nodes S that has good motif conductance (for motif M) such that S contains a given set of seed nodes. Cluster S has good (low) motif conductance for some motif M if the nodes in S participate in many instances of M and there are few instances of M that cross the set boundary defined by S. Figure 2 illustrates the concept of motif conductance, where the idea is that we do not count the number of edges that are cut, but the number of times a given network motif M gets cut. This way edges that do not participate in a given motif (say, a triangle) do not contribute to the conductance. Motif conductance has the benefit that it allows us to focus the clustering on particular network substructures that are important for networks of a given domain. For example, triangles are important higher-order structures of social networks [19] and thus focusing the clustering on such substructures can lead to improved results. Yin et al. Before deriving our algorithms, we first go over the basic notation and cluster quality scores that we use throughout the paper. Our datasets will be simple, unweighted, possibly directed graphs G = (V, E) with adjacency matrix A. We denote n = |V | as the number of nodes and m = |E| as the number of edges. Our algorithms will sometimes use a weighted graph G w = (V, E,W). Cut, volume, and conductance-The cut of a set of nodes S ⊂ V, denoted by cut(S), is the number of edges with one end point in S and the other end point in the complement set S̄ = V \S. The volume of a set of nodes S, denoted by vol(S), is the number of edge end points Yin et al.
doi:10.1145/3097983.3098069 pmid:29770258 pmcid:PMC5951164 dblp:conf/kdd/YinBLG17 fatcat:logzwli35rfi5koy7r5puxiao4