Use of Local Group Information to Identify Communities in Networks
Sucheta Soundarajan, John E. Hopcroft
ACM Transactions on Knowledge Discovery from Data
The recent interest in networks has inspired a broad range of work on algorithms and techniques to characterize, identify, and extract communities from networks. Such efforts are complicated by a lack of consensus on what a "community" truly is, and these disagreements have led to a wide variety of mathematical formulations for describing communities. Often, these mathematical formulations, such as modularity and conductance, have been founded in the general principle that communities, like a
... n, p) graph, are "round," with connections throughout the entire community, and so algorithms were developed to optimize such mathematical measures. More recently, a variety of algorithms have been developed that, rather than expecting connectivity through the entire community, seek out very small groups of well-connected nodes and then connect these groups into larger communities. In this article, we examine seven real networks, each containing external annotation that allows us to identify "annotated communities." A study of these annotated communities gives insight into why the second category of community detection algorithms may be more successful than the first category. We then present a flexible algorithm template that is based on the idea of joining together small sets of nodes. In this template, we first identify very small, tightly connected "subcommunities" of nodes, each corresponding to a single node's "perception" of the network around it. We then create a new network in which each node represents such a subcommunity, and then identify communities in this new network. Because each node can appear in multiple subcommunities, this method allows us to detect overlapping communities. When evaluated on real data, we show that our template outperforms many other state-of-the-art algorithms. advantages, it is not always clear whether a particular mathematical definition of "community" is correct. Many such mathematical definitions are based on the principle that a community ought to be "round," or well connected throughout: for example, a good community might resemble a G(n, p) graph within a larger network (with p large relative to the edge density of the rest of the network). Classic examples of such definitions include modularity and conductance, both of which reward a set of nodes for having high connectivity throughout the entire set. Another type of mathematical definition is founded in the belief that communities are "long," or formed of many small groups that are individually well connected, and while these small groups may be well connected to one another, each individual node in a group may not be well connected to the rest of the community. For example, the popular Clique Percolation algorithm [Palla et al. 2005] first identifies cliques of a certain size, and then "rolls" together adjacent cliques (those sharing all but one node) to find larger communities. While portions of such a community are certainly well connected (as they are cliques), an individual node need not have any connection to more distant cliques. Because there is little consensus about what real communities are, rather than creating and evaluating our method based on some mathematical criterion, we choose to design it and test it using real data. We use a collection of seven network datasets from varied domains, including social, product, and biological. Each of these networks contains some sort of external annotation that allows us to identify 'annotated communities." For example, in a social network of students at a university, all students in the same department constitute one annotated community. First, in order to gain insight into which of the two concepts of "community" is more realistic, we examine the annotated communities in detail. We demonstrate that annotated communities tend to be much "longer" than random graphs of the same size, and so conclude that the "long" model of communities as sets of small groups may better characterize annotated communities. We then decompose each annotated community into several constituent parts, and show that these parts tend to fit the "round" model much better than do the complete annotated communities. Working with these principles, we create the Node Perception algorithm template for finding overlapping communities in networks. Our method is founded partly in the intuition that while individuals may belong to many different communities, a relationship between two individuals will generally fall solidly into one community. Given this, individuals in a network should be able to partition their neighbors into disjoint "subcommunities" that are portions of larger communities. For example, an individual person may be in many communities, such as her workplace, a university department at her school, an extended family, and so on. While she cannot name every individual in these communities, she can probably identify which of her acquaintances fall into each of these communities, and so can group her neighbors into subcommunities (e.g., "my coworkers," "my classmates," etc.). These subcommunities can be identified through use of a simple graph partitioning algorithm, or, in some cases, may be more accurately identified with available metadata (e.g., if several people frequently appear together in photographs). We then identify communities in a new network in which each node represents a subcommunity. Each node from the original network can be represented by multiple subcommunities, so a node can appear in many different communities. Because a practitioner may choose how to identify subcommunities, how to create a network of subcommunities, and how to identify communities in that new network, this template is highly flexible and can easily be tuned to meet the user's needs. To evaluate our method, we test how well it recovers the set of annotated communities. Because it is a flexible template, we consider several specific instances, and show that all of these instances outperform several other popular methods for identifying communities.