Protein Sequence Annotation by Means of Community Detection
Giuseppe Profiti, Damiano Piovesan, Pier Martelli, Piero Fariselli, Rita Casadio
2015
Current Bioinformatics
The improvement of sequencing technologies is increasing the volume of biosequences in databases. Experimental validation of genomes and proteomes is however far too slow compared to the pace at which data are being produced and electronic annotation is the current solution to this problem. The annotation of a new sequence is inferred from experimentally validated reference proteins using different algorithms. Recently we developed BAR+ [1], an annotation system that is cluster centric: a
more »
... n enters an annotated cluster provided that it shares at least 40% sequence identity over at least 90% of the alignment length with a protein of the cluster. From the cluster the protein inherits all the statistically validated features that characterize the cluster. These can include GO terms, Pfam domains and protein structure. Clusters in BAR+ where generated by splitting the components of graphs where two nodes (two proteins) are linked when they share at least 40% sequence identity over at least 90% of the pairwise sequence alignment [2] . BAR+ clusters are therefore graphs where protein sequences are the nodes and similarity relationships are the edges, with weight equal to the evaluated sequence identity between the pair of nodes. Over 13 million protein sequences have been clustered into 913962 clusters, with cluster size up to 87893 nodes. Here we enhance the level of detail within BAR+ clusters by applying algorithms used to identify communities in graphs. This is done in order to subcluster sequences that share within the same cluster more specific functional and structural features. A community is defined as a subset of nodes having more edges leading to members of the same community than to other nodes in the graph. The term community comes from the original application of this concept to social networks; however, community detection is now used to assess robustness of network infrastructures and to analyze interaction networks [3], [4] . The definition of community is a bit vague and then a mathematical measure is needed in order to compare different assignment of nodes to communities in a graph. Different approaches to community detection have been developed [5] , mostly relying on the maximization of a target function. Other clustering techniques, like spectral methods and k-means, require a-priori knowledge of the number of communities. For our purpose, however, an algorithm capable to automatically detect the communities without the need of setting a parameter is
doi:10.2174/157489361002150518122954
fatcat:4t2ehykt45af5geprhrjrhjkiy