Protein Clustering and Classification
Cellular Origin and Life in Extreme Habitats and Astrobiology
Introduction Proteins are the building blocks of all organisms. The name protein is derived from the Greek word 'protos' which means first or primal. Indeed proteins are the most fundamental substance of life, as they are the key component of the protoplasm of all cells. In addition to their role as the building blocks of cells and tissue, proteins also play a role in executing and regulating most biological processes. Enzymes, hormones, transcription factors, pumps and antibodies are examples
... odies are examples for the diverse functions fulfilled by proteins in a living organism. Proteins are macromolecules, and consist of combinations of amino acids in peptide linkages, that contain carbon, hydrogen, oxygen, nitrogen, and sulfur atoms. There are only 20 different types of amino acids, and they can be combined to generate an infinite number of sequences. In reality, only a small subset of all possible sequences appears in nature. In the study of proteins, there are three important attributes of proteins: sequence, structure, and function. The sequence is essentially the string of amino acids which comprises the protein. The structure of the protein is the way the protein is outlaid in the three dimensional space. Perhaps most important, yet most elusive, is the protein function. The protein function is its actual role in the specific organism in which it exists. Understanding the protein function is critical for most applications, such as drug design, genetic engineering, or pure biological research. The advent of advanced techniques for sequencing proteins in the last two decades, spur the explosive growth witnessed today in protein databases. Due to this rate of growth, the biological function of a large fraction (between one third and one half, depending on the organism) of sequenced proteins remains unknown. The difficulty of assigning a certain function to a particular protein stems from the fact the function of many proteins is defined by its context in term of protein partner and even its localization. In addition, a protein string may be subjected to large number of modifications that may affect its function and fate. In this survey we will not address any of the dynamic properties of the proteins and will only address the protein as a predetermined string of amino-acids. A common way to tackle the complexity of protein function prediction uses database searches to find proteins similar to a new protein, thus inferring the protein function. This method is generalized by protein clustering or classification, where databases of proteins are organized into groups or families in a manner that attempt to capture protein similarity. We survey the field of protein clustering and classification systems. Such systems use the protein sequence, and at times structure, to classify proteins into families. The classification may be leveraged towards function inference. The structure of this survey is as follows. We start by describing the most commonly used algorithms for sequence similarity, and the way they can be used directly for protein classification. The following sections describe classification systems based on the methodology used: motif-based classifications, full-sequence analysis classifications, phylogenetic classification, and structure based classifications, aggregated classifications making use of the results of other classification systems. The last section provides a summary. It is important to note that vast amounts of research were made in the field of proteomics which are related to the topic of this survey. Herein, we focus primarily on publicly available software systems for protein classification. Furthermore, due to space constraints we describe only a selected subset of systems and methods available. An attempt was made to represent the full spectrum of methods and directions. Sequence similarity One of the most common approaches towards classifying proteins is using sequence similarity. Sequence similarity is a well studied subject, and numerous software packages suited for biological sequences are available. Such packages (e.g. BLAST) are probably the most widely used software in the fields of biology and bioinformatics. Sequence similarity algorithms (and software) take as input two sequences and provide a measure of distance or similarity between them. Note that these quantities are related in the sense that the higher the distance the lower the similarity, and vice versa. The notion of distance between sequences has been formalized by Levenshtein (1965) , who has introduced a dynamic programming algorithm for determining this distance, referred to as 'edit distance'. The 'edit distance' between two strings is defined as the number of insertions, deletions, and replacements of characters from the first string required to obtain the second string. Some variants of the edit distance allow for reversals of sub-strings. The edit distance problem is strongly related to the problem of string alignment. The problems are essentially equivalent, as the alignment can be easily produced from a set of insertions and deletions of characters. In the context of biological sequences, similarity and alignment were first studied by Needleman and Wunsch (1970) . The Needleman-Wunsch sequence similarity and sequence alignment are usually referred to as global sequence alignment. In other words, this is alignment of full length sequences. In practice, only extremely similar sequences can be nicely globally aligned. However, many proteins exhibit strong local similarity. The local alignment problem was studied by Smith and Waterman (1981).