Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction
Compensatory mutations between protein residues that are in physical contact with each other can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, high coupling coefficients predict residues contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation
... f coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on boosting contact prediction quality by adding external information, little progress has been made to improve the statistical procedure at the core. In that regard, our lack of understanding of the sources of noise poses a major obstacle. We have developed CCMgen, the first method for simulating protein evolution by providing full control over the generation of realistic synthetic MSAs with pairwise statistical couplings between residue positions. This procedure requires an exact statistical model that reliably reproduces observed alignment statistics. We also provide CCMpredPy, an implementation of persistent contrastive divergence (PCD), a precise inference technique that enables us to learn the required high-quality models. We demonstrate how CCMgen can facilitate the development and testing of contact prediction methods by analyzing the systematic noise contributions from phylogeny and entropy. For that purpose, we propose a simple entropy correction (EC) strategy which disentangles the correction for both sources of noise. We find that entropy contributes typically roughly twice as much noise as phylogeny.