Mean Dimension of Generative Models for Protein Sequences
AbstractGenerative models for protein sequences are important for protein design, mutational effect prediction and structure prediction. In all of these tasks, the introduction of models which include interactions between pairs of positions has had a major impact over the last decade. More recently, many methods going beyond pairwise models have been developed, for example by using neural networks that are in principle able to capture interactions between more than two positions from multiple
... quence alignments. However, not much is known about the inter-dependency patterns between positions in these models, and how important higher-order interactions involving more than two positions are for their performance. In this work, we introduce the notion of mean dimension for generative models for protein sequences, which measures the average number of positions involved in interactions when weighted by their contribution to the total variance in log probability of the model. We estimate the mean dimension for different model classes trained on different protein families, relate it to the performance of the models on mutational effect prediction tasks and also trace its evolution during training. The mean dimension is related to the performance of models in biological prediction tasks and can highlight differences between model classes even if their performance in the prediction task is similar. The overall low mean dimension indicates that well-performing models are not necessarily of high complexity and encourages further work in interpreting their performance in biological terms.