Deep Generative Models of Protein Domain Structures Can Uncover Distant Relationships: Evidence for an Urfold
Recent advances in protein structure determination and prediction offer new opportunities to decipher relationships amongst proteins—a task that entails 3D structure comparison and classification. Historically, protein domain classification has been somewhat manual and heuristic. While CATH and related resources represent significant steps towards a more systematic and automatable approach, more scalable and objective classification methods will enable a fuller exploration of protein structure
... protein structure or 'fold' space. Comparative analyses of protein structure latent spaces may uncover distant relationships, and will potentially entail a large-scale restructuring of traditional classification schemes. We have developed 3D convolutional variational autoencoders to 'define' ideal geometries and biophysical properties of proteins at CATH's homologous superfamily (SF) level. To quantitatively evaluate pairwise 'distances' between SFs, we built one model per SF and compared the evidence lower bound (ELBO) loss functions of the models when evaluated with different SF structure representatives. Clustering on these distance matrices provides a new view of protein interrelationships—a view that extends beyond simple structural/geometric similarity, towards the realm of structure/function properties, and that is consistent with a recently proposed 'Urfold' concept.