Peptide partitions and protein identification: a computational analysis [article]

G Sampath
2016 bioRxiv   pre-print
Peptide sequences from a proteome can be partitioned into N mutually exclusive sets and used to identify their parent proteins in a sequence database. This is illustrated with the human proteome (; id UP000005640), which is partitioned into eight subsets KZ*R, KZ*D, KZ*E, KZ*, Z*R, Z*D, Z*E, and Z*, where Z ϵ {A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V} and Z* ≡ 0 or more occurrences of Z. If the full peptide sequence is known then over 98% of the proteins in the
more » ... me can be identified from such sequences. The rate exceeds 78% if the positions of four internal residue types are known. When the standard set of 20 amino acids is replaced with an alphabet of size four based on residue volume the identification rate exceeds 96%. In an information-theoretic sense this last result suggests that protein sequences effectively carry nearly the same amount of information as the exon sequences in the genome that code for them using an alphabet of size four. An appendix discusses possible in vitro methods to create peptide partitions and potential ways to sequence partitioned peptides.
doi:10.1101/069526 fatcat:o34qqxqi6bbfzoxzupfq7ihoi4