Statistical methods for identifying sequence motifs affecting point mutations
Mutation processes differ between types of point mutation, genomic locations, cells, and biological species. For some point mutations, specific neighbouring bases are known to be mechanistically influential. Beyond these cases, numerous questions remain unresolved including: what are the sequence motifs that affect point mutations? how large are the motifs? and, do they vary between samples? We present new log-linear models that allow explicit examination of these questions along with sequence
... long with sequence logo style visualisation to enable identifying specific motifs. We demonstrate the utility of these methods by analysing human germline and malignant melanoma mutations. We recapitulate the known CpG effect and identify numerous novel motifs, including a highly significant motif associated with A→G mutations. We show that major effects of neighbourhood on germline mutation lie within ±2 of the mutating base. Models are also presented for contrasting the entire mutation spectra (the distribution of the different point mutations) and applied to the data. We show the spectra vary significantly between autosomes and X-chromosome, with a difference in T→C transition dominating. Analyses of malignant melanoma confirmed reported characteristic features of this cancer including strand asymmetry and markedly different neighbouring influences. The methods reported are made freely available as a Python libraryhttps://bitbucket.org/pycogent3/mutationmotif.