Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction
release_hmvwfwtbabaafdcxlt3w2dzjem
by
Dean Sumner,
Jiazhen He,
Amol Thakkar,
Ola Engkvist,
Esben Bjerrum
2020
Abstract
SMILES
randomization, a form of data augmentation, has previously been shown to
increase the performance of deep learning models compared to non-augmented
baselines. Here, we propose a novel data augmentation method we call "Levenshtein
augmentation" which considers local SMILES sub-sequence similarity between
reactants and their respective products when creating training pairs. The performance
of Levenshtein augmentation was tested using two state of the art models -
transformer and sequence-to-sequence based recurrent neural networks with
attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and
conventionally SMILES randomization augmented data when used for training of
baseline models. Furthermore, Levenshtein augmentation seemingly results in
what we define as <i>attentional gain </i>– an
enhancement in the pattern recognition capabilities of the underlying network
to molecular motifs.
In application/xml+jats
format
Archived Files and Locations
application/pdf
689.1 kB
file_liojg7naovhwbenu7ecsxr67tu
|
s3-eu-west-1.amazonaws.com (publisher) web.archive.org (webarchive) |
post
Stage
unknown
Date 2020-07-06
access all versions, variants, and formats of this works (eg, pre-prints)
Crossref Metadata (via API)
Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar