Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction release_hmvwfwtbabaafdcxlt3w2dzjem

by Dean Sumner, Jiazhen He, Amol Thakkar, Ola Engkvist, Esben Bjerrum

Released as a post by American Chemical Society (ACS).

2020  

Abstract

SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call "Levenshtein augmentation" which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as <i>attentional gain </i>– an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.
In application/xml+jats format

Archived Files and Locations

application/pdf   689.1 kB
file_liojg7naovhwbenu7ecsxr67tu
s3-eu-west-1.amazonaws.com (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  post
Stage   unknown
Date   2020-07-06
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: d382460e-cdd6-4519-9dd1-f317ddedae7a
API URL: JSON