Controlled exploration of chemical space by machine learning of coarse-grained representations [article]

Christian Hoffmann, Roberto Menichetti, Kiran H. Kanekal, Tristan Bereau
2019 arXiv   pre-print
The size of chemical compound space is too large to be probed exhaustively. This leads high-throughput protocols to drastically subsample and results in sparse and non-uniform datasets. Rather than arbitrarily selecting compounds, we systematically explore chemical space according to the target property of interest. We first perform importance sampling by introducing a Markov chain Monte Carlo scheme across compounds. We then train an ML model on the sampled data to expand the region of
more » ... space probed. Our boosting procedure enhances the number of compounds by a factor 2 to 10, enabled by the ML model's coarse-grained representation, which both simplifies the structure-property relationship and reduces the size of chemical space. The ML model correctly recovers linear relationships between transfer free energies. These linear relationships correspond to features that are global to the dataset, marking the region of chemical space up to which predictions are reliable---a more robust alternative to the predictive variance. Bridging coarse-grained simulations with ML gives rise to an unprecedented database of drug-membrane insertion free energies for 1.3 million compounds.
arXiv:1905.01897v1 fatcat:2tyfrtrtifgufejlphvswwvp2e