Machine learning made easy for optimizing chemical reactions

Jason E. Hein
2021 Nature  
The optimization of reactions used to synthesize target compounds is pivotal to chemical research and discovery, whether in developing a route for manufacturing a life-saving medicine 1 or unlocking the potential of a new material 2 . But reaction optimization requires iterative experiments to balance the often conflicting effects of numerous coupled variables, and frequently involves finding the sweet spot among thousands of possible sets of experimental conditions. Expert synthetic chemists
more » ... rrently navigate this expansive experimental void using simplified model reactions, heuristic approaches and intuition derived from observation of experimental data 3 . On page 89, Shields et al. 4 report machine-learning software that can optimize diverse classes of reaction with fewer iterations, on average, than are needed by humans. Machine learning has emerged as a useful tool for various aspects of chemical synthesis, because it is ideally suited to extrapolating predictive models that are used to solve synthetic problems by recognizing patterns in multidimensional data sets 5 . However, chemists need to learn new skills to correctly deploy machine learning in their research, thus limiting the widespread adoption of this approach. Shields et al. address this problem by reporting an open-source software toolkit that can be easily adopted by chemists. A range of machine-learning methods are now available, and the first task when developing any new application is to choose the most appropriate method. The choice depends on the type of data (numbers, pictures and so on), the number of data points available to train the system, and the desired output 6 . Wrong choices can lead to false correlations being made during training and ineffective predictive models. To train their model, Shields and colleagues selected a method that uses a machine-learning approach called Bayesian optimization. Bayesian-optimization algorithms have proved exceptionally effective in other applications, but the authors are among the first to develop a reaction-optimization toolkit that uses this approach. Their opensource software contains all the components necessary for researchers to carry out Bayesian reaction optimization for systems that have any number of experimental variables. The toolkit first uses a simple workflow to carry out a quantum-mechanical calculation that encodes the reaction of interest in a machine-readable format involving what are known as chemical descriptors 7 . Reaction parameters that can be represented as a continuous series of numbers, such as temperature and concentration, are already in a form that can be interpreted by the algorithm. However, categorized reaction parameters, such as the identity of the solvent or catalyst, need to be provided by the chemist using one of several commonly applied molecular notations. Each molecule in the reaction is then decomposed by the toolkit into a subset of numerical values that describe the molecule's inherent chemical properties (molecular weight, charge density, bond strengths and so on), which can be interpreted by the algorithm 8 . Some of the biggest pitfalls in the application of machine-learning methods to chemical systems arise in the execution of this decomposition process. After multiple trials, Shields and co-workers arrived at a balanced approach that can be generalized for a variety of reactions involving many diverse chemicals. The second part of the workflow is the Bayesian-optimization step. As the authors' work highlights, Bayesian algorithms are well suited for reaction optimization because they excel at handling relatively small data sets 9 . Starting from sparse data, the algorithm creates a surrogate model in an attempt to mathematically define how the input variables (reaction parameters) will affect the output target (the reaction yield or another measure of performance). At first, the model provides a poor approximation of the reaction system, but the algorithm also evaluates what is learnt when new reaction data are acquired to test the effects of the variables. The algorithm therefore suggests a new experiment for chemists to run, providing specific values for the reaction variables. An accessible machine-learning tool has been developed that can accelerate the optimization of a wide range of synthetic reactions -and reveals how cognitive bias might have undermined optimization by humans. See p.89 Figure 1 | Humans versus machine learning for reaction optimization. Shields et al. 4 have developed a machine-learning algorithm that optimizes the outcome of chemical reactions, and tested it in an optimization game. The authors selected a reaction, and defined five reaction variables that could be altered. They limited players to a fixed set of possibilities for each variable, and measured the reaction outcomes for all 1,728 possible combinations of variables. They then asked 50 expert chemists to carry out a virtual optimization: participants selected five combinations of variables and were shown the experimental outcomes, and could then select a new batch of five combinations to try to achieve the best possible reaction yield, up to a maximum of 20 batches (thin dashed lines indicate best yields per batch for each player; thick solid line indicates the mean average of the best yields). The algorithm also played the game 50 times, but started with random batches of variables. The experts made better initial choices, but the algorithm outperformed the players, on average, after the third batch of experiments (vertical grey dashed line). Moreover, the experts often did not achieve the optimal yield because they gave up too soon (blue dots indicate the end of each player's game), whereas the algorithm always achieved greater than 99% yields using its full allotment of batches. Yield (%) 100 80 60 40 20 0 Expert chemists Algorithm Experiment batch 0 5 10 15 20 0 5 10 15 20 40 | Nature | Vol 590 | 4 February 2021 News & views © 2 0 2 1 S p r i n g e r N a t u r e L i m i t e d . A l l r i g h t s r e s e r v e d .
doi:10.1038/d41586-021-00209-6 pmid:33536642 fatcat:5opnhiuchnaebi6q3264hyd5ye