A machine learning approach to predict metabolic pathway dynamics from time-series multiomics data

Zak Costello, Hector Garcia Martin
2018 npj Systems Biology and Applications  
10 New synthetic biology capabilities hold the promise of dramatically improving our ability to engineer biological systems. However, a fundamental hurdle in realizing this potential is our inability to accurately predict biological behavior after modifying the corresponding genotype. Kinetic models have traditionally been used to predict pathway dynamics in bioengineered systems, but they take significant time to develop, and rely heavily on domain expertise. Here, we show that the combination
more » ... of machine learning and abundant multiomics data (proteomics and metabolomics) can be used to effectively predict pathway dynamics in an automated fashion. The new method outperforms a classical kinetic model, and produces qualitative and quantitative predictions that can be used to productively guide bioengineering efforts. This method systematically leverages arbitrary amounts of new data to improve predictions, and does not assume any particular interactions, but rather implicitly chooses the most predictive ones. 11 12 Biology has been transformed in the second half of the 20th century from a descriptive science to a design science. This 13 transformation has been produced by a combination of the discovery of DNA as the repository of genetic information 1 , and 14 of recombinant DNA as an effective way to alter this instruction set 2 . The subsequent advent of genetic engineering and 15 synthetic biology as effective tools to engineer biological cells has produced numerous beneficial applications ranging from the 16 production of renewable biofuels and other bioproducts 3-6 to applications in human health 7-9 , creating the expectation of an 17 industrialized biology affecting almost every facet of human activity 10 . 18 However, effective design of biological systems is precluded by our inability to predict their behavior. We can engineer 19 changes faster than ever, enabled by DNA synthesis productivity that improves as fast as Moore's law 11 , and new tools like 20 CRISPR-enabled genetic editing, which have revolutionized our ability to modify the DNA in vivo 12 . In general, we can make 21 the DNA changes we intend (in model systems), but the end result on cell behavior is usually unpredictable 13 . At the same time, 22 there is an exponentially increasing amount of functional genomics data available to the experimenter in order to phenotype 23 the resulting bioengineered organism: transcriptomics data volume has a doubling rate of 7 months 14 , and high-throughput 24 workflows for proteomics 15 and metabolomics 16 are becoming increasingly available. Furthermore, the miniaturization of 25 these techniques and the progressive automation of laboratory work through microfluidics chips promises a future where data 26 analysis will be the bottleneck in biological research 17 . Unfortunately, the availability of all this data does not translate into 27 better predictive capabilities for biological systems: converting these data into actionable insights to achieve a given goal (e.g. 28 higher bioproduct yields) is far from trivial or routine. 29 Mathematical modeling provides a systematic manner to leverage these data to predict the behavior of engineered systems. 30 Hence, increasingly, computational biology is focusing on large-scale modeling of dynamical systems predicting phenotype 31 from genotype 18, 19 . However, computational biology is still nascent and not able to provide the high accuracy predictions that 32 we are accustomed to seeing in other engineering fields 20 . Arguably, the most widely used and successful modeling technique 33 in metabolic engineering involves analysis of internal metabolic fluxes (i.e. reaction rates) through stoichiometric models 34 of metabolism. Metabolic flux values are constrained by stochiometry, thermodynamic and evolutionary assumptions 21, 22 , 35 or experimental data (e.g. 13 C labeling experimental data 23-25 ), and used to suggest genetic interventions that bring cell 36 metabolism closer to the desired phenotype. While this approach has provided significant successes [26][27][28][29] , it has also shown its 37 limitations 30 due to its simplicity. Stochiometric models are limited for bioengineering purposes because they ignore enzyme 42 have the desired titers, rates and yields. Kinetic models rest on an explicit functional relationship connecting the rate of change 43 of a metabolite and the proteins and metabolites involved in the reaction (see Fig. 1 ): Michaelis-Menten kinetics 32, 33 is the 44 most common choice, but the fact is that the true mechanistic kinetic rate law for each specific reaction is unknown for most 45 enzymes 34 (alternatives include generalized mass action 35 , lin-log kinetics 36, 37 or power-law models 38 ). However, there is a 46 lack of reliable data for the enzyme activity and substrate affinity parameters used in these models: in-vitro characterization may 47 not be extrapolatable to in vivo conditions, and the effect of activators and inhibitors are typically unknown. Approaches such as 48 ensemble modeling 39-43 tackle the parsity of known kinetic constants by producing an ensemble of models displaying different 49 combinations of randomly chosen kinetic parameters and selecting only those models that match known experimental data, or 50 by optimizing the selection of these parameters through genetic algorithms 44, 45 . In a similar fashion, ORACLE 46, 47 produces 51 populations of models which are consistent with reaction stochiometry, thermodynamics and available concentration and 52 fluxomic data. By design, these approaches are able to match measured final production levels and flux data, and the predictions 53 have been shown to improve as the model approaches genome-scale coverage 45 . However, a significant problem remains in that 54 essential mechanisms are only sparsely known: allosteric regulation, for example, is known to be critical in order to determine 55 fluxes 48, 49 , and yet a comprehensive map of this regulatory mechanism is unavailable. Post-translational modifications of 56 proteins are also known to markedly affect catalytic activity 50 , but are still largely unmapped. Pathway channeling, too, 57 significantly affects catalytic rates but the degree to which this phenomenon occurs in metabolism has only begun to be 58 explored 51-53 . These and other gaps in our knowledge of mechanisms will require significant time and effort to be cleared. 59 Given the urgent need of predictive capabilities by the emerging biotech industry, it may be useful to consider a different 60 2/21 approach while this knowledge is gathered. 61 Here we propose an alternative to traditional kinetic modeling involving a machine learning approach (Figs. 1 and 2) , 62 in which the function that determines the rate of change for each metabolite from protein and metabolite concentrations is 63 directly learned from training data (equation 1 and Fig. S1 ), without presuming any specific relationship. Machine learning 64 has shown remarkable success in well bounded problems where a mechanistic model is impossible or difficult to develop: e.g. 65 artificial vision for driverless cars 54 , automated playing of the Go game 55 , automated language translation 56 or private trait 66 prediction from digital records of human behavior 57 with direct impact on national elections 58 . In biology, these methods have 67 recently been successfully applied to challenging problems such as predicting DNA and RNA protein binding sequences 59 , skin 68 cancer diagnosis 60 , single nucleotide polymorphism (SNP) and small indel variant calling 61 , and tumor detection in breast 69 histopathology 62 . 70 This alternative, machine-learning based, approach provides a faster development of predictive pathway dynamics models 71 since all required knowledge (regulation, host effects... etc) is inferred from experimental data, instead of arduously gathered 72 and introduced by domain experts (see supplementary material for an example). In this way, the method provides a general 73 approach, valid even if the host is poorly understood and there is little information on the heterologous pathway, and provides a 74 systematic way to increase prediction accuracy as more data is added. This method obtains better predictions than the traditional 75 Michaelis Menten approach for the limonene and isopentenol producing pathways studied here (Fig. 3) using only two times 76 series (strains), and is shown to significantly improve its prediction performance as more time series are added. The new 77 method is accurate enough to drive bioengineering efforts: we show it is able to predict the relative production ranking for 78 several designs, given enough data. This approach is a specific solution to the more general type of problem of determining 79 dynamics from observed data (system identification) 63-65 , a problem generally recognized as hard. We believe this approach is 80 scalable to genome-scale models, or generally applicable to other types of data (e.g. transcriptomics) or dynamic systems (e.g. 81 microbiome dynamics). 82 Mathematical Problem Formulation 83 Here, we describe the problem and its solution in succinct mathematical terms. Let us assume we are given q sets of time 84 99 Solving Problem 1 is equivalent to finding the metabolic dynamics which best describe the time series data provided. Once the 100 dynamics are learned we can then predict the behavior of the metabolic pathway by solving an initial value problem (equations 3 101 and 4). 102 Results and Discussion 103 We used the supervised learning method described above (Figs 1 and 2, Eqns 1, 2, 3 and 4) to predict pathway dynamics (i.e. 104 metabolite concentrations as a function of time) from protein concentration data for two pathways of relevance to metabolic 105 3/21 128 testable hypotheses. Finally, in a minority of cases (25%), the predictions are wrong both quantitatively and qualitatively: e.g. 129 HMG-CoA for the isopentenol producing pathway (Fig. 5b) , Mevalonate Phosphate (Fig. 6d) and IPP/DMAPP (Fig. 6e) for the 130 limonene producing pathway. Interestingly, the predictions for both final products (limonene and isopentenol) fell in the group 131 of quantitatively accurate predictions. This is important because, for the purpose of guiding metabolic engineering, it is the 132 final product predictions that are relevant. 133 The machine learning approach outperforms a handcrafted kinetic model of the limonene pathway (Fig. 6) . A realistic 134 kinetic model of this pathway was built and fit to the data, leaving all kinetic constants as free parameters (Figs. 3 and 4) . The 135 kinetic model notably fails to capture the qualitative dynamics for Acetyl-CoA, HMG-CoA, Mevalonate and IPP/DMAPP 136 (Figs 6a,b,c and e). More quantitatively, the machine learning model produces an average 130% error (RMSE = 8.42) versus an 137 5/21 1 [M K][M ev] K 4,2 [GP P ] + K 4,3 [M evP ] + K 4,4 [M ev] + K 4,5 d[M evP ] dt = K 4,1 [M K][M ev] K 4,2 [GP P ] + K 4,3 [M evP ] + K 4,4 [M ev] + K 4,5 − K5, 1[P M K][M evP ] K 5,1 + [M evP ] d[M evP P ] dt = K5, 1[P M K][M evP ] K 5,1 + [M evP ] − K 6,1 [P M D][M evP P ] K 6,2 [M evP ] + K 6,3 [M ev] + K 6,4 [M evP P ] + K 6,5 d[IP P ] dt = K 6,1 [P M D][M evP P ] K 6,2 [M evP ] + K 6,3 [M ev] + K 6,4 [M evP P ] + K 6,5 − K 7,1 [IDI][IP P ] K 7,2 + [IP P ] − K 8,1 [GP P S][IP P ][DM AP P ] K 8,2 + K 8,3 [IP P ]K 8,4 [DM AP P ] + [IP P ][DM AP P ] d[DM AP P ] dt = K 7,1 [IDI][IP P ] K 7,2 + [IP P ] − K 8,1 [GP P S][IP P ][DM AP P ] K 8,2 + K 8,3 [IP P ] + K 8,4 [DM AP P ] + [IP P ][DM AP P ] d[GP P ] dt = K 8,1 [GP P S][IP P ][DM AP P ] K 8,2 + K 8,3 [IP P ]K 8,4 [DM AP P ] + [IP P ][DM AP P ] − K 9,1 [LS][GP P ] K 9,2 + [GP P ] d[Limonene] dt = K 9,1 [LS][GP P ] K 9,2 + [GP P ] Supporting Information 427 S1 Appendix. Details of kinetic model for the limonene pathway. This file contains the details of the derivation and 428 sources for the model development for the limonene pathway model used in this paper, as well as supplementary figures S1, S2 429 and S3. 430 Code availability 431 Code is available at the following code repository: https://github.com/JBEI/KineticLearning. 432 Data availability 433 All data was obtained from Brunk et al 66 , and is also available at the code repository: https://github.com/JBEI/KineticLearning. 434 Acknowledgments 435 This work was part of the DOE Agile BioFoundry
doi:10.1038/s41540-018-0054-3 pmid:29872542 pmcid:PMC5974308 fatcat:b67iumgdn5hqloaxevlerb7l5y