Biochemical and statistical network models for systems biology

Nathan D Price, Ilya Shmulevich
2007 Current Opinion in Biotechnology  
Introduction The creation of models of the integrated functions of genes and proteins in cells is of fundamental and immediate importance to the emerging field of computational systems biology. Some of the most successful attempts at cell-scale modeling to date have been based on piecing together networks that represent hundreds of experimentally-determined biochemical interactions, while others have been very successful at inferring statistical networks from large amounts of high-throughput
more » ... a. These networks (metabolic, regulatory, or signaling) can be analyzed, and predictions about cellular behavior made and tested. Many types of models have been built and applied to study cellular behavior and in this review we focus on two broad types: biochemical network models and statistical inference models. Through iterative model prediction, experimentation, and network refinement, the molecular circuitry and functions of biological networks can be elucidated. The construction of genomescale models that integrate the myriad components that produce cellular behavior is a fundamental goal of systems biology today. Biochemical Reaction Networks Biochemical reaction networks represent the underlying chemistry of the system, and thus at a minimum represent stoichiometric relationships between inter-converted biomolecules. The stoichiometry of biochemical reaction networks can now be reconstructed at the genome-scale, and at smaller scale with sufficient detail to generate kinetic models. These biochemical reaction networks represent many years of accumulated experimental data and can be interrogated in silico to determine their functional states. Genome-scale models based on biochemical networks provide a comprehensive, yet concise, description of cellular functions. For metabolism the reconstruction of the biochemical reaction network is a well-established procedure [1] [2] [3] [4] [5] [6] [7] , while methods for the reconstruction of the associated regulatory [8, 9] and signaling networks [10] [11] [12] with stoichiometric detail are being developed. The typically used formalism is to reconstruct the stoichiometric matrix, where each row represents a molecular compound and each column represents a reaction. For metabolism, these networks are often focused on just the metabolites, where the existence of a protein that catalyzes this reaction is used to allow that reaction to be present in the network. It is also possible (and truer to the realities in the system) to represent the proteins themselves as compounds in the network, which enables the integration of proteomics, metabolomics, and flux data ( Figure 1 ). For regulatory and signaling networks, the inclusion of the proteins as compounds is essential. This process of reconstructing biochemical reaction networks has been referred to as the "two-dimensional annotation" of genomes [13] . Genome-scale models can be generated using a constraint-based approach, as has been reviewed in detail elsewhere [14] . This modeling process involves a three-step procedure. First, the biochemical network is reconstructed, as described above. Second, the physico-chemical constraints under which the reconstructed network operates are represented mathematically. Such constraints are based upon properties such as enzyme capacity, reaction stoichiometry and thermodynamics associated with reaction directionality and biochemical loops. This approach is most commonly used to describe fluxes through biochemical networks at steady state (i.e. flux balance analysis). The statement of constraints leads to the definition of a solution space that contains all non-excluded network states, describing the possible functions of the reconstructed network or all the allowable phenotypes. The third step is the determination of the possible solutions in this space that correspond to physiologically meaningful states. This constraint-based modeling procedure has been successfully utilized to study phenotypes in various model [15] [16] [17] [18] and infectious [19, 20] microorganisms. Recently, this approach has passed a significant milestone, namely the reconstruction of the human metabolic network [1]. Constraint-based analysis has been applied to selected human systems in the past [21, 22] , but with this reconstruction in place, this approach is poised to begin to have a much more significant impact on modeling human systems. Stoichiometric reaction networks can be expanded into dynamic models using standard differential equation models and the addition of information in the form of kinetic rate constants. These systems, unlike the usual constraint-based approach, do not make the steadystate assumption for all compounds but rather can simulate detailed dynamic behavior. Thus, they can model a greater degree of complexity for a given network and simultaneously account for both concentrations of compounds and fluxes through reactions. The disadvantage of these systems compared with constraint-based analysis is that they require many more parameters and are thus more data intensive to create and/or can be more prone to overfitting when large numbers of parameters are unknown. Thus, these models are not typically created at the whole cell scale, with a notable exception being cell-scale models of red blood cell metabolism [23, 24] . In recent years, these types of dynamic models have been used very successfully to provide mechanistic models of, for example, key signaling pathways involved in critical physiological and pathophysiological processes. For example, a differential equation model of NF-kB signaling [25] was updated and used to elucidate the role of tumor necrosis factor in controlling the sustained phase of lipopolysaccharide-induced IKK activation [26, 27] , demonstrating the usefulness of these models for deciphering functions of biological networks. Another example was the use of a model to study the integrated dynamic behavior of a network of epidermal growth factor receptor family members [28] . Christopher et al. have constructed a dynamic simulator of gene expression and signaling networks in a human cancer cell [29] . In the long run, such dynamic models set the stage for personalized medicine by offering the promise of rational drug design and control of the outcome of induced molecular perturbations. Statistical Influence Networks A second approach to modeling biological networks also holds tremendous potential for advancing knowledge of biology -namely, statistical influence networks. The growing amounts of microarray gene expression data along with improvements in data fidelity are now making it possible to make robust statistical systems-level inferences about the structure and dynamics of biomolecular control mechanisms, such as transcriptional regulatory networks. Price and Shmulevich Many approaches attempt to infer relationships between gene expression measurements using deterministic or stochastic formalisms. The fundamental idea behind these approaches is that models that faithfully capture such relationships have predictive capacity as regards system behavior and can be used to gain insight about system-wide properties, such as steady-state behavior or responses to perturbations or specific stimuli. There are a number of ways in which such relationships can be represented, both in the discrete and continuous domains. One popular modeling approach that captures the nonlinear multivariate relationships exhibited by biological control circuits, such as gene regulatory networks, is the class of Boolean networks, which owes its inception to the work of Stuart Kauffman in the late 1960s [30] . In the Boolean network model, the variables (e.g., genes or proteins) are binary-valued, meaning that their states can be either on or off, and the relationships between the variables are captured by Boolean functions. Each (target) gene is assigned a Boolean rule that determines its value as a function of the values of a set of other (predictor) genes, possibly including the target gene itself. System dynamics are generated by updating the Boolean functions, either synchronously or asynchronously, causing the system to transition from state to state in accordance with its Boolean update rules, where a state is a binary representation of the activities of all of the variables in the system (i.e., a binary vector representing the genes that are on or off at any given time). Boolean network models have been constructed and analyzed for a number of developmental and physiological processes. For example, Albert et al. constructed a Boolean network model for a subset of genes of the fruitfly Drosophila melanogaster, which describes different stable gene expression patterns in the segmentation process of the developing embryo [31] . The steady-state behavior of this model was in excellent agreement with experimentally observed expression patterns under wild type and several gene mutation conditions. This study highlighted the importance of the network topology in determining biologically correct asymptotic states of the system. Indeed, when the segment polarity gene control network was modeled with more detailed kinetic models, such as systems of nonlinear differential equations, exceptional robustness to changes in the kinetic parameters was observed [32]. Boolean networks have also been used to model the yeast and mammalian cell cycle [33, 34] . Li et al. demonstrated that the cell cycle sequence of protein states, which is a globally attracting trajectory of the dynamics, is extremely robust with respect to small perturbations to the network. The Boolean network formalism was also recently used to model systems-level regulation of the host immune response, which resulted in experimentally validated predictions regarding cytokine regulation and the effects of perturbations [35] . Boolean rules can be learned from gene expression data using methods from computational learning theory [36] and statistical signal processing [37] . A limitation of the Boolean network approach is its inherent determinism. Because of the inherent stochasticity of gene expression and the uncertainty associated with the measurement process due to experimental noise and possible interacting latent variables (e.g. protein concentrations or activation states that are not measured), the inference of a single deterministic function may result in poor predictive accuracy, particularly in the context of small sample sizes (e.g., number of microarrays) relative to the number of genes. One approach to "absorb" this uncertainty is to infer a number of simple functions (having few variables), each of which performs relatively well, and probabilistically synthesize them into a stochastic model, called a probabilistic Boolean network (PBN) [38] . The contribution of each function is proportional to its determinative potential as captured by statistical measures such as the coefficient of determination, which are estimated from the data [37] . The dynamical behavior of PBNs can be studied using the theory of Markov chains, which allows the Price and Shmulevich determination of steady-state behavior as well as systematic intervention and control strategies designed to alter system behavior in a specified manner [39] [40] [41] . The PBN formalism has been used to construct networks in the context of several cancer studies, including glioma [42], melanoma [41] , and leukemia [40] . PBNs, which are stochastic rule-based models, bear a close relationship to dynamic Bayesian networks [43] -a popular model class for representing the dynamics of gene expression. Bayesian networks are graphical models that have been used to represent conditional dependencies and independencies among the variables corresponding to gene expression measurements [44] . One limitation of Bayesian networks for modeling genetic networks is that these models must be in the form of directed acyclic graphs and, as such, are not able to represent feedback control mechanisms. Dynamic Bayesian networks, on the other hand, are Bayesian networks that are capable of representing temporal processes [45, 46] that may include such feedback loops. Since not all causal relationships can be inferred from correlation data, meaning that there can be different directed graphs that explain the data equally well, intervention experiments where genes are manipulated by overexpression or deletion have been proposed to learn networks [47] . The Bayesian network formalism has also been used to infer signaling networks from multicolor flow cytometry data [48] . There exist a number of other approaches for inferring large-scale molecular regulatory networks from high-throughput data sets. One example is a method, called the Inferelator, that selects the most likely regulators of a given gene using a nonlinear model that can incorporate combinatorial nonlinear influences of a regulator on target gene expression, coupled with a sparse regression approach to avoid overfitting [49] . In order to constrain the network inference, the Inferelator performs a preprocessing step of biclustering using the cMonkey algorithm [50] , which results in a reduction of dimensionality and places the inferred interactions into experiment-specific contexts. The authors used this approach to construct a model of transcriptional regulation in Halobacterium that relates 80 transcription factors to 500 predicted gene targets. Another method that predicts functional associations among genes by extracting statistical dependencies between gene expression measurements is the ARACNe algorithm [51]. This information-theoretic method uses a pairwise mutual information criterion across gene expression profiles to determine significant interactions. A key step in the method is the use of the so-called data processing inequality, which is intended to eliminate indirect relationships in which two genes are co-regulated through one or more intermediaries. Thus, the relationships in the final reconstructed network are more likely to represent the direct regulatory interactions. The ARACNe algorithm was applied to 336 genome-wide expression profiles of human B cells, resulting in the identification of MYC as a major regulatory hub along with newly identified and validated MYC targets [52] . A method related to the ARACNe algorithm, called the context likelihood of relatedness (CLR), also uses the mutual information measure but applies an adaptive background correction step to eliminate false correlations and indirect influences [53] . CLR was applied to a compendium of 445 E. coli microarray experiments collected under various conditions and compared to other inference algorithms on the same data set. The CLR algorithm had superior performance as compared to the other algorithms, which included Bayesian networks and ARACNe, when tested against experimentally determined interactions curated in the RegulonDB database. It also identified many novel interactions, a number of which were verified with chromatin immunoprecipitation [53] . Price and Shmulevich There are fundamental differences between the biochemical and statistical classes of network modeling described herein. One clear difference is the manner in which these underlying networks are reconstructed. For biochemical networks, reconstruction is typically a workintensive process requiring significant biochemical characterization with little network inference done (other than inferences of single gene function for catalyzing a reaction based on sequence homology). Thus, the ability to rapidly generate networks for organisms that are relatively uncharacterized from high-throughput data is an inherent advantage of the inferred statistical networks. One advantage of the biochemical network models is that, once reconstructed, the networks are not as subject to change (other than addition) since many of the links are based directly on biochemical evidence. Inferred networks, on the other hand, can undergo substantial changes in light of additional data. Another common difference, although not fundamental, is that constraint-based biochemical network models have mostly been used to model flux, whereas inference networks have mostly been used to predict substance amounts (e.g. mRNA expression). One way this can be thought of is that the biochemical network models currently link more closely to functional phenotype (i.e. fluxes) [54] , while the inferred networks relate more directly to available high-throughput data (i.e. transcriptomes). The kinetic biochemical network models, of course, have the capacity to account for both flux and abundance, but suffer from the limitation that they are by far the most data intensive to reconstruct. Another key advantage of biochemical reaction networks, stemming from their basis in chemistry, is that physico-chemical laws apply, such as mass-energy balance, while such laws are not generally applicable to the inferred networks. Of course, the advantage of the inferred networks is that, since they do not need to be mechanistic or require biochemical detail, they can be applied very broadly to systems that are not yet well characterized and can link very disparate data types as long as underlying correlations exist. In summary, both modeling types are essential to contemporary computational systems biology and provide advantages over each other in different settings. One interesting challenge going forward is whether hybrid models that take advantage of the strengths of the different modeling approaches can be constructed to move us further towards the goal of predictive whole-cell models and beyond. Early attempts have been done to link Boolean regulatory networks with constraint-based flux models [8] , but the extent to which these approaches can be married to provide significant advances in our ability to model biological networks remains an open question.
doi:10.1016/j.copbio.2007.07.009 pmid:17681779 pmcid:PMC2034526 fatcat:bpdls7rhbvdwjkoedgnkdzd5ci