Retrosynthetic design of heterologous pathways.

Tools from metabolic engineering and synthetic biology are synergistically used in order to develop high-performance cell factories. However, the number of successful applications has been limited due to the complexity of exploring efficiently the metabolic space for the discovery of candidate heterologous pathways. To address this challenge, retrosynthetic biology provides an integrated framework to formalize and rationalize the problem of importing biosynthetic pathways into a chassis organism using methods at the interface from bottom-up and top-down strategies. Here, we describe step by step the process of implementing a retrosynthetic framework for the design of heterologous biosynthetic pathways in a chassis organism. The method consists of the following steps: choosing the chassis and the target, selection of an in silico model for the chassis, definition of the metabolic space, pathway enumeration, gene selection, estimation of yields, toxicity prediction of pathway metabolites, definition of an objective function to select the best pathway candidates, and pathway implementation and verification.


Introduction
Production of value-added compounds such as drugs or biofuels in chassis organisms often requires importing heterologous genes to build efficient biosynthetic pathways.
Computational techniques have provided useful methods for the development of such cell factories through a streamlined process for modeling and design of a complete pathway at each step of its construction. Recent advances in systems and synthetic biology are further enabling a systematic practice through an integrated computational framework for rational biosynthetic pathway design. Nowadays, genome-scale metabolic network reconstructions providing an accurate in silico model of the metabolism are available for many industrial chassis organisms (1). In such models, the effects on steady-state fluxes of metabolic interventions like enhancement of substrate uptake, supplements addition, reduction of undesirable by-products fluxes, introduction of heterologous pathways, or product export to the extracellular medium (2) can be predicted with a remarkable degree of agreement with experimental observations (3). Furthermore, an increasingly formalization of the space of biochemical transformations in metabolic networks is allowing the designer to explore creative ways to implement alternative biosynthetic pathways (4).
To that end, metabolic modeling for heterologous pathway design can be done from two complementary approaches: a topological approach using hypergraphs, where catalytic reactions are hyperedges connecting node substrates to products; and a steady-state approach, where stoichiometry of reactions is used in order to study the properties of all feasible equilibrium states. Knowledge-based comparative analysis, graph search algorithms, and constraint-based models are alternative approaches used in order to infer meaningful pathways in metabolic networks, even if there is missing information on the enzymes.
Several metabolic databases with rich information are available: one of the most comprehensive is MetaCyc and its associated BioCyc collection of pathway/genome databases (5); similarly, KEGG is a database resource that integrates genomics, chemical and systemic functional information (6); BRENDA is another database that contains one of the most complete collections of enzyme functional data (7). Gaps or incomplete knowledge, however, are still present in many cases, especially when looking for novel ways to synthesize compounds. In this regard, computational approaches can provide new alternatives by predicting putative heterologous pathways producing the target compound. In order to successfully handle this challenging task, the design process needs to be rationalized by 3 following the principles of synthetic biology: modeling of the biological system of interest, modular design through standardization, goal-oriented optimization, and experimental validation. To contribute to this endeavor, we present here a retrosynthetic design approach that aims to provide a streamlined methodology for addressing the general problem of obtaining successful high-yield production of target compounds in cell factories.

The retrosynthetic framework for heterologous pathway design
Retrosynthesis algorithms are applied to metabolic networks in order to perform a backwards search from the target compound to the host metabolites through the iterative application of a defined set of biochemical transformation rules. Depending on the level of atomic resolution of those rules, recruited enzymes in the biosynthetic pathways may involve novel compound intermediates and putative reactions with unknown efficiency. A successful expression of those genes in the chassis organism needs to be addressed. Therefore, subsequent optimization of the engineered strain through genetic, metabolic and enzyme design approaches would be usually necessary in order to maximize production yields of the target.
We present here a unified framework that combines several techniques involved in the design of heterologous biosynthetic pathways through a retrosynthetic biology approach, enabling by these means the flexible design of industrial microorganisms for the efficient on-demand production of chemical compounds of interest. The method for retrosynthetic design of heterologous pathways consists of the following steps: 1. First, the problem is defined by choosing the chassis organism and the target compound.
2. Second, an in silico reconstructed model of the organism containing at least the stoichiometric reactions involved in its metabolism is defined from biological databases and literature.
3. Third, the metabolic space is constructed from all known metabolic reactions and expanded to putative promiscuous reactions. 4. Fourth, heterologous pathways producing the target compound from endogenous metabolites in the chassis are enumerated using a retrosynthetic algorithm. 5. Fifth, gene sequences encoding heterologous enzymes are chosen in order to maximize gene expression and enzyme performance in the chassis organism.
6. Sixth, steady state fluxes for each pathway are estimated through flux balance analysis. 7. Seventh, toxicity of intermediate metabolites are estimated by using a QSAR model. 8. Eight, a cost function is defined for the pathway and the best pathways are chosen. 9. Ninth, selected pathways are implemented and their efficiency is verified.

Materials
The following list provides a review of the main metabolic engineering tools used at each step of the heterologous pathway design, including some specific tools for retrosynthesis design: 1. Choosing the chassis: a host organism optimized for metabolic engineering, such as strains from Escherichia coli, Bacillus subtilis or Saccharomyces cerivisiae.

2.
Selecting an in silico model of the chassis organism: from repositories of in silico model organisms like the databases BIGG (8) or BioModels (9).

Methods
The design methodology for any metabolic engineering application starts with the selection of a chassis organism along with an associated genome-scale in silico model of its metabolic network (see Note 1). The retrosynthetic approach offers as well the possibility of performing an additional preliminary modeling step to expand the starting metabolic reaction space and, thus, increasing the possibility of discovering novel biosynthetic routes. These prior steps will provide the designer with a detailed knowledge base about the metabolic system that can be used advantageously at later stages of the design. In the same fashion, target compounds need to be defined at this stage.
A basic methodology for retrosynthetic design of heterologous pathways will consist of the following steps ( Fig. 1): 1) choosing the chassis; 2) selecting an in silico model for the chassis; 3) definition of the metabolic space; 4) pathway enumeration; 5) gene selection; 6) estimation of yields; 7) toxicity prediction of pathway metabolites; 8) definition of an objective function to select the best pathway candidates; 9) pathway implementation and verification.

Choosing the chassis
An early decision that necessarily influences the rest of the retrosynthetic design process is the choice of the chassis organism where the desired compound will be produced. For example, in order to increase the production of a compound naturally produced in plants, its biosynthetic pathway, if known, is imported into an industrial chassis organism. Factors that need to be considered when choosing the chassis include the following: 1. The extent and level of curation of the organism's metabolic pathways in databases.
2. The availability of a genome-wide reconstructed in silico model that has been experimentally verified (16), and that is ready to be used in constraint-based modeling to quantitatively estimate steady-state fluxes (see Note 2).
3. The availability of information about toxicity effects of heterologous metabolite intermediates in the organism. 4. The fact that biosynthetic pathways may involve large enzymatic complexes (such as polyketide synthases or non-ribosomal synthases for secondary metabolite synthesis) (17). 5. Similarly, specific redox reactions catalyzed by the CYP450, which is often needed in the last steps of metabolite synthesis, add another layer of complexity because of the difficulty to model these reactions and often a need to optimize further its catalytic activity through protein engineering (18).

Selecting an in silico model for the chassis
In silico organisms models are currently available for many industrial strains, including strains evolved for efficient production in Escherichia coli, Saccharomyces cerevisiae or Bacillus Subtilis (19-21). Most

Definition of the retrosynthetic metabolic space
The power of retrosynthesis for heterologous pathway design resides in the way representations of chemical biotransformations can provide a generalization of important chemical features. The most valuable information (but also the most challenging) obtained from retrosynthesis analysis is the identification of putative metabolic pathways involving promiscuous biochemical transformations that often had not yet been well annotated. Several techniques have been proposed in order to expand the metabolic reaction space to contain such putative reactions. The basic idea is to use a set of reaction rules from which not only known reactions can be generated but also novel reactions. To define reaction rules, several representations have been used, such as those derived from bond-electron matrices (25) (see Besides this native reaction, however, 4CL is reportedly able to catalyze promiscuously, 9 among others, reactions producing caffeoyl-CoA from caffeate, feruloyl-CoA from ferulate, sinapoyl-CoA from sinapate, cinnamoyl-CoA from cinnamic acid (27). Interestingly, substrate cinnamic acid is transformed into cinnamoyl-CoA, which may also serve as a precursor for the production of resveratrol. As shown in Fig. 3, both reactions can be derived from a single reaction rule, since the net balance of bonds that are formed and broken is identical for both reactions.

Enumerating heterologous pathways
The metabolic space that has been defined in the previous step consists of both endogenous and heterogeneous reactions. In order to produce exogenous compounds, the corresponding metabolic routes containing heterologous enzymes that start from endogenous metabolites must be found (see Note 4). Two methods can be applied in order to list all possible pathways leading to the target compound (10) (see Note 5): steady state and topological methods. 1. Input: any metabolite that can be produced in the chassis organism; 2. Output: any metabolite that is produced and is not further consumed by the heterologous network; 3. Stoichiometric matrix: it is given by the reactions that are heterologous to the chassis organism.
Special care needs to be taken with those endogenous metabolites that are also produced in the heterologous network (usually co-factors and currency metabolites such as ATP, NAD and protons, which participate in a large number of reactions (29), but also by-products of the biosynthesis), since they will appear in the system defined above as both inputs and outputs, generating thus elementary modes containing loops. An easy way to prevent the enumeration of these return loops is by replacing endogenous products by a generic end node sink (see

Note 6).
Under this set up, each elementary mode will correspond to a pathway that produces heterologous compounds. Because the number of pathways needs to be kept minimal, all pathways of interest producing a target compound should be contained in the elementary modes. In addition, elementary modes containing loops should not be considered as pathway candidates. Several software packages are available that compute the elementary modes of a given metabolic network. A popular implementation is Metatool (28).
Pathway enumeration through the topological approach. In the topological approach, each reaction is represented by a node of a hypergraph that is connected through hyperedges to the substrates (see Note 7). The strategy used in the hypergraph approach for pathway enumeration is the application of a recursive backward algorithm that traverses the network starting from the target in order to search for all possible pathways connecting the target to the source (see Note 8).
The main advantage of this approach is computational efficiency. Another remarkable feature of the topological approach is that it allows for supplements and bootstraps molecules identification. Supplements are compounds that provide new biosynthetic pathways if added to the medium because they act as precursors of the target compounds. Bootstraps are 11 compounds that are needed to be present in the medium at least in small amounts in order to allow the reactions in the pathway to start producing the target compound. A software tool for enumerating pathways using the topological approach is Metahype (10).
Example of pathway enumeration of resveratrol producing pathways in E. coli. Starting from precursors in E. coli, five alternative viable pathways producing resveratrol are identified by the retrosynthetic approach (shown in Fig. 4). One of the pathways consists of three enzymatic steps, while the rest contain four enzymes. The question that is investigated about these pathways in the next sections is how to prioritize them depending on their expected performance, as a preliminary step before selecting the ones that would eventually be implemented.

Gene compatibility for expression in the host
Heterologous enzymes of the pathway need to be successfully expressed. Gene compatibility with the expression host is crucial, although it remains still a challenging task. Facilities proposing gene synthesis with codon optimization (30) have been developed in the past few years and are often used for metabolic engineering. In order to select the gene, several strategies are possible: 1. Rare codons, and GC content are known parameters that can influence gene expression, as well as RNA secondary structure (30). In addition, other parameters such as sequence length or hydrophobicity might also influence a successful gene expression.

Homology search of heterologous genes: A blast search of the National Center of
Biotechnology Information (NCBI) nucleotide data bank can identify sequences predicted to encode the enzyme having the desired activity. Phylogenetic trees can be built to identify groups of the different enzymes identified (31). Minimizing the phylogenetic distance between the chassis organism and the organism where the gene is endogenous can also help in order to choose the homologue enzyme.
3. Scoring gene compatibility: An adequate strategy for gene sequence selection is to associate a score to each sequence, so that only sequences with top score are further considered. In a simple approach, the score can be built as a weighted sum of the considered factors, such as GC content or phylogenetic distance. Because of the multiplicity of factors that can influence enzyme expression, it might be difficult to blindly assign weighting priorities to each factor. One possible approach to address this issue is to build a statistical learning predictor based on techniques such as multilinear regression, support vector machines, or decision trees (32). The training set consists of the selected sequence properties with positive data formed by the list of enzyme sequences in the chassis, while the negative set has to be chosen as a significantly diverse selection of heterologous enzyme sequences (see Note 9).
Example of gene selection in resveratrol production. Besides its production from stilbene synthase (STS), production of resveratrol has also been observed as a cross-reaction from chalcone synthase (CHS, EC 2.3.1.74) (33). Interestingly, CHS is ubiquitous in plants and is also found in bacteria, while STS is only found in plant species that accumulate resveratrol and other related compounds (22). Selecting a prokaryotic CHS gene from an organism closer to E. coli could, thus, ease its successful expression. To that end, the retrosynthesis methodology can be applied in order to select best gene candidates expressing CHS enzyme, with the goal of converting it into an efficient resveratrol-producing factory. Table 1 provides the score of CHS genes as candidates for promiscuously producing resveratrol in E. coli.

Estimating yield and drains
The next step in pathway design corresponds to metabolic analysis of the enumerated pathways in order to get an estimation of growth and yield of the target product associated with each metabolic intervention. This step is typically performed for the steady state through flux balance analysis (FBA) (3). To that end, an in silico model of the metabolic network of the chassis organism, normally given in SBML representation, needs to be obtained from the literature or from databases such as BIGG. Several software packages for FBA like the COBRA toolbox (11) are available. The following steps should be followed to perform flux balance analysis of the heterologous pathways: 1. Consistency check between the in silico SBML model and the metabolic database: Because they contain only reactions reconstructed from high-throughput omics data, genome-wide reconstructed in silico models do not contain such level of detail in the pathways as the one in metabolic databases like MetaCyc or KEGG. Therefore, a minimum level of consistency needs to be guaranteed between the metabolic network model from databases described in Section 3.3 and the in silico SBML model for FBA, since it might happen otherwise that the heterologous pathway obtained from pathway enumeration in Section 3.4 is fully or partially disconnected from the chassis in the in silico model. Therefore, it is necessary to verify that the endogenous precursors in the pathway are present in the in silico model, either because they are being produced by some enzymatic reaction or by adding the ones missing to the medium.
2. Inserting the heterologous pathway and their corresponding transport and exchange processes: Next, the heterologous pathway has to be imported into the model by adding as many reactions as enzymatic steps are present in the pathway. The target product is exported out of the cell through the use of the relevant transport reactions (see Note 10).
In addition, any side product of the reaction steps, which is not being further degraded by any reaction has to be exported out of the cell. In silico models generally make distinction between the exchange reaction between the extracellular medium and the periplasmic space, and the net exchange reaction of the metabolite with the system (see   Table 2. Optimal flux for biomass in wild type strain is 0.737 (a.u.). A similar value is obtained in the strain that has been engineered with Pathway 1, while the maximum yield for resveratrol is 1.951. The strain with Pathway 2 shows an increase in biomass (1.111), indicating that some of the by-products can be used to increase growth.
Maximum production of resveratrol (3.589) is significantly increased in this pathway.
Pathway 3 and 4 correspond to biomass and resveratrol yields that are in between the maximum values obtained by Pathway 1 and 2. Finally, Pathway 5 is the one that yields the maximum production of resveratrol (3.951).

Toxicity effects of heterologous pathways
In metabolic engineering, the importance of compound toxicity has been pointed it out by several authors (35,36). Indeed, for pathway performance the less toxic molecule is usually desired, while conversely the highly toxic molecule is wanted when producing therapeutics such as antimicrobials. In both cases, detailed information on compound toxicity is important in the design of metabolic pathways and toxicity is a parameter that should be included in the computer-aided pathway design framework (15). In order to establish a reference database of toxicity data in the chassis organism, a library of MIC (minimal inhibitory concentration) or    Fig. 4, is the most toxic compound in the list.

Defining a cost function for the pathways
As presented in previous sections, several aspects are to be considered when estimating the cost of pathway insertion. A quantitative definition of a cost function associated with the pathway provides the possibility of ranking enumerated pathways, so that top pathways can be selected by the designer for implementation. The process is as follows: 1. A simplified scheme consists on dividing the effects of pathway insertion into three main factors: enzyme compatibility and reaction efficiency (Section 3.3 and Section 3.5), expected yield (Section 3.6), and metabolite toxicity (Section 3.7). A possible definition of the pathway cost function is as follows (15): where ν c (r) is the flux for pathway r producing compound c, as described in Section 3.6, K(S(r)) is the cost associated with sequence S of the enzyme catalyzing the reaction r in the pathway r, and T(p) is the toxicity (-log 10 (IC 50 )) associated with the metabolite p product of reaction r, as defined in Section 3.7.
2. The cost for the sequence K(S(r)) has to take into account the fact of whether reaction is found annotated in databases for the given sequence or it is found based on a prediction as described in Section 3.3. In the case of a putative reaction, a penalty is added to the cost as follows: where ! "(S) is the compatibility for sequence S as defined in Section 3.5, and ! " pred (r) is defined in order to assign an additional cost to those enzyme sequences S where the 19 corresponding reaction r is either catalyzed as a promiscuous or side reaction or its assignment is only putative. Therefore: ! " pred (r) = " penalty promiscuous/predicted 0 annotated with penalty constant G penalty arbitrarily set to a value that is a upper bound for the score of sequence compatibility:  Table 4. The total cost associated with each pathway, according to Equation 2 and to the weighting parameters is shown in Table 5. From these results, Pathway 2 appears finally as the best candidate pathway to engineer in E. coli for the production of resveratrol, a result that is due both to the fact that the pathway contains less putative enzymatic steps and to the higher expected yield. Pathway 1 appears as the second ranked pathway because even if it involves only three enzymes in comparison with the four enzymes from the other pathways, it contains two predicted or putative reactions (cinnamic acid production from PAL and again STS).
Pathway 3, which is similar to Pathway 2 except for the first step (2-enoate reductase), predicted as a promiscuous reaction, appears next in the ranking. Finally, Pathways 5 and 4, containing three out of four predicted reactions (promiscuous activity predicted for 4CL and CH4, and STS), appear at the end of the ranking.

Pathway implementation
Validation of the retrosynthetic design is performed through pathway implementation of the top ranked heterologous pathways. The process consists typically of the following steps ( Fig.   1): 1. Gene amplification: Genes can be amplified from genomic DNA, when available, or synthesized. The ability to synthesize genes in whole novel genetic pathways is now routinely used for metabolic engineering. Software and websites to facilitate the execution of oligonucleotide assembly into long custom sequences are available (39).
Chemical synthesis allows synthesizing oligonucleotides of up to 120-150 nucleotides in length. Numerous methods have been developed to assemble relatively short synthetic oligonucleotides into longer gene sequences through ligation or PCRmediated assembly. Also, codon usage varies by organism and has implications for heterologous expression of proteins. Codon optimization, which consists to render a nucleotide sequence with suitable codon usage for the expression host, might also help the gene expression but will not necessarily maximize the protein expression level. Following this step, preparative methods such as HPLC coupled with spectrometry are usually used to quantify the metabolite production. For example, metabolite can be specifically separated on a C18 column with a determined acetonitrile/water gradient.
A large number of purification methods exist and need to be optimized for each compound. thermodynamics, kinetics as well as other information is increasingly becoming available in these models and will bring in the future the design to finer levels of detail.

Notes
2. There is a basic difference between the information that is required in the model in order to design heterologous metabolic pathways and to estimate steady-state fluxes. In the former case, the most essential information is the knowledge about the metabolites that are endogenous to the organism and therefore can be used as precursors in the heterologous pathway. In the latter case, the accuracy of the stoichiometric relationship between those reactions that directly influence the pathway is required, while partial knowledge about upstream reactions with low influence into the pathway can be tolerated.
3. In bond-electron matrices (BEM), each row and column correspond to one atom of the compound, and each entry is the order of the covalent bond between the atoms. The BEM of a reaction is defined as the difference between the end (right) and begin (left) BEMs.
4. The reason why we need to enumerate all pathways instead of searching for the shortest one is because not always the shortest is the best in terms of the cost associated to the pathway, as described in Section 3.8.
6. The basic limitation of the elementary modes approach is its computational complexity, which can make slow the computation in case of large heterologous networks.
7. The hypergraph representation is used in order to require all substrates to be present for the reaction to be activated.
8. The hypergraph approach can lead to solutions that are not stoichiometrically balanced, since stoichiometry is not taken into account. These solutions need to be filtered out from the output of the algorithm. 9. A detailed description of such type of predictor can be found in (15) as well as in the patented method from DNA 2.0 (30). 10. As in silico models become more detailed, higher attention needs to be paid in order to describe accurately the process inside the cell. Cell compartmentalization, for instance, might imply the need for enzyme co-localization in order for the pathway to proceed. This aspect is especially relevant for plant metabolism. Similarly, exporting the metabolite out of the cell might involve several transport processes through different cell compartments (11).
11. The flux balance might require for metabolites to be taken outside of the extracellular medium in order to make the net flux zero, avoiding accumulation.
12. Several clustering methods can be applied, depending first on the type of molecular descriptors used to define molecular similarity and on the clustering algorithm. Hierarchical agglomerative clustering should be preferred, since it allows building libraries of variable size. if the reaction appears as a putative promiscuous reaction. Next, steady state yields are estimated, as well as toxicity of reaction products. These values are combined in order to score and select the best pathway(s), which are finally implemented for verification.     content, probability to be expressed as inclusion bodies, isoelectric point (pI), hydrophobicity, secondary structure distribution, and distance to prokaryotes. Table 2. Optimal steady state fluxes for the five pathways in Fig. 3 maximizing biomass (first column) or the production of resveratrol (second column). Table 3. Predicted toxicity in E. coli of metabolite intermediates of the resveratrol pathways in Fig. 3. Table 4. Gene costs K(S(r)) associated with each gene in the pathway in Fig. 3.    Table 5