Extended metabolic space modeling

Determining the fraction of the chemical space that can be processed in vivo by using natural and synthetic biology devices is crucial for the development of advanced synthetic biology devicesapplications. The extended metabolic space is a coding system based on molecular signatures that enables the derivation of reaction rules for metabolic reactions and the enumeration of all possible substrates and products corresponding to the rules. The extended metabolic space expands capabilities for controlling the production, processing, sensing, and the release of speciﬁc molecules in chassis organisms


Introduction
The set of chemical compounds that organisms can process and synthesize is finite. Such a finite set, however, is not fully known yet. Based on a model that accounts for versatility of enzymatic reactions, we describe here a computational protocol to estimate the extent of such a full metabolic space. The extended metabolic space can be screened to list any possible biological circuit that can be conceived, such as the ones that are used to produce, detect, and process chemicals.
To fully exploit the metabolic space, an essential requirement is having a thorough knowledge of the metabolome associated with any given organism. However, experimental evidences from metabolomics analyses often show that with currently known metabolites one cannot cover the ranges of masses found in actual samples, and consequently there is an impelling need of completing the metabolomes and reactomes of interest for metabolic design [1,2 ]. Furthermore, the metabolic phenotype of an organism may vary upon different conditions such as during different growth states leading to variations in the metabolite profile [3]. Besides such sources of uncertainty in samples, many unassigned peaks should be due to promiscuous activities of enzymes not yet characterized because of the lack of an appropriate description of the mechanisms of enzyme promiscuity.
Our group has addressed the issue of complexity by proposing a tradeoff solution based on molecular signatures [4]. Our molecular signature codes for changes in atom bonding environments where the reaction is taking place. The advantage of the signature method is that the reaction rules describe the changes in the environments of the atoms belonging to the catalytic center of the reactions, and the size of the environment (named diameter) can be tuned to control the combinatorial explosion of possible compounds. Moreover, reaction signatures are robust to unbalanced reactions and can be created automatically without the need of any atom-atom mapping. The signature representation has shown itself to be specially well suited for modeling the mechanisms of enzyme promiscuity [5], paving by these means the way toward engineering innovation in metabolic networks. Either through directed evolution [6] or random selection [7], latent capabilities present in enzymes as modeled by the extended metabolic space can be potentially enhanced to optimize the desired activity and eventually implemented as a biological part containing a metabolic circuit.
Here, we describe the necessary steps to generate an extended metabolic space and how to compute all viable routes within the extended space that determine a viable pathway connecting a desired target to the chassis organism ( Fig. 1).

Fig. 1
Steps involved in the construction of the extended metabolic space. The first step consists of converting compounds and reactions into molecular signatures. The second step enumerates new products by an iterative algorithm applied to the reaction signatures. The third step consists of choosing a target, i.e., a reaction or a compound, and a chassis organism. The fourth step determines the metabolic scope linking the chassis to the target. Finally, the fifth step enumerates all viable pathways connecting the chassis to the target 2. Materials Materials for the described computational protocols consist basically of datasets obtained from public databases and processing software.
-A metabolic database of reference covering chemical structures and reactions. Metanetx [8] is a consensus database that reconciliates multiple databases.
-Models of metabolism for chassis organisms. Biomodels [9] and BiGG [2], among others, are databases containing genome-scale models for most commonly used organisms.
-Software to compute molecular signatures, which are a specialized type of topological chemical descriptors. MolsigMolSig [4], among others, is an open-source package that provides such capabilities.
-Computation of elementary modes. Efmtool [10] provides both a Java and matlabMATLABbased efficient implementations.
-Software for chemical manipulation. Some of the most popular implementations are RDKit, Marvin, CDK, KNIME ( Table 1). Table 1 A selection of software tools for modeling in the extended metabolic space

Name Keyword Comment
Stand-alone software

Cytoscape Graph visualization
Cytoscape can be used to manually explore and visualize the EMS [11] efmtool Elementary flux Computation of elementary flux modes [10] KNIME Workflow Knime propose to create automatic processes ("workflow") through a drag-n-drop interface of small tasks ("node"). It is useful for reproducibility of data analysis [12] MarvinSketch Chemical editor

Computation of Molecular Signatures
The first step to generate an extended metabolic space is to encode all compounds of a metabolic database in a format that will allow the subsequent encoding of enzymatic reactions. We propose here to showcase the important steps that should be kept in mind through the use of one of the available encoding methods, the molecular signature [4] (see Note 1).
2. Check compounds for incomplete structural data. Some compounds can be defined with incomplete Markush structure or wildcard atoms. Those compounds typically stand to define classes of compounds (e.g., "an alcohol") and should be removed since they cannot be interpreted through the molecular signatures algorithm used in this protocol.
3. Standardize compounds. Molecular signatures encode directly molecular graphs from a MDL MolFile input. Users must ensure that compounds (resp., chemical groups) that should be considered identical have the same molecular graph (resp., subgraph) (see Note 3).
(a) Neutralize or remove charges. As much as possible, chemical groups should be represented with the same protonation state to prevent different tautomeric forms. One can either use heuristics to add or remove hydrogen when necessary or simply remove all charges from the compound dataset.
(b) Choose one conjugated form by compound. This is particularly important for aromatic compounds, which could appear under different kekulé forms in the database. A good solution is to explicitly use aromatic bonds in the molecular graph description.
(c) Use a consistent hydrogen representation, either implicit or explicit.
4. To compute the signature of a chemical compound, we need initially to consider its molecular graph. Let G(V, E) be the molecular graph associated with some chemical compound C and let a V (b E) be an atom (bond) of G. The atomic signature of atom a of diameter d, σ(a), is a canonical representation of the subgraph of G spanned by its vertices at a maximum distance of d/2 from a. From a chemical point of view, this corresponds to a circular fragment of the compound centered on a.
5. The molecular signature of a molecular graph G of diameter d associated with C, σ(G), is defined as the list of all atomic signatures of diameter d (one by atom). Therefore, a molecular signature is a list of overlapping molecular fragments.
6. Depending on the diameter d, a molecular fingerprint can show degeneracy, i.e., a same molecular signature can represent more than one molecular graph G, much like a chemical formula can correspond to several compounds.
7. Based on previous definitions, the computation of the molecular signature involves two steps: (a) Choose a diameter to encode enzymatic promiscuity. To some extent, enzymes have the ability to process additional reactants that are structurally similar to the known ones. In a context where it is important to maximize the number of reactions to get more leads, modeling promiscuity can reveal itself to be a critical feature (see Note 4). We recommend starting with a diameter of 12 and going lower (down to 4) if no satisfying solution can be found.

Thus, σ(R)
is the difference in terms of atomic signatures (i.e., molecular fragments) occurring during a reaction; created (resp. consumed or needed) fragments being positives (resp. negatives). In this context, the diameter d corresponds to the reacting moieties and their neighboring atoms (the environment), hence the possibility to tune the degree of the enzymatic promiscuity hypothesis by increasing or decreasing (Fig. 2).

Products Enumeration
Once reactions have been encoded into reaction signatures, they can be applied to compounds to predict potential products under the enzymatic promiscuity hypothesis. 4. Being able to model enzymatic promiscuity assumes that reaction signatures can be used with other substrates than the ones in the native reaction. In turn, alternative substrates produce new products. Those compounds may be absent from the metabolic space, i.e., the set of known metabolites. Therefore, reaction signatures extend the metabolic space by linking potentially new compounds to the metabolism (see Note 9).

Chassis Modeling in the Extended Metabolic Space
In the previous sections we have described the protocol that allows extending the metabolic space. When the extension is applied to a metabolic network consisting of all known metabolic reactions, we arrive at the full description of all available metabolic capabilities. Some of these capabilities are going to be common to several groups of organisms, such as reactions in the central metabolism, while others like secondary metabolism will be specific to some groups. In applications such a biotechnology, the organism that is engineered is known as the chassis organism and often the objective will be to expand the natural capabilities of the chassis by introducing heterologous enzymes. In this section, we will describe how to model the chassis organism as a subset of the extended metabolic space.
1. The extended metabolic space of diameter d, denoted by M , represents all the possible compounds C and allowed transformations (reactions) R between compound as spanned by the enumerated reactions computed by following the method described. 3. The list of nominal metabolic reactions for a given organism can be compiled from databases such as KEGG [14], MetaCyc [15], BiGG [2], BRENDA [16], etc. The choice of one database over the others depends on several factors: (c) The way the model is going to be analyzed, i.e., network analysis, steady-state

Computing the Scope
The next step in modeling in the extended metabolic space is to have an understanding of the design space for a given target metabolic activity. In other words, we want to compute the metabolic scope connecting some target reaction to the chassis. To that end, we provide in this section some relevant definitions and a two-step procedure that allows the determination of the metabolic scope.

1.
A minimal pathway is defined as any set of reactions connecting the chassis to the target that are minimal: (a) They form a viable production pathway in terms of precursors availability.
(b) All reactions are essential, i.e., the removal of any reaction renders nonviable the pathway (see Note 12).
2. Based on that definition, the metabolic scope is defined as follows: given an initial set of source metabolites S (the chassis) and a final set of target metabolites T, the scope is the set of enzymes that are at least involved in one minimal pathway connecting elements of T to the source S, i.e., the scope should contain only enzymes that are at least essential for establishing one of the metabolic pathways. To compute the scope for a given compound, a two-step procedure can be applied, as described in the following.
3. Reduction of the extended metabolic space to the reachable space of reactions. It consists of the following steps: (a) A compound is defined as reachable if there exists a reachable reaction that can produce it, i.e., a reaction for which all substrates are available.
(b) Start from the set of initial compounds S and iteratively find newly reachable compounds.
(c) The process stops when no new reachable compounds are found.  (c) The recursion stops when initial compounds S are reached.

Enumerating Pathways
Once the extended metabolic scope has been determined, we should be interested in enumerating all viable metabolic pathways connecting the source to the target. This turns out to be a computationally complex problem that can be solved through several approaches [18]. We describe here a solution based on the computation of elementary flux modes [19] (see Note 13). EFMs are the set of minimal pathways that are nontrivial solutions to the steady-state equation whose combination can describe any possible path in the network (see Note 14).
1. Define the augmented metabolic space formed by the union of the reactions in the chassis and in the scope (Fig. 3a).
2. Construct a stoichiometric matrix S where each row corresponds to a compound and each column to a reaction of the previous augmented metabolic spaces and the value of each cell is the stoichiometric coefficient (Fig. 3b).

Remove all rows representing initial compounds (see Note 15).
4. Remove all rows representing compounds that are produced by a reaction but never used in any other.
5. Merge identical columns by deleting redundant columns and renaming the remaining column with the names of all reactions (see Note 16).
6. Add an additional column to create a flux out for the target compound.
7. Several toolboxes exist that allow efficiently computing the elementary modes (Fig. 3c). For instance, efmtool [10] provides an efficient implementation that can either run in matlabMATLAB or in Java.
8. Expand resulting elementary modes into the pathway solutions by enumerating all combinations of merged reactions in each elementary mode (Fig. 3d).

Fig. 3
Example of pathway enumeration in the extended metabolic space. Panel (a) shows the scope graph connecting compounds in the extended chassis (C , C , C , C , C , C ) to target compound C through reactions R , R , R , R and intermediate compounds C , C , C in the extended metabolic space (EMS). Panel (b) displays the equivalent stoichiometric matrix. Grayed columns and rows are discarded in the enumeration, as described in the enumeration protocol. Panel (c) shows the reduced matrix used for enumeration, containing an additional reaction T for the selected target compound. The enumeration algorithm found two elementary modes EM and EM . Panel (d) shows the resulting four pathways solution P -P after expansion of topological equivalent reactions. Pathways P and P involve three reactions, while pathways P and P involve two reactions

Design in the Extended Metabolic Space
We have described in previous sections step-by-step methods that generate extended metabolic spaces for (a) global metabolic capabilities; (b) chassis organisms; (c) organisms augmented with desired target activities. From here, resulting extended models can be used in multiple engineering biology applications, from production of chemicals to their sensing and regulation. Some of the main applications developed to date in extended metabolic spaces include the following: 1. Engineering of heterologous pathways for the production of a desired chemical in a chassis organism. To select enzyme sequences for each enzymatic step in the pathway for the most promising routes in the extended metabolic space, a pathway ranking function needs to be defined. The approach is described in detail in the retrosynthetic RetroPath protocol [20] and a demonstration of the application of such a protocol is shown in the XTMS web service [21].
2. Development of novel biosensors based of metabolic pathways. Metabolic pathways that transform a target compound into a detectable compound allow the expansion of the observable extended metabolic space [22]. Such an application has been demonstrated through the SensiPath web service [23]. 4. Notes 1. Molecular signatures are an efficient and intuitive way to model metabolites. They are similar to the well-known Extended Connectivity FingerPrint (ECFP) topological fingerprint, which summarizes compounds in lists of circular molecular fragments.
2. Chemical structures and reactions can be found in multiple formats. Reactions are often defined in a database-specific flat-file where reactants are referenced by their compound identifier. Most of the time, you will find a file in MDL SDF or MOL format binding the compound identifiers to their respective structures. Other interchangeable formats are usually available such as SMILES and InChI. Inter-conversion between formats using standard software such as Open Babel [24] yields to equivalent representations of the compound. A sanity check can help to ensure that they all refer to the same compound. This will eventually filter out wrong annotations.
3. Before being converted into molecular signatures, molecular graphs do not need to represent chemically valid compounds in terms of valence, charges, etc. The important point is that compounds (moieties) that should be considered identical according to the final application share the same molecular graph (subgraph). Of course, those simplifications introduced at the compound encoding step must be kept in mind while interpreting the results.
4. Putative enzymes promiscuity can be modeled through molecular signatures given an appropriate diameter. Obviously, as we lower the diameter, the stronger is the promiscuity hypothesis and the riskier are the predictions.
5. Molecular signatures can take into account stereo-chemistry, which is particularly appealing when working with enzymes. Nonetheless, if stereo information is considered, it is important to ensure that it is available (and valid) for most of the compounds; otherwise, compounds with and without stereo information will be perceived differently through signatures.
6. Metabolic databases contain generally a substantial portion of reactions that are not stoichiometrically balanced. Reaction signatures can be computed for reactions that do not need strictly balanced input reaction. Nonetheless, working with balanced reactions is always recommended and is a sign of a well-curated database.
7. This mathematical expression simply states that the reaction signature is the set formed by the difference between product signatures and reactions signatures. Intuitively, it can be understood as the chemical groups that are transferred or transformed through the reaction.
8. Multi-substrate reactions are difficult to handle with the proposed equation. Indeed, testing all compounds with a reaction would take N tries, where N is the total number of compounds in the database and m the number of substrates anticipated for that reaction. A more practical option is to allow promiscuity for only one substrate at a time, therefore limiting the number of trials toto N m. A complementary approach is to allow promiscuity only for non-cofactors compounds.
9. This feature is particularly desirable to untap enzymes full potential in metabolic engineering applications since it can find an unexpected synthesis route.
10. There is a basic difference between the information that is required in the model to design heterologous metabolic pathways and to estimate steady-state fluxes. In the former case, the most essential information is the knowledge about the metabolites that are endogenous to the organism and therefore can be used as precursors in the heterologous pathway. In the latter case, the accuracy of the stoichiometric relationship between those reactions that directly influence the pathway is required, while partial knowledge about upstream reactions with low influence into the pathway can be tolerated.
11. The extended metabolic space of the model of an organism provides useful information to discover previously unidentified routes and to fill gaps in present models.
12. Pathway minimality is a heuristic condition based on reducing metabolic burden in the cell (a pathway with a less number of enzymes should be more tolerated by the cell because it potentially imposes less stress).
13. Metabolic networks are formally modeled as hypergraphs for pathway enumeration. Basically, the availability of each substrate is required in the reaction to produce the product. That creates some level of complexity higher than in the classical graph pathway enumeration algorithm. Moreover, standard graph approaches do not consider stoichiometry. The stoichiometric approach, in turn, based on linear algebraic decomposition provides an easier analytic approach.
14. Pathway enumeration based on elementary flux modes can become computationally intractable for highly connected networks such as central metabolism. However, in cases where we want to produce some heterologous compound in a chassis organism, pathways are generally almost linear and the elementary flux mode enumeration remains tractable. The enumeration of elementary flux modes can be also expressed as a dual problem using minimal cut sets. 15. We remove all the initial compounds in the chassis, as we already know that they are available. Products of reactions in the scope consuming the initial compounds will be kept for the enumeration. 16. Identical columns represent routes that are topologically equivalent. In order to make the enumeration algorithm more efficient, we remove duplicated columns. However, for the final enumeration we should list each topologically equivalent reaction as an alternative pathway.