Iterated local search with partition crossover for computational protein design

Structure‐based computational protein design (CPD) refers to the problem of finding a sequence of amino acids which folds into a specific desired protein structure, and possibly fulfills some targeted biochemical properties. Recent studies point out the particularly rugged CPD energy landscape, suggesting that local search optimization methods should be designed and tuned to easily escape local minima attraction basins. In this article, we analyze the performance and search dynamics of an iterated local search (ILS) algorithm enhanced with partition crossover. Our algorithm, PILS, quickly finds local minima and escapes their basins of attraction by solution perturbation. Additionally, the partition crossover operator exploits the structure of the residue interaction graph in order to efficiently mix solutions and find new unexplored basins. Our results on a benchmark of 30 proteins of various topology and size show that PILS consistently finds lower energy solutions compared to Rosetta fixbb and a classic ILS, and that the corresponding sequences are mostly closer to the native.


| INTRODUCTION
Proteins are responsible for a wide range of vital functions in all living organisms, such as cell signaling, transport, regulation, defense against pathogens and catalysis of various chemical reactions. By exploiting the relationship between the sequence of amino acids of a protein, its three-dimensional structure, and its function, it is possible to engineer new proteins for various applications in health, environment, and bionanotechnologies. [1][2][3][4][5] The need for efficient computational protein design (CPD) methods emerged from the fact that it is impossible to experimentally test all possible protein sequences corresponding to a target protein structure. CPD, therefore, aims at finding a sequence of amino acids that fold into a target three-dimensional protein structure using purely in silico methods. It can be formalized as a combinatorial optimization problem where variables are amino acid conformations at each sequence position and where an energy function capturing interactions between amino acids within a target three-dimensional structure is to be minimized. 6 In the most common representation, the energy function is pairwise decomposable, the variables take their values in a discrete set of preferred amino acid side chain nature and orientations, and the backbone of the target structure is fixed. Under these assumptions, the CPD problem has been proven to be NPhard. 7 For this reason, most CPD methods rely on stochastic optimization. For example, the widely used molecular modeling suite Rosetta relies on a simulated annealing algorithm. 8 Other existing methods rely on evolutionary algorithms such as genetic algorithms 9 or estimation of distribution algorithms. 10 Along with local search methods, exact and deterministic methods that can provably identify the global minimum of the energy function (global minimum energy conformation, GMEC) also exist. 11 The state-of-the-art here relies on Cost Function Networks algorithms, 12,13 and has recently been extended to optimize sequences for one or several protein states at the same time. 14 It is available as open source software under the name of "POMP d ." In addition to providing access to the protein sequence of lowest energy, POMP d , which relies on the constraint programming solver ToulBar2, 15 is able to exhaustively enumerate all protein sequences within a threshold to the global minimum. Using this ability, a recent fitness landscape analysis around the optimum of CPD problems showed that the structure of CPD problems can prevent simulated annealing from approaching the GMEC. 16 Indeed, the number and depth of local minima on CPD problems requires simulated annealing methods to accept several unfavorable local moves in a row in order to escape local minima. As a result, the search often gets stuck with infinitesimal chances of finding the GMEC. Besides this, it has also been shown that the gap between the best solutions found with simulated annealing and the GMEC increases with the length of the target proteins. It is thus crucial to develop new local search methods that could shrink this gap once POMP d hits the complexity barrier of NP-hardness and cannot provide the GMEC anymore. A careful analysis of CPD fitness landscapes put in evidence that local search methods with high exploration abilities would be more suitable.
Here, we study the performance of iterated local search algorithms (ILS). 17 ILS is an iterated process of steepest descent and random solution perturbations. The steepest descent ensures to reach a local minimum, and the perturbation is used to jump off its attraction basin with the hope of reaching a lower local minimum after the next steepest descent. ILS algorithms can be augmented with a so-called partition crossover, which can mix two solutions together by taking advantage of the problem structure, in order to reach better local minima. This kind of algorithms, mixing iterated local search and partition crossover, has already demonstrated good performance for pseudo-Boolean optimization. 18 In this article, we show the benefits of using explorative local search methods on CPD problems. We generalize partition crossover to combinatorial optimization, combine it with an ILS in a new algorithm that we named PILS, and compare the performances with Rosetta's simulated annealing algorithm as implemented in the fixbb protocol and a classical ILS on a benchmark of 30 protein targets.

| Definition
Under the assumption that the protein backbone is fixed, the CPD  (1), where x X represents a solution of length ℓ from the sequence/conformation space X (which contains all possible rotamer assignments for a given protein structure), and where E i and E ij are, respectively, unary and binary energy terms whose values are function of rotamer assignments at position x i (for E i ) and at positions x i , x j (for E ij ). G is the undirected interaction graph, whose edges represent all interactions between pairs of residues.
In our experiments, we minimize the energy function beta_nov16, as provided by the Rosetta modeling software. 20

| Neighborhood
Local search algorithms rely on the notion of neighborhood. In CPD, we define a neighbor of a solution as an assignment that differs at one position. The neighborhood relation N is defined as: where d hamming is the Hamming distance defined over residues: the distance is 1 if only one rotamer differs between the two solutions.
The size of the neighborhood is then j N j¼ P ℓ i¼1 n i À 1 ð Þ¼ P ℓ i¼1 n i À ℓ, where i is a variable index, n i is the domain size of variable i (the number of available rotamers) and ℓ the number of variables.

| Local update
The time complexity to evaluate E x ð Þ can be reduced using incremen- Let δ i,v ð Þ x ð Þ be the difference of energy between the neighbors x and op i,v x ð Þ: Only a few terms from Equation (1) are modified in order to com- The time complexity of the incremental evaluation is then j l i j þ1 which is bounded by ℓ. This time complexity is linear instead of quadratic using Equation (1).
The time complexity can be further reduced using double incremental evaluation. Similar to second derivative computation, let be ð Þ x ð Þ can be rewritten as: The computation complexity of δ 2 However, according to the values of i and k, some simplifications reduce the complexity.
This case is the worst case. The time complexity is the same as This case is the best case, and if j l i j is bounded, it is the most common case. The complexity is 0. When i ≠ k, and k l i , then, In this case, the time complexity to compute δ 2 . j l i j and j l k j are bounded by ℓ À 1, the complexity of the update of all δ values is bounded by 4ℓ, which is a linear complexity.

| ILS
Iterated local search consists in building a sequence of locally optimal solutions by iteratively perturbing the current local minimum and applying a local search operator. 17 The ILS used in this work is given in Algorithm S1. The main components of the algorithm are the perturbation operator and the local search operator. A perturbation of strength k consists in randomly modifying the value of k variables in the solution.
The local search operator used is a steepest descent. This cycle of perturbations and local searches iterates while keeping track of the best solution until a limit number of solution evaluations is reached, or if no improvement occurs in a predetermined number of iterations.
The steepest descent algorithm used as local search operator is given in Algorithm S2. Double incremental evaluation is used to find the best neighbor at each iteration of the algorithm. A Binary Search Tree is used to select one of the best neighbor with complexity O logðjN jÞ ð Þ .

| PILS
The algorithm of PILS (see Algorithm 1) is built on the generic form of the ILS. It introduces a second perturbation operator: the partition crossover. The partition crossover exploits the structure of the variable interaction graph in order to efficiently mix two solutions. 21 A description of this crossover operator is presented in Algorithm S3.
The partition crossover operator takes two parent solutions as input.  each problem instance, results in terms of energy minimization and native sequence recovery rates, and provide some statistics on the search dynamics of PILS.

| Benchmark description and statistics
The PDB code and length of each protein in the benchmark, as well as main statistics extracted from the interaction graph associated with each protein are shown in Table S1.

| Energy minimization
PILS performs consistently better than ILS and Rosetta fixbb in terms of energy minimization. Table 1 Table 1), whereas ILS and Rosetta fixbb were respectively able to locate the GMEC on four and one instances. As pointed out in a previous study, the energy gap between sequences predicted with Rosetta fixbb and the GMEC increases with the size of the problem. 12 PILS closes this gap on almost all instances for which the GMEC could be found, showing that such enhanced iterative local search methods are preferable to simulated annealing for solving CPD instances when the size of the problems prevent global optimization methods from returning the global optimum. They could also speed up these global optimization methods by providing better initial upper bounds.

| Native sequence recovery
Solutions returned by PILS are closer to the native sequences on average in comparison with ILS and Rosetta fixbb. Native sequence recovery is a well-known and accepted measure for CPD in silico assessment. This test relies on the relationship between protein sequence and structure: the more two sequences are similar, the more they tend to fold into the same three-dimensional structure. Thus, if a computationally designed sequence is close to the native sequence of the target structure, it has good chances of adopting the correct fold.   to quickly identify good solutions. The curve flattens a bit faster in the case of 2erw, which is an easier target. In both cases, we observe that PILS converges to a minimum in less than 600 iterations.

| PILS search dynamics
We then looked at the average number of connected components, and the average number of successful crossovers on the same two protein targets (Figure 2, bottom). We consider a crossover as successful if it allows to reach a solution with a lower energy than that of its two parent solutions. We observe different behaviors depending on the protein target. The average number of connected components seems to be constant across iterations in both cases, but is smaller in the case of 2erw. As a consequence, there are few successful crossovers in the first iterations and no successful crossover happens after 50 iterations. In the case of 1z2u, the average number of connected components is higher and allows more successful crossovers.
Although the frequency decreases with the iterations, the crossover still has some impact near the end of the runs. 2erw is an easy target on which PILS and ILS could find the global optimum. On this target, the iterated local search on its own is sufficient to quickly find good solutions. The smaller number of connected components translates the fact that the search quickly focuses on the correct region in the search space. It leaves no room for solution diversity that is essential for connected components to appear. On the other side, 1z2u is the longest protein target, with the biggest neighborhood size (Table S1), and could be considered as the most difficult target in the benchmark.
In this case, the partition crossover in PILS helps finding better solu-

CONFLICT OF INTERESTS
You may be asked to provide a conflict of interest statement during the submission process. Please check the journal's author guidelines for details on what to include in this section. Please ensure you liaise with all co-authors to confirm agreement with the final statement.

PEER REVIEW
The peer review history for this article is available at https://publons. com/publon/10.1002/prot.26174.

DATA AVAILABILITY STATEMENT
The source code of PILS is freely available (https://forgemia.inra.fr/ david.simoncini/pils). Datasets will be provided upon request.