Extending Population-Based Incremental Learning to Continuous Search Spaces

. An alternative to Darwinian-like arti(cid:12)cial evolution is o(cid:11)ered by Population-Based Incremental Learning (PBIL): this algorithm mem-orizes the best past individuals and uses this memory as a distribution, to generate the next population from scratch. This paper extends PBIL from boolean to continuous search spaces. A Gaussian model is used for the distribution of the population. The center of this model is constructed as in boolean PBIL. Several ways of de(cid:12)ning and adjusting the variance of the model are investigated. The approach is validated on several large-sized problems.


Introduction
Evolutionary algorithms (EAs) 13,6,5] are mostly used to nd the optima of some tness function F de ned on a search space . F : ! IR From a machine learning (ML) perspective 9], evolution is similar to learning by query: Learning by query starts with a void hypothesis and gradually re nes the current hypothesis through asking questions to some oracle. In ML, the sought hypothesis is the description of the target concept; the system generates examples and asks the oracle (the expert) whether these examples belong to the target concept. In EA, the sought "hypothesis" is the distribution of the optima of F; the system generates individuals and asks the oracle (a routine or the user) what their tness is. In all cases, the system alternatively generates questions (examples or individuals) depending on its current hypothesis, and re nes this hypothesis depending on the oracle's answers.
One core di erence between ML and evolution is that ML, in the arti cial intelligence vein, manipulates high-level, or intensional description of the hypothesis sought. Conversely, evolution deals with a low-level, or extensional description of the sought distribution: the distribution of the optima is represented by a collection of individuals (the current population).
The Population Based Incremental Learning (PBIL) approach bridges the gap between ML and EAs: it explicitly constructs an intensional description of the optima of F, expressed as a distribution on 2,3]. This distribution is alternatively used to generate the current population, and updated from the best individuals of the current population. The advantage of the approach is that, as claimed throughout arti cial intelligence 12], the higher level the information, the more explicit and simple the information processing can be. And indeed, PBIL involves much less parameters than even the canonical GAs 6].
PBIL was designed for binary search spaces. It actually constructs a distribution on = f0; 1g N represented as an element of 0; 1] N . The basics of this scheme are rst brie y recalled in order for this paper to be self contained (section 2). Our goal here is to extend this scheme to a continuous search space IR N . Continuous PBIL, noted PBIL C , evolves a Gaussian distribution on noted N(X; ). The center X of the distribution is evolved much like in the binary case; evolving the standard deviation of this distribution is more critical, and several heuristics to this aim are proposed (section 3). PBIL C is nally validated and compared to evolution strategies on several large-sized problems (section 4). The paper ends with some perspectives for further research.  Let denote a population of individuals in = f0; 1g N . An element h of H = 0; 1] N can be associated to , by de ning h i as the fraction of individuals in having their i-th bit set to 1. Conversely, an element h in H de nes a distribution over : one draws an element X = (X 1 ; : : :; X N ) in by setting X i to 1 with probability h i .
PBIL relies on the following premises 2]: a) if evolution succeeds, the population converges toward a single 1 optimum of F; b) the more converged the population , the better it is represented by h. Assuming these, PBIL discards all information in the population not contained in h: The population is simply considered as a manifestation of h. The attention is thus shifted from evolving by means of mutation and recombination, to evolving h ( Fig. 1). To this aim, PBIL uses the information contained in the current population t : h is evolved, or rather updated, by relaxation from the best individual X max in t : h t+1 = (1 ) : h t + : X max ; in ]0; 1 Distribution h t can be viewed as the memory of the best individuals generated by evolution. Relaxation factor corresponds to the fading of the memory: the higher , the faster h t moves toward the current local optimum.
In contrast to standard evolution, PBIL explicitly explores the space H of distributions on . And, as noted already, this higher level representation allows for a simpler information processing: besides the population size, PBIL involves a single key parameter, , to be compared to the various parameters controlling mutation and recombination. Further, the exploration is deterministic, in the sense that h t is deterministically updated from the current population 2 .

Discussion
Let us reformulate PBIL as a learning-by-query algorithm, by de ning a partial generality order on the set of distributions H. The generality of a distribution h is clearly related to the diversity of the population generated from h, and the diversity of the population with regard to bit i is inversely proportional to jh i :5j. Accordingly, a distribution h is more speci c than h 0 , if, for each bit i, either 0 h i h 0 i :5, or :5 h 0 i h i 1.
PBIL initializes h to the most general distribution h 0 = (:5 : : : ; :5), and gradually specializes it along generations. Let X h denote the (boolean) individual most similar to h t ; then, h t is specialized on all bits i such that X h i = X max i .
The complete convergence of the scheme is avoided as h t i never reaches 0 or 1; in theory, PBIL can generate any individual at any time.
In practice, PBIL can su er from premature convergence. This happens when h t gets too speci c 3 , and no new good individual is discovered. PBIL o ers two heuristics to resist premature convergence 2]: Using the average of the two best individuals in t , rather than the single best one. This way, h t is generalized on all bits discriminating these individuals. Perturbing h t with a Gaussian noise: with a given probability (5%), a Gaussian variable with a low standard deviation is added to h t i . This way, the center of the distribution is durably perturbed, which helps escaping from local minima.
A more fundamental limitation of PBIL comes from the distribution space, which implicitly assumes the linear separability of the problem (genes are considered independent). This distribution space appears too poor to t complex tness landscapes, such as the Long Path problem 7]. Previous experiments show that distributions used in PBIL have di culties to overlap the narrow path 14]. Recent extensions to PBIL have considered richer distribution spaces 4].

Continuous PBIL
This section rst brie y discusses a previous attempt to extend PBIL to continuous search spaces, then details the proposed method and outlines PBIL C .

Continuous PBIL with dichotomic distributions
To the best of our knowledge, the only extension of PBIL to continuous search spaces has been proposed in 15]. This algorithm explores the search space much like the delta-coding approach 17]. The domain of each gene is divided into two intervals ("low" and "high" values); the current distribution h (h in 0; 1] N ) is used to determine which interval an individual belongs to: 2 ) = h i X i is then drawn with uniform probability in the selected interval.
At each generation, h is updated like in the boolean case, by memorizing whether the best individual takes low or high values for each gene: When h i gets speci c enough (h i < :1 or h i > :9), the population gets concentrated in a single interval (resp. a; a+b 2 ] or a+b 2 ; b]). The search is then focused: the domain of the gene is set to the interval considered and h i is reinitialized to :5.
In this scheme, evolution gradually focuses on the region most often containing the best individuals. One limitation is that a region which has been discarded at some point is hardly explored ever after, and this violates the ergodicity requirement. Furthermore, the search might be insu ciently focused, given the poor (uniform) distribution used within the selected interval.

Continuous PBIL with Gaussian distributions
Our approach rather explores Gaussian distributions N(X; ) on the search space , given as products of Gaussian distributions N(X i ; i ) on each gene domain. With no loss of generality, is set to 0; 1] N in the following.
Like PBIL, PBIL C starts with a rather general distribution; then it alternatively uses this distribution to draw the population, and uses the population to update the distribution. The center of the distribution X t is initialized to the center of the search space (:5; : : : ; :5). At each generation, X t is updated from a linear combination of the two best and the worst individuals in the current population, inspired from PBIL and Di erential Evolution 16]: X t+1 = (1 ) : X t + : (X best; 1 + X best; 2 X worst ) The diversity of the population, controlling the convergence of evolution, depends on the variance = ( 1 ; : : : N ) of the distribution. Several heuristics have been investigated to adjust parameters i . A The simplest possibility is to use a constant value. The trade-o between exploration and exploitation is thus settled once for all: the search cannot become too speci c and it cannot be speeded up either. B A second possibility is to make evolution itself adjust . PBIL C here proceeds exactly as a self-adaptive (1; )-evolution strategy (ES) 4 where stands for the size of the population, except that the parent is replaced by the center X t of the distribution. C A third possibility is to adjust depending on the diversity of the current best o spring; t is then set to the variance of the K best current o spring: where X denotes the average of the best K o spring X 1 ; : : : X K . D Last, can be learned in the same way as X itself, by memorizing the diversity of the K best o spring: At rst sight, PBIL C is quite similar to a (1; )-ES, the o spring being generated from the single parent (X t ; t ). The di erence is twofold.
In (1; )-ES, the parent is simply replaced by the best o spring, whereas PBIL C updates X t by relaxation. Let any o spring X k be written X t + Z k , with Z k being a random vector drawn according to N(0; t ). Then it comes: X t+1 = (1 )X t + (X best;1 +X best;2 X worst ) = X t + (Z best;1 +Z best;2 Z worst ) The evolution of X t can be viewed as a particular case of weighted recombination as studied by Rudolph 11]; a theoretical analysis shows that weighted recombination with optimal weights should be preferred to the simple replacement of the parents. Interestingly, the heuristic recombination used in PBIL C is intermediate between two particular cases with good theoretical properties (for F(X) = P X 2 i ): the half sum of the two best o spring, and the di erence of the best and the worst o spring. PBIL C uses xed, hence non-optimal, weights; but note that intervenes as an additional scaling factor, controlling the variance of X t .
Independently, the variance of X t is also controlled from t . PBIL C uses global 4 In self-adaptive ES, besides the Xi an individual X carries the variance i of the mutation to be applied on the Xi 13, 1]: Mutation rst evolves the i, then uses the new i to perturb the Xi. Evolution thus hopefully adjusts the i "for free", at the individual level.
mechanisms (options A, B and D) to adjust t , by opposition to the local adjustment of achieved by self-adaptive mutation. Actually, the adjustment of (option D) much resembles the 1/5th rule used to globally adjust in early evolution strategies 10]. The di erence is that the 1/5th rule criterion compares the o spring to the parents, and considers whether a su cient fraction of o spring is more t than the parents. In opposition, PBIL C only examines the diversity of the best t o spring: it does not need to restrict the exploration, even if the o spring are less t than the parent, because the center of the explored region moves more slowly than in standard ES. To sum up, PBIL C controls the exploration-exploitation tradeo in a way rather di erent from that of (1; )-ES. First of all, the single parent does not jump directly to a desirable location (the best o spring, or some weighted combination of the remarkable o spring), but rather makes a very small step toward this desirable location (e.g. is set to 10 2 in the experiments). Variance is adjusted in a similarly cautious way. It appears that ES takes instant decisions, on the basis of the instant information. On the opposite, PBIL C maintains a long-term memory, slowly updated from the instant information, and bases its cautious decisions on this long-term memory.

Validation
This section describes the goal of the experiments and the problems considered. We then report and discuss the results obtained.

Experiment Goals and Problems
Our goal is to study the respective advantages of evolving extensional vs intensional information about the tness landscape. Practically, PBIL C , evolving an intensional information represented as a distribution, is compared to selfadaptive evolution strategy, evolving an extensional information represented as usual as a population. We deliberately consider large-sized search spaces (N = 100) for the following reason. In low or middle-sized spaces, populations or distributions might convey similarly accurate information about the tness landscape. This is not true in large-sized spaces: any reasonable number of point s can only convey a very poor information about IR 100 . Experimenting PBIL C in IR 100 will show how intensional evolution stands the curse of dimensionality. Functions and search spaces considered are displayed in Table 1. Functions F 1 to F 3 have been used to evaluated binary PBIL 2]. Besides the size of the search space, F 1 and F 2 su er from an additional di culty, epistasis (the genes are linked via the y i ). Functions F 6 to F 8 have been extensively studied in the literature, for lower-sized search spaces (N 30).

Experimental setting
We used two reference algorithms: boolean PBIL working on a discretization of the continuous problem (each continuous variable is coded through 9 binary variables), using either a binary or a Gray coding; and a (10 + 50)-ES with self adaptive mutation 1]. In the PBIL case, the size of the population is set to 50 and the relaxation factor is set to :01. PBIL C involves the same setting as PBIL ( = 50 and = :01). Four options regarding the variance of the distributions have been considered (section 3.2): A Constant variance. B Self-adapted variance: PBIL C here behaves like a self-adaptive (1; )-ES, except that the parent is replaced by X t . C Instant variance: i is set to the variance of the best K o spring in the population. Several values of K were considered: =2; =3; =5. D Relaxed variance: i is the variance of the best K o spring relaxed over the past generations; the relaxation factor is again set to = :01.

Results
Algorithm    Table 2 displays the results obtained on functions F 1 ; F 2 and F 3 . Results obtained by boolean PBIL are taken from 2]; additional results not reported here, show that boolean PBIL signi cantly outperforms several variants of GAs and Hill-Climbers on these functions. Note that all algorithms end rather far from the actual optimum (10 7 ). Still, PBIL C signi cantly outperforms standard ES on these problems | provided that the variance of the distribution is adequately set. Note also that PBIL C outperforms PBIL itself, working on a binary or Gray discretization of these continuous problems. This might be due either to the loss of information entailed by discretization, or because PBIL, as already mentioned, explores a too restricted distribution space.
The worst results of PBIL C are obtained when is self-adapted or set to the diversity of the current best o spring (options B and C); they are due to a fast decreasing of . And, in retrospect, a vicious circle occurs when tightly depends on the diversity of the o spring: the less diverse the o spring, the smaller , hence the less diverse the o spring... Setting to a constant value (option A; the particular values were chosen after 10,000 evaluations preliminary runs) leads to satisfactory results, even outperforming those of standard ES. Further experiments will show whether this is rather due to the superiority of weighted recombination (replacing a parent by a combination of o spring) over replacement | or to the "long-term memory" e ect, as the parent slowly moves toward the weighted combination of the o spring instead of jumping there.
The best option appears to learn the variance in the same way as the center of the distribution X t (option D). Further, the fraction K of the o spring considered to update apparently is not a critical parameter 5 .  These trends are con rmed by preliminary experiments on F 6 ; F 7 and F 8 ( Table  3): PBIL C signi cantly outperforms self-adaptive ES on two out of the three problems, the best option for adjusting being the relaxation from a small fraction of the best o spring.

Conclusion
The main originality of PBIL is to reformulate evolution into new, higher-level, terms: rather than specifying all operations needed to transform a population into another population (selection, recombination, mutation, replacement), one only speci es how to evolve or update a distribution given the additional information supplied by the current population. At this level, many core traits of evolution (e.g. diversity, speed of changes) are explicit and can be directly controlled.
Overall, evolution shifts from the stochastic exploration of the search space , to learning a distribution on by reinforcement from the current population. This paper extends PBIL from boolean to continuous search spaces, by learning Gaussian distributions N(X; ). The resulting PBIL C algorithm can be thought of as a (1; )-ES, with the following di erences. ES takes instant decisions, on the basis of the instant information. PBIL C maintains a long-term memory, takes its decisions on the basis of this long-term memory, and slowly updates the memory from the instant information. Practically, the parent of a (1; )-ES jumps toward the best o spring; in opposition, the center of the distribution in PBIL C cautiously moves toward a weighted combination of the o spring. Similarly, self-adaptive ES locally adjusts the variance of mutation by means of instant decisions; in opposition, PBIL C cautiously updates the variance from the global diversity of the best o spring. One argument for learning distributions is that it expectedly scales up more easily than evolving populations: a reasonable size population gives little information on large-sized search space. Experimental results on large-sized problems show that PBIL C actually outperforms standard ES on ve out of six problems (with one or two orders of magnitude) and also outperforms the original PBIL working on a discretized version of the continuous problems considered.
Nevertheless, given the size of the search space, PBIL C ends rather far from the optimum on four out of six problems. Further experiments will consider other problems, and study how PBIL C behaves in the last stages of exploitation. Another perspective of research is to evolve several distributions rather than a single one. This would relax the main limitation of the PBIL scheme, that is, the fact that it can only discover a single optimum. Indeed, learning simultaneously several distributions is very comparable to evolving several species. The advantage is that comparing an individual to a few distributions might be less expensive and again more transparent, than clustering the population, adjusting the selection or the tness function to ensure the co-evolution of species.