Comparison of Kriging-based methods for simulation optimization with heterogeneous noise

In this article we investigate the unconstrained optimization (minimization) of the performance of a sys-tem that is modeled through a discrete-event simulation. In recent years, several algorithms have been proposed which extend the traditional Kriging-based simulation optimization algorithms (assuming deterministic outputs) to problems with noise. Our objective in this paper is to compare the relative performance of a number of these algorithms on a set of well-known analytical test functions, assuming different patterns of heterogeneous noise. We also apply the algorithms to a popular inventory test problem. The conclusions and insights obtained may serve as a useful guideline for researchers aiming to apply Kriging-based algorithms to solve engineering and/or business problems, and may be useful in the development of future algorithms.


Introduction
Assume that we would like to find the solution that minimizes some goal function f ( x ): where f : → R and x = (x 1 , x 2 , . . ., x d ) (with d the dimension of the solution space).This goal function cannot be directly observed, and must be estimated through a stochastic simulation model: we thus only have access to noisy observations ˜ where ˜ f j (x i ) represents the observed goal value in the j th simulation replication at point x i .The noise is additive; the distribution of ε j ( x i ) has mean zero, and its variance depends on x i (denoted by τ 2 ( x i ), we thus have heterogeneous noise).We usually estimate the value of f ( x i ) by performing n i simulation replications: This type of problem is very common in engineering and business applications since in many practical problems, the objective function is analytically intractable ( Law, 2015 ).In inventory management problems, for instance, we often need to optimize a system modeled through a discrete event simulation to obtain the optimal stocking quantities ( Jalali & Van Nieuwenhuyse, 2015 ).Examples in other fields include supply chain design ( Saif & Elhedhli, 2015 ), pump scheduling ( Naoum-Sawaya, Ghaddar, Arandia, & Eck, 2015 ), and ambulance fleet allocation ( McCormack & Coates, 2015 ).
In this paper, we focus on problems where is continuous and is bounded by box constraints; for reasons explained in Section 3.1 , we discretize the function domain to obtain a finite set of points (as common in the literature; see, e.g., Frazier, Powell, and Dayanik, 2009;Kleijnen, van Beers, andVan Nieuwenhuyse, 2012 , andSun, Hong, &Hu, 2014 ).Many methods are available for solving problem (1) with a finite set of points such as the ranking and selection (R&S) method (see Hong, Nelson, and Xu, 2015 , for a review).Most of them are, however, unsuitable when (1) the number of feasible solutions is large, or (2) simulation replications are timeconsuming ( Xu, 2012 ).Kriging-based or Bayesian optimization is among the few techniques that can handle this problem with low problem dimensionality ( d ≤ 20, Brochu, Cora, & De Freitas, 2010;Preuss, Wagner, & Ginsbourger, 2012 ).Based on the observed f (x i ) for several x i , Kriging provides an approximation or metamodel for f ( x ).The traditional Efficient global optimization (EGO) approach of Jones, Schonlau, and Welch (1998) is one of the most popular Kriging-based algorithms for optimizing noiseless simulation; in this case, the fitted metamodel is a deterministic Kriging model (see Kleijnen, 2015 , for references on the use of EGO in constrained and multi-objective optimization problems).In stochastic simulation (e.g., discrete event simulation), however, EGO might not be http://dx.doi.org/10.1016/j.ejor.2017.01.035 0377-2217/© 2017 Elsevier B.V. All rights reserved.very appropriate since it ignores the noise in the observations, assuming they were sampled with infinite precision ( Quan, Yin, Ng, & Lee, 2013 ).Being aware of this deficiency, researchers have tried to extend EGO for stochastic simulation.Most of the approaches assume homogeneous simulation noise, meaning that the variance of the noise does not depend on x ( Picheny, Ginsbourger, Richet, & Caplin, 2013a ); Picheny, Wagner, and Ginsbourger (2013b) compare several Kriging-based algorithms for optimizing functions with this type of noise.
In practice, however, the noise is heterogeneous ( Kim & Nelson, 2006;Kleijnen & Van Beers, 2005;Yin, Ng, & Ng, 2011 ).In recent years, a number of Kriging-based optimization algorithms have been proposed that can handle heterogeneous noise, based on stochastic Kriging models (see Ankenman, Nelson, and Staum, 2010;Cressie, 1993 , andYin et al., 2011 ).To the best of our knowledge, the performance of these algorithms has not yet been compared.Our objective in this paper is to evaluate the effectiveness and the relative performance of these algorithms for simulation optimization problems with heterogeneous noise.More specifically, we evaluate the performance of six Kriging-based algorithms (details will be explained in Section 2.2 ; see the overview in Table 2 ): • MQ: the minimum quantile criterion ( Picheny et al., 2013b ), which is similar to the Gaussian process upper confidence bound (GP-UCB) method of Auer, Cesa-Bianchi, and Fischer (2002) and Jones (2001) ; • SKO: an adapted version of the sequential Kriging optimization approach ( Huang, Allen, Notz, & Zeng, 2006 ); • CKG: correlated knowledge-gradient ( Frazier et al., 2009 ); • EQI: expected quantile improvement ( Picheny et al., 2013a ); • TSSO: two-stage stochastic optimization ( Quan et al., 2013 ); • eTSSO: extended two-stage stochastic optimization ( Liu, Pedrielli, & Ng, 2014 ).
To this end, we apply these algorithms to minimize three wellknown analytical test functions (Rescaled Branin, Six-hump Camelback, and Hartmann-6), perturbed with heterogeneous Gaussian noise, over a finite set of solutions.We also apply the algorithms to optimize the ( s , S ) inventory system of Fu and Healy (1997) , which has been a popular test problem in the simulation optimization literature ( Jalali & Van Nieuwenhuyse, 2015 ).
Though our work is closely related to Picheny et al. (2013b) , it clearly differs in the following respects: (1) we consider heterogeneous noise, (2) we add two recent Kriging-based algorithms (TSSO and eTSSO), (3) we adapt the SKO algorithm ( Huang et al., 2006 ) to settings with heterogeneous noise, and (4) we explicitly consider the replication strategy used by the different algorithms, i.e., the algorithms need to smartly allocate the limited replication budget to obtain reasonably good estimates.In Picheny et al. (2013b) , the noise magnitude is known and the replication strategy is limited to simple revisits (i.e., each time the algorithms visit x i , n i = 1 replication is performed).While Picheny et al. (2013b) compare the algorithms only in terms of how they perform the search for interesting alternatives in , we also study the effectiveness and importance of the replication strategies and their influence on the choice of the final solution (i.e., the identification step, see Fig. 1 , last block, and Section 2.1 , last paragraph).
The complexity and computational requirements vary widely across the algorithms.As the MQ algorithm is the most straightforward, we consider it as a benchmark in our experiments.Some algorithms (i.e., CKG, SKO, and EQI, see Section 2 ) require information on the noise variance function τ 2 ( x ) at any x ∈ .This may limit the applicability of these algorithms in practice, as an unknown noise variance function is "a fact of life in system simulation problems" ( Kim & Nelson, 2006 ), so it needs to be estimated.
In our experiments, we test whether these "informed" algorithms outperform the ones that don't need this information.
The following conclusions may serve as a useful guideline for researchers who want to apply Kriging-based algorithms to solve engineering or business problems, and may be useful in the development of future algorithms: 1.Not only the magnitude but also the structure (i.e., shape) of noise can have a large impact on the performance of the algorithms.The true objective value at point x i , which is assumed to be unknown.
f (x i ) Estimate of the objective value at point Kriging estimate: Kriging prediction of the objective value at point x i .s 2 ( x i ) Estimate of Kriging prediction variance at point x i .
2. Among the algorithms that require an estimate of τ 2 ( x ), SKO and CKG usually outperform MQ while EQI performs worse.3. Two-stage stochastic optimization (TSSO) simulates the most promising alternatives but usually fails to identify them at the end.If the identification is improved, TSSO will outperform MQ and will provide very competitive results, making it preferable over SKO or CKG since it does not require an estimate of the noise variance function.Also, eTSSO does not seem to provide better results than TSSO.4. None of the existing algorithms have an appropriate replication strategy.While the focus of the literature has been primarily on designing better search methods, our results show that there is a need for smarter replication strategies (e.g., by adding an appropriate ranking and selection procedure).This can likely lead to significant improvements in the performance of all algorithms.
The remainder of this article is organized as follows.Section 2 provides a brief explanation of the Kriging-based algorithms studied, Section 3 details the experiments, and Section 4 contains the main results (additional results are provided in an appendix).We present conclusions in Section 5 .

Overview of Kriging-based algorithms
Section 2.1 provides a brief introduction to Kriging-based simulation optimization, and explains the general steps involved in this process.Section 2.2 highlights the similarities and differences between the algorithms studied in this paper.

Kriging-based simulation optimization
As noted in Section 1 , we need a stochastic simulation model to estimate the objective value at a point within (see Eq. ( 1) ).As the simulation output is noisy, we use the sample mean of the simulation output across several replications to obtain an estimate.This estimate is denoted by f (x i ) with estimated variance Var [ f (x i )] at any arbitrary point x i .
As shown in Fig. 1 , the Kriging-based algorithms studied in this article are sequential .In the first step, they fit an initial Kriging metamodel using n 0 initial points.Space filling designs (e.g., Latin hypercube sampling) are often used for sampling these initial points ( Kleijnen, 2009 ); simulation replications are used to estimate the objective value at these points.Since our observations are noisy, we use stochastic Kriging ( Ankenman et al., 2010 ) to fit an initial Kriging metamodel.This metamodel provides us with (1) ˆ f (x ) : a prediction for the objective function value at any x ∈ and (2) s 2 ( x ): an estimate of the variance of ˆ f (x ) , which provides a measure for the Kriging prediction uncertainty or error ( Kleijnen, 2014 ); see Appendix A for related expressions.Table 1 provides an overview of the most important notations in this paper.We refer to Quan et al. (2013) , Ankenman et al. (2010) , Yin et al. (2011) , andCressie (1993) for further details on stochastic Kriging.
In the search step , the algorithm iteratively chooses a point and simulates it for some number of replications to obtain f (x ) and Var [ f (x )] ; it then refits the Kriging metamodel, re-estimating the Kriging parameters with the new point included.The algorithm uses an infill criterion to choose these points (known as infill points ): the purpose of the infill criterion is to discover the good alternatives in , and guide the search to regions with promising objective values ( Forrester, Sobester, & Keane, 2008 ).
Optionally, the replication step allows us to perform extra replications at some of the points already sampled (either initial design points or infill points), to increase the precision of both f (x ) and the estimated variance in those points.Such a replication step is present in TSSO, eTSSO, and EQI (see Section 2.2 ), but the actual choice of points and the number of replications allocated to this step differ between these algorithms.
Each time we execute the search and replication steps, we consume part of our finite replication budget.Hence, after some iterations, the replication budget will end and we enter the identification step : the algorithm looks back at all simulated or tested ( T ) alternatives ( T ) and returns one of them as the best solution to the optimization problem.In noiseless simulation, this is trivial since f ( x ) is available for all the simulated points: x r = arg min where x r denotes the proposed solution.In the presence of noise, only ˆ f (x ) is available, along with s 2 ( x ) and the observed f (x ) at each alternative x ∈ T .Consequently, the success of an algorithm is impacted by the infill criterion (only alternatives that have been simulated are considered in the identification step), by the criterion used to choose x r in the identification step (i.e., the identification criterion ), and by the replication strategy (more precise estimates can help to distinguish superior alternatives in the identification step).

Description of the algorithms
Due to the limited budget, all algorithms must make a tradeoff between the total number of infill points and their precision ( Picheny et al., 2013a ).If the replication strategy performs only a few replications at each point, we can sample many infill points but the estimates f (x ) may remain very noisy.This may result in inaccurate Kriging models, which in turn affects the search step (i.e., we might sample points with bad objective values).Additionally, it may cause problems in the identification step, as it becomes more difficult to differentiate between good and inferior solutions.Allocating more replications to the points mitigates these issues, but decreases the number of infill points, implying that we might not get the chance to discover (some) interesting alternatives.
Table 2 summarizes the main characteristics of the algorithms (infill criterion, replication strategy, identification criterion).As evident from this table, the identification criterion varies between the algorithms: yet, most of them rely on ˆ f (x ) and s 2 ( x ) for proposing the final solution.The remainder of this section briefly explains the search and the replication strategy for each algorithm.

MQ algorithm
MQ is straightforward and acts as a benchmark for other algorithms: it simply chooses the point with minimum Kriging quan- Obviously, MQ does not require information on τ 2 ( x ).The algorithm allocates a fixed number B of replications per iteration: the  and β ∈ (0, 0.5] in MQ and β ∈ [0.5, 1] in SKO and EQI ( Picheny & Ginsbourger, 2014 ).
analyst needs to decide on B prior to running the algorithm.However, MQ allows for revisits: at any iteration k , we might sample a point that has been simulated previously, and add B additional replications.The sampled points can thus receive a different total number of replications depending on how often they are revisited.

Adapted sequential Kriging optimization (SKO)
The SKO algorithm of Huang et al. (2006) chooses the alternative with maximum augmented expected improvement (AEI) as the next infill point: where ˆ f (x * * ) is the Kriging prediction at the current effective best solution x * * : i.e., the point with minimum q ( x ) among the simulated points, with β ∈ [0.5, 1].Thus, the first term of the above equation represents how much we expect the cost value at x to be lower than ˆ f (x * * ) , see Huang et al. (2006) for details.This algorithm was originally designed for homogeneous noise: in the second term of Eq. ( 2) , we introduce τ 2 ( x ) instead of τ 2 to reflect the presence of heterogeneous noise.SKO also uses a fixed number of replications per iteration ( B ) and allows for revisits.Note that the second term in Eq. ( 2) prevents the algorithm from getting trapped in a given point x , as continued resampling causes s 2 ( x ) to gradually approach zero, which in turn pushes AEI ( x ) towards zero ( Huang et al., 2006 ).

Correlated knowledge-gradient (CKG)
The idea behind the correlated knowledge-gradient (CKG) infill criterion is that in noisy environments, the Kriging prediction ˆ f (x ) may be closer to f ( x ) than the sample mean f (x ) ; therefore, points are selected based on their effect on the Kriging prediction ( Frazier et al., 2009;Picheny & Ginsbourger, 2014 ).More specifically, at iteration k + 1 , the improvement that would result from sampling alternative x k +1 is defined as: where k is the set of simulated points at the end of iteration k , ˆ f k (x ) is the current Kriging prediction and ˆ f k +1 (x ) is the Kriging prediction after refitting the Kriging metamodel with x k +1 included.We choose the point with highest CKG (x ) = E[ I(x )] as the next point.As with SKO, we need to know τ 2 ( x ) (or its estimate) for calculating this expectation (see Frazier et al., 2009 , for further details on this calculation).Analogous to MQ and SKO, the CKG algorithm allows for revisits, and uses a fixed number of replications B per iteration.

Expected quantile improvement (EQI)
In EQI, at iteration k + 1 , simulating an alternative x k +1 is beneficial if its quantile using the updated Kriging model is lower than the current minimum quantile: where EQI allows for revisits, and additionally includes a replication step after each search step (see Fig. 1 ).The number of replications per iteration is no longer fixed to a constant B .In the search step of iteration k + 1 , EQI samples the infill point x k +1 with n inc replications and refits the Kriging metamodel.In the replication step, it checks whether performing another n inc replications at x k +1 is beneficial, by looking at the evolution of EQI (x k +1 ) (see Picheny et al., 2013a , for details).If so, the algorithm adds another n inc replications, updates f (x k +1 ) and Var [ f (x k +1 )] , and refits the Kriging metamodel.This is repeated until adding extra replications is no longer beneficial: at this moment EQI leaves the replication step.

Two-stage sequential optimization (TSSO)
Analogous to EQI, TSSO has a search and a replication step (called allocation stage in Quan et al., 2013 ).In the search step of iteration k + 1 , TSSO looks at all unvisited alternatives ( x ∈ \ k ) and simulates the one with maximum modified expected improvement (MEI): where ˆ f (x min ) is the stochastic Kriging prediction at x min = arg min x ∈ k f (x ) (i.e., the alternative with the lowest sample mean among the already simulated points) and The number of replications per iteration is fixed to a predetermined value B .This number is shared between the search and replication steps according to a heuristic ( Quan et al., 2013 ).In the replication step, the algorithm uses the optimal computing budget allocation (OCBA) which is a Bayesian ranking and selection procedure ( Chen, Lin, Yücesan, & Chick, 20 0 0 ).OCBA allocates most of the replication step budget to points with low f (x ) and high Var [ f (x )] ( Quan et al., 2013 ), in view of maximizing the probability of selecting the best point among the sampled ones.Dixon and Szegö (1978) 0

Extended two-stage sequential optimization (eTSSO)
The eTSSO algorithm differs from the TSSO approach in the number of replications per iteration; while this was fixed to a given value B in TSSO, it now differs for each iteration k : where Var [ f (x OCBA )] is the estimate of the variance of the sample mean at the point where OCBA allocates the highest number of replications ( x OCBA ), and s 2 D (x k ) is the estimate of the deterministic Kriging prediction error at the point that maximizes MEI at iteration k .In eTSSO, the search step budget of each iteration is fixed to a value ( n s ) prior to running the algorithm: at iteration k , the replication step budget of eTSSO is thus B k − n s .

Designing numerical tests
In this section, we explain our experiments in detail.We apply the Kriging-based algorithms to three well-known analytical test functions; Section 3.1 provides the details.We also test our algorithms on the inventory problem of Fu and Healy (1997) , see Section 3.2 .Section 3.3 explains the method used for sampling the initial design and Section 3.4 discusses the different scenarios tested in our experiments.We discuss the implementation of the algorithms in Section 3.5 .Finally, Section 3.6 explains the performance measure used to compare the algorithms.

Analytical test functions and candidate points
The algorithms are applied to minimize three well-known analytical test functions (six-hump Camelback, rescaled Branin, and Hartmann-6; Dixon & Szegö, 1978 ), perturbed with heterogeneous Gaussian noise.Hence, we obtain noisy observations ˜ 3 gives further details on these analytical test functions.All test functions are continuous; however, as discussed in the introduction, we discretize the domain to obtain a large but finite set of solutions.In this way, we avoid the challenging problem of maximizing the expected improvement criteria in continuous space; also, some of the criteria might be more difficult to maximize than others.We discretize the domains using Faure low discrepancy sequences (a quasi-Monte Carlo sampling method with good space-filling properties, see chapter 5 of Lemieux, 2009 ) and we take 10 0 0 points for the Camelback and Branin functions ( d = 2 ), and 10 0 0 0 points for the Hartmann-6 function ( d = 6 ).These points represent our set of candidate points ; our objective is to find the global minimum among these alternatives (referred to as x * , having function value f ( x * )) using our Kriging-based algorithms.
Figs. 2 and 3 show an illustration of the rescaled Branin and the Camelback functions, along with their respective candidate points.As evident from the figures, the global minimum of Camelback lies within a small valley, while Branin has a large flat valley.Although the exact shape of Hartmann-6 is unknown, it is well-known that it is a multimodal function.

The ( s , S ) inventory problem and candidate points
In the inventory control problem of Fu and Healy (1997) , a firm employs a periodic review ( s , S ) policy to manage the inventory of a single product.At the end of each period (e.g., each week), the firm checks the inventory.If the inventory position (i.e., inventory on-hand + on-order -backorder) is above the level s , the firm does not order.If it is below s , the firm orders the difference between the order-up-to level S and the inventory position.Demand is stochastic and continuous and is i.i.d across the periods.The goal is to determine the optimal s and S in view of minimizing the longrun expected cost per period.The costs include the fixed ordering ( K ), per-unit ordering ( c ), holding ( h ), and backorder ( b ) cost.Zero replenishment lead time and full backlogging are assumed.
To simulate this problem, we use a simulation runlength of 10 0 0 periods per replication with a warm-up length of 100 periods.The inventory parameters ( s , S ) are continuous; we again use the Faure sequence to take 10 0 0 space filling solutions ( Fig. 4 ).The expected cost function is convex and with exponential demand, it has a closed form, such that the optimal s and S can be calculated analytically; we can thus evaluate the performance of our Kriging-based algorithms (see Table 4 for details about the parameters used in our experiment; based on Table 1 in Fu & Healy, 1997 ).

Initial design and Kriging metamodeling
For the initial design, we follow the suggestion of Jones et al. (1998) : n 0 = 10 d (so 20 initial points for the Camelback and rescaled Branin functions, as well as for the ( s , S ) inventory problem; 60 initial design points for the Hartmann-6 function).A maximin Latin hypercube sample (LHS) is used to obtain these points (as is common in Kriging metamodeling, see Picheny et al., 2013b ); we determine f (x i ) for each i = 1 , . . ., n 0 using 55 replications (in view of controlling the noise, see Section 3.4 ).As prevalent in the literature, we consider only a constant trend (i.e., an intercept) in the Kriging metamodel ( Kleijnen, 2009 ).We also don't use common random numbers (CRN): as shown by Chen, Ankenman, and Nelson (2012) , CRN increases the variance of the Kriging predictor  ( s , S ) inventory problem with exponential( λ) demand.

Description
Periodic review

Heavy noise min
Noise structure for the 3 analytical test functions Best case min Note : R f is the range of the objective value in the region of interest.
and variance of the constant trend estimator.We fit the Kriging metamodel by means of maximum likelihood; Matern 2.5 is used as the covariance function (see Rasmussen and Williams, 2006 , chapter 4).

Algorithmic factors
There are many factors that can influence the performance of the algorithms: some factors are internal or algorithm-specific parameters (e.g., β in SKO), while others are external (such as the replication budget, the noise magnitude and noise structure).Table 5 provides an overview which we now detail.
We test our algorithms using low and high replication budgets; this refers to the replication budget available after simulating the n 0 initial points, for performing the search and replication steps.We focus on problems for which a very small number of solution evaluations are possible, e.g., because the simulation is very time consuming.The number of replications per iteration ( B ) is equal to 55 for SKO, CKG, TSSO, and MQ, implying that with low (resp.high) budget we simulate 10 (resp.50) infill points for these algorithms.In TSSO, the number of replications per infill point is usually less than 55, since 55 replications are shared between the search and replication steps (see Section 2.2.5 ).In EQI and eTSSO, B is not fixed; the number of infill points is thus not known prior to running these algorithms.
The maximum and minimum values of τ ( x ) are linked to R f , i.e., the range of the objective value in the region of interest (as in Huang et al., 2006 andPicheny &Ginsbourger, 2014 ).With light noise, τ ( x ) varies between 15% and 60% of R f .With heavy noise, these numbers increase to 150% and 600% of R f .We sample the initial designs with 55 replications: for these points the noise on f (x ) varies between 1 .5 / In real-life problems, the noise can follow any type of structure.In the (s,S) inventory problem this structure is unknown.For the 3 analytical test functions, we assume that the standard deviation of the noise (i.e., τ ( x )) varies linearly in terms of the objective value: τ (x ) = a f (x ) + a × b (see Appendix B for details).Here, we focus on two cases: 1. Best case : the noise standard deviation decreases linearly as the objective value decreases; therefore we have minimum noise at the global minimum.2. Worst case : the noise standard deviation increases linearly as the objective value decreases; we then have maximum noise at the global minimum.
As mentioned earlier, CKG, EQI, and SKO need an estimate of τ ( x ) at all the candidate points.In the analytical test functions, In practical implementations, τ ( x ) will be unknown.Therefore, for the analytical test functions, the noise estimates offered to CKG, EQI, and SKO are much better than the estimates that one would likely obtain in practice, where no prior information on τ ( x ) is available.This is the case in the inventory test problem.Analogous to Section 3.1 in Ankenman et al. (2010) , we then fit a deterministic Kriging model to the sampled variances of the simulated function values and use it to predict τ ( x ) at any x ∈ .Fig. 5 illustrates the structure of heavy worst case and heavy best case noise for the Branin function, at x 2 = 0 .Here R f = 6 , so the maximum and minimum standard deviation of noise with 55 replications are 0 .8 × 6 = 4 .8 and 0 . 2 × 6 = 1 .2 , respectively.As shown in the figure, with best case noise, this maximum occurs at (x 1 = 0 , x 2 = 0) , and the noise drops linearly as f ( x ) decreases.
With worst case noise, we have the opposite behavior.
The noise structure may substantially impact the performance of the algorithms, as it impacts the search pattern in step 2. As an illustration, Fig. 6 shows the structure of MEI for the Camelback function, with best and worst case noise, at the first iteration of an arbitrary run (so only the n 0 design points have been simulated).With best case noise, the maximum MEI is near the optimal solution; with worst case noise, the search may be driven to noninteresting areas of the search space.The noise structure can thus completely change the search behavior of the algorithms.
Table 6 provides the parameter settings of the different algorithms.Huang et al. (2006) recommend β = 0 .84 for SKO ( −1 (0 .84) = 1 ); for the β values in MQ and EQI we follow the suggestions of Picheny et al. (2013b) .We follow the recommendation of Picheny et al. (2013a) for γ in EQI; we set n inc = 10.In TSSO, n min (i.e., the minimum number of replications for simulating an infill point) only affects the way B is shared between the search and replication steps (see Quan et al., 2013 ).In eTSSO, we chose n s = 10 to match n inc of EQI.MEI for best case heavy noise MEI for worst case heavy noise

Implementation details
For best case as well as for worst case noise, we have the following 4 settings: 1) Low budget -Light noise 2) Low budget -Heavy noise 3) High budget -Light noise 4) High budget -Heavy noise.These are applied to each of the 3 analytical test functions, yielding a total of 2 × 4 × 3 = 24 scenarios.For the inventory problem, we have 2 scenarios: low budget and high budget.Recall that the budget is for the search and replication steps (steps 2 and 3 in Fig. 1 ).To evaluate the performance of an algorithm for a given scenario, we use 100 macroreplications: we run the algorithm 100 times, each time using a different initial Latin hypercube design.In a given macroreplication, all algorithms start with the same set of initial points (same x i , f (x i ) , and Var [ f (x i )] ) and, hence, the same Kriging model (analogous to Picheny et al., 2013b ).All the algorithms were coded in MATLAB.To fit the stochastic Kriging models, we used the code provided on http://stochastickriging.net/ , for CKG we used the code on http://people.orie.cornell.edu/pfrazier/src.html, the TSSO code was based on http://www.ise.nus.edu.sg/staff/ngsh/download/matlabdocs/SOK/ and the eTSSO code was obtained from the authors of Liu et al. (2014) .The remaining algorithms were coded by the authors themselves and compared with the DiceOptim package in R ( Picheny & Ginsbourger, 2014 ) to ensure correctness.
As most of the algorithms allow for revisits, the covariance matrix of the metamodel may become ill-conditioned; we follow the guidelines in Section 4.3 of Picheny and Ginsbourger (2014) for handling this issue.Table 7 shows the average running time of different algorithms for one macroreplication in all the test problems (on a Dell desktop with 3.4 gigahertz Core i7-2600 processor and 8 gigabytes of RAM).For the analytical test functions, we focused on heavy worst case noise; the structure or magnitude of noise had only minor effects on the running times.As evident from the table, CKG and then EQI have the longest running time, while eTSSO requires the smallest computation time followed by MQ.In eTSSO, the number of replications per iteration quickly increases (see Eq. ( 6) ) and the budget is depleted after a relatively small number of iterations: this explains its short running time.CKG requires a lot of time to calculate the expected improvement of all 10 0 0 0 feasible alternatives in Hartmann-6 and thus, as shown in the table, takes around 24 hours per macroreplication.For the experiments in Section 4 , we implemented this algorithm on a server with 5 cores and 100 threads, running each macroreplication on a different thread.However, in many problems, the feasible region contains millions of points ( Xu, 2012 ); the computational requirements may limit the applicability of CKG in these cases.

Performance measure
Let x r m be the point returned by the algorithm in the identification step at macroreplication m = 1 , 2 , . . ., 100 .GAP m = f (x r m ) − f (x * ) then represents the difference in objective value between the alternative suggested by the algorithm and the optimal solution.The distribution of GAP = (GAP 1 , GAP 2 , . . ., GAP 100 ) is our main performance measure, visualized by means of a boxplot for each of the algorithms.

Results
In this section, we provide the results of our analysis.As outlined above, we study the following research questions: 1. Does the structure and magnitude of the noise and the replication budget affect the performance of the algorithms? 2. Do the algorithms that require an estimate of the noise variance function in step 2 (CKG, SKO, EQI) outperform the others that don't need this information?
3. How do the algorithms perform in comparison to our benchmark algorithm (MQ)? 4. How do the algorithms compare in terms of their search strategies?Do they manage to locate good solutions among ? 5. Are the replication strategies of the algorithms effective?Do they help the algorithms to distinguish the superior solutions in the identification step, or do the algorithms experience the identification problem?
In Section 4.1 , we discuss the effect of the external factors proposed in Table 5 .Section 4.2 details the performance of the algorithms with respect to our performance measure, GAP .In Section 4.3 we discuss the effect of the identification step on the performance of the algorithms.As the results were similar for the Camelback, Branin, and the ( s , S ) problem, we moved the results of the two latter ones to Appendix C and Appendix D , respectively; only the results for Camelback ( Figs. 7 , and 9 ) and Hartmann-6 ( Figs. 8 , and 10 ) are discussed in the text.

Effect of external factors
The effect of the external factors on the performance of the algorithms is largely the same for the different test problems.From Figs. 7 and 8 , we can derive the following effects: • Replication budget : comparing low and high budget figures reveals that a higher replication budget improves the GAP distribution for all algorithms; for the analytical test functions, though, the difference is smaller with heavy noise: simulating more infill points does not significantly improve GAP with heavy noise.• Noise structure (for the analytical test functions) : Figs.7 and 8 show that the performance of all algorithms usually degrades with worst case noise comparing to best case noise, for all scenarios.
• Noise magnitude (for the analytical test functions) : comparing light and heavy noise figures, we see that stronger noise has a substantial negative impact on the performance of all algorithms.Moreover, noise magnitude has a stronger influence on the performance of the algorithms than noise structure.

Comparison with respect to GAP
Figs. 7 and 8 show that there is no algorithm that clearly provides a smaller GAP than other algorithms in all the tested scenarios.The benchmark algorithm MQ is very competitive in Camelback, Branin, and the ( s , S ) problem: only CKG and SKO tend to outperform MQ.TSSO and eTSSO give the largest GAP on average (especially with best case noise).The results for Hartmann-6 are somewhat different: as evident from Fig. 8 , MQ is less competitive (especially with light noise); together with EQI, it usually gives the largest GAP .TSSO usually provides the best outcome (except with heavy best case noise), and is similar in performance to CKG.
The above observations show that, among the algorithms that require τ 2 ( x ) in step 2, CKG and SKO usually outperform the benchmark MQ algorithm, while EQI performs worse.This holds even in the ( s , S ) inventory problem where τ 2 ( x ) is estimated via a metamodel, see Appendix D .Analogous to MQ, the TSSO and eTSSO algorithms don't require an estimate for τ 2 ( x ) in step 2:

Effect of the identification step
In the previous section, we mentioned that TSSO usually provides a bad GAP : in comparison to other algorithms, the final solutions returned by this algorithm usually have higher objective values.This bad GAP performance can have two reasons: (1) the algorithm fails to discover promising alternatives in the search step, so T (the set of sampled solutions) does not contain interesting points (i.e., search problem ); or (2) promising alternatives are present in T , but the algorithm fails to identify them in the identification step; it thus returns an inferior solution to the user (i.e., identification problem ).
For the Camelback function, Fig. 9 shows the GAP performance of algorithms if we had achieved perfect identification in the last step, i.e., if there was no identification problem.Algorithms then return the true best point among T as the final solution (the  search strategies and thus the sampled solutions are unchanged).Comparing to Fig. 7 , TSSO is performing much better and as evident from Fig. 9 , in the absence of the identification problem, TSSO provides very good performance in all scenarios.Therefore, the bad GAP performance of TSSO in Fig. 7 is caused by the identification problem rather than the search problem; Fig. 9 shows that TSSO actually has a relatively good search strategy as it usually discovers better alternatives than other algorithms.It thus seems that the replication strategy of TSSO is not very successful: the algorithm usually can't distinguish between the inferior and good solutions in the identification step (the same observations hold for the Branin function,  Hartmann-6, most of the algorithms already achieve perfect identification using the identification criteria in Table 2 (i.e., the GAP boxplots in Figs. 8 and 10 are identical in most cases).
The replication strategy of the algorithms also affects their performance in the search step.Appendix F shows the number of distinct infill points (i.e., without counting the revisits) sampled by the algorithms, for the high budget scenarios.As evident from these figures, TSSO always samples 50 distinct infill points as it does not allow for revisits.All other algorithms sample fewer distinct points.This can also contribute to the relatively good search performance of TSSO.EQI and eTSSO usually yield the lowest number of distinct infill points.Especially with heavy noise, EQI becomes very "conservative", and spends a lot of replications to get more precise estimates for the objective value: the replication budget is depleted after sampling a few infill points.In eTSSO, the number of replications per iteration quickly increases (see Eq. ( 6) ), implying that the budget is also depleted after a relatively small number of iterations.
The above observations lead us to conjecture that if the identification problem of TSSO could be solved, this algorithm would probably provide much better GAP results, and might even be preferred over CKG or SKO (as it does not require τ 2 ( x ) in step 2).This could be done by either changing the replication strategy used by the algorithm (for instance, using other ranking and selection procedures instead of OCBA), or by modifying the identification criterion.For the Branin and Camelback functions, Figs.C.3 and E.1 in Appendix show, for instance, the change in performance when the identification criterion of TSSO is changed to choose the point with minimum Kriging prediction (as in CKG) instead of the point with minimum sample mean.This modified TSSO (MTSSO) almost always gives a better GAP performance than TSSO, and has a performance similar to MQ; the identification criterion thus has a significant effect on the outcome of Kriging-based algorithms.
Although it is more acute in TSSO, all algorithms have identification issues in the inventory problem ( Fig. D.1 ), and the Camelback and Branin functions (compare Fig. 7 vs. 9 and Figs.C.1 vs. C.2 ); in the latter two functions, the impact on the resulting GAP distribution is more severe with heavy noise, as higher noise also has a negative impact on the search step.The replication strategies of the algorithms are thus, in general, not helpful.While increasing B (replication budget per iteration) might, at first sight, seem the obvious way to mitigate this problem, it is in general not easy to find the right B in practical problems where the optimal solution is unknown.Moreover, a higher B will cause us to sample less infill points.While some algorithms (i.e., eTSSO and EQI) determine B dynamically during the run of the algorithm, our results reveal that these are not very successful.
We thus believe that it is essential to focus on smarter replication strategies.One could include, for instance, a R&S procedure that acts as a "clean-up" in the identification step (see, e.g., the procedure proposed in Boesel, Nelson, & Kim, 2003 ); the aim of this R&S procedure is to discard the bad alternatives in T and find a sufficiently good one using as few replications as possible.Alternatively, R&S could be used in the optional replication step (see Fig. 1 ); Hong and Nelson (2007) , for instance, propose a R&S procedure to identify the best alternative among the sampled ones in each iteration.In addition to mitigating the identification problem, this approach might also enable the algorithm to choose better infill points in the search step.While the OCBA technique ( Chen et al., 20 0 0 ) used in TSSO tries to achieve a similar goal, our results indicate that it is not very effective.Although perfect identification (as in Figs. 9 and 10 ) is unlikely to result from any given R&S method, these figures illustrate the best possible improvement in GAP that we could hope to obtain.
Evidently, R&S consumes replication budget.Note, however, that in our case, all algorithms provide better GAP performance in low budget scenarios with perfect identification than in their corresponding high budget scenarios without perfect identification (e.g., compare Figs.7 and 9 for the Camelback function; the same observations hold for the Branin function and the inventory problem).Therefore, reducing the budget of the search step to allow for a suitable R&S could be beneficial.In Hartmann-6, R&S is less useful since the criteria employed by the algorithms usually achieve good identification.Only with heavy noise, R&S might still be beneficial.
Further research is needed on how to combine Kriging-based algorithms with R&S to achieve an appropriate replication strategy.In addition, deciding on how much replication budget to allocate to the R&S step is not straightforward.For instance, when using R&S in a clean-up phase, we should decide on when to stop simulating infill points and move to the R&S stage.

Conclusions and future research
In this paper, we have compared six recent Kriging-based algorithms for continuous optimization via simulation with heterogeneous noise.Using Kriging-based algorithms for optimizing systems modeled through stochastic simulation (especially with heterogeneous noise) is relatively new, and provides an active research area: we think that the insights obtained in this article can be useful in improving these algorithms, and give useful suggestions for researchers who wish to employ them for solving engineering/business problems.
Our results showed that not only the magnitude, but also the structure of the noise can affect the performance of the algorithms.More specifically, we observed that with worst case noise (highest noise in the global minimum region) all algorithms performed worse than with best case noise (lowest noise in the global minimum region).We also showed that increasing the replication budget could improve the performance of the algorithms, especially when the noise magnitude was low.
CKG, SKO, and EQI require an estimate for the noise variance function but in practice, finding a reasonable estimate can be challenging.Among these algorithms, CKG and SKO tend to outperform our benchmark algorithm (MQ), while EQI does not.CKG is computationally very intensive, especially when the number of candidate points is high (e.g., 10 0 0 0 feasible alternatives in the Hartmann-6 function).As, in some optimization problems, the feasible space includes millions of solutions, applying CKG to these problems might be impractical.In these cases, one can use the approximate knowledge gradient policy ( Scott, Frazier, & Powell, 2011 ), which employs an approximation of the expected improvement function to speed up the algorithm.As shown by Scott et al. (2011) , this policy performs similarly to SKO, at least in a setting with homogeneous noise.
Analogous to MQ, TSSO and eTSSO do not require information about the noise variance function.We did not observe a big difference between TSSO and eTSSO; TSSO usually even gave better results.The quality of the solutions returned by TSSO was strongly affected by its inability to identify good solutions among the simulated ones in the identification step: if it were successful in identifying the best alternatives, it would give significantly better results than MQ, and would perform as well as CKG or SKO.However, it would likely be preferred over CKG or SKO, since it does not require information with regard to the noise variance function.
Our results indicate that, all algorithms suffer from the identification problem; the resulting GAP distributions are worse for settings with heavy noise, as a higher noise also negatively impacts the search step.Consequently, it is essential to combine the algorithms with smarter replication strategies (e.g., using appropriate ranking and selection methods).In our opinion, this is a key area for future research.
the quantile of the updated Kriging metamodel at x k +1 , with β ∈ [0.5, 1].The next point is the point with maximum EQI (x ) = E[ I(x )] ; Picheny et al. (2013a) explain the calculation of this expectation.Again, this calculation requires an estimate of τ 2 ( x ).

Fig. 4 .
Fig. 4. Expected cost function of the ( s , S ) inventory problem with exponential( λ) demand; an arrow points to x * .

Fig. 6 .
Fig. 6.MEI of Camelback function for the first iteration of a given run, with best and worst case noise.The dark red regions of the figures have the highest values for MEI, and are most likely to be sampled in the next step.

Figs. 7
Figs. 7 and 8 show that TSSO usually provides a better GAP than eTSSO (the same holds for the Branin function and the inventory problem, see Figs.C.1 and D.1 ).

Fig. 7 .
Fig. 7. GAP for the Camelback function ( R f = 7 .3 ).At each scenario, the boxplots show the distribution of GAP for each distinct algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.

Fig. 8 .
Fig.8.GAP for the Hartmann-6 function ( R f = 3 .3 ).At each scenario, the boxplots show the distribution of GAP for each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.

Fig. 9 .
Fig.9.GAP results for the Camelback function with perfect identification ( R f = 7 .3 ).At each scenario, the boxplots show the distribution of GAP for each algorithm when the true best point among the sampled points is returned as the final solution.The median is indicated by a solid star.The edges of the boxes are the 25th and 75th percentiles.

Fig. 10 .
Fig.10.GAP results for Hartmann-6 function with perfect identification ( R f = 3 .3 ).At each scenario, the boxplots show the distribution of GAP for each algorithm when the true best point among the sampled points is returned as the final solution.The median is indicated by a solid star.The edges of the boxes are the 25th and 75th percentiles.

Fig. C1 .
Fig. C1.GAP for the Branin function ( R f = 6 ).At each scenario, the boxplots show the distribution of GAP for each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.The results are analogous to the Camelback function: CKG and SKO perform somewhat better than the benchmark algorithm (MQ) while other algorithms don't.TSSO and eTSSO usually have the largest GAP .

Fig. C2 .
Fig. C2.GAP results for the Branin function with perfect identification ( R f = 6 ).At each scenario, the boxplots show the distribution of GAP for each algorithm when the true best point among the sampled points is returned as the final solution.The median is indicated by a solid star.The edges of the boxes are the 25th and 75th percentiles.Analogous to the Camelback function, we see that all algorithms suffer from the identification problem (especially TSSO).

Fig. C3 .
Fig. C3.GAP for the Branin function with MTSSO included ( R f = 6 ).At each scenario, the boxplots show the distribution of GAP for each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.MTSSO usually provides smaller GAP than TSSO.
Fig. F2.Number of distinct points sampled for the Camelback function (High budget).At each scenario, the boxplots show the distribution of the number of distinct points sampled by each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.

Fig. F3 .
Fig. F3.Number of distinct points sampled for the Hartmann-6 function (High budget).At each scenario, the boxplots show the distribution of the number of distinct points sampled by each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.

Fig. F4 .
Fig. F4.Number of distinct points sampled for the ( s , S ) inventory problem (High budget).At each scenario, the boxplots show the distribution of the number of distinct points sampled by each algorithm.The median is indicated by a solid circle; the edges of the boxes are the 25th and 75th percentiles.

Table 1
Overview of notations.

Table 3
Analytical test functions.

Table 5
External factors considered.

Table 6
Parameter settings for different algorithms.
Note : CKG does not require any parameters.τ ( x ) is known: as mentioned above, we assume a linear relation between τ ( x ) and the function value.We then estimate τ ( x ) at x ∈ by using the Kriging prediction ˆ f (x ) instead of the unknown true function value: ˆ

Table 7
Average running time of different algorithms for one macroreplication (high budget) in seconds.