Role of standard deviation in Bayesian optimization using GP

Question

I am new to GP and BO and I have been playing with the two in a simple 1D context which happens to be practically relevant to what I am working on. Essentially, I am trying to find a peak (modeled as a Gaussian) buried in additive Gaussian noise. Something like $f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({x-p}\right)^{2}} + \epsilon$ where $\epsilon\sim N(0,1)$, where I am trying to recover $p$. This is a toy version of a real life problem where evaluating $f(x)$ is very expensive. I would also like to minimize the assumptions about the true $f(x)$, although I think it is likely to be a convex function. True knowledge of $\epsilon$ is also imprecise.

The only complexity is that the observational noise is of the same order of magnitude as the peak size.

I am using an off the shelf GP estimation function, Matlab's 'fitrgp'. I haven't noticed the choice of Kernel or initialization of the observation noise or kernel parameters to impact the result very drastically. The way I -think- the data is being treated is like this: $ f(x) \sim GP(m(x), k(x, x')) + \epsilon$ where $\epsilon$ is the observational noise estimated from the data (although it can also be set constant).

I tried a few acquisition functions, but none seemed to work very well. For example, expected improvement gets stuck very quickly in a non optimal area.

I think what is happening is that the GP is picking up on the observational noise as constant and large. The GP is returning a standard deviation that is governed by the observational noise and not the uncertainty in the estimate of the mean. Because of this the basic EP, PI functions are just impacted by the mean itself - and they go for whatever the highest mean is and get stuck there. They get stuck there because the standard deviation estimate does not change measurably the more you sample a given x, and so the given x remains the most valuable point.

I made two modifications that both seem to correct this issue:

Use PI, but make the choice of what to sample next based on the normalized probability estimates of PI, rather than simply taking the max.
Instead of using the standard deviation, try to estimate the standard error of the mean (SEM) by calculating the number of effective samples at any given point. I do this by sticking a gaussian with peak = 1 on each sample and adding up the result across the samples.

The approach of using standard error of the mean, makes a lot more sense to me compared to using standard deviation. In the context of optimizing for machine learning, it seems to me that using the SD will result in trying to get the best possible score, regardless of what the actual underlying mean at a given location is - while using SEM will try to find the location of highest mean score.

Since it seems that all the algorithms use SD rather than SEM, I suspect I might have misunderstood something along the way - so I was wondering if someone could point me in the right direction. Thanks!

Additional edit: this question stems from my surprise that the GP + PI/EP approach often does not converge (in my hands) on what what I thought would be a relatively simple problem. As user Banach has pointed out, this is probably not the best approach for the problem I defined - but I would like to understand why this isn't working since any real applications (at least in my field) are likely to be more complex than what I outlined. My more general question is, why aren't PI and EP based on standard error of the mean rather than standard deviation? Isn't the standard error of the mean a better reflection of the confidence in the mean, and therefore a better estimate of the likely payout of future acquisitions at any particular $x$?

Matlab code:

    clear;

    r = rng(0);
    xx = 10+rand*2 - 1;

    effect_size = 1.5;  effect_breadth = 1;  observation_noise = 1;
    draw_noiseless = @(xo) effect_size * normpdf(xo, xx, effect_breadth)./normpdf(0, 0, effect_breadth);
    draw_noisy = @(xo) draw_noiseless(xo) + observation_noise*randn(size(xo));

    n_its = 100;
    xl = 6; xu = 14; % limit acquisition range

    xo = [8];  % arbitrary starting point
    yo = draw_noisy(xo);

    for ix_its = 1:n_its
        model = fitrgp(xo, yo,  'KernelFunction', 'squaredexponential');

        % acquisition operation
        xs = xl + rand(round((xu - xl) * 50), 1) * (xu - xl);
        ybest_so_far = max(predict(model, xo));
        [mu, sd] = predict(model, xs);
        % ss = 0.05;   % EDIT: potential SEM modification. Ideally ss would come from information about how the standard deviation varies.
        % X = normpdf(xs, xo.', ss)./normpdf(0, 0, ss);  % EDIT: potential SEM modification
        % sd_effective_n = sum(X, 2);  % EDIT: potential SEM modification
        % sd = sd./sqrt(sd_effective_n);  % EDIT: potential SEM modification

        [~, ixbest] = max(mu);
        xs_best = xs(ixbest);
        p = normcdf((mu - ybest_so_far)./(sd+1e-9));  % PI

        [~, ix] = max(p);
        xopt = xs(ix);

        % concatenate new data
        xo = [xo; xopt];
        yo = [yo; draw_noisy(xopt)];

        % plot
        x = linspace(0, 15).';
        [yp, sd] = predict(model,x);

        figure(1);clf;
        hold on
        scatter(xo,yo,'xr') % Observed data points
        plot(x,yp,'g')                   % GPR predictions
        patch([x;flipud(x)],[yp - sd(:,1);flipud(yp + sd(:,1))],'k','FaceAlpha',0.1);
        plot(xx * ones(1, 2), get(gca, 'ylim'))
        plot(x, draw_noiseless(x), 'k-');
        title(sprintf(['%d, %0.1f --> %0.1f'], ix_its, xx, xs_best));
        drawnow;
    end

Great question. Can you clarify a bit more the specification of the surrogate? For example what do you mean exactly by "observational noise" ? Do you estimate it? Or set ad hoc? If you spell out the math it will be easier for us but will also help you to organize your thoughts. — Banach
– Banach, Commented Jan 1, 2023 at 20:41

Banach · Accepted Answer · 2023-01-03 19:47:38Z

Let me first make sure that we are on the same page that BO is a terrible approach to solve this particular problem. For many reasons some of which I will now enumerate.

Given that your observations are generated from $y \sim N(p,2)$, "the best" way to get $p$ (the best in multiple ways, for example it's the MLE) is a sample mean, $p := \frac{1}{N} \sum_{i=1}^N y_i$.

But my understanding is that you are trying to learn BO so let's pretend we want to estimate $p$ by maximizing the Gaussian curve. So, your model can be written as follows. I want to maximize some function $f \colon R \mapsto R$, given by $$ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({x-p}\right)^{2}}}.$$

This is not a black-box problem. You have an expression for $f$ so you can compute gradient and use Newton's method. Even if the formula is more complicated than this, BO will be very inefficient. You will be much better off computing gradient by autodiff and use a gradient-based optimizer.
This is a single-peaked problem. BO is a global algorithm, meaning that it is designed to explore the state space rather than excel in exploiting local curvature as local algorithms do (Newton, quasi-Newton or trust-region ones). With an appropriate acquisition function and the kernel appropriately smooth, BO is guaranteed to converge globally even on non-convex and noisy functions. But the price you pay for this is much lower rates of convergence on convex problems than the alternatives I mentioned above.
$f$ is computationally cheap. The selling point of BO is that in can locate the basin of attraction of the global minimum of a very wild function with as few calls to $f$ as possible. It will not be good at polishing the minimum up to the numerical precision (it will always over-explore). But it's useful if each call to $f$ is very costly and you need only an ok-ish solution. This is not the case here.

Ok, but let's pretend it's a black-box function and you don't know it's single-peaked. You can only obtain noise-corrupted measurements, $$ y_i = f(x_i) + \varepsilon_i, \quad \varepsilon \sim N(0,2)$$ In a small sample, GP regression will not be able to tell apart the non-linearity in $f$ from the noise $\varepsilon_i$. Figure 5.5 in Rasmussen & Williams "Gaussian Processes for Machine Learning" says it all. So, it's usually the best to fix $\nu$, especially at the beginning.

Update

To your point about the standard deviation: We use predictive standard deviation in the formula for EI because this follows from the expression for $E [ \max\{0,f - f^*\}]$ given that $f \sim GP(m,k)$. SD is also nice because it turns that it's equal to so called "power function" which can be used to bound the approximation error in $L_{\infty}$ norm, $\sup_u |f(u) - \hat{\mu}(u)|$, where $\hat \mu$ is the GP predictive mean. (See e.g. [4] Proposition 3.5). So, the SD is not as useless as it looks in this particular example.

Three final remarks

EI (or PI) with noisy observations are always problematic becasue you don't know what the best so far actually is. In the matlab code, you have ybest_so_far = max(predict(model, xo)); which may be far off the truth given the magnitude of the noise. The max of actual realization isn't perfect either, again due to the noise. See [3] for more.
It is common practice to initialize BO over some quasi-uniform set, e.g uniform random points (see e.g. [2] on p. 473). It is because initial recommendations of any acquisition are rubbish.
Take a look at fitrgp function a try to limit the admissible bounds on hyperparameters. Too much change in hypers is not good. In the same paper Jones et al. suggest to estimate then every 10th or so iteration, not continuously.

But ultimately, this is a very difficult problem given that noise and signal are essentially isomorphic. Good luck!

[2]: Jones, Donald R., Matthias Schonlau, and William J. Welch. “Efficient Global Optimization of Expensive Black-Box Functions.” Journal of Global Optimization 13, no. 4 (1998): 455–92. https://doi.org/10.1023/A:1008306431147.

[3]: Picheny, V., Wagner, T., and Ginsbourger, D. (2013b). “A Benchmark of Kriging-Based Infill Criteria for Noisy Optimization.” Structural and Multidisciplinary Optimization,48: 607–626.

[4]: Stuart, Andrew, and Aretha Teckentrup. "Posterior consistency for Gaussian process approximations of Bayesian posterior distributions." Mathematics of Computation 87.310 (2018): 721-753.

Thank you for your detailed answer. I should have clarified initially (and will edit the question later): the toy is problem reflects a simplified version of a real life problem. In the real problem, the function evaluations are very expensive. The function that is to be optimized is not known, although I can probably assume that it does have a single peak. Estimating mean and SD of a simpler peaked function and doing BO on that may be a better strategy in any case - so thank you for that. Regarding SEM I did indeed mean standard error of the mean. — jrmxn
– jrmxn, Commented Jan 2, 2023 at 14:46
Regarding the point on standard deviation: "you just reduce the exploration incentives as n increases" I agree with this statement. "so the acquisition exploits relative more the areas where the predictive mean is know to be high" I think what is happening is that the acquisition is then going for areas where the SEM is high (rather than the mean, because the precise mean is known where the SEM is low - so the biggest potential upside is where the SEM is high despite the mean being low). — jrmxn
– jrmxn, Commented Jan 2, 2023 at 15:21
But SEM is high in the exact same area in which SD is high. I don't get this part. It seems it's not in the matlab code. The matlab code only does the PI exploration, right? — Banach
– Banach, Commented Jan 2, 2023 at 16:40
The matlab code does PI. I added in as comments what an SEM estimate might look like. The SEM is not high in the same locations as the SD: the SD is approximately constant everywhere, my thought is that because the observation noise is high it doesn't really shrink when we repeatedly sample the same location. On the other hand, the SEM does drop and so the algorithm then moves onto sampling something more useful. Thanks for taking the time to engage with this. Perhaps in most BO-GP scenarios observation noise is low relative to model uncertainty so this issue does not crop up. — jrmxn
– jrmxn, Commented Jan 2, 2023 at 20:22
I added some final general thoughts on this. Given that signal is isomorphic to noise in this setup, I guess you can't expect too much from any algorithm. But an interesting problem, thanks! — Banach
– Banach, Commented Jan 3, 2023 at 19:25

Stack Exchange Network

Role of standard deviation in Bayesian optimization using GP

1 Answer 1

Update

Your Answer

Hot Network Questions

Role of standard deviation in Bayesian optimization using GP

1 Answer 1

Update

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions