I am new to GP and BO and I have been playing with the two in a simple 1D context which happens to be practically relevant to what I am working on. Essentially, I am trying to find a peak (modeled as a Gaussian) buried in additive Gaussian noise. Something like $f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({x-p}\right)^{2}} + \epsilon$ where $\epsilon\sim N(0,1)$, where I am trying to recover $p$. This is a toy version of a real life problem where evaluating $f(x)$ is very expensive. I would also like to minimize the assumptions about the true $f(x)$, although I think it is likely to be a convex function. True knowledge of $\epsilon$ is also imprecise.
The only complexity is that the observational noise is of the same order of magnitude as the peak size.
I am using an off the shelf GP estimation function, Matlab's 'fitrgp'. I haven't noticed the choice of Kernel or initialization of the observation noise or kernel parameters to impact the result very drastically. The way I -think- the data is being treated is like this: $ f(x) \sim GP(m(x), k(x, x')) + \epsilon$ where $\epsilon$ is the observational noise estimated from the data (although it can also be set constant).
I tried a few acquisition functions, but none seemed to work very well. For example, expected improvement gets stuck very quickly in a non optimal area.
I think what is happening is that the GP is picking up on the observational noise as constant and large. The GP is returning a standard deviation that is governed by the observational noise and not the uncertainty in the estimate of the mean. Because of this the basic EP, PI functions are just impacted by the mean itself - and they go for whatever the highest mean is and get stuck there. They get stuck there because the standard deviation estimate does not change measurably the more you sample a given x, and so the given x remains the most valuable point.
I made two modifications that both seem to correct this issue:
- Use PI, but make the choice of what to sample next based on the normalized probability estimates of PI, rather than simply taking the max.
- Instead of using the standard deviation, try to estimate the standard error of the mean (SEM) by calculating the number of effective samples at any given point. I do this by sticking a gaussian with peak = 1 on each sample and adding up the result across the samples.
The approach of using standard error of the mean, makes a lot more sense to me compared to using standard deviation. In the context of optimizing for machine learning, it seems to me that using the SD will result in trying to get the best possible score, regardless of what the actual underlying mean at a given location is - while using SEM will try to find the location of highest mean score.
Since it seems that all the algorithms use SD rather than SEM, I suspect I might have misunderstood something along the way - so I was wondering if someone could point me in the right direction. Thanks!
Additional edit: this question stems from my surprise that the GP + PI/EP approach often does not converge (in my hands) on what what I thought would be a relatively simple problem. As user Banach has pointed out, this is probably not the best approach for the problem I defined - but I would like to understand why this isn't working since any real applications (at least in my field) are likely to be more complex than what I outlined. My more general question is, why aren't PI and EP based on standard error of the mean rather than standard deviation? Isn't the standard error of the mean a better reflection of the confidence in the mean, and therefore a better estimate of the likely payout of future acquisitions at any particular $x$?
Matlab code:
clear;
r = rng(0);
xx = 10+rand*2 - 1;
effect_size = 1.5; effect_breadth = 1; observation_noise = 1;
draw_noiseless = @(xo) effect_size * normpdf(xo, xx, effect_breadth)./normpdf(0, 0, effect_breadth);
draw_noisy = @(xo) draw_noiseless(xo) + observation_noise*randn(size(xo));
n_its = 100;
xl = 6; xu = 14; % limit acquisition range
xo = [8]; % arbitrary starting point
yo = draw_noisy(xo);
for ix_its = 1:n_its
model = fitrgp(xo, yo, 'KernelFunction', 'squaredexponential');
% acquisition operation
xs = xl + rand(round((xu - xl) * 50), 1) * (xu - xl);
ybest_so_far = max(predict(model, xo));
[mu, sd] = predict(model, xs);
% ss = 0.05; % EDIT: potential SEM modification. Ideally ss would come from information about how the standard deviation varies.
% X = normpdf(xs, xo.', ss)./normpdf(0, 0, ss); % EDIT: potential SEM modification
% sd_effective_n = sum(X, 2); % EDIT: potential SEM modification
% sd = sd./sqrt(sd_effective_n); % EDIT: potential SEM modification
[~, ixbest] = max(mu);
xs_best = xs(ixbest);
p = normcdf((mu - ybest_so_far)./(sd+1e-9)); % PI
[~, ix] = max(p);
xopt = xs(ix);
% concatenate new data
xo = [xo; xopt];
yo = [yo; draw_noisy(xopt)];
% plot
x = linspace(0, 15).';
[yp, sd] = predict(model,x);
figure(1);clf;
hold on
scatter(xo,yo,'xr') % Observed data points
plot(x,yp,'g') % GPR predictions
patch([x;flipud(x)],[yp - sd(:,1);flipud(yp + sd(:,1))],'k','FaceAlpha',0.1);
plot(xx * ones(1, 2), get(gca, 'ylim'))
plot(x, draw_noiseless(x), 'k-');
title(sprintf(['%d, %0.1f --> %0.1f'], ix_its, xx, xs_best));
drawnow;
end