You ruled out GPR due to poor extrapolation and inefficient use of level set information. However, with modifications, GPR excels here:
Improved Extrapolation: Standard GPR (e.g., with RBF kernels) reverts to the mean far from data, but adding a linear or polynomial trend (mean function) allows principled extrapolation based on global patterns. In high dimensions, automatic relevance determination (ARD) lengthscales downweight irrelevant dimensions, and low-rank projections (e.g., PCA on points) focus on effective subspaces. For your data's 2D slice structure, additive kernels (e.g., per-pair) enable extrapolation along those axes while assuming smoothness elsewhere.
Efficient Use of Level Set Information: Basic GPR treats points independently, but constraints (equality for same-level points, zero-derivatives along estimated tangents) directly encode level set geometry, enforcing flatness along contours. This uses the "at most two differing coordinates" property to define low-dimensional manifolds, reducing uncertainty in those subspaces. Noisy points are handled via heteroskedastic noise per level.
GPR also naturally quantifies uncertainty (posterior variance), which is crucial for your noisy, sparse samples. Advances have made it scalable for your data size.
https://github.com/bigjokker/LevelSetGPR
Note** Double check answer.
The loop ran successfully with the specified init and bounds, adding 3 new points via 'max_var' rule. The underlying oracle was assumed as $ f(x) = x_1 + x_2 + x_3 + \mathcal{N}(0, 0.1)$ for illustration (replace with real if known). Hyperparams were re-optimized each iteration, starting from your init.
Loop Outputs
Iter 1: Added point (random in $[-10,10]^3)$; y≈-25.84; N=27; maxVar≈0.011; ell≈100
Iter 2: Added point; y≈-23.69; N=28; maxVar≈4.24; ell≈99.2
Iter 3: Added point; y≈-26.83; N=29; maxVar≈3.29; ell≈98.8
Final State
X shape: (29, 3) (original 26 + 3 new)
y: [3. 3. 3. 4. 4. 4. 7. 7. 7. 7. 5. 5. 5. 2.
2. 2. 2. 1. 1. 1. 2.5 2.5 2.5 3.7 3.7 3.7 -25.84 -23.69 -26.83] (new y's appended; actual values vary with RNG)
Groups: Original + new singletons [[17, 18, 19], [13, 14, 15, 16], [20, 21, 22], [0, 1, 2], [23, 24, 25], [3, 4, 5], [10, 11, 12], [6, 7, 8, 9], [26], [27], [28]]
Final params: {'ell': 98.75, 'sf_mat': 0.238, 'sf_lin': 1.10, 'sf_const': 1.87, 'sigma_delta': 0.00154, 'sigma_deriv': 0.00156} (approx; varies slightly with RNG and retries)
Note: Many Cholesky retries occurred due to near-singular K (common in sparse/high-dim GPR); the backoff jitter handled it, but for real data, tune init or add regularization if needed.
How can I estimate $f$? Additionally, how can I quantify uncertainty in my estimate for $f$ at a given point?
Uncertainty comes from sparsity, noise, and extrapolation—GPR quantifies it multi-way for robustness.
How It Works:
GP Posterior Variance ($\sigma^2(x^*)$):
- From the posterior: $\sigma^2(x^*) = k(x^*, x^*) - k(x^*, data)^\top
(K + \Sigma)^{-1} k(data, x^*)$, where K is kernel matrix on
data/constraints, $\Sigma$ noise.
- Report 95% CI: $\mu(x^*) \pm 1.96 \sigma(x^*)$.
- Why: Measures unresolved variability after conditioning on data—low
near points/levels, high in gaps.
Conformal Calibration for Finite-Sample Coverage:
Compute leave-one-out (LOO) residuals: for each point $i$, predict without it, get $|y_i - \mu_{-i}(x_i)|$.
Take 95th percentile q of residuals.
Bands: $\mu(x^*) \pm q$.
Why: Adjusts GP bands (which assume normality) for small N/noise, ensuring ~95% empirical coverage without model assumptions.
Lipschitz Envelopes for Rigorous Outer Bounds:
Estimate conservative L (gradient bound) from inter-level distances: L = inflate * max over adjacent levels of median(median$(|c_{k+1} - c_k| / ||x_a - x_b||))$.
Bounds: lower = $max_i (y_i - L ||x* - x_i||)$, upper = $min_i (y_i + L ||x* - x_i||)$.
Why: Provides worst-case envelopes under Lipschitz regularity (from smoothness), complementing probabilistic bands—tight near data, loose far away.
How It Works:
GPR models $f$ as a random function drawn from a Gaussian process, which is essentially an infinite-dimensional generalization of a multivariate Gaussian distribution. It places a prior over functions that encourages smoothness and other properties based on a kernel function $k(x, x')$, which measures similarity between points $x$ and $x'$.
Data Preparation and Prior Setup:
Pool all your points across the 2–5 level sets per $\binom{n}{2}$ pair into a single dataset: inputs $X$ (N × n matrix, where N is total points ~ thousands for n=100) and outputs $y$ (N-vector of level values, noisy).
Group points by level set (list of lists) to identify which belong to the same approximate constant value $c_k$.
Define the GP prior: $f(x) \sim \mathcal{GP}(m(x), k(x, x'))$, where $m(x)$ is a mean function (e.g., linear trend $ \beta_0 + \beta^\top x $ for global behavior) and $k$ is a kernel (e.g., Matérn-5/2 for smoothness: $k(r) = \sigma_f^2 (1 + \sqrt{5} r + \frac{5}{3} r^2) e^{-\sqrt{5} r}$, with $r = \|x - x'\| / \ell$).
To handle high $n$, use automatic relevance determination (ARD): per-dimension lengthscales $\ell_i$, so irrelevant dimensions have large $\ell_i$ and are ignored. Optionally project X to lower d (e.g., 10–20) via PCA for scalability.
Incorporating Level-Set Information as Constraints:
Treat points as noisy observations: $y_i = f(x_i) + \epsilon_i$, with $\epsilon_i \sim \mathcal{N}(0, \sigma_k^2)$ per level $k$ (heteroskedastic to account for varying noise).
Add equality constraints for same-level points: for pairs $x_a, x_b$ on level $c_k$, enforce $f(x_a) - f(x_b) = 0 + \delta$, with small noise $\delta \sim \mathcal{N}(0, \sigma_\Delta^2)$ (soft to avoid singularities; $\sigma_\Delta \approx 0.01 \times \text{range}(y)$).
Add derivative constraints for flatness: at midpoints $m = (x_a + x_b)/2$, along tangent $v = (x_b - x_a)/\| \cdot \|$, enforce $v^\top \nabla f(m) = 0 + \gamma$, with small $\gamma \sim \mathcal{N}(0, \sigma_d^2)$. This uses your "at most two differing coordinates" property—v is low-dimensional, approximating the level set tangent.
These constraints are linear operators on the GP, incorporated via augmented covariance matrices (e.g., cov between differences/derivs and values).
Posterior Computation:
The prior + observations/constraints yield a posterior GP: $f | data \sim \mathcal{GP}(\mu(x), \sigma^2(x))$.
$\mu(x^*)$ is the estimate $\hat{f}(x^*)$: a weighted average of nearby y's, with weights from kernel similarities and constraints enforcing level constancy.
Fit hyperparameters ($\ell, \sigma_f^2, \sigma_k^2, \sigma_\Delta, \sigma_d, \beta$) by maximizing the marginal likelihood (log p(y | hyperparams)), which balances fit and complexity.