2
$\begingroup$

I'm dealing with a regression problem and have two datasets at my disposal. Dataset A is properly labeled and I use it to fit and validate my model, B is unlabeled and I can only visually inspect performance of my model on that. For all practical purposes, B can be thought of as real-world data that I'd like to deploy a trained model on, so naturally the results on this data are more important.

The problem is that A and B have been drawn from slightly different "areas" in the problem domain. When I randomly split A into train and validation subsets, I often obtain a fit with a very good $R^2$ on the validation data which however performs very poorly on the test set B. My understanding is that this is because the model interpolates during validation on the subset of A, while it extrapolates on B. The figure below illustrates this case on a simple 1-D example:

bad fit with a good R-squared

Is there a way to perform the training&validation procedure on A in such a way that would give me a better estimation on the model's extrapolation performance on B? Or, more generally, what should I read to understand that what I'm trying to do is either called X and I should just refer to some source, or generally impossible and/or wrong because of Y and I should read that instead, or there exists a better approach Z that I should get acquainted with?

What I came up with so far is a "structured" way of splitting A into train and val subsets - instead of a random split that evenly samples A, perform a "cut" and assign samples to the subsets by their location in the space (example in the figure below). This would force the model to extrapolate during validation on A. I have already sketched a proof-of-concept basing on a zero-centering my data and cutting a sphere of some radius (selected to achieve a desired proportion of train/val sample counts); the model is fit to the data from inside the sphere and validated everywhere outside it. In this situation, poor $R^2$ on the validation subset of A does give me some indication of poor quality of fit on B. But is this methodologically valid? Is there something I could quote instead of giving an elaborate explanation of this procedure in my paper (which is not statistic-centric, I just use regression to solve a real problem).

bad fit with a worse R-squared

$\endgroup$
1
  • 1
    $\begingroup$ Great question! I'm interested in the same thing: if we're building a predictive machine learning model, and we know we are going to extrapolate (and aren't able to gather additional training data in the region in which we know we are going to extrapolate), how should we structure our train/validate/test splits? The default/textbook i.i.d. train/val/test split seems like it is sub-optimal in the setting I am describing. $\endgroup$ Commented Jul 29, 2020 at 22:52

1 Answer 1

0
$\begingroup$

My answer is since one cannot in practice even confirm the long term accuracy of a simple regression model, one should only expect a likely diminishing probability of successful forecasting.

Now, to attempt to quantify the latter diminishing probability of accuracy, in the classic regression scenario, a simple approach is to obtain a very long history and tabulate the declining accuracy of a best-fitting model, from some selected (at random) short time period, and apply it to the longer historical series. Repeat for various selected periods.

As regression theory supplies a probabilistic prediction as well, it may be meaningful to compare the accuracy by assessing # of time frames in the future where it becomes evident that the model has failed based on history.

This gives one a contextually based quantitative estimate (and perhaps meaningful insight) on how likely the latest time period based model may behave going forward. However, this is still questionable as have driving forces producing change remained even stationary? Perhaps yes, if physical laws of nature are driving processes, but otherwise, not likely.

Now, with respect to machine learning, good news, no need to divide up your database. The not so good news, get a related time series of much longer (essentially older) data, and perform my suggest analysis and use it as a guide to avoid proclaiming excessive expected forecasting accuracy.

$\endgroup$
1
  • 1
    $\begingroup$ My background (nor my problem) isn't in time series forecasting, so I assume that in a broader sense your argument could be stated as: obtain more samples from a larger area in the problem space and use that to quantify the performance of the model in various conditions -- am I understanding you right? In any case, I am unable to acquire more data at this point (in this specific case it is rather expensive) so I have to work with what I've already got. Is there any reading you would recommend in this subject? $\endgroup$ Commented Jul 1, 2020 at 14:09

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.