Regression for function approximation

Question

I have a program for heat exchanger calculations which uses correlations that are complex and highly non-linear. I need to come up with an approximation of the function using regression.

The function takes two inputs : inlet air temperature $T_{a,i}$ and valve control signal $u$. The output is $\Delta T_a$. I have plotted the output values from the program for a range of input values. How can the function be approximated with regression?

EDIT : The data is generated from the following python program which implements the heat exchanger model in [2] which is also used in the Building Simulation Library 3. I need to find aproximations for the efficiency and outlet-inlet temperature difference for the air and water streams. The efficiency one is simple and can be approximated by an exponential function but I am struggling with the others.

# Air-water finned heat exchanger model
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from pylab import meshgrid,show,figure
exp = np.exp
rand = np.random.rand

def fcumodel(T_a_in, T_w_in, fs, val):
    # Nominal parameters
    T_a_in_0 = 20.0  # Nominal air inlet temperature
    T_w_in_0 = 40.0  # Nominal water inlet temperature
    T_a_out_0 = 27.0 # Nominal air outlet temperature
    T_w_out_0 = 32.0 # Nominal water outlet temperature

    c_a = 1005.0     # specific heat capacity of air
    c_w = 4180.0     # specific heat capacity of water

    rh = 0.5         # ratio of heat transfers (at nominal conditions)

    mdota_0 = 5.0    # Nominal air mass flow rate
    mdotw_0 = 1.0    # Nominal water mass flow rate

    #NTU0 = calculate_nominal(T_a_in_0, T_w_in_0, T_a_out_0, T_w_out_0, mdota_0, mdotw_0)
    #print NTU0

    NTU0 = 0.747     # precomputed externally for above parameters

    UA_0 = NTU0 * min(c_a * mdota_0, c_w * mdotw_0)   # Nominal conductance

    hw_0 = UA_0 * (rh+1)/rh   # water side heat transfer

    ha_0 = rh * hw_0          # air side heat transfer

    # operating values
    mdota = fs * mdota_0      # air mass flow rate  
    mdotw = val * mdotw_0     # water mass flow rate

    Cdota = c_a * mdota       # air capacitance rate
    Cdotw = c_w * mdotw       # water capacitance rate

    x_a = 1 + 4.769 * 1e-3 * (T_a_in - T_a_in_0)   # air side heat transfer temperature variation

    ha = ha_0 * x_a * (mdota/mdota_0)**0.7         # air side heat transfer

    s_w = 0.014/(1 + 0.014 * T_w_in)               # water side heat transfer temperature sensitivity

    x_w = 1 + s_w * (T_w_in - T_w_in_0)            # water side heat transfer temperature dependence

    hw = hw_0 * x_w * (mdotw/mdotw_0)**0.85        # water side heat transfer

    UA = 1/(1/hw + 1/ha)         # overall heat exchanger conductance             

    Cdotmin = min(Cdota, Cdotw) 
    Cdotmax = max(Cdota, Cdotw)
    Z = Cdotmin/Cdotmax

    if Cdota < Cdotw:
        small = 'air'
    else :
        small = 'water'

    NTU = UA/Cdotmin

    eff = 1 - exp((exp(-Z * (NTU ** 0.78) ) - 1) * (NTU ** 0.22) / Z)    # Cross flow heat exchanger eff

    Qdot = eff * Cdotmin * (T_w_in - T_a_in)

    T_a_out = T_a_in + Qdot/Cdota
    T_w_out = T_w_in - Qdot/Cdotw

    print "eff", eff 
    #print "det:", eff*(Z+1)

    return Qdot, T_a_out, T_w_out, 1/UA, NTU, eff, small


R_a = np.arange(15,21,0.5)
R_u = np.arange(0.1,1,0.05)

N_a = R_a.shape[0]
N_v = R_u.shape[0]

EFF = np.zeros((N_a,N_v),dtype = 'float64')
Ta = np.zeros((N_a,N_v), dtype = 'float64')
Tw = np.zeros((N_a,N_v), dtype = 'float64')


for i in range(N_a):
    for j in range(N_v):
        T_a_in = R_a[i]
        T_w_in = 35

        fs = R_u[j]
        val = R_u[j]

        Qdot, T_a_out, T_w_out, R, NTU, eff, small = fcumodel(T_a_in, T_w_in, fs, val)

        EFF[i][j] = eff
        Ta[i][j] = T_a_out - T_a_in
        Tw[i][j] = T_w_out - T_w_in


[x, y] = meshgrid(R_u, R_a)

fig1 = figure(1)
ax = Axes3D(fig1)
ax.plot_surface(x,y,EFF)


fig2 = figure(2)
ax = Axes3D(fig2)
ax.plot_surface(x,y,Ta)

fig3 = figure(3)
ax = Axes3D(fig3)
ax.plot_surface(x,y,Tw)

show()

[2]: Wetter, M. (1999). Simulation model finned water-air-coil withoutcondensation (No. LBNL--42355). Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US).

It's so smooth that any nonlinear curve fitting should work here. In regression you either linearize data or use something like GAM, which should work very well — Aksakal
– Aksakal, Commented Dec 26, 2017 at 21:36
Would you please post a link to the data? I will run it through the 3D surface equation "function finder" on my web site to see if I can find a simple approximating function. — James Phillips
– James Phillips, Commented Dec 26, 2017 at 21:37
For your approximating function, how much do you care about precision (i.e. how close the approximation is to the actual curve) vs. intuitive ease of communication? For example, fractional polynomial regression can provide precise functional fitting, while things like hinge functions can intuitively describe functions in a nonlinear least squares model (a la linear until $X=$*value*, when the line changes slope. Also: how much do you care about robustness across different samples? — Alexis
– Alexis, Commented Dec 26, 2017 at 22:51
it looks like the output is a linear function of one of the two variables (I guess $T_{a,i}$ with a slope (and possibly an intercept) which depends on the value of the other variable, and a simple power function in the other variable. Something like $y(T_{a,i}, u) = \beta_0+\beta_1 T_{a,i}u^{\frac{1}{\beta_2}}$ with $\beta_2\geq 1$. You could easily fit this with constrained NLS. If, for each fixed $u$, the intercept of the resulting linear function of $T_{a,i}$ depends on $u$, then the expression gets a little bit more complicated. — DeltaIV
– DeltaIV, Commented Dec 26, 2017 at 23:25
Please register &/or merge your accounts (you can find information on how to do this in the My Account section of our help center), then you will be able to edit & comment on your own question. — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Dec 27, 2017 at 0:55

DeltaIV · Accepted Answer · 2017-12-28 18:09:45Z

Introduction

You don't need interpolation for this problem. In general it's true that interpolation is well suited to approximating an unknown, smooth function of a few inputs ($\mathbf{x}=(x_1,\dots,x_n$), with $n$ small) over a hypercube, starting from the function values over a set of training points $S=\{(\mathbf{x}_i,y_i)\}_i^N$ located in the hypercube. This is especially true when function values at new training points can be generated cheaply, which is your case since your code runs very quickly. Also, interpolation has (close to machine) zero approximation error at points in $S$: this is a disadvantage (overfitting) when trying to approximate noisy data, but it's actually the correct behavior when approximating data which are known without error, as in your case.

On the other hand, interpolation has some distinct disadvantages:

the number of parameters you need to store to perform interpolation grows with the size of the training set. Thus memory requirements grow quickly with the size of the training set.
making predictions with some interpolation methods (for example vanilla Gaussian Processes) is much slower than making predictions with a simple parametric regression model, even though it's probably still way faster than running your computer code
the approximation error (difference between values predicted by the interpolation method $y_p$ and unknown function values $y$) grows a lot as soon as you move outside the region of the input parameter space "spanned" by the training set (extrapolation). Regression models usually have the same issue, but not always (see below).

in your case it's evident that, for each fixed $u$, $\Delta T$ is very well approximated by a linear function of $T_{a,in}$, while for each fixed $T_{a,in}$, $\Delta T$ is very well approximated by a logarithmic function of $u$ (plus an intercept). It's important to verify that the model makes physical sense: based on your physical knowledge of the problem, is it plausible that for fixed $u$, $\Delta T$ is linear in $T_{a,in}$, and for fixed $T_{a,in}$, it is logarithmic in $u$? If so, we can try to fit the following model:

$$\Delta T \approx \beta_0+\beta_1 T_{a,in} + \beta_2 \log(u) + \beta_3 T_{a,in}\log(u) $$

and check whether on extrapolated data points, not used for training, its accuracy is still good. Note that this model has only 4 parameters.

Model Fitting

I will use R to show the approach:

# R_a and R_u are the corresponding vectors in your code
R_a <- seq(15 ,20.5, by = 0.5)
R_u <- seq(0.1, 0.95, by = 0.05)

# DeltaT_matrix is the output matrix which you called Ta in your Python code
DeltaT_matrix <- structure(c(9.16474143429817, 8.94118290271588, 8.71731921196482, 
8.49315204987976, 8.26868309204161, 8.04391400188728, 7.81884643081838, 
7.59348201830876, 7.36782239201091, 7.14186916786124, 6.9156239501842, 
6.68908833179539, 8.77268383665777, 8.55914642773912, 8.3452924610834, 
8.13112363233619, 7.91664162521941, 7.70184811163494, 7.48674475176713, 
7.27133319418426, 7.05561507593893, 6.83959202266747, 6.62326564868829, 
6.40663755709921, 8.48960063743449, 8.28327754024661, 8.07663085966664, 
7.86966228789457, 7.66237350551092, 7.45476618157531, 7.24684197372405, 
7.03860252826679, 6.8300494802822, 6.62118445371259, 6.4120090614577, 
6.20252490546751, 8.26779964588656, 8.06711631671647, 7.86610461555747, 
7.66476622681244, 7.46310282353722, 7.26111606753518, 7.05880760945069, 
6.85617908886182, 6.65323213437205, 6.44996836370108, 6.24638938377474, 
6.04249679081404, 8.08543143687201, 7.88937668440084, 7.69299009870359, 
7.49627335453585, 7.29922811555063, 7.10185603438936, 6.90415875277201, 
6.70613790158635, 6.50779510097636, 6.30913196042974, 6.11015007886457, 
5.91085104471518, 7.93061256207576, 7.73848107017399, 7.54601514630728, 
7.3532164548195, 7.16008664917265, 6.96662737203485, 6.7728402553675, 
6.57872692051164, 6.38428897827341, 6.18952802900874, 5.99444566270719, 
5.79904345907504, 7.79614628340926, 7.60741771126701, 7.41835270460851, 
7.2289529171241, 7.03921999182307, 6.84915556111902, 6.65876124691438, 
6.46803866068416, 6.27698940355893, 6.08561506640695, 5.89391722991556, 
5.70189746467181, 7.67733775913916, 7.49161228525216, 7.30554880534835, 
7.11914896248988, 6.93241438924182, 6.74534670775515, 6.55794752984902, 
6.37021845709215, 6.18216108088354, 5.99377698253235, 5.80506773333712, 
5.61603489466418, 7.5709520023041, 7.3879127571883, 7.20453425705099, 
7.02081813449161, 6.83676601178218, 6.65237950094813, 6.46766020384846, 
6.28260971225508, 6.09722960793146, 5.91152146271053, 5.72548683857184, 
5.53912728771803, 7.47466418569886, 7.29405385400307, 7.11310326613761, 
6.9318140444779, 6.75018780122923, 6.5682261385056, 6.38593064840804, 
6.20330291310201, 6.02034450489423, 5.83705698630867, 5.65344191016203, 
5.46950081963829, 7.3867468512066, 7.20835232899164, 7.02961674413013, 
6.85054170904833, 6.67112882614881, 6.49137968788769, 6.31129587685102, 
6.13087896583062, 5.95013051789909, 5.76905208648422, 5.58764521544269, 
5.40591143913309, 7.30588106490289, 7.12952297600297, 6.95282317363471, 
6.77578326056511, 6.59840482967455, 6.4206894640324, 6.24263873697182, 
6.06425421216397, 5.88553744369151, 5.70648997612139, 5.52711334457709, 
5.34740907481004, 7.23103683577564, 7.05656212263367, 6.88174517153485, 
6.70658757587974, 6.53109091931105, 6.35525677578768, 6.17908670965812, 
6.00258227573309, 5.82574501935767, 5.64857647648261, 5.47107817373513, 
5.29325162848903, 7.16139436134207, 6.98867089917336, 6.81560477840227, 
6.64219758335165, 6.46845088870781, 6.29436625959328, 6.1199452516386, 
5.94518941105382, 5.77010027469909, 5.59467937015477, 5.41892821579085, 
5.24284832083573, 7.09629042461093, 6.92520301178304, 6.75377260599608, 
6.58200078277584, 6.40988910812628, 6.23743913860088, 6.06465242137331, 
5.89153049430752, 5.71807488602711, 5.54428711598418, 5.37016869452754, 
5.19572112297026, 7.03518087250459, 6.86562818364286, 6.69573223984451, 
6.52549460810856, 6.35491684602097, 6.18400050182467, 6.01274711448908, 
5.84115821377896, 5.66923532032263, 5.49697994567962, 5.32439359240773, 
5.15147775412946, 6.97761370854148, 6.80950593760835, 6.64105471097168, 
6.47226158736275, 6.30312811620306, 6.13365583767325, 5.96384628278145, 
5.79370097343104, 5.62322142248771, 5.45240913384602, 5.28126560249541, 
5.1097923145855, 6.92320938647168, 6.75646639488318, 6.58937979896262, 
6.42195114942078, 6.25418198775705, 6.08607384632729, 5.91762824841104, 
5.74884670827825, 5.57973073125531, 5.41028181379056, 5.24050144351922, 
5.07039109932776), .Dim = c(12L, 18L), .Dimnames = list(NULL, 
    NULL))

We now reshape the data to streamline the estimation of the regression model:

# flatten response and create design matrix from predictor levels
DeltaT <- as.vector(DeltaT_matrix)
Ta_in <- rep(R_a, 18)
u <- rep(R_u, each = 12)

# assemble data frame for modeling
df <- data.frame(Ta_in, u, DeltaT)

Finally, we fit the linear model:

# fit linear model
my_model <- lm(data = df, DeltaT ~ Ta_in*log(u) )
summary(my_model)
# 
# Call:
#   lm(formula = DeltaT ~ Ta_in * log(u), data = df)
# 
# Residuals:
#   Min         1Q     Median         3Q        Max 
# -0.0165326 -0.0020604  0.0003775  0.0032458  0.0070474 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  11.8946112  0.0051852  2294.0   <2e-16 ***
#   Ta_in        -0.3343834  0.0002908 -1150.1   <2e-16 ***
#   log(u)       -1.7572983  0.0050434  -348.4   <2e-16 ***
#   Ta_in:log(u)  0.0504915  0.0002828   178.5   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.004541 on 212 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.513e+06 on 3 and 212 DF,  p-value: < 2.2e-16


my_summary <- summary(my_model)
my_summary$adj.r.squared
# [1] 0.9999715

The adjusted $R^2$ is very high, taking into account the fact that we have a lot of training points and just 4 parameters. To get another measure of approximation accuracy, let's compute the root mean squared relative error: we can do this because DeltaT never gets close to 0.

RMSRE <- function(residuals, response){
     sqrt(mean((residuals/response)^2))
}
RMSRE(my_summary$residuals, df$DeltaT)
# [1] 0.0006351842

Model check on extrapolated data

Let's see how we perform in extrapolation: I re-ran your Python code changing only these lines

R_a = np.arange(1, 15, 1)
R_u = np.arange(0.01, 0.1, 0.01)

and I got a brand new Ta array from your code. Repeating analogous steps to those I performed before, I store the extrapolated points in the dataframe df_extrapolated:

DeltaT <- as.vector(DeltaT_matrix_extrapolated) 
R_a <- seq(1, 14)
R_u <- seq(0.01, 0.09, by = 0.01)
Ta_in <- rep(R_a, 9)
u <- rep(R_u, each = 14)
df_extrapolated <- data.frame(Ta_in, u, DeltaT)

To get predictions at the new points, it's convenient to generate the model matrix X and convert it to a data frame, for usage with the generic function predict:

X <- data.frame(model.matrix.lm(my_model, data = df_extrapolated))
my_predictions <- predict(my_model, X)
extrapolation_residuals <- my_predictions - DeltaT
RMSRE(extrapolation_residuals, DeltaT)
# [1] 0.0160831

The root-mean-square relative error is still quite small ($\sim 1.6\%$) even if we tested the model for values of $T_a$ and $u$ considerably lower than those used in fitting the model. Remember that $\log(0)$ is undefined, so the fact that even as we approach the $u=0$ line, the approximation is still good, is definitely remarkable.

user20160 · Accepted Answer · 2017-12-27 07:04:57Z

Your problem is much more straightforward than a typical regression problem because 1) You know the actual ground truth, and can generate as many data points as you'd like, 2) There's no noise, and 3) The function is very well behaved (smooth and low dimensional).

Simple interpolation should work well in this setting (also see here). For example, the surface plot you showed was almost certainly generated using interpolation. In this approach, you'd evaluate the function at a fixed set of sample points. The function is then approximated as a set of local pieces. Each piece is a simple function defined over the space between neighboring sample points, and runs through the true function values at these points.

The type of local function used determines the type of interpolation. For example, linear and cubic spline interpolation are popular choices. Cubic splines require more computation, but produce smoother approximations. Since you're using Python, check out scipy.interpolate.

Interpolating using a greater number of sample points gives a more accurate approximation at the expense of more computation. In some cases it's possible to balance between accuracy and computational expense by allocating sample points adaptively, such that they're more densely distributed in regions where the function is more complicated. In the simplest case, sample points can be chosen on a regular grid. Extrapolating outside the range of the sample points may not work well, so they should be chosen to span the largest range you anticipate needing.

There are many fancier regression methods that would also give good results (e.g. Gaussian process regression, kernel methods, tree/forest-based methods, various basis expansions, etc.). But, because of the simplicity of your function and the lack of noise, I don't think the added complexity would buy you anything over simple interpolation.

Another alternative would be to come up with a simple, parametric form for the approximation, then fit the parameters. This would have the advantages of being highly computationally efficient and easily interpretable. But, it would require that such a form exists (with high enough accuracy), and that you can find it.

Stack Exchange Network

Regression for function approximation

2 Answers 2

Introduction

Model Fitting

Model check on extrapolated data

Your Answer

Hot Network Questions

Regression for function approximation

2 Answers 2

Introduction

Model Fitting

Model check on extrapolated data

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions