16
$\begingroup$

I understand that neural networks (NNs) can be considered universal approximators to both functions and their derivatives, under certain assumptions (on both the network and the function to approximate). In fact, I have done a number of tests on simple, yet non-trivial functions (e.g., polynomials), and it seems that I can indeed approximate them and their first derivatives well (an example is shown below).

What is not clear to me, however, is whether the theorems that lead to the above extend (or perhaps could be extended) to functionals and their functional derivatives. Consider, for example, the functional: \begin{equation} F[f(x)] = \int_a^b dx ~ f(x) g(x) \end{equation} with the functional derivative: \begin{equation} \frac{\delta F[f(x)]}{\delta f(x)} = g(x) \end{equation} where $f(x)$ depends entirely, and non-trivially, on $g(x)$. Can a NN learn the above mapping and its functional derivative? More specifically, if one discretizes the domain $x$ over $[a,b]$ and provides $f(x)$ (at the discretized points) as input and $F[f(x)]$ as output, can a NN learn this mapping correctly (at least theoretically)? If so, can it also learn the mapping's functional derivative?

I have done a number of tests, and it seems that a NN may indeed learn the mapping $F[f(x)]$, to some extent. However, while the accuracy of this mapping is OK, it is not great; and troubling is that the computed functional derivative is complete garbage (though both of these could be related to issues with training, etc.). An example is shown below.

If a NN is not suitable for learning a functional and its functional derivative, is there another machine learning method that is?

Examples:

(1) The following is an example of approximating a function and its derivative: A NN was trained to learn the function $f(x) = x^3 + x + 0.5$ over the range [-3,2]: function from which a reasonable approximation to $df(x)/dx$ is obtained: function derivative Note that, as expected, the NN approximation to $f(x)$ and its first derivative improve with the number of training points, NN architecture, as better minima are found during training, etc.

(2) The following is an example of approximating a functional and its functional derivative: A NN was trained to learn the functional $F[f(x)] = \int_1^2 dx ~ f(x)^2$. Training data was obtained using functions of the form $f(x) = a x^b$, where $a$ and $b$ were randomly generated. The following plot illustrates that the NN is indeed able to approximate $F[f(x)]$ quite well: functional Calculated functional derivatives, however, are complete garbage; an example (for a specific $f(x)$) is shown below: functional derivative As an interesting note, the NN approximation to $F[f(x)]$ seems to improve with the number of training points, etc. (as in example (1)), yet the functional derivative does not.

$\endgroup$
1
  • $\begingroup$ Interesting question. How are you representing the input f of the functional F? I assume f is being quantized to some vector of f-values (say a vector of 1000 samples). If so, what does the x-axis of your third plot mean? It seems to be different than the x-axis of your 4th plot. Is the network being trained to learn F[f] and dF/df, or are you computing dF/df once the network is trained? $\endgroup$ Commented Nov 10, 2019 at 22:22

4 Answers 4

7
$\begingroup$

Neural nets can approximate continuous mappings between Euclidean vector spaces $f : \mathbb{R}^M \to \mathbb{R}^N$ when the hidden layer becomes infinite in size. That said, it's more efficient to add depth than width. A functional is simply a map where the range is $\mathbb{R}$ i.e. $N=1$. So yes, neural nets can learn functionals as long as the input is a finite dimensional vector space and the derivative is easily found by reverse-mode differentiation aka backpropagation. Also, quantising the input is indeed a good way to extend the network to continous function inputs.

$\endgroup$
5
$\begingroup$

This is answer is a little late, but for future reference, I recently worked on a paper where we investigated learning functionals with neural networks, which you may find interesting: https://arxiv.org/abs/2505.13275, especially the section on implicitly learning functional derivatives of some toy problems. For example, the linear functional $\displaystyle F[u(x)] = \int_{-1}^1 u(x) \, x^2 \,dx$ has the functional derivative $\dfrac{\delta F}{\delta u} = x^2$; by training on scalar labels $\|F_\theta(u(x)) - F[u(x)]\|$, the network $F_\theta$ (NF or Neural Functional) can implicitly learn the derivative after taking its autograd.

enter image description here

As you observed earlier, an MLP trained to fit functionals has poor functional derivatives when taking its autograd. We believe this may be the case since conventional neural networks approximate a function, rather than a functional, and backpropagating through it produces a gradient (a discrete vector), rather than a functional derivative (a continuous function). Approximating linear functionals can be done by making a connection to the Riesz Representation Theorem; in essence, a functional can be represented as an inner product over suitable functions. In this manner, learning functionals can be recast as learning a function with an integral kernel: $ F_\theta [u(x)] = \int_{\Omega}u(x)*\kappa_{\theta}(x)dx$, where we train the kernel $\kappa_\theta(x)$.

From a theoretical standpoint, this can provably approximate linear functionals, however, implementation choices still need to be made to approximate the integral and parameterize the kernel $\kappa_\theta$. It turns out for 1D problems and simple domains $\Omega$, approximating the integral can be done through standard quadrature rules (Riemann sums, Trapezoidal rule, etc.). Parameterizing the kernel can also be done with an MLP or neural-field based methods.

There are other works that also approximate function to vector mappings, which are also closely related: https://arxiv.org/abs/2402.06031, and it also turns out that the broader field of operator learning is also related to learning functionals.

$\endgroup$
2
  • 1
    $\begingroup$ This article is responsive to the question but this answer is a bit sparse on details. Can you edit to explain the key details in the article & how it solves the problem that MLPs exhibit? $\endgroup$ Commented Jun 4 at 19:04
  • $\begingroup$ Edited! Thank you for the suggestion. $\endgroup$ Commented Jun 18 at 13:45
3
$\begingroup$

This is a good question. I think it involve theoretical mathematical proof. I have been working with Deep Learning (basically neural network) for a while (about a year), and based on my knowledge from all the papers I read, I have not seen proof about this yet. However, in term of experimental proof, I think I can provide a feedback.

Let's consider this example below:

enter image description here

In this example, I believe via multi-layer neural network, it should be able to learn both f(x) and also F[f(x)] via back-propagation. However, whether this apply to more complicated functions, or all functions in the universe, it require more proofs. However, when we consider the example of Imagenet competition --- to classify 1000 objects, a very deep neural network are often used; the best model can achieve an incredible error rate to ~5%. Such deep NN contains more than 10 non-linear layers and this is an experimental proof that complicated relationship can be represented through deep network [based on the fact that we know a NN with 1 hidden layer can separate data non-linearly].

But whether ALL derivatives can be learned required more research.

I am not sure if there any machine learning methods that can learn the function and its derivative completely. Sorry about that.

$\endgroup$
6
  • $\begingroup$ Thank you for your answer. I was actually a bit surprised at first that a neural network could approximate a functional at all. Accepting the fact that it could though, it then does then intuitively seem that information about its functional derivative should be contained in the solution (as is the case with functions), especially for simple functions and functionals (as in your example) In practice, however, this is not the case. In light of your example, I added some examples to my original post. $\endgroup$ Commented Jun 23, 2015 at 22:36
  • $\begingroup$ Cool, what is the setting for your neural network? Such as number of layers, hidden units, activation functions, etc. $\endgroup$ Commented Jun 24, 2015 at 17:02
  • $\begingroup$ I have tried various settings: 1-3 hidden layers, 5 to 100 hidden units (per layer), various numbers of input (while the functional is defined as the limit that this goes to infinity, I have tried as few as four points), sigmoid and tanh (normal, as well as that recommended by LeCun) activation functions, and various training methods (backpropagation, QRPROP, particle swarm optimization, and others). I have tried both in-house and some well-known software. While I can get improvement in approximating the functional as I change things, I can't in the functional derivative. $\endgroup$ Commented Jun 24, 2015 at 17:41
  • $\begingroup$ Cool. What software did you use? Have you done cross-validation to optimize your network setting? Here are some of my thoughts: (1) I would expect 3 or more hidden layers maybe required because the problem is highly non-linear, (2) try to use undercomplete setting for hidden units, i.e., input-100-50-20-output, instead of input-20-50-100-output, (3) use ReLU instead of sigmoid or tanh; a research publish few papers in 2010s and proved that ReLU can lead to better result, (4) parameters such as weight decay, learning rate are important, make sure you tune them appropriately, (5) caffe as a tool $\endgroup$ Commented Jun 24, 2015 at 19:39
  • $\begingroup$ Besides in-house software, I have used stats++, Encog, and NeuroSolutions (the latter was only a free trial, and I don't use it anymore). I have not yet tried cross validation to optimize things, but I will; I will also try your other suggestions. Thank you for your thoughts. $\endgroup$ Commented Jun 24, 2015 at 23:07
1
$\begingroup$

If the functional is in the form $$F[f(x)]=\int\limits_a^bf(x)g(x)dx$$ then $g(x)$ can be learned with a linear regression given enough training functions $f_i(x), ~i=0,\dots,M$ and target values $F[f_i(x)]$. This is done approximating the integral by a trapezoidal rule: $$F[f(x)]= \Delta x\left[\frac{f_0g_0}{2}+f_1g_1+...+f_{N-1}g_{N-1}+\frac{f_Ng_N}{2}\right]$$ that is $$\frac{F[f(x)]}{\Delta x}=y= \frac{f_0g_0}{2}+f_1g_1+...+f_{N-1}g_{N-1}+\frac{f_Ng_N}{2}$$ where $$f_0=a,~f_1=f(x_1),~...,~f_{N-1}=f(x_{N-1}),~f_N=b,$$ $$a<x_1<...<x_{N-1}<b,~~\Delta x=x_{j+1}-x_j$$

Suppose we have $M$ training functions $f_i(x),~i=1,\dots,M$. For each $i$ we have $$\frac{F[f_i(x)]}{\Delta x}=y_i= \frac{f_{i0}g_0}{2}+f_{i1}g_1+...+f_{i,N-1}g_{N-1}+\frac{f_{iN}g_N}{2}$$

The values $g_0,\dots, g_N$ are then found as a solution of a linear regression problem with a matrix of explanatory variables $$X=\begin{bmatrix} f_{00}/2 & f_{01} & \dots & f_{0,N-1} & f_{0N}/2 \\ f_{10}/2 & f_{11} & \dots & f_{1,N-1} & f_{1N}/2 \\ \dots & \dots & \dots & \dots & \dots\\ f_{M0}/2 & f_{M1} & \dots & f_{M,N-1} & f_{MN}/2 \end{bmatrix}$$ and the target vector $y=[y_0,\dots,y_M]$.

Let's test it for a simple example. Suppose, $g(x)$ is a Gaussian.

import numpy as np 

def Gaussian(x, mu, sigma):
    return np.exp(-0.5*((x - mu)/sigma)**2)

Discretize the domain $x \in [a,b]$

x = np.arange(-1.0, 1.01, 0.01)
dx = x[1] - x[0]
g = Gaussian(x, 0.25, 0.25)

Let's take sines and cosines with different frequencies as our training functions. Calculating the target vector:

from math import cos, sin, exp
from scipy.integrate import quad

freq = np.arange(0.25, 15.25, 0.25)

y = []
for k in freq:
    y.append(quad(lambda x: cos(k*x)*exp(-0.5*((x-0.25)/0.25)**2), -1, 1)[0])
    y.append(quad(lambda x: sin(k*x)*exp(-0.5*((x-0.25)/0.25)**2), -1, 1)[0])
y = np.array(y)/dx

Now, the regressor matrix:

X = np.zeros((y.shape[0], x.shape[0]), dtype=float)
print('X',X.shape)
for i in range(len(freq)):
    X[2*i,:] = np.cos(freq[i]*x)
    X[2*i+1,:] = np.sin(freq[i]*x)

X[:,0] = X[:,0]/2
X[:,-1] = X[:,-1]/2

Linear regression:

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
ghat = reg.coef_

import matplotlib.pyplot as plt 

plt.scatter(x, g, s=1, marker="s", label='original g(x)')
plt.scatter(x, ghat, s=1, marker="s", label='learned $\hat{g}$(x)')
plt.legend()
plt.grid()
plt.show()

enter image description here The Gaussian function is successfully learned although the data are spread somewhat around the true function. The spread is larger where $g(x)$ is close to zero. This spread can be smoothed with a Savitzky-Golay filter

from scipy.signal import savgol_filter
ghat_sg = savgol_filter(ghat, 31, 3) # window size, polynomial order

plt.scatter(x, g, s=1, marker="s", label='original g(x)')
plt.scatter(x, ghat, s=1, marker="s", label='learned $\hat{g}$(x)')
plt.plot(x, ghat_sg, color="red", label='Savitzky-Golay $\hat{g}$(x)')
plt.legend()
plt.grid()
plt.show()

enter image description here

In general, $F[f(x)]$ does not depend linearly on $f(x)$, that is $$F[f(x)]=\int\limits_a^b\mathcal{L}\left(f(x)\right)dx$$ It is still can be written as a function of $f_0, f_1\dots,f_N$ after discretizing $x$ which is also true for the functionals of the form $$F[f(x)]=\int\limits_a^b\mathcal{L}\left(f(x),f'(x)\right)dx$$ because $f'$ can be approximated by a finite differences of $f_0, f_1\dots,f_N$. As $\mathcal{L}$ is a non-linear function of $f_0, f_1\dots,f_N$, one may attempt to learn it with a non-linear method, e.g. neural networks or SVM, although it will probably not be so easy as in the linear case.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.