In "Stochastic modified equations and adaptive stochastic gradient algorithms" (Li et. al 2015) the authors approximate stochastic gradient descent, as in
$$x_{k+1} = x_k - \eta u_k \nabla f_{\gamma_k}(x_k),$$
by an SDE, a so called stochastic modified equation (SME)
$$d X_t = -u_t \nabla f(X_t)dt + u_t \sqrt{\eta \Sigma(X_t)}dW_t,$$
in the sense of weak approximation of order 1 by viewing the SGD update as the application of the Euler scheme to the SDE.
Here
- $\eta > 0$ is the maximum learning rate
- $T > 0$
- $u : [0,T] \to [0,1]$ controls the learning rate
- $(\gamma_k)_k$ are iid RV with values in $\{1,\dots, n\}$ and $\mathbb P(\gamma_k = i) = 1/n$, which represent random samples from a training set (e.g.)
- $f(x) = \frac 1 n \sum_{i=1}^n f_i(x)$ is the objective
- $(W_t)_t$ is Brownian motion independent of the $\gamma_k$
and
$$\Sigma(x) := \frac 1 n \sum_{i=1}^n (\nabla f(x) - \nabla f_i(x))(\nabla f(x) - \nabla f_i(x))^T.$$
Of course, many conditions are left implicit here, s.t. the $f_i$ are continuously differentiable or that the coefficients of the SME satisfy growth conditions that guarantee the existence of a unique solution.
Now, the goal is to solve a control problem for the SME, such as
$$\min_{u} \mathbb E(f(X_T)),$$
and use the solution to control the learning rate of the original SGD. What I wonder is
Why can't we just control the original SGD directly? What can we gain by passing to this continuous problem?
Among other things, I think this is important because the difference between the real $x_k$ and its continuous counterpart ("$|X_{\eta k} - x_k|$") can be quite large since the approximation is only weak. I think this prevents us from letting $u$ be dependent on $X_t$ as well, because then the solution cannot be meaningfully used for the original SGD.