20
$\begingroup$

Specifically, I mean

$$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot: enter image description here

It behaves similarly to widely used $\tanh(x)$ (blue) except it avoids saturation/vanishing gradients since it has no horizontal asymptotes. It's also less computationally expensive.

Is there some issue with it I'm missing?

$\endgroup$

1 Answer 1

24
$\begingroup$

For a long time, neural network researchers believed that sigmoid activations like the inverse logit and $\tanh$ were the only activations that were necessary. This is because the Cybenko (1989) Universal Approximation Theorem (loosely) states that, under certain conditions, a neural network can approximate certain functions to a desired level of precision with 1 hidden layer & a finite number of units. One of the conditions is that the activation function is bounded. (For full details, consult the paper.)

The function $f(x)=\log(1+x)$ is not bounded, so it does not satisfy the boundedness condition.

However, in the time since Cybenko published his UAT, many other UAT variations have been proven in different settings & allowing more flexibility in the choice of activation functions, number of layers, and so on.

From the perspective of modern neural network theory, you would need to show that the proposed activation has some desirable property that is not found in alternative choices. One problem that I anticipate with this activation is that its derivative is $f^\prime(x)=\frac{1}{1+|x|}$, and goes to 0 as $x$ gets large or small. This is undesirable because of the vanishing gradient phenomenon.

By contrast, an activation function with derivative exactly 1 for a "large" portion of inputs is preferable because it ameliorates the vanishing gradient. An example of this type of function is the ReLU and related functions.


Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989). https://doi.org/10.1007/BF02551274

$\endgroup$
2
  • 1
    $\begingroup$ "[...] derivative [...] is strictly less than 1 almost everywhere [...]" -- Isn't this true for all activation functions in common use except ReLU and its variants, though? Do modern nets only use these? $\endgroup$ Commented Feb 16, 2023 at 15:38
  • 17
    $\begingroup$ Indeed, this is the reason that ReLU-like activations are used in almost all modern neural networks. $\endgroup$ Commented Feb 16, 2023 at 15:53

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.