0
$\begingroup$

I'm reading the Deep Learning book by Goodfellow, Bengio, and Courville (Chapter 8 section 8.7.1 on Batch Normalization, page 315). The authors use a simple example of a deep linear network without activation functions:

$\hat{y} = x · w_1 · w_2 · w_3 · ... · w_l$

They explain how batch normalization helps prevent exploding/vanishing gradients by normalizing activations at each layer. However, they then state:

"Batch normalization has thus made this model significantly easier to learn. In this example, the ease of learning of course came at the cost of making the lower layers useless. In our linear example, the lower layers no longer have any harmful effect, but they also no longer have any beneficial effect. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence."

Given that stacking linear layers is already "useless" (reducible to one layer) even without batch normalization, what additional "uselessness" does batch normalization introduce? The book seems to suggest batch normalization specifically makes them useless, and I do not understand why.

$\endgroup$

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.