I'm reading the Deep Learning book by Goodfellow, Bengio, and Courville (Chapter 8 section 8.7.1 on Batch Normalization, page 315). The authors use a simple example of a deep linear network without activation functions:
$\hat{y} = x · w_1 · w_2 · w_3 · ... · w_l$
They explain how batch normalization helps prevent exploding/vanishing gradients by normalizing activations at each layer. However, they then state:
"Batch normalization has thus made this model significantly easier to learn. In this example, the ease of learning of course came at the cost of making the lower layers useless. In our linear example, the lower layers no longer have any harmful effect, but they also no longer have any beneficial effect. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence."
Given that stacking linear layers is already "useless" (reducible to one layer) even without batch normalization, what additional "uselessness" does batch normalization introduce? The book seems to suggest batch normalization specifically makes them useless, and I do not understand why.