Batch normalization before or after nonlinearity?

Batch normalization has emerged as a popular way to boost the performance of deep feed-forward neural networks. I discuss batch normalization briefly in this post. In essence, batch normalization helps stabilize network training

In the BN paper, Ioffe and Szegedy recommended inserting batch normalization layers between convolution and activation. Here’s the relevant paragraph:

We add the BN transform immediately before the nonlinearity, by normalizing x = Wu + b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyvarinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution.

While their explanation makes sense, I always felt that batch normalization should still be performed immediately before any parameterized layer, i.e. after activation. Think about it from a theoretical perspective – the original idea behind batch normalization was to push activations closer to zero mean and unit variance. Indeed, the statistics of a layer post-ReLU are likely to have high variability between iterations of stochastic gradient descent, but post-activation statistics should settle later on in training, thereby leading to similar if not improved performance.1 One can argue that applying batch normalization in such a manner can induce more noise early on in training; even so, noise may help the network “escape” poor local minima and improve results later on in training.

Doing a quick search, I stumbled upon this, which seems to support this hypothesis. However, some of the co-workers I spoke with aren’t as convinced, and gave strong reasons as to why BN into activation would lead to better performance. I’ll look to run some more comprehensive experiments in the coming weeks to see what works better for most networks.2

1 This is true of feedforward networks trained in supervised fashion. For other applications, such as adversarial networks, this is probably not true.
2 UPDATE: I did run some experiments, but unfortunately not as comprehensive as I had wanted.