Batch normalization before or after nonlinearity?

Batch normalization has emerged as a popular way to boost the performance of deep feed-forward neural networks. I discuss batch normalization briefly in this post. In essence, batch normalization helps stabilize network training

In the BN paper, Ioffe and Szegedy recommended inserting batch normalization layers between convolution and activation. Here’s the relevant paragraph:

We add the BN transform immediately before the nonlinearity, by normalizing x = Wu + b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyvarinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution.

While their explanation makes sense, I always felt that batch normalization should still be performed immediately before any parameterized layer, i.e. after activation. Think about it from a theoretical perspective – the original idea behind batch normalization was to push activations closer to zero mean and unit variance. Indeed, the statistics of a layer post-ReLU are likely to have high variability between iterations of stochastic gradient descent, but post-activation statistics should settle later on in training, thereby leading to similar if not improved performance.1 One can argue that applying batch normalization in such a manner can induce more noise early on in training; even so, noise may help the network “escape” poor local minima and improve results later on in training.

Doing a quick search, I stumbled upon this, which seems to support this hypothesis. However, some of the co-workers I spoke with aren’t as convinced, and gave strong reasons as to why BN into activation would lead to better performance. I’ll look to run some more comprehensive experiments in the coming weeks to see what works better for most networks.2

1 This is true of feedforward networks trained in supervised fashion. For other applications, such as adversarial networks, this is probably not true.
2 UPDATE: I did run some experiments, but unfortunately not as comprehensive as I had wanted.

Using Caffe to experiment with different architectures

For the past several months, I’ve been absorbed with deep learning, especially as it pertains to computer vision. Although essentially synonymous with neural networks, deep learning has emerged as the state-of-the-art method for visual and linguistic representation in the past several of years.

Due to the continuation of Moore’s Law and various improvements in training techniques, convolutional neural networks have come to the forefront of computer vision research. Although most people consider Yann LeCun to be the primary founding father of convnets, I think an equally solid case could be made for Kunihiko Fukushima, who described a very complimentary idea in 1980 via this paper. Due to deep learning’s rise in popularity, various high-level neural network and machine learning libraries and frameworks have come into popularity, almost all of which perform computation on GPUs. In this post, I’ll be using Caffe, a popular C++ toolkit developed and maintained by the Berkeley Vision and Learning Center (BVLC). Its speed, flexibility, and production readiness make it a popular choice for both researchers and industry engineers alike.

In this post, I’ll talk about how to use Caffe to experiment with different convnet architectures1. Specifically, I’ll show how you can easily define a network, swap layers, and re-train the updated network using Caffe’s network definition interface. Stanford’s CS231n lecture notes should provide a good refresher on deep convolutional neural networks, if you need it. If you’d like to follow along, it would also be good to ensure that you have the right dependencies prior to actually installing Caffe on your machine.

The entirety of this tutorial will be based off of the well-known “VGG16” architecture, developed by the Visual Geometry Group at Oxford University for the ImageNet competition (ILSVRC’14). The VGG16 architecture is composed almost entirely of 3×3 convolutions and max pooling layers, with a couple of more traditional densely connected layers at the very end. For this tutorial, we’ll make some slight modifications to improve training time and performance. Specifically, we’ll nix all of the conv1_* layers in the network and add a single strided 3×3 convolutional layer in its place, replace all of the 2×2 max pooling operations with strided 3×3 convolutional layers, and utilize only a single penultimate layer (which we’ll be experimenting with in this tutorial). We’ll also make use of batch normalization after each of the convolutional layers to greatly speed up training. I’ll go into all of these changes into a bit more detail in the next couple of paragraphs.

The removal of the first layer greatly reduces the computational burden. A traditional VGG-like architecture makes use of 3×3 convolutions with a stride of 1 throughout the entire network. Using unstrided convolutional layers at the top of the network greatly increases the amount of computation required, since all of initial set of convolutions are occurring over what is essentially the entire input image. Thinking about it from a theoretical perspective, it’s likely that we won’t lose much representational power by increasing the size of the first layer’s kernel while simultaneously using a stride of 2 instead of 1. While this technically decreases the overall representational power of the network, the latter layers will require much less computation since the input size is decimated by a factor of 4 (via conv1 and conv2_1 right off the bat.

Removal of max pooling layers makes sense from a theoretical perspective – we allow the network to learn kernels associated with the downsampling instead of fixing the pooling operation. These stride convolutional layers were explored in this paper with good results – it seems reasonable to use them here as well.

The inclusion of batch normalization seems pretty standard nowadays. It allows the network to train considerably faster, and enables higher final validation accuracy measures. The premise is simple: machine learning algorithms are known to train faster when the inputs are normalized. Batch normalization learns normalization parameters over output activations, resulting in gradients which are more stable and a faster training process. Batch normalization layers are typically applied immediately after all of the convolutional layers in the network, i.e. pre-ReLU. However, I noticed that use of batch normalization does, in some extremely rare cases, cause the network to learn incredibly slowly in early iterations. This could’ve easily been an error on my part – I’ll leave the investigation of the root cause of this to future work.

Training methodology

We’ll try three different penultimate layers in this tutorial: 1) a 7×7 average pooling layer, 2) a fully connected layer, 3) a 7×7 convolutional layer. Although the pooling operation doesn’t contain any learnable parameters, we’ll consider it a layer for the purposes of this tutorial. Each of these should generate different results – we’ll try each of them to see which performs the best. The data we’ll be using to train the network is the 1000-class image data provided in the ILSVRC’12 training set, with each image resized to a fixed 256×256. You can prepare the data in LMDB format for Caffe on your own using this guide. I’m happy to share pixel data as well, provided that you adhere to ImageNet’s terms and conditions.

Here’s what the final network looks like, minus the penultimate layer and the final fully connected layer (click to enlarge):
VGG16 reduced architecture

And here’s the associated training files needed for Caffe (I’ve dubbed the network “VGG16_reduced” to reflect the way the network is constructed). Note that I’ve already selected learning rates, batch sizes, and initialization strategies to avoid the vanishing gradient problem. Again, I’ve left out the penultimate layer and the final fully connected layer:

Note how easy it is to define and update network architectures. Furthermore, note that it is possible to define standard neural networks via Caffe’s interface as well. If you need to define a large network, you can use the net_spec utility that ships directly with Caffe.

Attempt 1: 7×7 average pooling layer

Now that we have everything we need, let’s start training. I’m using a 4-GPU Amazon AWS2 setup with the solver above, meaning that each GPU will take a batch size of 64, and a total of 6,400,000 batches will be run across all four GPUs. We’ll start with the average pooling layer, which is used by many of the more recent architectures, such as Inception v3 and ResNet:

layer {
  bottom: "bn5_4"
  top: "pool"
  name: "pool"
  type: "Pooling"
  pooling_param {
    pool: AVE
    kernel_size: 7
  }
}
layer {
  bottom: "pool"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

Parsing the logs, we get a final top-5 validation accuracy of 87.4%. The final train/val curve looks like this:
Experiment: pool5_no_dropout

The first thing to note – there’s some pretty obvious overfitting going on here, i.e. the loss on the validation data is worse (by almost a factor of 2x) than the accuracy on the training data. This is actually quite common in neural networks during early stages of experimentation. Good news for us is that there’s several ways to combat this. Here’s a couple of common anti-overfit strategies:
1) Bumping the weight decay,
2) Thinning the network,
3) Data augmentation, and
4) Applying dropout.

I’ll explain each of these in a bit more detail. The weight decay of the network, also commonly known as L2 regularization, is a penalty applied to the weights which discourages “spikes” in the learned parameters. Effectively, this smooths the values learned by the network, thus encouraging a bit more generalization. Thinning the network simply reduces the number of variables through the entire model, thereby making it less prone to simply learning the input images. This can be a bit difficult to visualize in high-dimensional space, but line-fitting in a 2D plane is a good analogy – imagine trying to fit a set of points on a Cartesian coordinate system using a line (2 parameters) versus a polynomial with a degree of 99 (100 parameters). Data augmentation artificially increases the number of training samples by applying judicious distortions to the input images, such as contrast or exposure shift. This allows the network to “see” a greater number of different images, thereby artificially bumping the number of training samples and discouraging overfitting. Dropout omits neurons in one or more layers during training by removing a randomized subset of p percent of the neurons in a corresponding layer. This causes the network to learn independent representations of the input images using a randomized subset of the neurons, thereby improving the generalizability of the model. Dropout is arguably the trickiest of these four to understand; this paper explores it in a bit more depth.

There are other methods of regularization as well, but the four above are perhaps the most general and best understood methods for doing so in convnets. For the purposes of this experiment, let’s try adding some dropout to the penultimate layer of our network. This can be done easily in Caffe by adding a Dropout layer.

layer {
  bottom: "bn5_4"
  top: "pool5"
  name: "pool5"
  type: "Pooling"
  pooling_param {
    pool: AVE
    kernel_size: 7
  }
}
layer {
  name: "drop5"
  type: "Dropout"
  bottom: "pool5"
  top: "pool5"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  bottom: "pool5"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This will apply in-place dropout (with probability 0.5) to the pool5 layer. With dropout enabled, we get a final top-5 validation accuracy of 86.5% and the following train/val curve:
Experiment: pool5_0.5_dropout

Although the train and validation curves are visibly closer, the final validation accuracy is actually marginally worse with dropout than without! This is an indicator that better regularization or a slightly different network architecture is needed. I’ll leave the exploration of this as a future exercise – we’ll continue with the experiment as is for now.

Attempt 2: fully connected layer (1024 hidden units)

Although we were already able to get decent results using the 7×7 average pooling layer, let’s continue with our experiments. This time, we’ll attempt a fully connected layer, meaning that the connections are dense – each neuron of this layer has a direct connection with the neurons of the previous layer. This means that the “receptive field” for fully connected layers is, in essence, the entire image. Note that the fully connected layers in VGG16 and AlexNet contain 4096 hidden units – I’ve reduced the number of hidden units here to lower the number of parameters required by the network:

layer {
  bottom: "bn5_4"
  top: "fc6"
  name: "fc6"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1024
    weight_filler {
      type: "xavier"
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
}
layer {
  bottom: "fc6"
  top: "bn6"
  name: "bn6"
  type: "BatchNorm"
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
}
layer {
  bottom: "fc6"
  top: "fc6"
  name: "relu6"
  type: "ReLU"
}
layer {
  bottom: "bn6"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    bias_term: false
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This network gets a final validation accuracy of 85.0%, with a train/val curve as so:
Experiment: fc6

Not looking so good – the overfitting here is considerably more noticeable, and the final validation loss is worse than that of the global average pooling layer. A couple of solutions to improve this network would be to add dropout or increase the weight decay for the penultimate layer, which can be accomplished by increasing the decay_mult parameter of the fc6 layer.

Attempt 3: 7×7 convolutional layer (no padding)

The final layer we’ll try is a 7×7 convolutional layer with 64 kernels (which equates to 7x7x512x64=1.6M weights). Since we’re applying no padding to this layer’s input, this essentially reduces the output of the conv5 layer down to a series of 1×1 activations. In terms of output size, it accomplishes something similar to the global average pooling layer, but with learnable parameters:

layer {
  bottom: "bn5_4"
  top: "conv6"
  name: "conv6"
  type: "Convolution"
  convolution_param {
    num_output: 64
    kernel_size: 7
    bias_term: false
    weight_filler {
      type: "xavier"
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
}
layer {
  bottom: "conv6"
  top: "bn6"
  name: "bn6"
  type: "BatchNorm"
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
}
layer {
  bottom: "bn6"
  top: "bn6"
  name: "relu6"
  type: "ReLU"
}
layer {
  bottom: "bn6"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This network gets a final top-5 validation accuracy of 77.0%, with a train/val curve as so:
Experiment: conv6

Closing words

Although global average pooling achieved the highest final validation accuracy, it is by no means the “best” method for training networks. Small tweaks may be necessary to squeeze the last bit of performance on some dataset, and different computer vision problems will respond differently to different architectures. Given the consistent overfitting, a reasonable conclusion we can derive from this experiment is that the model we used simply isn’t regularized strongly enough for the input data.

With ever deeper (and wider) models being prevalent in today’s literature, I’d also like to point out that deeper does not necessarily mean better. At some point, your network will get diminishing returns, and over-complicating the model may cause it to generalize poorly to other datasets. This is why VGG16 remains my favorite network architecture, despite the fact that there are newer models which perform better on the ImageNet dataset.

It’s also important to note that Caffe is just one of many great tools for experimenting with neural networks. Theano, Torch, and Tensorflow are three other well-known machine learning and neural network libraries, all of which have GPU support and active community contributions. I’ll likely re-do this entire tutorial in Tensorflow sometime in the near future, so keep your eyes peeled for that!

1Before diving in too far, I’d should mention that I’m nowhere near an expert on deep learning or convolutional neural networks. This field happened to catch my attention for the past several months, which prompted me to go through tutorials and read papers related to the topic. This post reflects part of what I’ve learned in the last half-year or so.

2Although it’s relatively straightforward, I can talk about the setup process I had to go through here if there’s enough interest.

Installing Caffe on OS X El Capitan

(Won’t be much content to this post – this is mostly for future reference.)

Step 0: Install brew and CUDA

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew doctor
brew update

Download and install CUDA

echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.profile
echo "export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH" >> ~/.profile

Step 1: Caffe dependencies

brew install python
brew install wget
brew install protobuf
brew install boost-python
brew install lmdb
brew install leveldb
brew install snappy
brew install gflags
brew install glog
brew tap homebrew/science
brew install opencv
brew install hdf5

Step 2: Caffe Python dependencies

pip install numpy
pip install scipy
pip install Cython
pip install scikit-image
pip install protobuf
pip install lmdb
pip install leveldb
pip install python-gflags
pip install h5py

Step 3: Build Caffe from source

git clone https://github.com/fzliu/caffe.git
cd caffe && wget https://gist.githubusercontent.com/fzliu/bef9b8e8ea4f3c082101/raw/8697dff9233f42ffca50595e0fb6915b67a4566e/Makefile.config
make -j8 && make pycaffe

Artistic style transfer

(Disclaimer: Besides for the picture of myself, I don’t own any of the images or artwork shown in this post.)

I recently implemented this paper on artistic style transfer in Python, using Caffe to perform the neural net operations. Dylan Paiton, who worked with the Flickr Vision team as a summer intern, benchmarked the code and committed several nice optimizations as well.

The premise behind the paper is quite simple – it intends to take the style of one image (preferably sketch, painting, or drawing) – and transfer it to another. I was pleasantly surprised by the quality of the results, especially given how difficult of a problem this is. This page contains a few more artistic style transfer examples, and shows some of the features of the code. To fully understand the paper (and the rest of this post), it’s probably good to have a little bit of knowledge in the realm of deep learning, with specific application to convolutional neural networks. If you’re interested in trying the code out for yourself, you can download it at here, along with instructions on how to use it as well as a small set of initial examples.

Changing the style image
Let’s start off with a few artistic styles. The target artwork is (in order of from top to bottom) 1) American Gothic by Grant Wood, 2) Impression, Sunrise by Claude Monet, 3) Rain Princess, by Leonid Afremov, 4) The Scream, by Edvard Munch, and 5) The Persistence of Memory by Salvador Dali. I used the VGG model with a style-to-content ratio of 1e5 and cut off the scalar loss optimization at 500 iterations for each of these examples.

Content Content
American Gothic Output
Impression, Sunrise Output
Rain Princess Output
The Scream Output
Persistence of Memory Output

As expected, different style images generate vastly different results.

Number of iterations
Next, let’s take a look at the effect of modifying the maximum number of iterations performed. If you’re concerned about runtime, this option could be very important for you, since each iteration under the VGG model is quite expensive (especially when run on the CPU). For each of the output images below, I used 50, 100, 200, 400, and 800 iterations respectively. The style image is Picasso’s 1907 self-portrait.

Style and content Style Content
Iteration 50 Iteration 50
Iteration 100 Iteration 50
Iteration 200 Iteration 200
Iteration 400 Iteration 400
Iteration 800 Iteration 800

As with nearly all scalar and vector minimization algorithms, you’ll get diminishing returns as the number of iterations goes up. This effect remains evident here.

Model used
The model used can drastically change the way the results look as well. Currently, the style-transfer code supports three different models: AlexNet1, GoogleNet, and VGG. I let the L-BFGS optimizer run to completion in each case.

Style and content Style Content
AlexNet AlexNet
GoogleNet GoogleNet
VGG VGG

VGG is an extremely wide and deep model. Each convolutional layer has relatively large kernels and a small stride, enabling the convnet to capture an extremely thorough style representation. As a result, VGG-based output images are arguably the “best” in terms of look. The results generated using CaffeNet appear quite awful in comparison, but its runtime is extremely low; optimization took a total of around 30 seconds on my NVidia GeForce GT 750M GPU with AlexNet. GoogleNet’s results probably lie somewhere in the middle. Although it has fewer total weights than AlexNet, it is nonetheless an extremely deep network.

More examples
I’ll continue to update this page with examples as I add functionality to the code.

1Here, AlexNet actually refers to CaffeNet, which is a slightly modified version of AlexNet used by BVLC as a reference convnet in many of their public examples and benchmarks.

Visual machine learning tutorial

The purpose of this post is to share an external article on machine learning. Unlike many other machine learning tutorials I’ve seen, which rely heavily on text and maybe a few plots, this one feels interactive, with the text following the diagrams.

You can read up on it here.

Overall, I think the tutorial is extremely well done – hats off to its creators. I do, however, have a couple of complaints:
1) The machine learning algorithm used in the example is the well-known decision tree. SVMs and neural networks, which are arguably more important than decision trees, aren’t referenced.
2) The tutorial tends toward the side of computational statistics, and doesn’t acknowledge examples of machine learning applied to computer vision or natural language processing,
3) There is no mention of boosting or ensembles, which are two important concepts in machine learning.

Hopefully these will be addressed in a future tutorial.

UPDATE: The tutorial is also available in Chinese.