Using Caffe to experiment with different architectures

For the past several months, I’ve been absorbed with deep learning, especially as it pertains to computer vision. Although essentially synonymous with neural networks, deep learning has emerged as the state-of-the-art method for visual and linguistic representation in the past several of years.

Due to the continuation of Moore’s Law and various improvements in training techniques, convolutional neural networks have come to the forefront of computer vision research. Although most people consider Yann LeCun to be the primary founding father of convnets, I think an equally solid case could be made for Kunihiko Fukushima, who described a very complimentary idea in 1980 via this paper. Due to deep learning’s rise in popularity, various high-level neural network and machine learning libraries and frameworks have come into popularity, almost all of which perform computation on GPUs. In this post, I’ll be using Caffe, a popular C++ toolkit developed and maintained by the Berkeley Vision and Learning Center (BVLC). Its speed, flexibility, and production readiness make it a popular choice for both researchers and industry engineers alike.

In this post, I’ll talk about how to use Caffe to experiment with different convnet architectures1. Specifically, I’ll show how you can easily define a network, swap layers, and re-train the updated network using Caffe’s network definition interface. Stanford’s CS231n lecture notes should provide a good refresher on deep convolutional neural networks, if you need it. If you’d like to follow along, it would also be good to ensure that you have the right dependencies prior to actually installing Caffe on your machine.

The entirety of this tutorial will be based off of the well-known “VGG16” architecture, developed by the Visual Geometry Group at Oxford University for the ImageNet competition (ILSVRC’14). The VGG16 architecture is composed almost entirely of 3×3 convolutions and max pooling layers, with a couple of more traditional densely connected layers at the very end. For this tutorial, we’ll make some slight modifications to improve training time and performance. Specifically, we’ll nix all of the conv1_* layers in the network and add a single strided 3×3 convolutional layer in its place, replace all of the 2×2 max pooling operations with strided 3×3 convolutional layers, and utilize only a single penultimate layer (which we’ll be experimenting with in this tutorial). We’ll also make use of batch normalization after each of the convolutional layers to greatly speed up training. I’ll go into all of these changes into a bit more detail in the next couple of paragraphs.

The removal of the first layer greatly reduces the computational burden. A traditional VGG-like architecture makes use of 3×3 convolutions with a stride of 1 throughout the entire network. Using unstrided convolutional layers at the top of the network greatly increases the amount of computation required, since all of initial set of convolutions are occurring over what is essentially the entire input image. Thinking about it from a theoretical perspective, it’s likely that we won’t lose much representational power by increasing the size of the first layer’s kernel while simultaneously using a stride of 2 instead of 1. While this technically decreases the overall representational power of the network, the latter layers will require much less computation since the input size is decimated by a factor of 4 (via conv1 and conv2_1 right off the bat.

Removal of max pooling layers makes sense from a theoretical perspective – we allow the network to learn kernels associated with the downsampling instead of fixing the pooling operation. These stride convolutional layers were explored in this paper with good results – it seems reasonable to use them here as well.

The inclusion of batch normalization seems pretty standard nowadays. It allows the network to train considerably faster, and enables higher final validation accuracy measures. The premise is simple: machine learning algorithms are known to train faster when the inputs are normalized. Batch normalization learns normalization parameters over output activations, resulting in gradients which are more stable and a faster training process. Batch normalization layers are typically applied immediately after all of the convolutional layers in the network, i.e. pre-ReLU. However, I noticed that use of batch normalization does, in some extremely rare cases, cause the network to learn incredibly slowly in early iterations. This could’ve easily been an error on my part – I’ll leave the investigation of the root cause of this to future work.

Training methodology

We’ll try three different penultimate layers in this tutorial: 1) a 7×7 average pooling layer, 2) a fully connected layer, 3) a 7×7 convolutional layer. Although the pooling operation doesn’t contain any learnable parameters, we’ll consider it a layer for the purposes of this tutorial. Each of these should generate different results – we’ll try each of them to see which performs the best. The data we’ll be using to train the network is the 1000-class image data provided in the ILSVRC’12 training set, with each image resized to a fixed 256×256. You can prepare the data in LMDB format for Caffe on your own using this guide. I’m happy to share pixel data as well, provided that you adhere to ImageNet’s terms and conditions.

Here’s what the final network looks like, minus the penultimate layer and the final fully connected layer (click to enlarge):
VGG16 reduced architecture

And here’s the associated training files needed for Caffe (I’ve dubbed the network “VGG16_reduced” to reflect the way the network is constructed). Note that I’ve already selected learning rates, batch sizes, and initialization strategies to avoid the vanishing gradient problem. Again, I’ve left out the penultimate layer and the final fully connected layer:

Note how easy it is to define and update network architectures. Furthermore, note that it is possible to define standard neural networks via Caffe’s interface as well. If you need to define a large network, you can use the net_spec utility that ships directly with Caffe.

Attempt 1: 7×7 average pooling layer

Now that we have everything we need, let’s start training. I’m using a 4-GPU Amazon AWS2 setup with the solver above, meaning that each GPU will take a batch size of 64, and a total of 6,400,000 batches will be run across all four GPUs. We’ll start with the average pooling layer, which is used by many of the more recent architectures, such as Inception v3 and ResNet:

layer {
  bottom: "bn5_4"
  top: "pool"
  name: "pool"
  type: "Pooling"
  pooling_param {
    pool: AVE
    kernel_size: 7
  }
}
layer {
  bottom: "pool"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

Parsing the logs, we get a final top-5 validation accuracy of 87.4%. The final train/val curve looks like this:
Experiment: pool5_no_dropout

The first thing to note – there’s some pretty obvious overfitting going on here, i.e. the loss on the validation data is worse (by almost a factor of 2x) than the accuracy on the training data. This is actually quite common in neural networks during early stages of experimentation. Good news for us is that there’s several ways to combat this. Here’s a couple of common anti-overfit strategies:
1) Bumping the weight decay,
2) Thinning the network,
3) Data augmentation, and
4) Applying dropout.

I’ll explain each of these in a bit more detail. The weight decay of the network, also commonly known as L2 regularization, is a penalty applied to the weights which discourages “spikes” in the learned parameters. Effectively, this smooths the values learned by the network, thus encouraging a bit more generalization. Thinning the network simply reduces the number of variables through the entire model, thereby making it less prone to simply learning the input images. This can be a bit difficult to visualize in high-dimensional space, but line-fitting in a 2D plane is a good analogy – imagine trying to fit a set of points on a Cartesian coordinate system using a line (2 parameters) versus a polynomial with a degree of 99 (100 parameters). Data augmentation artificially increases the number of training samples by applying judicious distortions to the input images, such as contrast or exposure shift. This allows the network to “see” a greater number of different images, thereby artificially bumping the number of training samples and discouraging overfitting. Dropout omits neurons in one or more layers during training by removing a randomized subset of p percent of the neurons in a corresponding layer. This causes the network to learn independent representations of the input images using a randomized subset of the neurons, thereby improving the generalizability of the model. Dropout is arguably the trickiest of these four to understand; this paper explores it in a bit more depth.

There are other methods of regularization as well, but the four above are perhaps the most general and best understood methods for doing so in convnets. For the purposes of this experiment, let’s try adding some dropout to the penultimate layer of our network. This can be done easily in Caffe by adding a Dropout layer.

layer {
  bottom: "bn5_4"
  top: "pool5"
  name: "pool5"
  type: "Pooling"
  pooling_param {
    pool: AVE
    kernel_size: 7
  }
}
layer {
  name: "drop5"
  type: "Dropout"
  bottom: "pool5"
  top: "pool5"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  bottom: "pool5"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This will apply in-place dropout (with probability 0.5) to the pool5 layer. With dropout enabled, we get a final top-5 validation accuracy of 86.5% and the following train/val curve:
Experiment: pool5_0.5_dropout

Although the train and validation curves are visibly closer, the final validation accuracy is actually marginally worse with dropout than without! This is an indicator that better regularization or a slightly different network architecture is needed. I’ll leave the exploration of this as a future exercise – we’ll continue with the experiment as is for now.

Attempt 2: fully connected layer (1024 hidden units)

Although we were already able to get decent results using the 7×7 average pooling layer, let’s continue with our experiments. This time, we’ll attempt a fully connected layer, meaning that the connections are dense – each neuron of this layer has a direct connection with the neurons of the previous layer. This means that the “receptive field” for fully connected layers is, in essence, the entire image. Note that the fully connected layers in VGG16 and AlexNet contain 4096 hidden units – I’ve reduced the number of hidden units here to lower the number of parameters required by the network:

layer {
  bottom: "bn5_4"
  top: "fc6"
  name: "fc6"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1024
    weight_filler {
      type: "xavier"
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
}
layer {
  bottom: "fc6"
  top: "bn6"
  name: "bn6"
  type: "BatchNorm"
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
}
layer {
  bottom: "fc6"
  top: "fc6"
  name: "relu6"
  type: "ReLU"
}
layer {
  bottom: "bn6"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    bias_term: false
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This network gets a final validation accuracy of 85.0%, with a train/val curve as so:
Experiment: fc6

Not looking so good – the overfitting here is considerably more noticeable, and the final validation loss is worse than that of the global average pooling layer. A couple of solutions to improve this network would be to add dropout or increase the weight decay for the penultimate layer, which can be accomplished by increasing the decay_mult parameter of the fc6 layer.

Attempt 3: 7×7 convolutional layer (no padding)

The final layer we’ll try is a 7×7 convolutional layer with 64 kernels (which equates to 7x7x512x64=1.6M weights). Since we’re applying no padding to this layer’s input, this essentially reduces the output of the conv5 layer down to a series of 1×1 activations. In terms of output size, it accomplishes something similar to the global average pooling layer, but with learnable parameters:

layer {
  bottom: "bn5_4"
  top: "conv6"
  name: "conv6"
  type: "Convolution"
  convolution_param {
    num_output: 64
    kernel_size: 7
    bias_term: false
    weight_filler {
      type: "xavier"
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
}
layer {
  bottom: "conv6"
  top: "bn6"
  name: "bn6"
  type: "BatchNorm"
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
}
layer {
  bottom: "bn6"
  top: "bn6"
  name: "relu6"
  type: "ReLU"
}
layer {
  bottom: "bn6"
  top: "classifier"
  name: "classifier"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
}
layer {
  bottom: "classifier"
  bottom: "label"
  top: "loss"
  name: "loss"
  type: "SoftmaxWithLoss"
}

This network gets a final top-5 validation accuracy of 77.0%, with a train/val curve as so:
Experiment: conv6

Closing words

Although global average pooling achieved the highest final validation accuracy, it is by no means the “best” method for training networks. Small tweaks may be necessary to squeeze the last bit of performance on some dataset, and different computer vision problems will respond differently to different architectures. Given the consistent overfitting, a reasonable conclusion we can derive from this experiment is that the model we used simply isn’t regularized strongly enough for the input data.

With ever deeper (and wider) models being prevalent in today’s literature, I’d also like to point out that deeper does not necessarily mean better. At some point, your network will get diminishing returns, and over-complicating the model may cause it to generalize poorly to other datasets. This is why VGG16 remains my favorite network architecture, despite the fact that there are newer models which perform better on the ImageNet dataset.

It’s also important to note that Caffe is just one of many great tools for experimenting with neural networks. Theano, Torch, and Tensorflow are three other well-known machine learning and neural network libraries, all of which have GPU support and active community contributions. I’ll likely re-do this entire tutorial in Tensorflow sometime in the near future, so keep your eyes peeled for that!

1Before diving in too far, I’d should mention that I’m nowhere near an expert on deep learning or convolutional neural networks. This field happened to catch my attention for the past several months, which prompted me to go through tutorials and read papers related to the topic. This post reflects part of what I’ve learned in the last half-year or so.

2Although it’s relatively straightforward, I can talk about the setup process I had to go through here if there’s enough interest.