Vision transformers (ViTs) have seen an incredible rise in the past three years. They have an obvious upside: in a visual recognition setting, the receptive field of a pure ViT is effectively the entire image 1. A larger receptive field gives each neuron more view into the input pixels/voxels at the expense of the potential to overfit on small datasets and significantly more computation. In particular, vanilla ViTs maintain the quadratic time complexity (w.r.t. number of input patches) of language models with dense attention.

Kernels in convolutional networks, on the other hand, have the property of being invariant to the input pixel/voxel that it is applied to, a feature that is typically referred to as translation equivariance. This is desireable because it allows the model to effectively recognize patterns and objects regardless of where they are located spatially. The weight sharing present across convolutional layers also makes convnets highly parameter-efficient and less prone to overfitting - a property ViTs do not have.

As such, you might expect that ViTs and convnets are used equally in production environments that leverage visual models - ViTs for more “global” tasks such as scene recognition and convnets for more “local” tasks such as object detection. Even so, we’ve been inundated with work that utilizes ViTs, with bold high-level claims (mostly by media outlets) that convnets are a thing of the past.

Curious to see if I could lend a hand in helping debunk this claim, I set out to figure whether or not we could match the performance of both ViT and ConvNeXt with a mostly vanilla ResNet. I thought it would be interesting to compare this new and improved ResNet - which I’ll dub GResNet (Good/Great/Gangster/Godlike/etc ResNet) - With a bit of experimentation on Imagenet-1k, I was able to achieve 82.0% accuracy at 224x224 eval size with no extra training data, matching ConvNeXt-S and surpassing ViT-S, both of which have more parameters and flops.

Training methodology

Here, I started by adopting the training methodology set in Pytorch’s late 2021 blog, where they achieved an impressive 80.8% accuracy on Imagenet-1k with a stock ResNet50 model. I won’t list out the entire training methodology here, but here’s a couple of key points to note:

  • We stick with SGD as the optimizer, rather than going for RMSProp or Adam (or any of their variants).
  • The scheduler uses cosine decay with five warmup epochs and 600 total epochs. That’s quite a bit, but we’ll get around to lowering it later.
  • We utilize a whole slew of augmentations found in modern literature, including, but not limited to, label smoothing, mixup, cutmix, and model EMA.
  • To prevent overfitting on the validation dataset, I didn’t do much hyperparameter tuning nor grid search beyond the stock training methodology listed out in the blog post.

Nearly all of these training optimizations have already been used to boost the performance of modern visual recognition models.

Architectural modifcations

Four tiny architectural modifications enable ResNet to beat ViT/ConvNext on the Imagenet-1k dataset.

ResNet-d

First order of business is the embrace ResNet “modernizations”2. For completeness, here are the changes listed out:

1) The initial 7x7 convolution is changed to a sequence of three 3x3 convolutions with 32, 64, and 128 output channels, respectively. The stride remains on the first convolutional layer. We now use exclusively 3x3 and 1x1 convolutions across the entire network all while retaining the size of the receptive field for the network head. 2) Strides in downsampling residual blocks are moved from the first 1x1 convolutional layer to the subsequent 3x3 convolutional layer. This has the effect of capturing all input pixels in a downsampling block, since a strided 1x1 convolution effectively skips every other pixel. 3) The max pooling in the stem is removed. The first 3x3 convolution of the first block now has a stride of two, matching all residual blocks. While max pooling is theoretically useful for retaining edges, corners, and other low-level features, I haven’t found it to be particularly useful in practice. 4) The strided 1x1 convolution in the shortcut connections of downsampling blocks is replaced with 2x2 average pooling followed by a standard 1x1 convolutional layer. Again, this has the effect of capturing all input activations rather than just one out of every four input channels.

ReLU -> SiLU

ReLU has two weaknesses compared to other activation functions: 1) it is not smooth (ReLU is, strictly speaking, non-differentiable at 0), and 2) the “dying ReLU” problem, where pre-activation values are near-universally negative during a forward pass, causing gradients to always be zero and the neuron to carry no information.

As a direct result, a number of novel activations have been proposed throughout the years - Leaky ReLU, Parametric ReLU, ELU, and Softplus are three well-known examples. The idea behind all of these is to fix one or both of the above problems; Parametric ReLU, for example, attempts to fix the dying ReLU problem by introducing a learnable parameter $\alpha$ that defines the slope the function for negative pre-activation values. For this experiment, I went with the Sigmoid Linear Unit (SiLU), defined by $SiLU(x) = \frac{x}{1+e^{-x}}$, which has already seen success with a number of visual recognition models.

Although I could’ve used Gaussian Error Linear Units (GELUs), I decided to use SiLU because it has an inplace parameter and could serve as a drop-in replacement for ReLU in the original reference implementation. GELU might have performed better; they are widely used with great success in language models. In fact, GELU and SiLU are very similar - so much so that there exists a fairly accurate approximation that relates the two: $GELU(x) = \frac{SiLU(1.702x)}{1.702}$. As a direct result of this, a common belief is that networks trained with GELU are essentially equivalent to networks trained with SiLU in terms of representational capacity. This, however, is not true without the proper initialization scheme and zero weight decay.

Lastly, I would expect a SiLU network to perform better with stochastic depth. I suspect ReLU may act like a mild regularizer by adding a certain level of “sparsity” to the network activations, which can be great for overparameterized models (pretty much all of today’s LLMs), but not so much for parameter-efficient models such as GResNet. SiLU, on the other hand, has nonzero gradients for all values $x$ except for $x = -W(\frac{1}{e}) - 1 \approx -1.278$. As such, with the switch from ReLU to SiLU, adding back a bit of regularization via dropout or stochastic depth might be warranted. I’ll have to experiment more with this in the upcoming weeks.

Split normalization

Vanilla ResNet uses a generous amount of batch normalization; one BN layer per convolutional layer to be exact. The original BN paper argues that BN improves internal covariate shift (ICS) - defined by the authors as the change any intermediate layer sees as upstream network weights shift - but this has since proven to be untrue (I’ll elaborate on this in a bit). I wanted to go back to the original ICS thesis, i.e. normalization in BN was meant to re-center the activations, while the affine transformation immediately following normalization was meant to preserve each layer’s representational capacity. It simply made no sense to me that these two must be applied back-to-back. Backpropogation effectively treats each individual layer of neurons as an independent learner, and the most sensible thing to do is to normalize layer inputs rather than outputs. The affine layer, on the other hand, should remain sandwiched in between the convolutions and the activations.

Long story short, I found that splitting BN substeps into two separate steps, with normalization occuring before the convolutional layer and the affine transformation occurring immediately after it, improves performance by a whopping 0.6% on Imagenet-1k. While this does affect training speed, it won’t affect inference speeds since we can represent the normalization and affine transformation as diagonal matrices and fuse them with the weights of the convolutional layer during inference.

I was curious to learn more about the theory of “split” normalization since I couldn’t find it anywhere else in ML literature - I tried searching Google Scholar as well as Ross Wightman’s timm library but couldn’t find anything related to this type of modification3. We still don’t fully understand why or how batch normalization improves network trainability, but the most compelling research in my eyes comes from Santurkar et al.’s 2018 paper. In it, they argue that batch normalization works well because it significantly improves the loss landscape, thereby enabling the network to converge faster and to better local minima. Specifically, they show that BN improves the Lipschitzness of the loss function, i.e. $|L(\textbf{x_0}) - L(\textbf{x_1})| \leq |\textbf{x_0} - \textbf{x_1}|$ for all $\textbf{x_0}$ and $\textbf{x_1}$.

The results

Before I present the results, you might be asking: why Imagenet-1k? Aren’t there a number of great labelled datasets i.e. YFCC, LAION, etc? Secondly, since modern LLMs are exclusively transformer-based, isn’t it beneficial to also use transformers for vision in order to take advantage of things like cross-attention (e.g. Flamingo), or by linearly projecting patches directly into the transformer decoder (e.g. Fuyu-8b)?

0,1.402,1.432
9,55.434,43.612
19,63.176,36.716
29,65.804,22.988
39,66.892,19.494
49,67.892,24.064
59,68.458,29.988
69,69.132,40.074
79,69.696,47.926
89,70.526,55.714
99,70.230,60.374
109,70.528,64.574
119,71.422,68.382
129,71.388,70.646
139,71.494,72.102
149,72.754,72.926
159,72.224,73.634
169,72.340,74.038
179,73.468,73.982
189,73.406,74.170
199,74.022,74.386
209,74.168,74.120
219,74.462,74.208
229,75.012,74.724
239,75.198,75.424
249,75.850,76.930
259,75.968,77.614
269,76.500,78.356
279,76.898,78.914
289,77.054,79.186
299,77.540,79.738
309,77.870,80.286
319,78.444,80.556
329,78.678,80.934
339,79.058,81.156
349,79.534,81.316
359,79.742,81.492
369,80.020,81.598
379,80.580,81.716
389,80.778,81.800
399,81.158,81.914
409,81.450,82.004
419,81.568,81.980
429,81.612,82.004
439,81.704,81.898
449,81.852,81.872

The final model reaches 82.0% accuracy on Imagenet-1k without any external sources of data! For reference, ConvNeXt achieves 82.1% accuracy with 27M parameters, while ViT-S (DeiT-III) achieves 81.4% accuracy with 22M parameters.

Ending words

I recently came across a paper from folks at Google Deepmind titled Convnets match vision transformers at scale. The concluding section contains a stark result: “Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.” This simply reinforces a lesson that must be repeated time and time again: model architecture actually matters very little compared to 1) a large, high-quality dataset, 2) a solid, highly paralellizable training strategy, and 3) having lots of GPUs/TPUs. This is the beauty of stochastic gradient descent: neural architecture search has effectively never given us a top-performing model, and I’m fairly certain that a modern twist on old-school recurrent networks (LSTMs/GRUs) could potentially give us an LLM that performs similarly to GPT-3.5.

ResNet strikes back... again?

I’ve been working on a couple other random projects in my spare time as well - I’ve been traning a couple of ResNet + attention models without patchification that seem to be approaching the performance of hybrid models as well. I’ll try to share some of my findings in another post. Stay tuned!

Addendum - comparing embedding quality

I thought it might be interesting to compare embeddings from ViT (DeiT), ConvNeXt, and this model (GResNet) by leveraging vector search on Zilliz Cloud. I won’t add much commentary here - just going to let the results do the talking:


  1. For a visual recognition model, the receptive field is the effective area of the input Nd-xels that a layer or neuron “sees” and can capture. Early layers in a pure convolutional model, for example, have a very small receptive field, while each layer in a vision transformer with dense attention sees the entire input image. 

  2. Strictly speaking, the stem is not exactly that of ResNet-d, but it’s close enough. 

  3. I was admittedly very surprised since this was an idea I had a couple of months after the release of the batch normalization paper. I’d like to give credit where credit is due, so please reach out to me if you know of a prior paper that implements this. 


<
Previous Post
a16z Blogs Are Just Glorified Marketing
>
Blog Archive
Archive of all previous blog posts