11
$\begingroup$

This is a rather general question that came to mind while attending a talk on deep learning. A deep net had been trained with 32x32 images and there were 10 classes(CIFAR dataset). There were about 60000 images in the dataset. The person giving the talk said that the network has learnt more than 10 million parameters. That seems a bit weird to me because the number of pixels in all these images is around that order. How does one justify a model using this many parameters? Don't the chances of overfitting to a dataset increase with the number of parameters?

$\endgroup$
2
  • $\begingroup$ If there are 60000 images with 32*32 pixels each, and each pixel contains red, green and blue values, the total number of inputs would be 32*32*60000*3 = 184320000, some 18 times more than the number of parameters. $\endgroup$ Commented Oct 4, 2015 at 12:43
  • $\begingroup$ Those are approximate numbers. I might have worded by question wrongly. How do you not overfit when the parameter space is of such a huge order would be a better way to put it? $\endgroup$ Commented Oct 4, 2015 at 15:24

1 Answer 1

5
$\begingroup$

Yes, there should not be 10 million parameters of a model which trained on CIFAR-10 as its input dimension is small (32*32*3 = 3072). It can barely reach to million of parameters, but that model becomes prone to over-fitting. Here is a reasonable structure of convnet trained on CIFAR-10;

2 convolution layer 1 fully connected layer and 1 classification layer (also fully connected). Most of the parameters are concentrated on the last two layers as they are fully connected.

  • Filter size at the first convolution layer is 7x7@32
  • Pooling size at the first pooling layer is 2x2
  • Filter size at the second convolution layer is 5x5@16
  • Pooling size at the second pooling layer is 1x1 (no pooling)

I'm assuming a valid convolution and stride number of pooling is equal to pooling size. At his configurations, dimension of the first feature maps are (32-7+1)/2 x (32-7+1)/2 = 13x13@32. Dimension of the second feature maps are (13-5+1)/1 x (13-5+1)/1 = 9x9@16

As convolution layers are rolled into vector before passing into fully connected layer, dimension of the first fully connected layer is 9*9*16 = 1296. Lets assume that the last hidden layer contains 500 units (this hyper-parameter is the most significant one for total number of parameters). And the last layer is 10.

At total the number of parameters are 7*7*32 + 5*5*16 + 1296*500 + 500*10 = 1568 + 400 + 648000 + 5000 = 654968.

But I expect smaller network can yield better results as the number of samples is relatively small. So if the 500 neurons reduced to 100 neurons, the total number of parameters reduces to 1568 + 400 + 129600 + 5000 = 136568. Maybe it would be better to include another pooling layer at the second layer, or discarding first fully connected layer.

As you can see most of the parameters are concentrated at the first fully connected layer. I don't think deeper network can yield significant gain as the dimension of input layer is small (relative to ImageNet). So your point is right.

If you concern about over-fitting you can check 'Recuding overfitting' section of Alex's convnet paper

$\endgroup$
1
  • 3
    $\begingroup$ Actually I have to disagree with you, ResNet was trained on Cifar10 as well, using 50,101, and later 1000 layer and got ~95.50 percent accuracy. so a deep network can do pretty well on a small dataset like cifar10 as well. I myself could get 93.86% accuracy on cifar10 using my own deep architecture (a 12 layer network) . apart from that, using BN and dropout helps with over-fitting and that would not be a problem as ResNet showed. $\endgroup$ Commented Jun 5, 2016 at 5:13

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.