ImageNet Classification with Deep Convolutional Neural Networks Published in NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems Authors Alex Krizhevsky , Ilya Stuskever, Geoffrey E.Hinton Presenter: Chao-Chun, Sung Date: 107/10/24 Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.
Computer & Internet Architecture Lab Introduction Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. To learn about thousands of objects from millions of images, we need a model with a large learning capacity.----CNN Current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs. Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Dataset ImageNet is a dataset of over 15 million labeled high-resolution Images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality . Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Back propagation(1/) Computer & Internet Architecture Lab CSIE NCKU
Back propagation(2/) Forward Propagation 再帶入activation function(sigmoid) 得out_h1=0.59326992 第一層是輸入層,包含兩個神經元i1,i2,和截距項b1;第二層是隱含層,包含兩個神經元h1,h2和截距項b2,第三層是輸出o1,o2,每條線上標的wi是層與層之間連接的權重,激活函數我們默認為sigmoid函數。 截距項表示的就是 你的迴歸模型中 解釋變數所不能解釋的值 在類神經網路中如果不使用激勵函數,那麼在類神經網路中皆是以上層輸入的線性組合作為這一層的輸出(也就是矩陣相乘),輸出和輸入依然脫離不了線性關係,做深度類神經網路便失去意義。
Computer & Internet Architecture Lab Back propagation(3/) 總誤差: E_o1=0.274811083 E_o2=0.023560026 E_total=0.29837111 Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Back propagation(4/) 我們想知道w5對整體誤差造成了多少影響,用整體物誤差對w5做偏導數 Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Back propagation(5/) Computer & Internet Architecture Lab CSIE NCKU
ReLU Nonlinearity(1/) (Rectified Linear Unit Nonlinearity) Saturating nonlinearities F(x)=tanh(x) F(x)= (1+𝑒 −𝑥 ) −1 (sigmoid) Activation function Activation function分成兩種 解釋哪兩種 解釋為何要用non-saturating nonlinearities (ReLU的分段線性性質能有效的克服梯度消失) (Relu會使部分神經元的輸出為0,可以讓神經網路變得稀疏,緩解過度擬合的問題。) (Relu 計算量小) Non-Saturating nonlinearities = Max(0,x) Tanh(x) Max(0,x)
Computer & Internet Architecture Lab ReLU Nonlinearity(2/) Computer & Internet Architecture Lab CSIE NCKU
Local Response Normalization 其中N是该层的feature map总数,n表示取该feature map为中间的左右各n/2个feature map来求均值。 论文中使用的参数是:k=2.n=5,gmma= 10 −4 ,beta=0.75,每一層ReluLU後接一層LRN 使用LRN来训练他们的网络,在imageNet上top-1和top-5的错误率分别下降了1.4%,1.2% 因為ReLU神經元具有無限激活,我們需要LRN來規範化。我們希望檢測具有大響應的高頻特徵。如果我們圍繞興奮神經元的局部鄰域進行標準化,則與其鄰居相比,它變得更加敏感 在大腦中觀察到這種橫向抑制,您也可以將其視為有助於加強反應。它不是承載補丁的多個模糊表示,而是推動網絡更多地向特定表示提交,釋放資源以更好地分析它 Computer & Internet Architecture Lab CSIE NCKU
Training on Multiple GPUs Use two GTX 580 GPU , because single GPU has only 3GB of memory Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. the GPUs communicate only in certain layers Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Reducing Overfitting Overfitting的意思就是太過追求參數完美預測出訓練數據的結果,反而導致實際預測效果不佳 underfit Overfit exactly Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Data Augmentation(1/) The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224*224 patches (and their horizontal reflections) from the 256*256 images and training our network on these extracted patches( (256-224)*(256-224)*2=2048 ) Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Data Augmentation(2/) The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. p is eigenvector, lambda is eigenvalue 首先通过PCA,主成分分析,找出整个测试集数据中RGB像素值的主成分,然后在每次训练的图片像素中加上一定随机比例的主成分。 其中, 和 是第 i 个3 x 3 图片RGP像素值协方差矩阵的特征值和特征向量, 是随机变量,从一个均值为 0 标准差 0.1 的高斯分布中抽取。对于一次训练中的一张训练图片, 只抽取一次,下一次再用这张图片训练时再重抽取。通过这个方法,top1 的错误率下降了1%。 Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Dropout The recently-introduced technique, called “dropout” , consists of setting to zero the output of each hidden neuron with probability 0.5.The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge. 用來防止過於依賴某些節點,用來迫於其他節點也能夠去學習(連結),造成整體效果提升 什么原因导致了过拟合? 1.数据太少了,模型没有足够多的意外数据用来使模型更加“通用”。 2.神经网络模型的复杂度太高了!以至于模型的复杂程度高于问题的复杂程度! 缺点就是会明显增加训练时间,因为引入dropout之后相当于每次只是训练的原先网络的一个子网络,为了达到同样的精度需要的训练次数会增多。 大型网络但是数据集缺少的时候可以使用dropout防止过拟合,对于小型网络或者说不缺数据集的网络不推荐使用。 Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Overlapping pooling 6 8 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Max pooling 6 7 8 10 11 12 14 15 16 Overlapping pooling Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Details of learning We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. where i is the iteration index , v is the momentum variable, is learning rate , is the average over the ith batch of the derivative of the objective with respect to w , evaluated at Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Overall Architecture Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab Results Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab CSIE NCKU