Bi-weekly Report on Neural Networks Compression

Bi-weekly Report on Neural Networks Compression
主要看的是压缩神经网络相关的论文 -Hang, Luo -CSLT, THU

Content Introduction SVD Decomposition Tensor Decomposition
Related work Experiments on kaldi Future work 非常简短的介绍一下神经网络压缩

Introduction What is compression ? Why need compression ?
Reduce neural network memory with any kinds of approaches. Why need compression ? Memory and computationally intensive Parameters redundancy Large percent parameters in full-connected layer and many of them are redundancy. 在speech recognition中输出target很多时，最末层参数占50% 在CNN中全连接层占总参数90%

Introduction What can compression do? Memory saving.
Speed up in test time, and in training time sometimes. Make deployment in mobile acceptable. Real-time work like self-driving car.

SVD Decomposition For full-connected layer, considering the hidden layer units is m and the output target is n. Then the weight matrix is m*n If we only consider k biggest singular value

SVD Decomposition According to SVD, the weight matrix can be represented by two matrix. Advantages: Original m*n parameters reduces to m*k + n*k Accelerates the matrix-vector multiplication time from O(m*n) to O(m*k+ n*k) Very suitable for low-rank matrix

Implementation of SVD Decomposition
Approach 1 Using SVD after normal train and get the original weight matrix. Fine-tune then. Experiments 576 input features, hidden units,5 layers, 5976 output target. Using SVD on last layer. e.g Keep the biggest ¼ singular, then the parameter reduces from 2048*5976=12M to 2048* *5976 =4M 如果每个参数是32bit 的float类型，节省32MB的空间

Implementation of SVD Decomposition
Approach 2 Using SVD when training the network. Fine-tune Experiments 使用cross-entropy training and Hessian-Free Sequence training

Results Results Reduce the parameters by 30%-80% when using SVD to some layers. The compress rate depends on the rank r The accuracy nearly decrease after fine-tune Accelerate test time, while only approach 2 can accelerate train time 压缩比率比较大时，SVD出来的效果很差，但基本能fine-tune回来

Tensor Decomposition SVD Decomposition searches for a low-rank approximation of the weight matrix. Tensor Decomposition treat the matrix as a tensor, and apply the tensor decomposition algorithm. (e.g Tensor Train Decomposition) 现在用的比较多的是Tensor Train Decomposition

Traditional Tensor Decomposition
Tucker decomposition For n-d tensor, Tucker-decomposition memory Not suitable when d is large CP-decomposition For n-d tensor, CP-decomposition memory O(ndr) NP hard Tucker分解类似于高维SVD CP分解为若干秩为1的张量

Tensor-Train Decomposition
Tensor-Train format(TT-format) to represent the dense weight matrix of the fully-connected layers. For every matrix Gk[jk] , size is Gk[jk] is a three-dimensions array By restrict TT-rank, the parameter can be reduced, the memory is 约定r0=rd=1,所以相乘能保证是实数以索引方式来写的话，那个其实是三维张量

Tensor-Train Decomposition
Vector and matrix can transform to tensor. Y= W x + b Reduce memory and speed up. (TT-SVD)

Related work Dark knowledge Structured matrix Hashing tricks
2016 Best ICLR Paper 了解了compression的相关技术

Dark Knowledge Learn a small model from a cumbersome model , also called “distilling” Use the class probabilities produced by the cumbersome model as “soft target” for training the small model Ensemble / bad at test time

Dark Knowledge In softmax regression, the cost function is:
While in Dark knowledge, we learn a soft target, replace the original hard target by this. Ensemble / bad at test time

Structured matrix Use circulant matrix to represent weight matrix, which can save memory and speed up with FFTs If C is a circulant matrix, then y=Cx can be computed in ‘FFT speed’ because Fn is a Fourer matrix, FnC is eigenvalue,Fn* is egienvector 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Circulant matrix 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器
提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Hashing tricks Use a hash function to share weights randomly.
Weight sharing vs feature hashing

Hashing tricks Forward pass Gradient over parameters 一站式&icrm推词策略
推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

2016 ICLR Best Paper Hashing tricks determine weight sharing before the networks see any training data There is another way to determine after the network is fully trained. How to do it? K-means ! 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Weight sharing using K-means
Partition n original weights into k clusters, the forward pass and gradient computations likes what hashing tricks do. 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Other tricks used in the paper
Pruning Removing the weights below a threshold (Also can compress NN by remove weight randomly, there are papers about this approach) 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Huffman coding Huffman coding
In AlexNet, the weights and the sparse matrix index are both biased, which is suitable for huffman coding. 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Results 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器
提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Experiments on kaldi Run the wsj example
With the limit of memory, change the original 6 -layer network to 4-layer, use 1000 hidden units and ReLU function, the results are very close to the given results. 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Experiments on kaldi RUN TDNN3 TEST Original network WER
4-layer network WER Decode_bd_tgpr_dev93 7.19 7.24 Decode_bd_tgpr_eva192 3.93 4.38 Decode_tgpr_dev93 9.57 9.98 Decode_tgpr_eva192 6.86 6.73 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Keeping read papers about neural network compression
Future Work Papers Experiments Background Keeping read papers about neural network compression 。 Exploring proper compression approach on ASR, by experiments on wsj, starting from SVD Study automatic speech recognition & deep learning systematically 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Thanks~

Structured matrix Structured matrix can save memory and speed up, suppose the weight matrix is a Toeplitz matrix, only need O(nlogn) time to do matrix-vector multiplication. 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Structured matrix Stein displacement M,A,B,L(M) are all n*n matrix
Krylov Decomposition 一站式&icrm推词策略推荐关键词，提升用户消费优惠券相关工作优惠促销，提升客户消费流量折扣拍卖器提供定向优质折扣流量，促进百度与客户双赢客户企业群分布策略研究及其应用利用工商大数据精准定位客户，指导销售

Bi-weekly Report on Neural Networks Compression

Similar presentations

Presentation on theme: "Bi-weekly Report on Neural Networks Compression"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Bi-weekly Report on Neural Networks Compression

Similar presentations

Presentation on theme: "Bi-weekly Report on Neural Networks Compression"— Presentation transcript:

Similar presentations

About project

反馈