Deep Learning with Limited Numerical Precision

Deep Learning with Limited Numerical Precision
2019/5/2 Deep Learning with Limited Numerical Precision Author: Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan Publisher/Conference: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, JMLR: W&CP volume 37. Presenter: Yu-Hsiang Lin Date: 2018/10/17 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C. CSIE CIAL Lab 1

2019/5/2 Introduction(1/) Neural networks considered are often limited to variants of the classical multilayer perceptron containing a single hidden layer and only a few hidden units the state-of-the-art deep neural networks that can easily contain millions of trainable parameters 以往類神經網路大部分都只包含幾個hidden layer，hidden unit 也很少而現在的deep neural networks常常都有好幾百萬個參數要運算這論文想要找出想要暨可以節省記憶體，又不降低類神經網路訓練的精確度的方法 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Introduction(2/) deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy 以往在作neural network的back propagaion的時候，都是以32 bit floating point為主計算相較於傳統使用floating point(single precision)的運算模式，想法上有兩個優點 Fixed-point的運算單元相較於foating-point 通常更快、消耗更少硬體資源、功耗預算，而且運算電路邏輯的foot print小，面積小，表示在給定的區域面積內可以實作更多此種電路邏輯 Low-precision data可以降低memory的 footprint，可以增加程式model的規模、降低memory的bandwidth National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Introduction(3/) demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding 前一頁的想法上延伸到硬體的設計，藉由大量的fixed-point computation unit，還有之後會提到的data flow architecture與stochastic rounding module，以FPGA實作可以得到高輸出、低功耗 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Limited Precision Arithmetic(1/)
2019/5/2 Limited Precision Arithmetic(1/) back-propagation algotithm <IL, FL> or IL(FL): IL: integer length, FL: floating-point length WL: word length(here is 16) = IL + FL <IL, FL> fixed-point format limits the precision to FL bits, sets the range to [- 2 𝐼𝐿−1 , 2 𝐼𝐿−1 − 2 −𝐹𝐿 ] 𝜀 denote the smallest positive number that may be represented in the given fixed-point format, here defines ɛ to be equal to 2 −𝐹𝐿 𝑥 as the largest integer multiple of 𝜀( 2 −𝐹𝐿 ) less than or equal to x 在開始介紹rounding model的時候先作一些簡介一般的data forward是input參數值一層層運算並傳遞到 output，但是back propagate是以預測結果與正確值之間的誤差回推參數，藉由改善參數來修正預測結果的一種演算法以<IL=2, FL=6>為例: [-2= , 127/64= ] ɛ表示 fixed-point format的最小精度 ⌊𝑥⌋表示小於等於x的最大𝜀的倍數，意即，<2, 6>為例，如果x= , ⌊𝑥⌋= National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Limited Precision Arithmetic(2/) 這裡介紹2種如何把高精度數字轉換成低精度數字的方法傳統的四捨五入: 比較特別的是這裡都是二進位法本篇論文提出的隨機捨入法: 這裡進位或捨去的條件是以機率的形式來表現，⌊𝑥⌋(round to x) 的機率正比於x跟⌊𝑥⌋的差距，也就是說x愈靠近⌊𝑥⌋，被round to x的機率愈高 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Limited Precision Arithmetic(3/) If x lies outside the range of [- 2 𝐼𝐿−1 , 2 𝐼𝐿−1 − 2 −𝐹𝐿 ] 如果x的數字超出fixed-point format可以表示的範圍，直接把x化為該range的lower bound或upper bound National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Multiply and accumulate (MACC) (1/)
2019/5/2 Multiply and accumulate (MACC) (1/) 假設a, b為d-dimentional vectors in fixed-point fomat <IL, FL> , c = a.b as inner product of a and b, c is also in some fixed-point format <IL`, FL`>, we split the computation of c into the following two steps The product of 𝑎 𝑖 and 𝑏 𝑖 produces a fixed-point number in the <2*IL, 2*FL> format 𝑧= 𝑖=1 𝑑 𝑎 𝑖 ∗ 𝑏 𝑖 Round z to the limit set by <IL`, FL`> 𝑐=𝐶𝑜𝑛𝑣𝑒𝑟𝑡(𝑧,<𝐼𝐿`,𝐹𝐿`>) 採用這個步驟有3個優點這個方法的模式接近FPGA中的hardware 運算單元，稱為Digital Signal Processing units(DSP)，DSP接收18-bit inputs，使用48-bit register作運算，這個hardware unit可以實作一些算數、邏輯的運算，包括fixed-point的加法、乘法。在作完全部運算之後才作rounding可以大幅降低實作stochastic rounding 結構的硬體負擔讓我們有效率地用CPUs/GPUs和vender-supplied Basic Linear Algebra Subprograms(BLAS，一種API) libraries 模擬 fixed point computation，例如把兩個向量A、B先轉成floating number的矩陣，呼叫SGEMM routine計算舉震相成的結果，再把結果放入Convert作化簡，不過這個優勢跟論文本身比較沒有關係，因為最後是實作在FPGA上 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Training Deep Networks
2019/5/2 Training Deep Networks fully connected deep neural networks (DNN) MNIST convolutional neural networks (CNN) CIFAR10 用兩種類神經網路CNN、DNN測試 MNIST用來訓練辨識手寫數字 CIFAR10包含50000 RGB圖片，分成10個種類，每個種類5000張，test set有10000張，用來訓練判斷圖片裏面有什麼特徵，例如:飛機、汽車、青蛙、狗… 參數一開始訓練的時候都是randomly 給定，之後在每個epoch逐步調整 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Training Deep Networks-MNIST in DNN
2019/5/2 Training Deep Networks-MNIST in DNN 在訓練的時候round to nearest跟stochastic rounding的差別在於當參數在更新的時候，如果range落在( −ɛ 2 , + +ɛ 2 )，如果是round to nearest，每次都會被round to zero，也就是有更新等於沒更新，然而stochastic rouding有一定的機率可以讓這些參數進位到+-ɛ 在來就是ReLU activation function在計算的時候，因為fixed point的range比較小，很容易就會超過<IL, FL>的limit而被強制round到<IL, FL>upper limit，這樣一來數據很容易失真 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Training Deep Networks-MNIST in CNN
National Cheng Kung University CSIE Computer & Internet Architecture Lab

Training Deep Networks-CIFAR10 in CNN
2019/5/2 Training Deep Networks-CIFAR10 in CNN 這裡比較特別的是stochastic 12在訓練的時候，test error差不多在28%的時候就飽和了，降不下去，所以我們在最後幾個epoch把WL提高到20，變成<4, 16>，如此改善他的精準度，這是一種mixed-precision的概念 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Hardware Prototyping(1/)
2019/5/2 Hardware Prototyping(1/) 動機一連串的GEMM運算占了整體網路的執行時間很大的比例，例如feed-forward, error back-propagation, weight update calculation 雖然GPU可以改善這些計算的能力，但是GPU偏重於優化floating point運算的performance，跟我們的需求不太一樣選FPGA的兩大原因相較於ASIC，FPGA擁有快速的硬體開發時間，而且價格便宜有大量的fixed-point DSP units可以用，非常適合這次fixed-point arithmetic的實作，提高潛在performance與energy efficiency 簡單講一下上面modules的功能: 要避免把整個矩陣都存在L2 cache，取而代之，只把一部分row、column放進L2 cache，剩下的放在DDR3裡面，想辦法提高L2 cache裡面資料的重複使用率，算出partial result存回DDR3 READ: 跟DDR3要資料，把這些資料放入L2 cache WRITE: 把SA的patrial result放回DDR3 L2-to-SA: 把放在L2 cache的row、column資料傳給SA運算 Systolic Array: 所有加法、乘法、stochastic rounding都實作在這裡 TOP:整合所有modules National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Hardware Prototyping(2/)
2019/5/2 Hardware Prototyping(2/) reuse allows efficient use of the bandwidth between the FPGA and the DDR3 memory P取決於on chip memory 的容量: A是n個rows為一組，B是n個為一組 n columns of matrix B 與 p*n rows of matrix A，被放入cache， SA作運算，求出p*n*n個partial result，存回DDR3 下一組n columns of matrix B被放入cache，重複動作直到m columns of B算完為止，結束這回合下一回合，下一組p*n rows of matrix A跟新循環的n columns of matrix B 放入Cache，以此類推總共作 l/pn個回合如此一來，當elemets一旦被放入FPGA，elements of matrix A 可以被重複用m次，elements of matrix B 可以被重複用p*n次 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Systolic Array Architecture(1/)
2019/5/2 Systolic Array Architecture(1/) 每個node(DSP MACC)是一個DSP unit，在每個clock cycle提供加法、乘法運算這是一個wavefront-type systolic array的例子，這個例子可以降低interconnect delays，提高maximum operating frequency 每個 FIFO 包含 row of A 或是column of B 的elements 每一次DSP計算的結果(result matrice中的一個element)都會藉由local storage register送到DSP Round作進位或捨去 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Systolic Array Architecture(2/) wavefront-type systolic array 運作示意圖 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/5/2 Systolic Array Architecture(3/) 實作環境 28x28 systolic array 實作在KintexK325T FPGA Xilinx’s Vivado synthesis and place-and-route tool maximum circuit operation frequency 為 166 MHz power consumption of 7W 效能評估 throughput: 260 G-ops/s power efficiency: 37 G-ops/s/W 相較於實作在Intel i7-3720QM CPU, NVIDIA GT650m 與 GTX780 GPUs，power efficiency: range of 1-5 G-ops/s/W National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Deep Learning with Limited Numerical Precision

Similar presentations

Presentation on theme: "Deep Learning with Limited Numerical Precision"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Deep Learning with Limited Numerical Precision

Similar presentations

Presentation on theme: "Deep Learning with Limited Numerical Precision"— Presentation transcript:

Similar presentations

About project

反馈