Download presentation
Presentation is loading. Please wait.
1
Deep Learning with Limited Numerical Precision
2019/5/2 Deep Learning with Limited Numerical Precision Author: Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan Publisher/Conference: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, JMLR: W&CP volume 37. Presenter: Yu-Hsiang Lin Date: 2018/10/17 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C. CSIE CIAL Lab 1
2
2019/5/2 Introduction(1/) Neural networks considered are often limited to variants of the classical multilayer perceptron containing a single hidden layer and only a few hidden units the state-of-the-art deep neural networks that can easily contain millions of trainable parameters 以往類神經網路大部分都只包含幾個hidden layer,hidden unit 也很少 而現在的deep neural networks常常都有好幾百萬個參數要運算 這論文想要找出想要暨可以節省記憶體,又不降低類神經網路訓練的精確度的方法 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
3
2019/5/2 Introduction(2/) deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy 以往在作neural network的back propagaion的時候,都是以32 bit floating point為主計算 相較於傳統使用floating point(single precision)的運算模式,想法上有兩個優點 Fixed-point的運算單元相較於foating-point 通常更快、消耗更少硬體資源、功耗預算,而且運算電路邏輯的foot print小,面積小,表示在給定的區域面積內可以實作更多此種電路邏輯 Low-precision data可以降低memory的 footprint,可以增加程式model的規模、降低memory的bandwidth National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
4
2019/5/2 Introduction(3/) demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding 前一頁的想法上延伸到硬體的設計,藉由大量的fixed-point computation unit,還有之後會提到的data flow architecture與stochastic rounding module,以FPGA實作可以得到高輸出、低功耗 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
5
Limited Precision Arithmetic(1/)
2019/5/2 Limited Precision Arithmetic(1/) back-propagation algotithm <IL, FL> or IL(FL): IL: integer length, FL: floating-point length WL: word length(here is 16) = IL + FL <IL, FL> fixed-point format limits the precision to FL bits, sets the range to [- 2 𝐼𝐿−1 , 2 𝐼𝐿−1 − 2 −𝐹𝐿 ] 𝜀 denote the smallest positive number that may be represented in the given fixed-point format, here defines ɛ to be equal to 2 −𝐹𝐿 𝑥 as the largest integer multiple of 𝜀( 2 −𝐹𝐿 ) less than or equal to x 在開始介紹rounding model的時候先作一些簡介 一般的data forward是input參數值一層層運算並傳遞到 output,但是back propagate是以預測結果與正確值之間的誤差回推參數,藉由改善參數來修正預測結果的一種演算法 以<IL=2, FL=6>為例: [-2= , 127/64= ] ɛ表示 fixed-point format的最小精度 ⌊𝑥⌋表示小於等於x的最大𝜀的倍數,意即,<2, 6>為例,如果x= , ⌊𝑥⌋= National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
6
Limited Precision Arithmetic(2/)
2019/5/2 Limited Precision Arithmetic(2/) 這裡介紹2種如何把高精度數字轉換成低精度數字的方法 傳統的四捨五入: 比較特別的是這裡都是二進位法 本篇論文提出的隨機捨入法: 這裡進位或捨去的條件是以機率的形式來表現,⌊𝑥⌋(round to x) 的機率正比於x跟⌊𝑥⌋的差距,也就是說x愈靠近⌊𝑥⌋,被round to x的機率愈高 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
7
Limited Precision Arithmetic(3/)
2019/5/2 Limited Precision Arithmetic(3/) If x lies outside the range of [- 2 𝐼𝐿−1 , 2 𝐼𝐿−1 − 2 −𝐹𝐿 ] 如果x的數字超出fixed-point format可以表示的範圍,直接把x化為該range的lower bound或upper bound National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
8
Multiply and accumulate (MACC) (1/)
2019/5/2 Multiply and accumulate (MACC) (1/) 假設a, b為d-dimentional vectors in fixed-point fomat <IL, FL> , c = a.b as inner product of a and b, c is also in some fixed-point format <IL`, FL`>, we split the computation of c into the following two steps The product of 𝑎 𝑖 and 𝑏 𝑖 produces a fixed-point number in the <2*IL, 2*FL> format 𝑧= 𝑖=1 𝑑 𝑎 𝑖 ∗ 𝑏 𝑖 Round z to the limit set by <IL`, FL`> 𝑐=𝐶𝑜𝑛𝑣𝑒𝑟𝑡(𝑧,<𝐼𝐿`,𝐹𝐿`>) 採用這個步驟有3個優點 這個方法的模式接近FPGA中的hardware 運算單元,稱為Digital Signal Processing units(DSP),DSP接收18-bit inputs,使用48-bit register作運算,這個hardware unit可以實作一些算數、邏輯的運算,包括fixed-point的加法、乘法。 在作完全部運算之後才作rounding可以大幅降低實作stochastic rounding 結構的硬體負擔 讓我們有效率地用CPUs/GPUs和vender-supplied Basic Linear Algebra Subprograms(BLAS,一種API) libraries 模擬 fixed point computation,例如把兩個向量A、B先轉成floating number的矩陣,呼叫SGEMM routine計算舉震相成的結果,再把結果放入Convert作化簡,不過這個優勢跟論文本身比較沒有關係,因為最後是實作在FPGA上 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
9
Training Deep Networks
2019/5/2 Training Deep Networks fully connected deep neural networks (DNN) MNIST convolutional neural networks (CNN) CIFAR10 用兩種類神經網路CNN、DNN測試 MNIST用來訓練辨識手寫數字 CIFAR10包含50000 RGB圖片,分成10個種類,每個種類5000張,test set有10000張,用來訓練判斷圖片裏面有什麼特徵,例如:飛機、汽車、青蛙、狗… 參數一開始訓練的時候都是randomly 給定,之後在每個epoch逐步調整 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
10
Training Deep Networks-MNIST in DNN
2019/5/2 Training Deep Networks-MNIST in DNN 在訓練的時候round to nearest跟stochastic rounding的差別在於 當參數在更新的時候,如果range落在( −ɛ 2 , + +ɛ 2 ),如果是round to nearest,每次都會被round to zero,也就是有更新等於沒更新,然而stochastic rouding有一定的機率可以讓這些參數進位到+-ɛ 在來就是ReLU activation function在計算的時候,因為fixed point的range比較小,很容易就會超過<IL, FL>的limit而被強制round到<IL, FL>upper limit,這樣一來數據很容易失真 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
11
Training Deep Networks-MNIST in CNN
National Cheng Kung University CSIE Computer & Internet Architecture Lab
12
Training Deep Networks-CIFAR10 in CNN
2019/5/2 Training Deep Networks-CIFAR10 in CNN 這裡比較特別的是stochastic 12在訓練的時候,test error差不多在28%的時候就飽和了,降不下去,所以我們在最後幾個epoch把WL提高到20,變成<4, 16>, 如此改善他的精準度,這是一種mixed-precision的概念 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
13
Hardware Prototyping(1/)
2019/5/2 Hardware Prototyping(1/) 動機 一連串的GEMM運算占了整體網路的執行時間很大的比例,例如feed-forward, error back-propagation, weight update calculation 雖然GPU可以改善這些計算的能力,但是GPU偏重於優化floating point運算的performance,跟我們的需求不太一樣 選FPGA的兩大原因 相較於ASIC,FPGA擁有快速的硬體開發時間,而且價格便宜 有大量的fixed-point DSP units可以用,非常適合這次fixed-point arithmetic的實作,提高潛在performance與energy efficiency 簡單講一下上面modules的功能: 要避免把整個矩陣都存在L2 cache,取而代之,只把一部分row、column放進L2 cache,剩下的放在DDR3裡面,想辦法提高L2 cache裡面資料的重複使用率,算出partial result存回DDR3 READ: 跟DDR3要資料,把這些資料放入L2 cache WRITE: 把SA的patrial result放回DDR3 L2-to-SA: 把放在L2 cache的row、column資料傳給SA運算 Systolic Array: 所有加法、乘法、stochastic rounding都實作在這裡 TOP:整合所有modules National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
14
Hardware Prototyping(2/)
2019/5/2 Hardware Prototyping(2/) reuse allows efficient use of the bandwidth between the FPGA and the DDR3 memory P取決於on chip memory 的容量: A是n個rows為一組,B是n個為一組 n columns of matrix B 與 p*n rows of matrix A,被放入cache, SA作運算,求出p*n*n個partial result,存回DDR3 下一組n columns of matrix B被放入cache,重複動作直到m columns of B算完為止,結束這回合 下一回合,下一組p*n rows of matrix A跟新循環的n columns of matrix B 放入Cache,以此類推總共作 l/pn個回合 如此一來,當elemets一旦被放入FPGA,elements of matrix A 可以被重複用m次,elements of matrix B 可以被重複用p*n次 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
15
Systolic Array Architecture(1/)
2019/5/2 Systolic Array Architecture(1/) 每個node(DSP MACC)是一個DSP unit,在每個clock cycle提供加法、乘法運算 這是一個wavefront-type systolic array的例子,這個例子可以降低interconnect delays,提高maximum operating frequency 每個 FIFO 包含 row of A 或是column of B 的elements 每一次DSP計算的結果(result matrice中的一個element)都會藉由local storage register送到DSP Round作進位或捨去 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
16
Systolic Array Architecture(2/)
2019/5/2 Systolic Array Architecture(2/) wavefront-type systolic array 運作示意圖 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
17
Systolic Array Architecture(3/)
2019/5/2 Systolic Array Architecture(3/) 實作環境 28x28 systolic array 實作在KintexK325T FPGA Xilinx’s Vivado synthesis and place-and-route tool maximum circuit operation frequency 為 166 MHz power consumption of 7W 效能評估 throughput: 260 G-ops/s power efficiency: 37 G-ops/s/W 相較於實作在Intel i7-3720QM CPU, NVIDIA GT650m 與 GTX780 GPUs,power efficiency: range of 1-5 G-ops/s/W National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Similar presentations