POWER-EFFICIENT RANGE-MATCH-BASED PACKET CLASSIFICATION ON FPGA 這篇主要提出的power efficient,是我們pipeline架構常見的一個現象,前面已經不符合某個條件下,那後面的比對也不用再進行 所以他們用的方法是關掉後面不用進行比對的memory module,後面有一些數據是我們可以參考的 Authors: Yun R. Qu, Viktor K. Prasanna Publisher: 2015 25th International Conference on Field Programmable Logic and Applications (FPL) Presenter: Chun-Yu, Li Date: 105/10/5 Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.
Computer & Internet Architecture Lab INTRODUCTION We construct a modular PE to match a stride of the packet header against a stride of a range boundary (A naïve approach to match ranges is to deploy a W-bit comparator for each range boundary). We concatenate multiple PEs into a systolic array to sustain high clock rates for large rule sets. We employ a self-enabled power gating technique on our architecture. The modular PEs are selectively enabled to save the memory access power. 每一個PE就是一個s-bit的comparator,如果是upper bound,就用小於等於的comparator,如果是lower bound,就用大於等於的comparator, 當然也可以直接做全部bit的比對,但是實作上就會出現電路比較慢的情形。 最後所有的PE會串成systolic array來進行比對(這個架構後面會看到) 這是剛剛提到的,後面不用進行比對的memory module,他們會用clock gating的方式去關掉memory省電。 Computer & Internet Architecture Lab CSIE NCKU
MODULAR PROCESSING ELEMENT s: s-bit stride. c: number of range boundaries compared by a modular PE. PE consist of c data memory module, c comparators. c = 2 1.解釋一個PE處理s bit, c boundaries 2.解釋小於等於的comparator 3.解釋紅圈 ≤ comparator eql0_in less0_in x & y eql0 less0 en0 * 1 en0_in x=y x<y x>y Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab SYSTOLIC ARRAY The strides from the input packet headers are propagated vertically. Each column of PEs compares an s-bit input stride with the corresponding strides of all the N rules. The number of rows required for N rule is 𝑁 𝑐 , while the total number of columns required for ( 𝑊 𝑚 )–bit packet is 2∙ 𝑊 𝑚 𝑠 . 所以我們可以看到所有的PE會串成這種SYSTOLIC ARRAY的架構,這個架構是Decomposition based,比對完某一維度,再比對下一個維度。 每個維度有各自的upperbound、lowerbound比對。 所以有幾個row就取決於一個PE可以處理多少rule,總共需要幾個row做處理。從column來看的話,就是 所有要進行比對的bit數除與stride數,乘與2是因為需要upperbound跟lowerbound的比對 Computer & Internet Architecture Lab CSIE NCKU
Computer & Internet Architecture Lab SYSTOLIC ARRAY (Type I) for the same field, both comparing upperbounds/lowerbounds 那接線之間會有三種type,如果是同一種 (Type III) for different fields (Type II) for the same field, one comparing upperbound but the other comparing lowerbound Computer & Internet Architecture Lab CSIE NCKU
SELF-ENABLED POWER GATING Chaining: If one data memory module is deactivated, a chain of data memory modules in the following PEs will be deactivated. To make the self-enabled power gating more effective, for a given packet header and a given rule, we need to report the mismatch, if any, as early as possible Computer & Internet Architecture Lab CSIE NCKU
ENTROPY-BASED SCHEDULING Number of bits in field 𝑚 as 𝑊 𝑚 , where 𝑚=0,1,…,𝑀−1 A range in field m may cover multiple values in [0, 2 𝑊 𝑚 ) Number of occurrences for value 𝑘 (𝑚) in field 𝑚 as 𝑓 𝑘 (𝑚) The probability that an input value 𝑘 (𝑚) matches some rules in field 𝑚 is 𝑝 𝑘 (𝑚) = 𝑓 𝑘 (𝑚) 𝑘 (𝑚) 𝑓 𝑘 (𝑚) The entropy of field 𝑚 is given: 𝐻 𝑚 =− 𝑘 (𝑚) [ 𝑝 𝑘 𝑚 ∙log 𝑝 𝑘 𝑚 ] Schedule all the fields in descending order of 𝐻(𝑚), a field with higher entropy reveals more information; thus, such a field should be matched early to identify mismatches. Computer & Internet Architecture Lab CSIE NCKU
PERFORMANCE EVALUATION We conducted experiments on the state-of-the-art Xilinx Virtex 7 FPGA (XC7VX1140T). We evaluated the performance using Xilinx Vivado 2014.2. We set a conservative clock constraint of 250MHz for all of our designs. We reported resource utilization and clock rate using post-place-and-route reports. For power estimation; we used Switching Activity Interchange Format (SAIF) files as inputs to Vivado power analysis tool. We built synthetic rule sets where the entropy of each field could be adjusted separately. Computer & Internet Architecture Lab CSIE NCKU
PERFORMANCE EVALUATION Fix the packet header length 𝑊 𝑚 =64 and the number of rule N = 128. Vary 𝑠=2, 4, 8, 16, and 𝑐=2, 4, 8, 16, 32, 64. The performance of 𝑠=4 and 𝑐=64 is best. Computer & Internet Architecture Lab CSIE NCKU
PERFORMANCE EVALUATION Our designs on FPGA sustains > 250 MPPS throughput even for 4K 512-bit rules. Computer & Internet Architecture Lab CSIE NCKU
PERFORMANCE EVALUATION We show the power performance with respect to N, for the classic packet classification (M = 5) and the OpenFlow table lookup (M = 15) under three scenarios: Without any power optimization technique. With self-enabled power gating, but randomly scheduling of packet header fields. With self-enabled power gating along with entropy-based scheduling of packet header fields. To see the effect of random scheduling, we run 100 tests for a fixed pair of 𝑀 and 𝑁. For each test, the power consumption is averaged on 1K random packets. Computer & Internet Architecture Lab CSIE NCKU
PERFORMANCE EVALUATION Self-enabled power gating reduces the average power consumption by 60%. Exploiting both optimization reduces the average power consumption further by 20%. red “o”: no optimization pink “x”: self-enabled power gating blue “+”: both optimization Computer & Internet Architecture Lab CSIE NCKU