Leftmost Longest Regular Expression Matching in Reconfigurable Logic 2018/5/8 Leftmost Longest Regular Expression Matching in Reconfigurable Logic Author: Kubilay Atasu (IBM Research – Zurich) Presenter: Kuan-Chieh Feng Date: 2016/10/11 Department of Computer Science and Information Engineering National Cheng Kung University CSIE CIAL Lab 1
Outline Introduction Related Work Baseline Architecture Optimized Architecture Experiment National Cheng Kung University CSIE Computer & Internet Architecture Lab
2018/5/8 Introduction The so-called big data has become a new natural resource, and discovering insights in big data will be the key capability of future computing platforms. The process of extracting information from large-scale unstructured text is called text analytics. Text analytics functions rely heavily on regexs and dictionaries for locating named entities. 按能源消耗的比例增加來增加processor的core數量相當不合成本。 所以要同時改善效能與能源消耗,就要仰賴其他的計算資源,像是GPU或是FPGA。 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
2018/5/8 Introduction 圖1顯示使用FPGA-based在商業分析平台上從不同的data resource持續收集新的entries 並且使用local news search engine把他們index。 當大量的使用者做query,第二階段就會變成bottleneck。 要減少bottleneck就需要增加processor的core數,且會導致使用更多的空間以及能量消耗 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Related Work When several regex matches that end at the same offset position exist, typically only the span with the smallest start-offset value will be reported. This technique is called leftmost regex matching heuristic. National Cheng Kung University CSIE Computer & Internet Architecture Lab
2018/5/8 Related Work Definition 1: Each regex match is associated with a span (s, e), where s is the start-offset position and e the end-offset position of the regex match. Definition 2: Span (s0, e0) contains span (s1, e1) if (s1 > s0) and (e1 < e0). 1 每個Regex match對應到一組span 2 每一組span包含s與e代表位置 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
2018/5/8 Related Work Definition 3: The leftmost regex match at offset position i is the regex match with the smallest start-offset position value that ends at offset position i. Definition 4: A leftmost longest regex match is a regex match that is not contained in any other regex match. 3 在offset位置i的leftmost regex match則是結束在位置i的regex match中其start-offset值最小者。 4 leftmost longest regex match代表部會包含在所有其他的regex match National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
2018/5/8 Related Work 假設regex是 a|aa|aaaa能夠match input stream aaaa 圖顯示可以match有八種情況。每個情況根據其start offset與end offset有不同的span(如圖) 其中只有四個是屬於leftmost matches,包括(0,0) (0,1) (1,2) (0,3) 當leftmost match在offset位置為1時,則是span (0,1) offset 2 : span (1,2) offset 3 : span (0,3) 其中start offset最小的為span(0,3)則(0,3)為leftmost longest match National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Baseline Architecture 2018/5/8 Baseline Architecture Leftmost longest regex matches combines : Regex matcher that supports start-offset reporting Sorting unit (increasing order of the start offsets, where the spans having the same start offset are sorted in decreasing order of their end offsets.) Containment filter sort過的span會被餵到containment filter(過濾所以包含在其他的span 且會產生指出leftmost longest regex match的過濾的span stream。 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Baseline Architecture 2018/5/8 Baseline Architecture 經過start offset遞增排序,且end-offset遞減排序後所得到的第一個span, 在抵達containment filter後會被儲存在local register且還沒產生其他output。 接著後面進來的span,其end offset必定小或等於原來儲存在register中的value。 總而言之,sorting的過程會保證後面進來的span必定會contain在原來的span當中。 之後span會儲存在local memory然後output,則後面的span在儲存在local register中。 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Baseline Architecture 2018/5/8 Baseline Architecture In general, the proposed architecture can produce multiple and possibly overlapping spans as output. Example : Assume span (0, 1), (2, 3), (2, 4), and (1, 5) . After sorting (0, 1), (1, 5), (2, 4), and (2, 3). The containment filter stores the first span (0, 1) in its registers. When the second span (1, 5) arrives, (0, 1) is written to output and (1, 5) is written to the registers. National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Baseline Architecture 2018/5/8 Baseline Architecture 當整個網路的資料都儲存在controller,很容易就會洩漏 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture 2018/5/8 Optimized Architecture We exploit two key properties to eliminate the redundancies that exist in the baseline architecture. Observe that to locate the leftmost longest matches, it is not necessary to generate all possible regex matches. The existing architectures that support leftmost regex matching [13], [14] produce partially sorted results. They report only the regex match with the smallest start offset. National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture 2018/5/8 Optimized Architecture 首先,在regex match我們只找出start-offset最小的span 接著,在inversion unit我們透過原來的順序來反轉,此時end-offset最大且start-offset最小的span會移到最前面 最後containment filer就可以直接取出LLM。 National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture 2018/5/8 Optimized Architecture Latency-Hiding Inversion Unit The inversion unit can be implemented using simple LIFO buffers. It will not produce any output until the leftmost regex matcher produces an eos (end-of-stream) signal. The latency of this can significantly reduce the throughput rate of the optimized architecture. National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture 2018/5/8 Optimized Architecture 透過兩種模式,我們可以同時針對兩條stream做讀寫的動作 start pointer與end pointer開始是相同的。而end 位置從LIFO buffer的end pointer讀取,而read pointer會在new span傳送到containment filter時遞減(遞增)。 當eos signal抵達時,inversion unit會執行: 1. 複製write pointer到read pointer , start pointer到end pointer 2.從forward切換到backward mode (或是相反) 3.初始化start pointer及write pointer LIFO buffer是否可能沒有空間? 發生的情況很少,因為read buffer會很快速的flush 而write buffer反而沒那麼頻繁的寫入buffer National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture Computing non-overlapping matches Definition 5: Two spans (s0, e0) and (s1, e1) overlap if (s1 > s0) and (e1 > e0) and (s1 <= e0) or if (s0 > s1) and (e0 > e1) and (s0 <= e1). Definition 6: The non-overlapping leftmost regex match at offset position i is the regex match with the smallest start offset that ends at position i and that does not overlap with any leftmost regex match that ends at a position smaller than i. National Cheng Kung University CSIE Computer & Internet Architecture Lab
Optimized Architecture 2018/5/8 Optimized Architecture Definition 7: A non-overlapping leftmost longest regex match is a non-overlapping leftmost regex match that is not contained in any other non-overlapping leftmost regex match. National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Optimized Architecture 2018/5/8 Optimized Architecture 在位置0時,只有(0,0) 所以它是該位置的non-overlapping match 位置1時,有(0,1)與(1,1)兩者,(0,1)並沒有與(0,0) overlap,且較(1,1)符合 位置2時有(1,2)與(2,2),但是(1,2)與(0,1)是overlap因此選擇(2,2) National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab
Experiment National Cheng Kung University CSIE Computer & Internet Architecture Lab