Download presentation
Presentation is loading. Please wait.
1
Picos: A hardware runtime architecture support for OmpSs
2016/12/5 2016/./. Picos: A hardware runtime architecture support for OmpSs 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung University Tainan, Taiwan, R.O.C ext 62365, Office: 奇美樓, 6F, 95601 Website address: or
2
Content Introduction The Picos hardware Experimental setup
Space design exploration Results 2.1. Operational flow overview 2.2. Picos modules 2.3. Picos hardware latencies 2.4. Design issues 4.1. Picos for high performance computing 4.2. Simple Picos for small multicores 5.1. Comparison with the software alternative
3
1. Introduction OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs and task parallelism. Based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution. It has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library:Nanos++. Therefore, although powerful and easy to use, the performance benefits are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. 1.OmpSS是一個編程模型去提供簡單有利的方法,去註釋順序和執行task平行化 -透過相依分析,資料流scheduling和亂序task執行。 -他有大大的改善在第四版本中。 -另外現在OmpSs能夠容納存軟體去運駔透過一個Library:Nanos++ 2.然侯有利且簡單去使用,效能的好處被軟體的運行發生的開銷所限制。 3.為了解決這個障礙,我們提供的Picos,一個實現Task超存量價購,去提供硬體之原OmpSs程序model。
4
Superscalar (補充) 超純量(superscalar)CPU架構是指在一顆處理器內核中實行了指令級並行的一類並行運算。
5
1. Introduction Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies task-level parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using parallel strategies with a large number of fine granularity tasks. 1.Picos是一種基於硬件數據流的新型任務調度程序,可以動態分析任務間依賴關係,並在運行時識別任務級並行性。 2.在本文中,我們描述了Picos硬件設計和其組件的主要功能的延遲,基於它們的VHDL設計的綜合。 3.我們已經實施了一個基於這些延遲的完整週期精確模擬器,以在合理的時間內對其組件的特性和數量進行設計探索。 4.最後,我們比較Picos和Nanos ++透錯real benchmarks 運行時性能。 5.使用Picos,寫程式人員可以使用具有高度且理想的精細度去平行處理大量的細的(粒度)任務。
6
2. The Picos hardware Picos hardware is a major revision of the Hardware Task Superscalar architecture with several improvements in its work-flow. The main improvements are related with architectural changes to add support to nested tasks, better memory management and faster task dispatching. It consists of a many-core with any number of threads that send two types of task information to the Picos hardware: (1) the dependency information of new tasks (2) the notification of ending a task. 1.Picos硬體主要是改善了幾個優點在硬體超純量(superscalar)架構的工作流中。 2.最主要的改善是,能夠支援槽狀Task,更好的MEM管理,還有更快的task發送。 3.他是由多核心和任和數量的行程,送出兩種task資訊給Picos硬體。 -1.相依性資訊給新task -2.通知結束的task
7
2. The Picos hardware The Picos hardware consists of one gateway (GW), one or more Dependence Chain Trackers (DCT), one or more Task Reservation Stations (TRS) and one Task Scheduler (TS). All these components work together in parallel in order to build the dynamic task dependency graph and generate a list of ready tasks that are sent back to the threads to be executed. The connections between the modules are decoupled by FIFO queues that are interconnected by arbiter modules. Picos 1.下圖是Picos硬體包含一個Gateway(入口)、一或多個DCT(依賴相依追蹤器)、一或多個TRS(任務追蹤站)、TS(任務計畫程序) 2.這些平行處理為了建立一個動態Task相依的圖形,產生一條ready task送回給thread去做執行。 -他是FIFO 先進新出 first in first out queue 中間是由互聯是由比較器決定。
8
2.1. Operational flow overview
Once a thread reaches a task creation it creates a new task descriptor. This information mainly includes the address of the task code to be executed and the address of all its dependencies. Once this descriptor is created, it is sent to the Picos hardware that reads its information and stores the data of the corresponding task until all its dependencies are fulfilled. For the first task created, all its dependencies are ready because all its dependencies are already in memory. However, the most common case is that a task has to wait until one or more of its dependencies become ready after other tasks finish. 1.一但行程來了的時候,他會產生新的Task descriptor.(描述者) -這個資訊保含task code的address去執行和相依的address. -當這個Task descriptor被產生時,他會送給Picos硬體去讀取資訊和儲存資料去儲存和對直到相依關係被充滿。 2.當第一個Task被產生時,所有相依的關係會被寫入mem中。 3.然後,最常見的case情況是,任務必須等待,直到其中一個或多個相依關係,在其他任務完成後才能執行。
9
2.1. Operational flow overview
With this finishing message Picos will delete the corresponding descriptor in the system and proceed to mark as ready all the task dependencies. The Picos hardware will then try to send the new ready task to be executed. This entails sending the descriptor to the TS, which will make it available to all the threads in the system. When one thread that is not busy realizes that a new descriptor is available it starts executing the corresponding task. If a new task is created, new descriptors are created and the dependency information is sent to Picos. 1.當Picos有完成的資訊,Picos會去合對descriptor在系統中和繼續去標記那些標記那些task相依是ready。讓其他程序可以使用。 2.Picos硬體會嘗試去送出新的準備好的Task去做運算。 3.接著descriptor會告訴TS,這個threads可以被使用在這個系統中。 4.當一個thread不是busy狀態,新的descriptor會備運作去執行對應的task。 5.如果新的task產生,新的描述會被創造還有相依資訊會被送給Picos
10
2.2. Picos modules (GW) The GateWay (GW) is a simple selector that reads the messages (new task and finishing task messages) that arrive to the system and sends them to the associated module. If the message is a new task, the GW reads the Task ID and the dependencies and sends this information in a packet to the corresponding TRS. Each dependency is then read and sent to the associated DCT with a packet containing the following information: address and direction of the dependency, and the task related identifiers. Dependence Task New Task 1.GW是一個簡單的選擇器,他會去讀取message的資訊,包含新來的task和已經完成的task資訊,會甕給系統和對應的module中。 2.如果訊息是一個新的Task,GW會去讀task ID and 相依性會去送資訊給相應的TRS. 3.每一個相依情況去讀取和送相關的DCT,只要有 addr和方向的相依性,和task相關標示。
11
2.2. Picos modules (GW) Therefore, to process the new task message, the GW needs: an empty slot in the TRS modules, and the DCT in which each dependence information should be stored. As we will see, each TRS sends a packet with this information to the GW as soon as it has an empty slot. On the other hand, the DCT associated with each dependence is computed directly from the dependence address with a hash function that should be properly selected to all the DCTs. 1.所以去處理一個新的task,GW需要一個空的插槽在TRS和DCT需要相依資訊被儲存。 2.TRS會去送informatio告訴GW只要TRS是空的情況下。 3.同樣的,DCT會去偵測相依性,直接從相依的addr,使用一個hash function (散列含數),能夠適當的被選取給DCT中。
12
2.2. Picos modules (TRS) The Task Reservation Station (TRS) is the module that manages all the processes related to the in-flight tasks. It has a Task Memory (TM) that is an indexed memory in which each TRS stores all the information about the tasks and their dependencies. A TRS may receive packets from the GW, the DCTs, and the TRSs. 1.From the GW with information about a new task, the TRS stores it in the slot specified by the packet and then looks for a new free slot. The TRS will then wait for all the dependencies of the task to be ready before sending it to execute 2,3 From the DCT or TRS modules can have two objectives: (1) To update information about a dependence in order to create a chain of dependencies. (2) To notify that a dependence is now ready. 1.TRS是一個module會去管理所有在處理有關於傳輸中的task。 2.他有TM去指出memory在每一個TRS store 所有的資訊關於task和他們的相依性。 3.TRS可能會收到來自GW或是DCT或是TRS的packet資訊。 -1.從GW來的的new task資訊,TRS會去儲存在插槽中,他會被指定然後去尋找新的插槽。 -TRS會去等待所有的相依性已經好了再送出資料執行之前。 -2.3.來自DCT或是TRS的資訊包含兩種 --1.為了更新相依資訊為了去創造新的相依鍊 --2.告訴相依還沒好。
13
2.2. Picos modules (DCT) The Dependence Chain Tracker (DCT) manages all the dependencies in the system. The dependence information is stored in two different memories: the Dependence Memory (DM) and the Version Memory (VM). 1. DM is indexed by a hash function of the initial dependence address and stores the basic dependence information and a pointer to the last version of each dependence. 2. Version information is stored in the VM, which keeps one entry for every version of the dependence, and also the TRS address of the producer of the next version. 1.DCT管理所有相依關西在這個系統。 2.相依資訊會存在DM和VM中。 -1DM會透過hash function指引到駔吹的相依位址和儲存的時駔相依資訊和最後一個版本相依性的指標。 -2.版本 資訊會存在VM中,保留一個entry給每個版本的相依關係,還有下一個生產的TRS的addr版本。(最新的)
14
Hash Function (補充) 雜湊函式(或雜湊演算法,又稱雜湊函式,英語:Hash Function)是一種從任何一種資料中建立小的數字。 雜湊函式把訊息或資料壓縮成摘要,使得資料量變小,將資料的格式固定下來。
15
2.2. Picos modules (DCT) When the DCT receives a packet from the GW with a new dependence instance entering the system, it searches for the dependence in the DM. If the dependence is not found, the DCT creates an entry for it both in DM and VM. Dependence is ready is then sent together with its VM address to the corresponding TRS. Otherwise, when a dependence instance is found in the DM, two main possibilities exist: The dependence has an input direction (i.e. it is a consumer). LW It has an output or inout direction (i.e. it is a producer). SW In the first case the dependence is added to the dependency chain in its last version and a packet to the TRS is sent with the dependency information. In the second case, a new version should be created and all memories are correspondingly. 1.當DCT收到新的packe來自GW有著新的相依關係例子會進入系統中,在DM中找尋相依關係。 -.如果沒找到,DCT會產生一個entry給DM和VM使用。 -相依關係會備標示成ready狀態,他會一起寄送和VM addr去合對TRS。 2. 當相依關係被找到在DM中 ,有兩種可能 -1.相依關西事input 方向,像是消耗者 LW -2相依關係事outpur or inout 像是生產者SW 3.第一個立子相依關係,需要加入相依鍊最後版本中,還有得到關於TRS送的相依資訊。(可能有些版本不是最新的要forwarding) 4.第二個例子,所有版本的需要被建立,還有所有的meM要被對應的更新
16
2.2. Picos modules (TS) The Task Scheduler (TS) is a simple dispatcher of tasks. As soon as it receives a ready task, it searches for an available thread and forwards the descriptor associated to the task to that thread. descriptor 1.TS是一個簡單的發送者在task中。 2.當他收到Ready task,他會搜尋可用的thread然後forward給描述器去關聯thread中的task
17
2.3. Picos hardware latencies
目的:To obtain the Picos hardware latencies we have undertaken a full VHDL implementation of all the modules in the Picos hardware and synthesized them targeting a Virtex-7 device. Table 1 shows the set of latencies corresponding to synthesis of the main functionality of the Picos hardware modules. 為了獲得Picos硬體延遲他們經歷的VHDL實現所有module 對於Picos 硬體和合成在 Virtex-7 這個裝置上。 Table 1 秀出了延遲集合對應合成的Picos 硬體module。
18
2.3. Picos hardware latencies
In addition, the queues does not allow them to be read from and written to at the same time. With this implementation, the first cycle in the process of writing from/reading to/from a queue is used to reserve the slot, while the second cycle performs the action. With this information, we have built a cycle-accurate simulator that exactly replicates the hardware design. The software simulator allows us to fully explore the optimum design point of the modules in terms of number of modules and memory sizes that cannot at present be synthesizable in a real FPGA. It also allows us to implement accurate simulation of complex behavior patterns. Like nested tasks or by a large number of modules, as well as detecting their bottlenecks and thereby correcting them. 1.此外,queue 不允許同時寫和讀。 2.因為這個限制,第一個cycle在處理queue時,是用來預留插槽位,第二個cycle才是執行動作。 3.有這些資訊,我們可以建立一個cycle正確的模擬器,精確的複製這個硬體設計。 -軟體模擬器允許我們充分的探索最佳化設計(包含modul的數量和mem size),這些不能能夠在FPGA上合成出來。 -他也允許五們實現叫複雜行為的pattern模擬器。 -像是槽狀的Task或是大量數量的module,同樣的能夠偵測他們的瓶頸從而去解決他們。
19
2.4. Design issues The modular design of Picos introduces challenges that an integrated design may not pose. The main problem sources in this kind of design are the unexpected effects that arise from the interaction of simple behaviors in the modules. The key point that makes the modules work together is the fact that the only module that can stall is the TS. If the TS stalls, it means that all the threads are busy and the system simply waits for them to finish. 1.我們介紹的這個Picos module 設計有個問題是他可能不會被提出。 2.因為還有幾個無法預期的原因,從交互反應的簡單行為中產生。 3.要讓每一個module能夠一起工作的原因是暫停TS。 -但是Ts stall的話,代表所有的程序都會busy,而且代表系統會等待他TS完成 TS stalls → all waits!
20
2.4. Design issues The GW cannot stall as the system has to wait for tasks to finish. This means that the finished task messages use a different queue than the new task messages. if a task cannot be issued the GW stores it internally and continues processing finished tasks. The same behavior occurs in the DCTs. If any of the DCT memories are full, the DCT stops processing new dependencies but continues processing packets (releasing dependencies) from the TRSs. GW 也不能stall,如果stall 所有task會等他。 Finished task和new task 使用不同的queue,就能夠解決。 -如果一個task不能夠issue從GW,GW會儲存起來,然後繼續處理finished task, 先把finish傳出去。 同樣的情況也會發盛在DCT -如果任何一個DCT memory full,DCT會停止處理新來的相依性,他會去處理中的packet,交給TRS。
21
2.4. Design issues The TRSs have the most complicated behavior.
The TRSs may deadlock for two reasons: 1.They try to send a ready task to execute but all the threads are busy. This can be solved simply by keeping a bit information for every ready task to indicate that it has not been sent for execution yet. 2. Finished task problem. TRS sending dependencies messages to the associated DCTs. The deadlock occurs when the DCT follows those messages trying to awake other tasks in other (or the same) TRSs. If two or more TRSs are doing the same thing at the same time a deadlock may occur. 解決方法:the queues from the TRSs to the DCTs are different than the others in the system. The TRS does not process finished tasks messages but continues to do other work. TRS有兩個死結 1.他們嘗試送出已經好的Ttask,但所有程序都busy -這可以被簡單的解決靠者一個bit的資訊去指出那些要被送出去的資料是否被運算過。 2. -完成的task代表TRS -死結發生在DCT允許這些訊息嘗試去告訴TRS可以傳資料給他(DCT好了)。 -但如果TRSs同時要跟DCT說 相依性的資料 3.為了避免這樣的情況發生,queue 從TRS給DCT的資料要不同 4.另外如果死結真的發生,TRS不會傳送完成的Task 訊息(給DCT)但會繼續工作。
22
2.4 Deadlock (補充) 死結(Deadlock)必須要滿足以下四個條件
Mutual exclusion:一個資源一次只能被一個process所使用 Hold and Wait: process取得一個資源之後等待其他的資源 No preemption:資源只能由process自己釋放,不能由其他方式釋放 Circular wait:每個process都握有另一個process請求的資源,導致每一個process都在等待另一個process釋放資源 DCT TRS
23
3. Experimental setup Picos hardware
bit
24
3. Experimental setup Picos hardware
Conf = confirm 設定檔
25
4. Picos for high performance computing
趨緩
26
4.Picos for high performance computing
幾乎不影響
27
4. Picos for high performance computing
TRS module Speed up 越好 Mem size 趨緩 Mem size
28
4. Picos for high performance computing
29
5. Results 越平行化 可看出core越高,越能夠平行化。
30
5. Results Picos>Nanos
31
5. Results 左邊是平均的task size 右邊是task的數量,當你兩個值取交集後,可以算出所需要用到的blocksize。 Mem size
Similar presentations