Picos: A hardware runtime architecture support for OmpSs

Slides:



Advertisements
Similar presentations
智慧老伯的一席話 原稿 : 溫 Sir 中譯 : 老柳 A man of 92 years, short, very well- presented, who takes great care in his appearance, is moving into an old people’s.
Advertisements

第七课:电脑和网络. 生词 上网 vs. 网上 我上网看天气预报。 今天早上看了网上的天气预报。 正式 zhèngshì (报告,会议,纪录) 他被这所学校正式录取 大桥已经落成,日内就可以正式通车 落伍 luòw ǔ 迟到 chídào 他怕迟到,六点就起床了.
2014 年上学期 湖南长郡卫星远程学校 制作 13 Getting news from the Internet.
专题八 书面表达.
雅思大作文的结构 Presented by: 总统秘书王富贵.
第4章 VHDL设计初步.
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
How can we be a member of the Society? You should finish the following tasks if you want to be a member of the Birdwatching Society.
云实践引导产业升级 沈寓实 博士 教授 MBA 中国云体系产业创新战略联盟秘书长 微软云计算中国区总监 WinHEC 2015
Operating System Process Management - 4 Monday, August 11, 2008.
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者:陳宜樺 報告日期:2015/9/25.
Operating System CPU Scheduing - 2 Monday, August 11, 2008.
CHT Project Progress Report
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Homework 4 an innovative design process model TEAM 7
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Operating System Concepts 作業系統原理 Chapter 3 行程觀念 (Process Concept)
第六章 应用程序结构.
Applied Operating System Concepts
第五讲 数据的分组、合并与转换.
Retail Customer Online Registration 零售顧客線上註冊教學
HLA - Time Management 陳昱豪.
Chapter 3 行程觀念 (Process Concept)
创建型设计模式.
Programmable Logic Architecture Verilog HDL FPGA Design
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳 中譯潤稿:風刀雨箭
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
软件工程 Software Engineering
增强型MR可解决 临床放射成像的 多供应商互操作性问题
客户服务 询盘惯例.
Lesson 44:Popular Sayings
簡易 Visual Studio 2005 C++ 使用手冊
Try to write He Mengling Daqu Middle School.
SectionA(Grammar Focus-4c)
IBM SWG Overall Introduction
資料結構 Data Structures Fall 2006, 95學年第一學期 Instructor : 陳宗正.
資料庫 靜宜大學資管系 楊子青.
Version Control System Based DSNs
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
2019/4/8 A Load Balancing Mechanism for multiple SDN Controllers based on Load Informing Strategy Miultiple controller 的 load balancing 機制,使用一個叫 Load informing.
Guide to a successful PowerPoint design – simple is best
Changhua University of Education
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
中国科学技术大学计算机系 陈香兰 Fall 2013 第三讲 线程 中国科学技术大学计算机系 陈香兰 Fall 2013.
Common Qs Regarding Earnings
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳
关联词 Writing.
Review and Analysis of the Usage of Degree Adverbs
從 ER 到 Logical Schema ──兼談Schema Integration
Transformational Leadership
高考应试作文写作训练 5. 正反观点对比.
第10章 存储器接口 罗文坚 中国科大 计算机学院
An organizational learning approach to information systems development
BiCuts: A fast packet classification algorithm using bit-level cutting
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
Chapter 10 Mobile IP TCP/IP Protocol Suite
CHAPTER 6 Concurrency:deadlock And Starvation
名词从句(2).
The Wise Old Man 智慧老伯的一席話 原稿: 溫Sir 中譯 : 老柳
常州市教育学会学业水平监测 九年级英语试卷分析 常州市第二十四中学 许喆 2012年2月.
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Operating System Software School of SCU
Race Conditions and Semaphore
MGT 213 System Management Server的昨天,今天和明天
Experimental Analysis of Distributed Graph Systems
Presentation transcript:

Picos: A hardware runtime architecture support for OmpSs 2016/12/5 2016/./. Picos: A hardware runtime architecture support for OmpSs 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung University Tainan, Taiwan, R.O.C 06-2757575 ext 62365, Office: 奇美樓, 6F, 95601 Email: seido5310@gmail.com Website address: http://j92a21b.ee.ncku.edu.tw/broad/index.html or http://www.ee.ncku.edu.tw/chinese/member/professor/T405-jou/T0000000c.htm

Content Introduction The Picos hardware Experimental setup Space design exploration Results 2.1. Operational flow overview 2.2. Picos modules 2.3. Picos hardware latencies 2.4. Design issues 4.1. Picos for high performance computing 4.2. Simple Picos for small multicores 5.1. Comparison with the software alternative

1. Introduction OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs and task parallelism. Based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution. It has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library:Nanos++. Therefore, although powerful and easy to use, the performance benefits are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. 1.OmpSS是一個編程模型去提供簡單有利的方法,去註釋順序和執行task平行化 -透過相依分析,資料流scheduling和亂序task執行。 -他有大大的改善在第四版本中。 -另外現在OmpSs能夠容納存軟體去運駔透過一個Library:Nanos++ 2.然侯有利且簡單去使用,效能的好處被軟體的運行發生的開銷所限制。 3.為了解決這個障礙,我們提供的Picos,一個實現Task超存量價購,去提供硬體之原OmpSs程序model。

Superscalar (補充) 超純量(superscalar)CPU架構是指在一顆處理器內核中實行了指令級並行的一類並行運算。

1. Introduction Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies task-level parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using parallel strategies with a large number of fine granularity tasks. 1.Picos是一種基於硬件數據流的新型任務調度程序,可以動態分析任務間依賴關係,並在運行時識別任務級並行性。 2.在本文中,我們描述了Picos硬件設計和其組件的主要功能的延遲,基於它們的VHDL設計的綜合。 3.我們已經實施了一個基於這些延遲的完整週期精確模擬器,以在合理的時間內對其組件的特性和數量進行設計探索。 4.最後,我們比較Picos和Nanos ++透錯real benchmarks 運行時性能。 5.使用Picos,寫程式人員可以使用具有高度且理想的精細度去平行處理大量的細的(粒度)任務。

2. The Picos hardware Picos hardware is a major revision of the Hardware Task Superscalar architecture with several improvements in its work-flow. The main improvements are related with architectural changes to add support to nested tasks, better memory management and faster task dispatching. It consists of a many-core with any number of threads that send two types of task information to the Picos hardware: (1) the dependency information of new tasks (2) the notification of ending a task. 1.Picos硬體主要是改善了幾個優點在硬體超純量(superscalar)架構的工作流中。 2.最主要的改善是,能夠支援槽狀Task,更好的MEM管理,還有更快的task發送。 3.他是由多核心和任和數量的行程,送出兩種task資訊給Picos硬體。 -1.相依性資訊給新task -2.通知結束的task

2. The Picos hardware The Picos hardware consists of one gateway (GW), one or more Dependence Chain Trackers (DCT), one or more Task Reservation Stations (TRS) and one Task Scheduler (TS). All these components work together in parallel in order to build the dynamic task dependency graph and generate a list of ready tasks that are sent back to the threads to be executed. The connections between the modules are decoupled by FIFO queues that are interconnected by arbiter modules. Picos 1.下圖是Picos硬體包含一個Gateway(入口)、一或多個DCT(依賴相依追蹤器)、一或多個TRS(任務追蹤站)、TS(任務計畫程序) 2.這些平行處理為了建立一個動態Task相依的圖形,產生一條ready task送回給thread去做執行。 -他是FIFO 先進新出 first in first out queue 中間是由互聯是由比較器決定。

2.1. Operational flow overview Once a thread reaches a task creation it creates a new task descriptor. This information mainly includes the address of the task code to be executed and the address of all its dependencies. Once this descriptor is created, it is sent to the Picos hardware that reads its information and stores the data of the corresponding task until all its dependencies are fulfilled. For the first task created, all its dependencies are ready because all its dependencies are already in memory. However, the most common case is that a task has to wait until one or more of its dependencies become ready after other tasks finish. 1.一但行程來了的時候,他會產生新的Task descriptor.(描述者) -這個資訊保含task code的address去執行和相依的address. -當這個Task descriptor被產生時,他會送給Picos硬體去讀取資訊和儲存資料去儲存和對直到相依關係被充滿。 2.當第一個Task被產生時,所有相依的關係會被寫入mem中。 3.然後,最常見的case情況是,任務必須等待,直到其中一個或多個相依關係,在其他任務完成後才能執行。

2.1. Operational flow overview With this finishing message Picos will delete the corresponding descriptor in the system and proceed to mark as ready all the task dependencies. The Picos hardware will then try to send the new ready task to be executed. This entails sending the descriptor to the TS, which will make it available to all the threads in the system. When one thread that is not busy realizes that a new descriptor is available it starts executing the corresponding task. If a new task is created, new descriptors are created and the dependency information is sent to Picos. 1.當Picos有完成的資訊,Picos會去合對descriptor在系統中和繼續去標記那些標記那些task相依是ready。讓其他程序可以使用。 2.Picos硬體會嘗試去送出新的準備好的Task去做運算。 3.接著descriptor會告訴TS,這個threads可以被使用在這個系統中。 4.當一個thread不是busy狀態,新的descriptor會備運作去執行對應的task。 5.如果新的task產生,新的描述會被創造還有相依資訊會被送給Picos

2.2. Picos modules (GW) The GateWay (GW) is a simple selector that reads the messages (new task and finishing task messages) that arrive to the system and sends them to the associated module. If the message is a new task, the GW reads the Task ID and the dependencies and sends this information in a packet to the corresponding TRS. Each dependency is then read and sent to the associated DCT with a packet containing the following information: address and direction of the dependency, and the task related identifiers. Dependence Task New Task 1.GW是一個簡單的選擇器,他會去讀取message的資訊,包含新來的task和已經完成的task資訊,會甕給系統和對應的module中。 2.如果訊息是一個新的Task,GW會去讀task ID and 相依性會去送資訊給相應的TRS. 3.每一個相依情況去讀取和送相關的DCT,只要有 addr和方向的相依性,和task相關標示。

2.2. Picos modules (GW) Therefore, to process the new task message, the GW needs: an empty slot in the TRS modules, and the DCT in which each dependence information should be stored. As we will see, each TRS sends a packet with this information to the GW as soon as it has an empty slot. On the other hand, the DCT associated with each dependence is computed directly from the dependence address with a hash function that should be properly selected to all the DCTs. 1.所以去處理一個新的task,GW需要一個空的插槽在TRS和DCT需要相依資訊被儲存。 2.TRS會去送informatio告訴GW只要TRS是空的情況下。 3.同樣的,DCT會去偵測相依性,直接從相依的addr,使用一個hash function (散列含數),能夠適當的被選取給DCT中。

2.2. Picos modules (TRS) The Task Reservation Station (TRS) is the module that manages all the processes related to the in-flight tasks. It has a Task Memory (TM) that is an indexed memory in which each TRS stores all the information about the tasks and their dependencies. A TRS may receive packets from the GW, the DCTs, and the TRSs. 1.From the GW with information about a new task, the TRS stores it in the slot specified by the packet and then looks for a new free slot. The TRS will then wait for all the dependencies of the task to be ready before sending it to execute 2,3 From the DCT or TRS modules can have two objectives: (1) To update information about a dependence in order to create a chain of dependencies. (2) To notify that a dependence is now ready. 1.TRS是一個module會去管理所有在處理有關於傳輸中的task。 2.他有TM去指出memory在每一個TRS store 所有的資訊關於task和他們的相依性。 3.TRS可能會收到來自GW或是DCT或是TRS的packet資訊。 -1.從GW來的的new task資訊,TRS會去儲存在插槽中,他會被指定然後去尋找新的插槽。 -TRS會去等待所有的相依性已經好了再送出資料執行之前。 -2.3.來自DCT或是TRS的資訊包含兩種 --1.為了更新相依資訊為了去創造新的相依鍊 --2.告訴相依還沒好。

2.2. Picos modules (DCT) The Dependence Chain Tracker (DCT) manages all the dependencies in the system. The dependence information is stored in two different memories: the Dependence Memory (DM) and the Version Memory (VM). 1. DM is indexed by a hash function of the initial dependence address and stores the basic dependence information and a pointer to the last version of each dependence. 2. Version information is stored in the VM, which keeps one entry for every version of the dependence, and also the TRS address of the producer of the next version. 1.DCT管理所有相依關西在這個系統。 2.相依資訊會存在DM和VM中。 -1DM會透過hash function指引到駔吹的相依位址和儲存的時駔相依資訊和最後一個版本相依性的指標。 -2.版本 資訊會存在VM中,保留一個entry給每個版本的相依關係,還有下一個生產的TRS的addr版本。(最新的)

Hash Function (補充) 雜湊函式(或雜湊演算法,又稱雜湊函式,英語:Hash Function)是一種從任何一種資料中建立小的數字。 雜湊函式把訊息或資料壓縮成摘要,使得資料量變小,將資料的格式固定下來。

2.2. Picos modules (DCT) When the DCT receives a packet from the GW with a new dependence instance entering the system, it searches for the dependence in the DM. If the dependence is not found, the DCT creates an entry for it both in DM and VM. Dependence is ready is then sent together with its VM address to the corresponding TRS. Otherwise, when a dependence instance is found in the DM, two main possibilities exist: The dependence has an input direction (i.e. it is a consumer). LW It has an output or inout direction (i.e. it is a producer). SW In the first case the dependence is added to the dependency chain in its last version and a packet to the TRS is sent with the dependency information. In the second case, a new version should be created and all memories are correspondingly. 1.當DCT收到新的packe來自GW有著新的相依關係例子會進入系統中,在DM中找尋相依關係。 -.如果沒找到,DCT會產生一個entry給DM和VM使用。 -相依關係會備標示成ready狀態,他會一起寄送和VM addr去合對TRS。 2. 當相依關係被找到在DM中 ,有兩種可能 -1.相依關西事input 方向,像是消耗者 LW -2相依關係事outpur or inout 像是生產者SW 3.第一個立子相依關係,需要加入相依鍊最後版本中,還有得到關於TRS送的相依資訊。(可能有些版本不是最新的要forwarding) 4.第二個例子,所有版本的需要被建立,還有所有的meM要被對應的更新

2.2. Picos modules (TS) The Task Scheduler (TS) is a simple dispatcher of tasks. As soon as it receives a ready task, it searches for an available thread and forwards the descriptor associated to the task to that thread. descriptor 1.TS是一個簡單的發送者在task中。 2.當他收到Ready task,他會搜尋可用的thread然後forward給描述器去關聯thread中的task

2.3. Picos hardware latencies 目的:To obtain the Picos hardware latencies we have undertaken a full VHDL implementation of all the modules in the Picos hardware and synthesized them targeting a Virtex-7 device. Table 1 shows the set of latencies corresponding to synthesis of the main functionality of the Picos hardware modules. 為了獲得Picos硬體延遲他們經歷的VHDL實現所有module 對於Picos 硬體和合成在 Virtex-7 這個裝置上。 Table 1 秀出了延遲集合對應合成的Picos 硬體module。

2.3. Picos hardware latencies In addition, the queues does not allow them to be read from and written to at the same time. With this implementation, the first cycle in the process of writing from/reading to/from a queue is used to reserve the slot, while the second cycle performs the action. With this information, we have built a cycle-accurate simulator that exactly replicates the hardware design. The software simulator allows us to fully explore the optimum design point of the modules in terms of number of modules and memory sizes that cannot at present be synthesizable in a real FPGA. It also allows us to implement accurate simulation of complex behavior patterns. Like nested tasks or by a large number of modules, as well as detecting their bottlenecks and thereby correcting them. 1.此外,queue 不允許同時寫和讀。 2.因為這個限制,第一個cycle在處理queue時,是用來預留插槽位,第二個cycle才是執行動作。 3.有這些資訊,我們可以建立一個cycle正確的模擬器,精確的複製這個硬體設計。 -軟體模擬器允許我們充分的探索最佳化設計(包含modul的數量和mem size),這些不能能夠在FPGA上合成出來。 -他也允許五們實現叫複雜行為的pattern模擬器。 -像是槽狀的Task或是大量數量的module,同樣的能夠偵測他們的瓶頸從而去解決他們。

2.4. Design issues The modular design of Picos introduces challenges that an integrated design may not pose. The main problem sources in this kind of design are the unexpected effects that arise from the interaction of simple behaviors in the modules. The key point that makes the modules work together is the fact that the only module that can stall is the TS. If the TS stalls, it means that all the threads are busy and the system simply waits for them to finish. 1.我們介紹的這個Picos module 設計有個問題是他可能不會被提出。 2.因為還有幾個無法預期的原因,從交互反應的簡單行為中產生。 3.要讓每一個module能夠一起工作的原因是暫停TS。 -但是Ts stall的話,代表所有的程序都會busy,而且代表系統會等待他TS完成 TS stalls → all waits!

2.4. Design issues The GW cannot stall as the system has to wait for tasks to finish. This means that the finished task messages use a different queue than the new task messages. if a task cannot be issued the GW stores it internally and continues processing finished tasks. The same behavior occurs in the DCTs. If any of the DCT memories are full, the DCT stops processing new dependencies but continues processing packets (releasing dependencies) from the TRSs. GW 也不能stall,如果stall 所有task會等他。 Finished task和new task 使用不同的queue,就能夠解決。 -如果一個task不能夠issue從GW,GW會儲存起來,然後繼續處理finished task, 先把finish傳出去。 同樣的情況也會發盛在DCT -如果任何一個DCT memory full,DCT會停止處理新來的相依性,他會去處理中的packet,交給TRS。

2.4. Design issues The TRSs have the most complicated behavior. The TRSs may deadlock for two reasons: 1.They try to send a ready task to execute but all the threads are busy. This can be solved simply by keeping a bit information for every ready task to indicate that it has not been sent for execution yet. 2. Finished task problem. TRS sending dependencies messages to the associated DCTs. The deadlock occurs when the DCT follows those messages trying to awake other tasks in other (or the same) TRSs. If two or more TRSs are doing the same thing at the same time a deadlock may occur. 解決方法:the queues from the TRSs to the DCTs are different than the others in the system. The TRS does not process finished tasks messages but continues to do other work. TRS有兩個死結 1.他們嘗試送出已經好的Ttask,但所有程序都busy -這可以被簡單的解決靠者一個bit的資訊去指出那些要被送出去的資料是否被運算過。 2. -完成的task代表TRS -死結發生在DCT允許這些訊息嘗試去告訴TRS可以傳資料給他(DCT好了)。 -但如果TRSs同時要跟DCT說 相依性的資料 3.為了避免這樣的情況發生,queue 從TRS給DCT的資料要不同 4.另外如果死結真的發生,TRS不會傳送完成的Task 訊息(給DCT)但會繼續工作。

2.4 Deadlock (補充) 死結(Deadlock)必須要滿足以下四個條件 Mutual exclusion:一個資源一次只能被一個process所使用 Hold and Wait: process取得一個資源之後等待其他的資源 No preemption:資源只能由process自己釋放,不能由其他方式釋放 Circular wait:每個process都握有另一個process請求的資源,導致每一個process都在等待另一個process釋放資源 DCT TRS

3. Experimental setup Picos hardware bit

3. Experimental setup Picos hardware Conf = confirm 設定檔

4. Picos for high performance computing 趨緩

4.Picos for high performance computing 幾乎不影響

4. Picos for high performance computing TRS module Speed up 越好 Mem size 趨緩 Mem size

4. Picos for high performance computing

5. Results 越平行化 可看出core越高,越能夠平行化。

5. Results Picos>Nanos

5. Results 左邊是平均的task size 右邊是task的數量,當你兩個值取交集後,可以算出所需要用到的blocksize。 Mem size