隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical.

隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung University Tainan, Taiwan, R.O.C ext 62365, Office: 奇美樓, 6F, 95601 Website address: or

Content Abstract Introduction
Memory as an Inter-operation Communication Agent Memory Traffic Analysis 方法一 : Speculative Memory Cloaking 方法二 : Speculative Memory Bypassing 方法三 : Transient Value Cache Experimental Evaluation Summary and Conclusions 1.我們開始討論內部記憶體溝通的方式和問題在 Sec2 2.我們做定量的內部分析在Sec3 3.Sec4 講 Cloacking 4.Sec5 講 bypassing 5.Sec6 講建立一個 TVC 6. Sec7 講以上技術的量化考量 7. Sec9 我們最後做一個總結

Abstract 方法一 (Cloaking) : We use data dependence prediction to identify and link dependent loads and stores. without incurring the overhead of address calculation, disambiguation and data cache access. 方法二 (Bypassing) : We also use data dependence prediction to convert, DEF-store-load-USE chains within the instruction window into DEF-USE chains prior to address calculation and disambiguation. 1.第一種方法是Cloaking : 我們利用資料相依的預測去確認有顆能發生資料相依的load跟store指令，避免過頭的位址計算，有效減少memory的溝通延遲。 Ps. 我們知道如果store完下一個接load會發生資料相依，會導致memory的溝通延遲。 2.第二種方法是Bypassing : 我們也使用資料相依的預測，讓原本要走DEF-store-load-USE chains 變成比較短的DEF-USE chains 。有效減少memory的溝通延遲。

Abstract 方法三 (建立TVC) : We use true and output data dependence status prediction to introduce and manage a small storage structure called the Transient Value Cache (TVC). The TVC captures memory values that are short-lived. 1.第三種方法是建立一個TVC : 我們利用output data相依狀態預測並產生一個小的結構TVC，TVC的儲存資料很短命，他可以快速儲存馬上要用的資料，而且TVC並不屬於其他的memory hierarchy(存儲層次)中，就像是data cache依樣。

Abstract 方法一 (Cloaking) 和方法二 (Bypassing) are aimed at reducing the effective communication latency (降低溝通的延遲性). 方法三 (建立TVC) is aimed at reducing data cache bandwidth requirements (降低快取的頻寬需求)，increasing the effective memory bandwidth (增加有效率的記憶體頻寬). 以上方法都是為了簡化內部記憶體的溝通。

Abstract Experimental analysis of the proposed techniques shows that:
(i) the proposed speculative communication methods correctly handle a large fraction of memory dependences (ii) a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place. 另外實驗的分析發現 1.投機性的溝通發方法能夠正確地處理大部分的資料相依。 2.當TVC放置時，很大部分的load跟store甚至不用接出data的快取。

1. Introduction 首先介紹: With an implicit specification, communication cam take place after address-calculation and disambiguation. With an explicit specification communication can take place as soon as the two instructions are encountered and the value is available. (Faster) implicit specification會在所有位址計算完後記憶體才會溝通。 explicit specification 會在兩個指令出現時，就計算完成，且可以讓值算出。較快

1. Introduction 目標 : We are primarily concerned with methods of converting the traditional implicit specification of memory communication into an explicit form. 方法 : To do so, we use data dependence prediction to explicitly link loads and stores that are likely to be dependent. These loads and stores can then communicate via a dynamically created name space without incurring the overhead of address calculation, disambiguation and data cache access. 1.為了要達到explicitly link我們使用資料相依的預測。 2.透過動態的建立name spcace 避免過多的計算。

2.Memory as an Inter-operation Communication Agent
Memory communication can be viewed as a two step process. 1. Dependences are established. 2. Actual values are communicated. To streamline memory communication we need: (i) establish the dependences as quickly as possible. (ii) provide storage structures that best meet the communication requirements. (low latency/high bandwidth) 1.Memory的溝通可以作為兩個階段的過程，第一是是否具有相依性的建立，第二是實際的值被計算出來。 2.為了要減少Memory的溝通延遲要做兩件事情 (1) 盡快地找出相依性 (2)提供儲存空間給溝通的裝置 (少延遲高頻寬)

2.Memory as an Inter-operation Communication Agent
In this paper we do not consider a static approach since it would require static knowledge of the dependences, and it would also involve changing the program representation completely. Instead, we investigate dynamic approaches. We then use these speculative dependences to create a dynamic name space through which the dependent loads and stores can communicate without incurring the overhead of address calculation. With dynamically collected information that can be used to develop and manage novel memory hierarchies. 1.在這個paper中我們不討論靜態的方法，因為這需要靜態依賴的知識，可能會涉及到程式改變 2.相反地，我們調查動態方法。 3.我們去建立一個動態的名字空間，可以讓有相依性的load store避免過頭的計算。 4.藉由動態的計算，我們可以發展出現代化的memory hierarchies

3. Memory Traffic Analysis
SPECint95 256 load 50% store 8 8K store 我們拿SPECint95中幾個規格做為測試程式，分析發現橫坐標是store的距離上面是 load/store相依百分比，下面是store/store相依百分比。這個結論告訴我們，在256 Store 距離中，有50%的下一個load有資料相依，有60%的下一個store有資料相依。如果能能夠先預測出這些相依性，可以讓延遲降低，效能提升 60% store

4. Speculative Memory Cloaking
The purpose of cloaking is to streamline memory communication by dynamically converting the implicit specification of dependences into an explicit form. In part (a), detecting a load-store dependence results in an association among the load, the store and a synonym. 首先，介紹第一個方法 1.Cloaking藉由動態的將implicit specification 轉成explicit form來減少memory communication 。 2.在圖a，發現有 load store 相依時，把結果放進association中。

When a subsequent instance of the store is encountered and a dependence is predicted (action 1). this association results in the generation of a new version of the synonym (action 2). Synonym File (SF) which is a small, low-latency/high-bandwidth storage structure. Upon value reception the synonym file entry is updated and marked as full (action 3). Finally, when the store computes its address it accesses memory (action 4). 1.在隨後的store來臨，並且預測到會有相依情形時。 2.Association的結果會產生一個新的synonym。(為了給下一個相依load使用) 3.SF的值會被更新，並且並且標示full 4.最後，store計算出位址，並存取記憶體。

When the appropriate instance of the load is brought into the instruction window, the association is used again to derive the synonym (action 5). Locate the appropriate element in the synonym file (action 6). Instructions that use the load value may at this point execute speculatively using this value (action 7). When the load data address becomes available, the memory system is accessed to read the actual value (action 8). 當隨後的load來時，associtaion再次被使用並且去驅動synonym。找到相應的值在SF中。 Load的value會被推測出來。當Load的位址會計算出來時，memory馬上頭存取並讀取正確的值。

驗證方式 : This is compared with the value obtained earlier via the cloaking mechanism. If the two values are the same, cloaking was successful and no further action is required. Otherwise, data value mis-speculation occurs, and any instructions that used wrong data have to be re-executed. 結論 : Speculative memory cloaking has the following requirements: (1). predicting dependences. (2). creating synonyms, associating them with the dependent instructions and assigning storage for the communication (3). verifying the speculatively communicated values 最後值會被做比較，如果兩個值相同 clacking成功不會有額外的行動需求，當有預測失敗時，wrong data 必須要重新計算。 Clacking要有以下需求 (1)預測相依性 (2)建立synonyms，好讓相依的指令能夠存儲。 (3)驗證value值是否正確

我們靠三個structures達到 (a) dependence detection table (DDT) (1) Data Address (ADDR) (2) Store PC (STPC) (3) a valid bit. (b) dependence prediction and naming table (DPNT) (1) instruction address (PC) (2) dependence status predictor (PRED) (3) dependence tag (DTAG) (4) a valid bit. (c) synonym file (SF). (1) name (2)value (3) full/empty bit (4) valid bit. 為了要達到cloacking需要三個structures達到這個是用來偵測相依性預測相依，並且將它重新命名儲存store的值，並給load值的地方

1.在part(b)跟part©中我們做如何偵測相依性。 2.在part(b)中第一個store執行並記錄在DDT。 3.第一個load去看他的data address是否跟剛剛的store相同，如果相同代表他們相依存在。 4.在這些行為後，將他們相關性寫入DPNT中。 In parts (b) and (c) we show the actions that lead to the detection of the dependence. In part (b), the first instance of the store executes and records in the DDT its PC and the data address it updated (action 1). Later on, in part (c), the first instance of the load using its data address probes the DDT (action 2) and determines that a dependence exists. In reaction to this detection, two entries are allocated in the DPNT (action 3).

A later instance of the store enters the instruction window. The PC of the store is used to probe the DPNT for a matching entry (action 4) Assuming that the predictor indicates so, a synonym is generated based on the tag recorded in the DPNT entry, and it is used to allocate space in the SF (action 5). The full/empty bit of the SF entry is set to empty to indicate that the value is not yet available, 接著有一個store進來 Store的PC會去偵測DPNT是否match 如果有預測到，synonym會會建立一個 tag 記錄在DPNT中用來連結SF。 full/empty bit 設定成空的因為值還沒算好

store 也記正確的值給SF，並把full/empty bit 設成1。最後store也進入傳統的memory hierarchy Whereas, the store also records the location of the SF entry since the actual data value, when it becomes available, will have to be written in the SF entry (action 6). Eventually, the store also accesses the traditional memory hierarchy (action 7).

When the next instance of the load enters the window Its PC is used to probe the DPNT (action 8). After a dependence status prediction is made, the tag recorded in the DPNT entry leads to the generation of the same synonym generated previously for the store. This synonym is used to access the appropriate SF entry (action 9) and to obtain the data left there by the store. At this point the load may use this data to execute (action 10). When the data address becomes available, the load accesses the traditional memory hierarchy to obtain the actual data value (action 11). 1. load的PC用來偵測DPNT 2. 先去偵測看有沒有address相同的指令。(如果有的話，會跟剛剛一樣 synonym會會建立一個 tag 記錄在DPNT中用來連結SF) 3. 接著synonym會去存取SF並獲取剛剛store留下的值。9 4. 這個值做運算回傳。10 5.當load的address計算好的時候再傳給traditional memory hierarchy ，去得到正確的data value。11

驗證 : This value is compared against the value read previously from the SF and appropriate action is taken if the two values differ. 更新 : At this point we may also update the predictors in the DPNT entries for both the load and the store. 這個值會被剛剛SF的值做比較，讓traditional memory hierarchy可以做為驗證，避免錯誤。同時。DPNT的值會被更新。

他的block diagram 長這樣子，當有指令進來時，去看DPNT是否有位址相同的指令，如果有synonym會會建立一個 tag 記錄在DPNT中用來連結SF，之後會進行指令解碼跟重新命名，同時會有SF做預測，即能夠及時算出相依時正確的值，之後可以靠EX做驗證回傳正確性，然後在commit的地方做更新data 的動作。

5. Speculative Memory Bypassing
Using the I1–store–load–I4 chain shown in part (a). 缺點 : travel through these two instructions before it can reach I4. (slower) when the dependent load and store co-exist in the instruction window, further reduction in the communication latency is possible with speculative memory bypassing. 第五節，我們講解第二個方法 bypassing (條件)當有相依的load跟store存在在同一個instruction window，降低他的latency就是利用Bypassing。

As shown in part (b) with speculative memory bypassing, the value can be sent directly from I1 to I4. (faster) As was the case with speculative memory cloaking, this communication is speculative and has to be verified. Speculative memory bypassing can be implemented as a simple extension to speculative memory cloaking. 1.藉由bypassing可以用值馬上從I1給I4 2.溝通可利用cloacking做驗證。 3.Bypassing可以作為cloaking的延伸。

At step (1), instruction I1 is decoded and register renaming creates a new name TAG1 for the target register R1. At step (2), the store instruction is decoded and determines the current name of its source register R1. In parallel, via the use of cloaking, a synonym is created for the memory communication, we also record in the synonym the current name TAG1 of store’s source register R1. I1被解碼，暫存器重新取名TAG1 store被解碼，從暫存器R1種計算得到current name。同時，我們利用cloaking，synonym會個當下，我們在synonym寫入the current name TAG1

At step (3), the load instruction is decoded and register renaming creates a new name TAG2 for the destination register R2. At step (4) I4 is decoded, it can determine that its source register R2 has two names: one actual TAG2 and one speculative TAG1. By using the speculative name TAG1, I4 can link directly to I1 and execute speculatively as soon as I4 produces its value. Later on, after the load has accessed the memory the integrity of the communication can be verified. 第三步，load被解碼，暫存器重新命名並製造一個新的TAG2給R2使用。在第四步時，I4被解碼，R2目前有兩個名字，one actual TAG2 and one speculative TAG1 3. 藉由speculative name TAG1，I4可以直接從I1連結並計算，只要I4產生出值。接著load從traditional memory hierarchy 算好後可以做驗證。 4. 接著load從traditional memory hierarchy 算好後可以做驗證。

6. Transient Value Cache 因為 most of the values stored to memory are quickly killed. (Sec 3) Motivated by these observations we extend the memory hierarchy by introducing a small storage structure, the Transient Value Cache (TVC). TVC用來記錄stored values are communicated or killed. 1. stores whose values that are likely to be killed soon 2. loads that are likely to access

6. Transient Value Cache Stores that are likely to be killed soon are initially sent only to the TVC in hope that they will be killed in it before they are forced to go the data cache (part (a)). Other stores are sent to both caches to keep them coherent (part (b)). 有資料相依的store一開始存在TVC中，希望在被傳到Data Cache前希望被殺掉其他一般的store會兩邊都傳。

6. Transient Value Cache Loads that are likely to have true dependences with recent stores are initially sent only to the TVC. Such a load is directed to the data cache only if we miss in the TVC (part (c)). Other loads have to access both the TVC and the data cache in parallel (part (d) . 功能 : reducing data cache bandwidth requirements (降低快取的頻寬需求)，increasing the effective memory bandwidth (增加有效率的記憶體頻寬). 有資料相依的Load一開始存在TVC中，如果miss的話才會傳給Data Cache中。其他的Load兩邊都傳值。

7. Experimental Evaluation
預測相依正確度 1.有限的硬體資源下的預測條 2.大部分都正確 3.錯誤的 4.無限的資源

7. Experimental Evaluation
Percentage of true dependences communicated correctly via cloaking. Dark bar is for infinite DPNT, gray bars are for 512, 1K, 2K and 4K entries. It can be seen that the majority of all dynamic dependences is correctly communicated. It can reduce the effective communication latency. 透過cloaking 可以看出多數的動態預測可以正確的溝通。有效的降低延遲。

8. Summary and Conclusions
(1) We show that the data dependence status of most memory operations can be predicted with high accuracy on a per instruction basis and based solely on the history of previous data dependences. (2) We show that the traditional implicit specification of memory communication can be dynamically converted into a explicit specification. (3) We propose speculative memory cloaking and its extension speculative memory bypassing, to take the address calculation, the load and store instructions themselves off the communication path. (4) We propose the Transient Value Cache a dependence status prediction managed storage structure that can reduce the contention for data cache resources. 我們證明了資料相依性能在多數的記憶體運算中被正確的運算，藉由之前的資料相依表。可以藉由動態的將傳統的implicit specification變成快速的explicit specification. 我們利用了speculative memory cloaking 和 speculative memory bypassing 避免過多的計算。我們建立TVC，一個獨立cache 給資料相依時暫時存放使用(像是馬上要被殺掉的store)，可以降低data cache 的使用。

隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical.

Similar presentations

Presentation on theme: "隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical.

Similar presentations

Presentation on theme: "隱藏之投影片 2016/8/22 2016/8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical."— Presentation transcript:

Similar presentations

About project

反馈