Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Similar presentations


Presentation on theme: "Chapter 7 Large and Fast: Exploiting Memory Hierarchy"— Presentation transcript:

1 Chapter 7 Large and Fast: Exploiting Memory Hierarchy
授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.) Tel: (05) ext. 4337

2 Memories: Review SRAM: value is stored on a pair of inverting gates
very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10)

3 Exploiting Memory Hierarchy
Users want large and fast memories! Build a memory hierarchy 記憶體技術 標準的存取時間 每Gbyte的價錢(2004年) SRAM 0.5~5 ns $ $10,000 DRAM 50~70 ns $ $200 磁碟機 5,000,000~20,000,000 ns $ $2 價格貴 速度快 容量大 速度慢

4 Locality Principle of locality
temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Our initial focus: two levels (upper, lower) block: minimum unit of data 命中(hit): data requested is in the upper level 誤失(miss): data requested is not in the upper level 命中率(hit rate): the fraction of memory accesses found in the upper level. 誤失率(miss rate): (1- hit rate) 命中時間(hit time): the time to access the upper level of the memory hierarchy. 誤失代價(miss penalty): the time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor.

5 Cache Small amount of fast memory
Sits between normal main memory and CPU May be located on CPU chip or module Write policy 1. Write back 2. Write through Replacement algorithm Mapping function: 1. Direct mapping 2. Associative mapping 3. Set associative mapping

6 Cache operation - overview
CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot

7 Direct Mapped Cache 直接對映 (direct Mapping): address is modulo the number of blocks in the cache: (區塊位址) mod (快取記憶體中的區塊數目) 將每個主記憶體的位置恰好對應到快取記憶體的一個位置 圖 7.5 顯示一個採用直接對映的8個字組的快取記憶體。 說明介於0與 31的記憶體字組位址對映到相同的快取記憶體位置。 主要記憶體中的最低的3個位元用來選取快取記憶體的區塊。 最高的2個位元當作標籤(tag)。 標籤中包含了位址資訊,可以用來辨別快取記憶體中的字組是否對應到我們要求的字組。 有效位元( valid bit )用來標示快取記憶體內的某個區塊是否存放著有效的位址。

8 7.2 The basics of Caches 範例 (Accessing a Cache)
圖7.6 說明了一個8字組大小,採用直接對映的快取記憶體如何 回應處理器一連串的需求。 時間區域性: 以最近被存取到的字組取代最近較少參考到的字組。 Decimal address Of reference Binary address of reference Hit or miss in cache Assigned cache block (where found or placed) 22 10110 Miss (7.6b) 10110 mod 8 = 110 26 11010 Miss (7.6c) 11010 mod 8 = 010 Hit 16 10000 Miss(7.6d) 10000 mod 8 = 000 3 00011 Miss(7.6e) 00011 mod 8 = 011 18 10010 Miss(7.6f) 10010 mod 8 = 010

9 7.2 The basics of Caches 範例 (續). 圖 7.6

10 Mapping Function Example Cache of 64KByte Cache block of 4 bytes
i.e. cache is 16k (214) lines (blocks) of 4 bytes 16MBytes main memory 24 bit address (224=16M)

11 Direct Mapping Each block of main memory maps to only one cache line
i.e. if a block is in cache, it must be in one specific place A referenced address is divided into A cache index, which is used to select the block A tag field, which is used to compare with the value of the tag field of the cache. Address is in two parts Least Significant w bits identify unique word Most Significant s bits specify one memory block The MSBs are split into a cache index field r and a tag of s-r (most significant)

12 Direct Mapping Address Structure
Word w Byte offset Tag s-r Index , Line or Slot r 14 2 8 24 bit address 2 bit word identifier (4 byte block) 22 bit block identifier 8 bit tag (=22-14) 14 bit slot or line No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag

13 Direct Mapping Cache Line Table
Cache line Main Memory blocks held 0 0, m, 2m, 3m…2s-m 1 1,m+1, 2m+1…2s-m+1 m-1 m-1, 2m-1,3m-1…2s-1

14 Direct Mapping Example

15 Direct Mapping Direct mapping的特色: Simple Inexpensive
Fixed location for given block If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

16 Direct Mapped Cache 假設一個 32位元的記憶體, Cache有 2n block (entry) ,且每個block大小為 2m 字組的直接對映快取記憶體需要一個大小為 32-(n+m+2)的標籤欄位。 圖 7.7 展示一個參考位址的分割情形 For MIPS: =20 共有1024的block,因此需要10 bit的index才能辨別需要的block 每個block為4 byte,因此需要2 bit的word identifier才能辨別需要的資料

17 Example: Bit in a Cache (1)
How many total bits are required for a direct-mapped cache with 64KB of data and one-word blocks, assuming a 32-bit address? Solution: 1 word = 32 bits = 4 bytes,所以w , byte offset= 2 bit 所以cache有64KB / 4 = 16 KB words (blocks)=214 words,因此需要14 bits的index。 所以標籤(tag) = 32 – 14 – 2 = 16 bits。 一個cache除了存放data (32 bit)之外,還要紀錄標籤(16 bit)及有效位元(1 bit)等資訊,因此cache的大小為 16K * ( ) = 16K *49 = 784K

18 Example: Bit in a Cache (2)
How many total bits are required for a direct-mapped cache with 16KB of data and 4-word blocks, assuming a 32-bit address? Solution: 4 word = 4*32 bits = 16 bytes,所以w , byte offset= 4 bit 所以cache有16KB / 16 = 1 KB blocks=210 blocks,因此需要10 bits的index。 所以標籤(tag) = 32 – 10 – 4 = 18 bits。 一個cache除了存放data (4*32=128 bit)之外,還要紀錄標籤(18 bit)及有效位元(1 bit)等資訊,因此cache的大小為 1K * ( ) = 1K *147 = 147K

19 Example: Mapping an address to a Multiword cache block
Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to? Solution: 先計算位址1200相對的 block number = byte address / bytes per block = 1200 / 16 = 75 因為共有64 block,所以75 mod 64 = 11

20 7.2 The Basics of Caches 快取誤失的處理 (Handling caches misses)
The CU detect a miss and process the miss by fetching the data from memory. To stall the CPU, freezing the contents of all registers. Fetching the data into the cache from memory. Execution is restarted at the cycle that caused the cache miss. 指令快取誤失所採取的步驟 (在 IF 發生誤失): 將原始的 PC 值 (目前的 PC 值減4) 送到記憶體。 告知主記憶體完成讀取,並等待記憶體完成此讀取的動作。 寫入快取記憶體 將記憶體得到的資料放到快取記憶體中適當的欄位。 將其位址的上半部份 (來自ALU)寫入標籤欄位。 將有效位元改為 on 重新啟動 IF指令, 這次會發現資料已在快取記憶體中。 快取記憶體對資料存取的控制在本質上是相同的 Stall the processor until the memory respond with the data. 參閱習題: 一個快取記憶體的例子:DECStation 3100, 552到555 頁(2nd Ed.)

21 An Example Cache: The DECStation 3100
Separate instruction and data caches. Each cache is 64KB, with one-word block. 1 word = 4 bytes =>w=2 bits 64K/4=16k blocks = 214 blocks Index = 14 bits Tag = 32 – 14 – 2 = 16 bits

22 Direct Mapped Cache 為了要達到空間區域性 ( spatial locality )的優點,採用底下的策略:
使用多重字組的快取記憶體區塊 (a cache block is larger than one word in length) 當誤失發生時,相鄰的多個字組便會被擷取出來 其他的字組馬上就會被使用到的機率相當大 Example 1 : pp. 557 (2nd Ed.) 一個 64 KB快取記憶體區塊包含: 每個區塊有4個字組 (16 位元組) 共有4K 區塊 (64KB / 16 = 4K) 一個 32-位元 位址可解譯為 : 區塊偏移量 (2-位元) :每個block有4=22個word, 因此需要2 bit block index 位元組位移量 (2-位元) :每個word有4=22個byte, 因此需要2 bit byte index 索引 (12-位元) :共有4K=212 區塊,因此需要12 bit index 標籤 (16-位元) : 32 – 2 – 2 – 12 = 16

23 7.2 The Basics of Caches 圖 7.9 說明對一個特殊的位址如何找到需要的快取記憶體區塊的方法

24 Example: The Intrinsity FastMATH processor
The Intrinsity FastMATH is a fast embedded microprocessor the uses the MIPS architecture. The processor has 12-stages pipeline. Each cache is 16KB with 16-word blocks 一個 32-位元 位址可解譯為 : 位元組位移量 (2-位元) :每個word有4=22個byte, 因此需要2 bit byte index 區塊偏移量 (4-位元) :每個block有16=24個word, 因此需要4 bit block index 索引 (8-位元) :共有16K/(16*4)=256=28 區塊,因此需要8 bit index 標籤 (18-位元) : 32 – 2 – 4 – 8 = 18

25 Direct Mapped Cache The 16KB caches in the Intrinsity FastMATH each contain 256 blocks with 16 words per block Taking advantage of spatial locality:

26 7.2 The Basics of Caches IF 時間 =快取記憶體存取時間+誤失率 *誤失代價 增加區塊大小與問題的關連性
誤失率 v.s. 區塊大小 較大的block會降低miss rate 較大的block會相對的減少cache中entry的數目,而增加miss penalty。 可增加MM和Cache間的頻寬來克服上述的問題或是 interleaving IF 時間 =快取記憶體存取時間+誤失率 *誤失代價 增加區塊大小與問題的關連性 誤失所付出的額外的時間(miss penalty)代價勝過大區塊誤失率的減少

27 Performance Increasing the block size tends to decrease miss rate:
Use split caches because there is more spatial locality in code: 對於1KB的cache而言,256byte的block會比16byte及64 byte的block有較高的miss rate

28 Hits vs. Misses Read hits this is what we want! Read misses
stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word

29 Hardware Issues: designing the memory system to support caches
Make reading multiple words easier by using banks of memory

30 Designing the memory system to support caches
Assuming that a set of hypothetical memory access times: 1 clock cycle to send the address 15 clock cycles for each DRAM access initiated 1 clock cycle to send a word of data A cache block of 4 words and a one-word-wide bank of DRAMs The miss penalty would be 1+ 4 * * 1 = 65 clock cycles The number of bytes transferred per clock cycle for a single miss = 4*4/65 = 0.25 With a main memory width of two words The miss penalty would be 1+ 2 * * 1 = 33 clock cycles The number of bytes transferred per clock cycle for a single miss = 4*4/33 = 0.48 With a main memory width of four words The miss penalty would be 1+ 1 * * 1 = 17 clock cycles The number of bytes transferred per clock cycle for a single miss = 4*4/17 = 0.94 With 4 bank (interleaving) The miss penalty would be 1+ 1 * * 1 = 20 clock cycles The number of bytes transferred per clock cycle for a single miss = 4*4/20 = 0.8

31 Measuring and Improving Cache Performance
Two ways of improving performance: decreasing the miss ratio Reducing the probability that two different memory blocks will contend for the same cache location. decreasing the miss penalty Adding an additional level to the hierarchy, multilevel caching. Simplified model: CPU execution time = (CPU execution cycles + Memory stall cycles) * cycle time Memory-stall clock cycles = Read-stall cycles + Write-stall cycles Read-Stall cycles = (Reads / Program) * Read miss rate * Read miss penalty Write-stall cycles for write-through = [(Writes/Program)*Write miss rate * Write miss penalty]+ Write buffer stall 在大多數的write-through cache組織,假設read /write的miss penalty相同,所以 Memory-stall clock cycles = (Memory accesses/Program) *miss ratio * miss penalty Memory-stall clock cycles = (Instructions/Program) *(Misses/ Instruction) * miss penalty

32 Example: Calculating Cache Performance (2nd Ed.)
Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. Use the instruction frequencies for gcc from Fig. 4.54 Solution: 假設指令數為 I,則 Instruction miss cycles = I * 2% * 40 = 0.8*I 由圖4.54知,在gcc中,load和store指令的總合為36% (37%?),因此 Data miss cycles = I * 36% * 4% * 40 = 0.58*I memory-stall cycles = 0.8I I = 1.38I 所以具有memory-stall的CPI = = 3.38 所以 CPU time with stalls / CPU time with perfect cache = CPIstall/ CPIperfect = 3.38/2 = 1.69 如果processor的速度變快,但記憶體仍保持不變,則花費在記憶體暫停的時間將會成為增加執行時間的一項因素。 將CPI變成1,則有沒有stall和有stall的比值為2.38/1 = 2.38倍 執行時間花在記憶體暫停的比例分別是: 1.38/3.38 = 41% 1.38/2.38 = 58 %

33 Example: Cache Performance with Increased Clock rate (2nd Ed.)
Suppose we increase the performance of the machine in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the machine be with the faster clock, assuming the same miss rate as the previous example? Solution: 因為clock rate增加一倍,所以miss penalty也相對增加一倍=80 clock cycle 假設指令數為 I,則 Instruction miss cycles = I * 2% * 80 = 1.6*I 由圖4.54知,在gcc中,load和store指令的總合為36%,因此 Data miss cycles = I * 36% * 4% * 80 = 1.15*I memory-stall cycles = 1.6I I = 2.75I 所以具有memory-stall的CPI = = 4.75 所以 Performance with fast clock / performance with slow clock =Execution timeslow / execution timefast =IC * CPIslow* clock cycle / IC* CPIfast *0.5*clock cycle = 3.38 /(0.5*4.75) =1.41

34 Example: Calculating Cache Performance (3rd Ed.)
Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. Use the instruction frequencies for gcc from Fig. 3.26, on page 228. Solution: 假設指令數為 I,則 Instruction miss cycles = I * 2% * 100 = 2.0*I 由圖3.26知,在gcc中,load和store指令的總合為36% ,因此 Data miss cycles = I * 36% * 4% * 100 = 1.44*I memory-stall cycles = 2.0I I = 3.44I 所以具有memory-stall的CPI = = 5.44 所以 CPU time with stalls / CPU time with perfect cache = CPIstall/ CPIperfect = 5.44/2 = 2.72 如果processor的速度變快,但記憶體仍保持不變,則花費在記憶體暫停的時間將會成為增加執行時間的一項因素。 將CPI變成1,則有沒有stall和有stall的比值為4.44/1 = 4.44倍 執行時間花在記憶體暫停的比例分別是: 3.44/5.44 = 63% 3.44/4.44 = 77 %

35 Example: Cache Performance with Increased Clock rate (3rd Ed.)
Suppose we increase the performance of the machine in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the machine be with the faster clock, assuming the same miss rate as the previous example? Solution: 因為clock rate增加一倍,所以miss penalty也相對增加一倍=200 clock cycle 假設指令數為 I,則 Instruction miss cycles = I * 2% * 200 = 4.0*I 由圖3.26知,在gcc中,load和store指令的總合為36%,因此 Data miss cycles = I * 36% * 4% * 200 = 2.88*I memory-stall cycles = 4.0I I = 6.88I 所以具有memory-stall的CPI = = 8.88 所以 Performance with fast clock / performance with slow clock =Execution timeslow / execution timefast =IC * CPIslow* clock cycle / IC* CPIfast *0.5*clock cycle = 5.44 /(0.5*8.88) =1.23

36 Reducing Cache Misses by More Flexible Placement of Blocks
Fully Associative A block in memory can be placed in any location in the cache. To find a given block , all entry in the cache must be searched. Set Associative N-way set-associative cache Consists of a number of sets, each set consists of n blocks A block is directly mapped into a set, and then all the blocks in the set are searched for a match. (block number) modulo (Number of sets in the cache)

37 7.3 Measuring and Improving Cache Performance
範例 圖7.13 顯示出根據直接映射、集合關聯式、完全關聯式的放置方法 ,來將第12個區塊放到快取記憶體的8個區塊中的某個位置。 Direct mapped Set associative Fully associative

38 Decreasing miss ratio with associativity
增加associativity的優點在於降低miss rate,缺點則會增加hit time。 下圖是一個8-block的cache組態 Cache size = # of set * associativity

39 Example: Associativity in Cache
有三個小容量的快取記憶體,每個都包含4個1個字組的區塊。第一個 為完全關聯式,第二個為2-way集合關聯式,第三個為直接映射。給定 下面序列的區塊位址:0、8、0、6、8,試找出每種快取結構的誤失數目。 解答: 狀況I. 直接映射 5 個誤失

40 Example: Associativity in Cache
解答 (續) 狀況 II. 2-way 集合關聯式 在集合中最久沒有被使用的區塊會被取代(LRU) 4 個誤失

41 Example: Associativity in Cache
解答 (續) 狀況 III.完全關聯式 在集合中最久沒有被使用的區塊會被取代(LRU) 3個誤失 Fig 顯示associativity對miss rate的改善

42 Example: Associativity in Cache (cont.)
圖 7.17 展示一個 4-way集合關聯式的快取記憶體的製作 利用索引來選擇包含我們所要尋找的位址的集合 這集合中所有區塊的標籤都必須被平行的搜尋過 在這個例子中需要4個比較器及4對一的多工器來從4個集合 中選擇可能的資料 結論 由以上的例子得知增加關聯性程度的好處是通常可以減少誤失率 關聯式的快取記憶體所需付出的額外成本為比較(comparator)及延遲

43 An implementation Tag Index Block Offset Fig. 7.17

44 Example: Size of Tags VS. Set Associativity
Assuming a cache of 4K blocks, a 4-word block size, and a 32-bit address, find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way and four-way set associative, and fully associative. Solution: 4-word block = 4 * 4 =16 bytes,因此block offset= 4 bits (a) direct mapped 4K blocks = 212 blocks,因此,index = 12 bits Tag bits = 32 – 4 – 12 = 16 bits Total tag bits = 4K * 16 = 64K bits (b) two-way set associative 4K / 2 = 2 K sets = 211 sets,因此,index = 11 bits Tag bits = 32 – 4 – 11 = 17 bits Total tag bits = 4K * 17 = 68K bits (c) four-way set associative 4K / 4 = 1 K sets = 210 sets,因此,index = 10 bits Tag bits = 32 – 4 – 10 = 18 bits Total tag bits = 4K * 18 = 72K bits (d) fully associative Tag bits = 32 – 4 = 28 bits Total tag bits = 4K * 28 =112K bits

45 Choosing Which Block to Replace
Replacement Algorithms Direct mapping No choice Each block only maps to one line Replace that line Associative & Set Associative Least Recently used (LRU) e.g. in 2 way set associative Which of the 2 block is LRU? First in first out (FIFO) replace block that has been in cache longest Least frequently used replace block which has had fewest hits Random

46 Reducing the Miss Penalty Using Multilevel Caches
為拉近CPU和主記憶體(由DRAM組成)間的速度差異,在CPU和DRAM之間加入第二層cache(由SRAM組成) 。 因為SRAM的存取速度較DRAM快,因此可降低access time,達到加速的效果。 Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

47 Example 1: Performance of Multilevel Caches
Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 500MHz. Assume a main memory access time of 200ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 5%. How much faster will the machine be if we add a secondary cache that has a 20ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 2%? Solution: clock rate = 500 MHz,所以,1個clock cycle = 1/ 500M = 2 ns 所以主記憶體的miss penalty = 200 ns / 2 ns = 100 clock cycles Total CPI = Base CPI + Memory-stall cycles per instruction = % * 100 = 6.0 2nd cache的miss penalty = 20 /2 =10 clock cycles Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction = % * % * 100 = 3.5 因此,具有兩層快取的機器比一層快取的機器快6.0/3.5 = 1.7倍。

48 Example 2: Performance of Multilevel Caches
Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 5GHz. Assume a main memory access time of 100ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 2%. How much faster will the processor be if we add a secondary cache that has a 5ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 0.5%? Solution: clock rate = 5GHz,所以,1個clock cycle = 1/ 5G = 0.2 ns 所以主記憶體的miss penalty = 100 ns / 0.2 ns = 500 clock cycles Total CPI = Base CPI + Memory-stall cycles per instruction = % * 500 = nd cache的miss penalty = 5 /0.2 =25 clock cycles Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction = % * % * 500 = 4.0 因此,具有兩層快取的機器比一層快取的機器快11.0/4.0 = 2.8倍。

49 Performance

50 Overview:虛擬記憶體(Virtual memory)
所謂的虛擬記憶體的技術是指程式在執行前並沒有全部載入到主記憶體中,而只將會用到的部分載入,因此程式可以不受實際主記憶體大小的限制,同時,也由於每個使用者真正使用的主記憶體容量減低,使得系統可以容納更多的使用者程式來執行,增加的CPU的使用效率。理論上,CPU會到主記憶體中存取執行時所需的資料,當資料不在主記憶體中,即發生分頁誤失(page fault)的陷落(trap)現象,此時必須將該資料置換進主記憶體中,而不是將整個程式資料載入,因此稱此置換的動作為偷懶置換程式(lazy swapper)。常見的用來設計虛擬記憶體的方法為需求分頁 需求分頁(Demand Paging) 需求分頁是一種具有置換功能的分頁系統,平常程式是儲存在輔助記憶體中,當程式執行時,才將該用到的部分資料載入到主記憶體中,如果程式執行到一個不在主記憶體內的分頁時,則會發生所謂的分頁誤失(page fault)的陷落(trap)。此時必須將該分頁置換進主記憶體中,供程式繼續執行。

51 Overview:虛擬記憶體(Virtual memory)
頁替換策略(Page Replacement) 分頁替換的工作主要是: 找出磁碟上所要的分頁位置。 找出可用頁框 (a)如果有可用頁框,則使用。 (b)無可用頁框時,則使用頁替換策略,從中選取一個犧牲頁框。 (c)將此犧牲頁框置換出去。 將所要的分頁讀入頁框中。 重新執行使用者process。 常見的替換策略有: FIFO(First In First Out):先進先出 將最先進入系統的page優先置換出去,一般說來,此種方法是所有策略中效率最低。 最佳替換法 這是一個理想上的策略,替換的原則是將最長時間不會被引用的分頁置換出去。然而實際上並不可行,因為系統無法預知未來的分頁參考情形。 最近最少使用法(Least Recently Used, LRU) LRU法紀錄每個分頁最後一次被參考的時間,當必須作分頁置換時,從中選擇一個最長未被使用的分頁,替換出去。

52 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Main memory need contain only the active portions of the many programs, just as a cache contains only the active portion of one program. Advantages: Illusion(錯覺) of having more physical memory program relocation protection Virtual addresses Physical addresses

53 Pages: virtual memory blocks
Mapping from a virtual to a physical address Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important, allow fully associative placement of pages. (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use write-back # of page 218 Page size= 212=4KB

54 Placing a Page and Finding it again
Page table A page table is resides in memory, is indexed with the page number from the virtual address and contains the corresponding physical page number. Each program has its own page table, which maps the virtual address space of that program to main memory. Page table register To indicate the location of the page table in memory, the hardware includes a register that points to the start of the page table.

55 Page Tables 每個program均有一個page table,以page table register指到第一個位置

56 Page Tables Virtual page number
如果有效位元為1,page table提供virtual page number所對映的physical memory 如果有效位元為0,表示目前的page在disk上,page table提供disk address

57 Example With a 32-bit virtual address, 4-KB pages, and 4 bytes per page table entry. What is the total page table size? Solution: page table entry的個數= 232 / 4 K = 232 / 212 = 220 page table size = 220 * 4 = 222 = 4MB

58 Making Address Translation Fast
A cache for address translations: translation lookaside buffer, TLB 此特殊的cache保存最近使用的資料 TLB的tag紀錄虛擬分頁號碼 TLB的data紀錄實際分頁號碼 TLB 若在TLB中,找不到特定的page,則去查分頁表,若在分頁表中可以找到,則將其加入TLB。 若在分頁表中還是找不到,則發生page fault。需到Disk將所要的分頁載入分頁表及TLB。

59 TLBs and caches

60 Modern Systems Very complicated memory systems:

61 Some Issues Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity Trends: synchronous SRAMs (provide a burst of data) redesign DRAM chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA)


Download ppt "Chapter 7 Large and Fast: Exploiting Memory Hierarchy"

Similar presentations


Ads by Google