Presentation is loading. Please wait.

Presentation is loading. Please wait.

高等计算机系统结构 多处理器系统 (第八讲) 程 旭 2011年5月16日.

Similar presentations


Presentation on theme: "高等计算机系统结构 多处理器系统 (第八讲) 程 旭 2011年5月16日."— Presentation transcript:

1 高等计算机系统结构 多处理器系统 (第八讲) 程 旭 2011年5月16日

2 Uniprocessor Performance (SPECint)
3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present 9/19/2018

3 Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’07 Intel/’07 IBM/’07 Sun/’07 Processors/chip 4 2 8 Threads/Processor 1 Threads/chip 64 9/19/2018

4 Other Factors  Multiprocessors
Growth in data-intensive applications Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design

5 并行计算机 定义: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 并行计算机的一些问题: 资源分配: 处理单元有多少? 处理单元的性能如何? 存储多大? 数据访问、通信和同步 处理单元之间如何协作和通信? 互联是什么类型? 数据如何在处理器之间传输? 编程人员使用什么样的原语? 性能和可扩展性 上述因素如何影响性能? 如何支持可扩展性?

6 并行处理器的 “信仰” 六十年代以来计算机设计人员的梦想:增加处理器数量以提升性能 与 设计更快的处理器
由于“单处理器不能继续发展”,因而导致创造出许多针对具体编程模型的机器组成 例如,由于受制于光速限制,单处理器的速度将不再提升: 1972, … , 1989 近乎宗教的狂热:必须确信无疑! 九十年代,一些著名公司,如 Thinking Machines、Kendall Square, …等退出商业领域,这种狂热有所降温 论据变为:可扩展性能机遇的“拉动”,而非“单处理器性能稳定”的“推动”

7 什么级别的并行性? 指令级并行 (ILP): 1985 到 今天 位级并行性: 1970 到 1985左右
4位、8位、16位、 32位处理器 指令级并行 (ILP): 1985 到 今天 流水技术 超标量 超长指令字 乱序执行 指令级并行性的限制? 进程级 或 线程级并行性;是否能够成为通用计算的主流?

8 Multiprocessors and Multicomputer Clusters
(TreadMarks, Wind Tunnel, Shrimp) UMA NUMA PVP SMP COMA CC- NUMA NCC-NUMA DSM or SC-NUMA (Cray T90) (Intel SHV, Sun, SGI, IBM) (KSR-1) (Stanford Dash, SGI Origin, Sequent NUMA, HP Exemplar) (Cray T3E, etc.) Multiprocessors (Single address space with shared-memory) UMA: Uniform Memory Access NUMA: Non_ Uniform Memory Access NORMA: NO Remote Memory Access PVP: Parallel Vector Processor SMP: Symmetrical Multiprocessor COMA: Cache-only Memory Access CC-NUMA: Cache-coherence NUMA NCC-NUMA: Non Cache-coherence NUMA SC-NUMA:software-coherent NUMA : MIMD (Intel Option Red, IBM Blue Pacific, SGI/Cray Blue Mountain) NORMA Cluster MPP (IBM SP2, TruCluster, Solaris MC, Tandem Hymalaya, Wolfpack, NOW, PearlCluster) (Multiple address spaces with no remote memory access) Multicomputers

9 并行体系结构 “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 并行体系结构用通信体系结构( communication architecture )对传统的体系结构进行扩展 抽象 (硬件/软件接口) 组成结构支持有效实现上述抽象

10 并行体系结构历史 发展方向的不确定性严重影响了并行软件的开发! 历史上,并行体系结构与编程模型紧密结合
出现大量不同的结构,不可预测到底会如何发展 应用软件 脉动阵列 系统软件 SIMD 体系结构 消息传递 数据流 共享主存 发展方向的不确定性严重影响了并行软件的开发!

11 Two Models for Communication and Memory Architecture
Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors (aka multicomputers) Modern cluster systems contain multiple stand-alone computers communicating via messages Communication occurs through a shared address space (via loads and stores): shared-memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non-Uniform Memory Access time multiprocessor) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

12 Centralized vs. Distributed Memory
Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory

13 Centralized Memory Multiprocessor
Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches  single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

14 Distributed Memory Multiprocessor
Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Software must be aware of data placement to take advantage of increased memory BW

15 Challenges of Parallel Processing
Big challenge is % of program that is inherently sequential What does it mean to be inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? 10% 5% 1% <1%

16 Symmetric Multiprocessors
Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer)

17 当今的并行体系结构 对“计算机体系结构”进行扩展以支持通信和协作 定义 旧:指令系统体系结构 新:通信体系结构
决定性抽象、边界 和 原语(接口) 实现接口的组成结构(硬件 或 软件) 今天,编译程序、库和操作系统是重要桥梁

18 当代并行计算机系统分层构架 物理通信介质 CAD 数据库 并行应用 多道程序 共享地址 消息传递 数据并行 编程模型 编译器 或 库
科学建模 并行应用 多道程序 共享地址 消息传递 数据并行 编程模型 编译器 或 库 通信抽象 用户/系统边界 操作系统支持 硬件/软件边界 通信硬件 物理通信介质

19 并行构架 分层: 编程模型 通信抽象: 多道程序(Multiprogramming):许多工作、之间没有通信 共享地址空间:通过存储器通信
消息传递:发送 和 接收 信息(信报) 数据并行: 多个代理同时对不同的数据集合进行操作,然后同时在全局交换信息 (共享或消息传递) 通信抽象: 共享地址空间:例如,load, store, atomic swap 消息传递:例如,send, receive的库调用 关于这一论题的争论 (易于编程、可扩展能力)

20 共享地址模型小结 每个 处理器 可以指定(name)该机器中所有的物理位置 每个 进程 可以指定(name)它与其他进程共享的所有数据
数据通过 load和store传输 数据大小:字节、字、 ... 或 cache块 使用虚拟存储技术来将虚拟地址映射到本地或远程的物理地址 存储层次模型要求:通信将数据移动到本地处理器的cache (就象load把数据从存储器移动到cache) 通信时,时延、带宽、可扩展性?

21 共享地址空间模型 机器的物理地址空间 进程组的虚拟地址空间的通信 通过共享地址完成 私用 私用 私用 私用 P Load P 公共物理空间
n 私用 Load P n 公共物理空间 P 2 P 1 P Store 地址空间共享部分 P 私用 2 P 私用 1 地址空间私用部分 P 私用

22 共享地址/存储多处理器模型 通过 Load 和 Store 通信 基于时间共享:多处理器上的多进程 与 共享单处理器
最老的、也是使用最广的模型 基于时间共享:多处理器上的多进程 与 共享单处理器 进程:单一虚拟地址空间 和 单或多线程控制 多进程可以重叠 (共享),但是所有 线程 共享同一进程地址空间 一个线程对共享地址空间的写操作对其他线程的读操作是可见的 通常模型:共享代码、私有栈、一些共享堆、和一些私用堆

23 示例:小规模多处理器设计 I/O系统 主存 存储器:具有相同访问时间(uniform access time:UMA) 和 总线互联、I/O
示例:Sun Enterprise 6000、SGI Challenge、Intel SystemPro I/O系统 主存 处理器 一级或多级 Cache

24 可扩展性 ... Network ... “闺房式” “舞厅式” Network 互联网络是问题所在: 成本 (交叉开关) 或带宽(总线)
M M M Network Network ... $ $ ° ° ° $ M $ M $ M $ P P P P P P “舞厅式” “闺房式” 互联网络是问题所在: 成本 (交叉开关) 或带宽(总线) 舞厅式:带宽仍可扩展,但比交叉开关的成本更低 到存储器的时延是统一的,但统一为最大时延 分布存储器 和 非统一存储器访问(non-uniform memory access :NUMA) 在通用网络上,构造成简单的信报(消息)事务之外的共享地址空间 (例如,读请求(read-request)、读响应(read-response)) 高速缓存共享(特别是非本地)的数据?

25 SMP互联 处理器连到存储器 并且 连到I/O
基于总线:所有的存储位置具有相同的访问时间,因而SMP = 对称多处理器(Symmetric MP) 随着处理器和I/O的增加,共享限制带宽 交叉开关:扩展的成本很高 多级网络(与具有更高带宽的交叉开关相比,扩展的成本较低) “舞厅式”设计: 所有的处理器在网络的一侧,所有的存储器在网络的另一侧

26 大规模多处理器系统设计 注: 利用非统一访问时间(nonuniform access time :numa)和可扩展的互联网络来实现分布(分布存储) 示例: T3E 处理器+Cache 处理器+Cache 处理器+Cache 1周期 T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M I/O系统 I/O系统 I/O系统 存储器 存储器 存储器 40周期 100周期 互联网络 存储器 存储器 存储器 I/O系统 I/O系统 I/O系统 处理器+Cache 处理器+Cache 处理器+Cache

27 消息传递模型 所有计算机 (CPU、存储器、I/O设备)用显式I/O操作来完成通信 Send 指定本地缓冲器 + 远程计算机的接收进程
本质上是NUMA,但利用I/O设备集成,而非存储系统 Send 指定本地缓冲器 + 远程计算机的接收进程 Receive 指定远程计算机的发送进程 + 存放数据的本地缓冲器 通常,send包括进程标志(tag)并且receive遵从基于该标志的规则:单一匹配、任意匹配 同步(Synch): 当send完成、当缓冲器空闲、当请求接受(request accepted)、receive等待发送 Send+receive => 存储器-存储器拷贝,每个原语都提供本地地址,并且 进行成对同步!

28 消息传递抽象 Match Receive Y,P,t Address Y Send X, Q, t Address X
Local process address space Local process address space Process P Process Q

29 消息传递模型(续) Send+receive => 即使在单处理器上运行,也进行存储器-存储器拷贝,利用操作系统同步 信息传递的历史:
由于只能发送数据给最临近的结点,因而网拓扑结构非常重要 典型的同步:阻塞发送与接收 后来,具有非阻塞发送的DMA,DMA负责将接收数据放在缓冲器中直到处理器真地开始接收,然后将数据传输到本地存储器 后来,通过软件库来实现任意通信 示例:IBM SP-2,放在机架上的RS6000工作站 网络接口卡中使用Intel 960 8X8的交叉开关作为基本通信模块 每条链路的带宽40 MByte/sec

30 通信模型 共享存储 消息传递 在两种硬件的基础上可能支持两种软件模型 处理器通过共享地址空间进行通信 易于在小规模机器上实现 优点:
单处理器和小规模多处理器系统选用的模型 易于编程 低时延 易于使用硬件控制的高速缓冲存储技术 消息传递 处理器具有私用存储器,通过消息进行通信 使用硬件少,易于设计 注意点在费时的 非本地 操作 在两种硬件的基础上可能支持两种软件模型

31 流行的Flynn分类 SISD (Single Instruction Single Data)
单处理器 MISD (Multiple Instruction Single Data) ??? SIMD (Single Instruction Multiple Data) 示例: Illiac-IV、CM-2 编程模型简单 低开销 灵活 全部都是定制的集成电路 MIMD (Multiple Instruction Multiple Data) 例如 :Sun Enterprise 5000、Cray T3D、SGI Origin 使用商业化的微处理器

32 数据并行模型 多个同样操作并行作用于一个大的规则数据结构(例如,数组)的每个元素 1个控制处理器广播到多个PEs
数据分布在每个存储器中 八十年代早期,VLSI => SIMD的复活: PE采用32个1位PE +片载存储器 数据并行编程语言给出数据在处理器上的布局 PE ... Control processor

33 数据并行模型 向量处理器具有类似的ISA,但是没有数据放置的限制 SIMD产生的数据并行编程语言
VLSI的发展产生了单片FPU和整个高速处理器 (SIMD 的吸引力弱) SIMD编程模型发展为 单程序多数据(SPMD) 模型 所有的处理器执行同样的程序 数据并行编程语言仍有用,立即完成所有的通信: 大批同步(Bulk Synchronous)-- 一些在一个全局栅栏( barrier )之后完成所有通信的阶段

34 并行体系结构逐步集中 在通信帮助下,将完整计算机连接到一个可扩展网络(“闺房式”) 不同的编程模型对通信帮助的需求不同
共享地址空间:与存储器紧密集成以捕获与其他处理器相互作用的存储器事件 + 以接受其他结点的请求 消息传递:发送消息快速,并对到来消息响应:标志比较、分配缓冲器、传输数据、等待接收置入 数据并行: 快速全局同步 高性能Fortran(HPF) 共享存储、数据并行; 消息传递接口(MPI) 消息传递库; 都可以在许多机器上工作,有多种不同实现

35 基本论题 刻画并行机器的3个要点 命名(Naming) 同步(Synchronization)
时延和带宽(Latency and Bandwidth)

36 基本论题之一: 命名 命名:如何快速求解大问题
哪些数据需要共享 如何对它进行寻址 哪些操作可以访问数据 处理器之间如何引用 命名的选择影响 编译产生的代码; 通过load记住的地址 或 对消息传递跟踪处理器号和本地虚拟地址 命名的选择影响 数据的重复;通过装载cache存储层次 或 通过软件重复和一致性

37 基本论题之一: 命名(续) 全局 物理 地址空间: 在单一操作中,任何处理器都能产生、寻址和访问它
存储器可以在任何地方: 通过虚拟地址变换来处理它 全局虚拟 地址空间: 如果每个进程的地址空间可以被配置成包括该并行程序的所有共享数据 分段共享地址空间: 对位置进行命名 <进程号,地址> 对并行程序的所有进程都统一

38 基本论题之二: 同步 为了协作,进程必须协调 消息传递是一种具有数据发送或抵达的隐含协调
共享地址 => 为显式协调需要额外操作: 例如,写一个标志、唤醒线程、中断一个处理器

39 基本论题之三: 时延和带宽 带宽 时延 时延隐藏 通信中需要高带宽 受制于网络、存储器和处理器 通信的开销在许多机器中是一大问题
由于处理器需要等待,因而影响性能 由于需要考虑许多问题来重叠通信和计算,因而也影响易于编程性 时延隐藏 一种机制如何帮助隐藏时延? 示例:把消息发送与计算重叠,预取数据,切换到其他任务

40 小规模多处理器系统 Cache的功效: cache一致性如何? 增加带宽 与 总线/存储器 减少访问的时延 对私有数据和共享数据都非常有效
I/O系统 主存 处理器 一级或多级 Cache Cache的功效: 增加带宽 与 总线/存储器 减少访问的时延 对私有数据和共享数据都非常有效 cache一致性如何?

41 Cache一致性问题 Processor b) Writes green, red stale
c) Update memory (green), red stale in cache

42 一致性的含义如何? 非正式: 较好: 保证上述要求的两个原则: 任何读操作都必须返回最近写的内容 太严格,也太难实现
任何写操作的结果最终都会被任何一次读操作看见 所有的写操作都以正确的次序可见 (序列化 serialization) 保证上述要求的两个原则: 如果 P 写 x 且 P1 读 x, 且读和写之间足够远,那么P的写效果将被P1看见 对单一位置的写操作是序列化的: 以一确定次序可见 将看见最后的写 否则将以不合逻辑的次序看见多次写 (在较新的数值写后,还看见较旧的数值)

43 可能的硬件一致性解决方案 Snooping Solution (Snoopy Bus): Directory-Based Schemes
Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

44 基本窥探的协议 Write Invalidate Protocol: Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration

45 窥探协议的一个例子 Invalidation protocol, write-back cache
Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, its writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

46 Snoopy-Cache 状态机-I State machine for CPU requests for each cache block
CPU Read hit State machine for CPU requests for each cache block CPU Read Shared (read/only) Invalid Place read miss on bus CPU read miss Write-back block Write back block CPU Write Invalid: read => shared write => dirty shared looks the same CPU Read miss Place read miss on bus Place Write Miss on bus CPU Write Place Write Miss on Bus Cache Block State Exclusive (read/write) CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit

47 Snoopy-Cache 状态机-II State machine for bus requests for each cache block Write miss for this block Shared (read/only) Invalid CPU Read miss Write Back Block; (abort memory access) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Exclusive (read/write)

48 Remote Write or Miss due to
窥探Cache:状态机 CPU Read hit Remote Write or Miss due to address conflict Shared (read/only) Invalid CPU Read Place read miss on bus CPU Write CPU Read miss Place rd miss on bus Place Write Miss on bus Remote Write or Miss due to address conflict Write back block CPU Write Place Write Miss on Bus Remote Read CPU Read Miss Write back block Exclusive (read/write) CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit

49 示例 Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2 Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

50 示例(续一) Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2. Active arrow = Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

51 示例(续二) Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2 Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

52 示例(续三) Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2. Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

53 示例(续四) Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2 Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

54 示例(续五) Assumes initial cache state
Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event. Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ≠ A2

55 实现复杂 Write Races: Must support interventions and invalidations
Cannot update cache until bus is obtained Otherwise, another processor may get bus first, and then write the same cache block! Two step process: Arbitrate for bus Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus: Bus transaction is not atomic: can have multiple outstanding transactions for a block Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block Must support interventions and invalidations

56 Cache State Transition Diagram The MSI protocol
M: Modified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 reads or writes M Other processor reads P1 writes back P1 intent to write Write miss Other processor intent to write Read miss S I Read by any processor Other processor intent to write Cache state in processor P1

57 MESI: An Enhanced MSI protocol increased performance for private data
M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 read P1 write P1 write or read M E Read miss, not shared P1 intent to write Write miss Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1

58 Performance of Symmetric Shared-Memory Multiprocessors
Cache performance is combination of Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses 4th C: coherence miss Joins Compulsory, Capacity, Conflict

59 Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word

60 MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine
True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory)

61 MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine
True sharing, false sharing increase going from 1 to 8 CPUs

62 较大的多处理器系统 Separate Memory per Processor
Local or Remote access via memory controller 1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is f(memory size) vs. f(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Processor have copies of their blocks

63 分布目录多处理器系统 互联网络 存储器 存储器 存储器 I/O系统 I/O系统 I/O系统 分布目录 分布目录 分布目录 分布目录 分布目录
处理器+Cache 处理器+Cache 处理器+Cache 存储器 存储器 存储器 I/O系统 I/O系统 I/O系统 分布目录 分布目录 分布目录 互联网络 分布目录 分布目录 分布目录 I/O系统 I/O系统 I/O系统 存储器 存储器 存储器 处理器+Cache 处理器+Cache 处理器+Cache

64 目录协议 Similar to Snoopy Protocol: Three states
Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent

65 目录协议(续) No bus and don’t want to broadcast:
interconnect no longer single arbitration point all messages have explicit responses Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address

66 目录协议消息 Message type Source Destination Msg Content Read miss Local cache Home directory P, A Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A Processor P writes data at address A; make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A Invalidate a shared copy at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data Return a data value from the home memory (read miss response) Data write-back Remote cache Home directory A, Data Write-back a data value for address A (invalidate response)

67 基于目录系统中一个独立Cache块的状态变换图
States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch requests Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

68 Invalidate or Miss due to
CPU-Cache 状态机 CPU Read hit Invalidate or Miss due to address conflict: State machine for CPU requests for each memory block Invalid state if in memory Shared (read/only) Invalid CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message to home directory Invalid: read => shared write => dirty shared looks the same Fetch: send Data Write Back message to home directory Exclusive (read/write) CPU read hit CPU write hit

69 目录的状态变换图 Same states & structure as the transition diagram for an individual cache 2 actions: update of directory state & send msgs to satisfy requests Tracks all copies of memory block. Also indicates an action that updates the sharing set, Sharers, as well as sending a message.

70 目录 状态机 State machine for Directory requests for each memory block
Read miss: Sharers += {P}; send Data Value Reply State machine for Directory requests for each memory block Uncached state if in memory Read miss: Sharers = {P} send Data Value Reply Shared (read only) Uncached Write Miss: Sharers = {P}; send Data Value Reply msg Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Reply msg Invalid: read => shared write => dirty shared looks the same Data Write Back: Sharers = {} (Write back block) Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write)

71 目录协议示例 Message sent to directory causes two actions:
Update the directory More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared. Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date: Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

72 目录协议示例(续) Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared. Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

73 示例 A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 and A2 map to the same cache block

74 示例(续二) A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 and A2 map to the same cache block

75 示例(续三) A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 and A2 map to the same cache block

76 示例(续四) A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory Write Back A1 and A2 map to the same cache block

77 示例(续五) A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 A1 and A2 map to the same cache block

78 示例(续六) A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 A1 and A2 map to the same cache block

79 实现一个目录 We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of bufffers in network Optimizations: read miss or write miss in Exclusive: send data directly to requestor from owner vs. 1st to memory and then from memory to requestor

80 同步 Why Synchronize: Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

81 Locks or Semaphores E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section

82 Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R  M[m]; M[m] R + RV; Swap (m), R: Rt  M[m]; M[m] R; R  Rt;

83 Using the Test&Set Instruction
P: Test&Set (mutex),Rtemp if (Rtemp!=0) goto P Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead V: Store (mutex),0 process(R) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section?

84 Nonblocking Synchronization
Compare&Swap(m), Rt, Rs: if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status success; else status fail; status is an implicit argument try: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rnewhead = Rhead+1 Compare&Swap(head), Rhead, Rnewhead if (status==fail) goto try process(R)

85 Load-reserve & Store-conditional
Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr>  <1, m>; R  M[m]; Store-conditional (m), R: if <flag, adr> == <1, m> then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead = Rhead + 1 Store-conditional (head), Rhead if (status==fail) goto try process(R)

86 Performance of Locks Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ...

87 另一多处理器论题: 存储器一致性模型 What is consistency? When must a processor see the new value before another processor update it? e.g., seems that P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above SC: delay all memory accesses until all invalidates done

88 存储器一致性模型 Schemes faster execution to sequential consistency
Not really an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x) release (s) {unlock} acquire (s) {lock} read(x) Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed) Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

89 Sequential Consistency
Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0

90 Sequential Consistency A Memory Model
P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs

91 Sequential Consistency
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ? more on this later

92 Issues in Implementing Sequential Consistency
Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a  b Store(a); Load(b) yes if a  b Store(a); Store(b) yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors


Download ppt "高等计算机系统结构 多处理器系统 (第八讲) 程 旭 2011年5月16日."

Similar presentations


Ads by Google