高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日.

高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日

Uniprocessor Performance (SPECint)
3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present 9/19/2018

Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’07 Intel/’07 IBM/’07 Sun/’07 Processors/chip 4 2 8 Threads/Processor 1 Threads/chip 64 9/19/2018

Other Factors  Multiprocessors
Growth in data-intensive applications Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design

并行计算机定义： “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 并行计算机的一些问题: 资源分配: 处理单元有多少？处理单元的性能如何？存储多大? 数据访问、通信和同步处理单元之间如何协作和通信？互联是什么类型? 数据如何在处理器之间传输？编程人员使用什么样的原语？性能和可扩展性上述因素如何影响性能? 如何支持可扩展性?

并行处理器的 “信仰” 六十年代以来计算机设计人员的梦想：增加处理器数量以提升性能与设计更快的处理器
由于“单处理器不能继续发展”，因而导致创造出许多针对具体编程模型的机器组成例如，由于受制于光速限制，单处理器的速度将不再提升： 1972, … , 1989 近乎宗教的狂热：必须确信无疑！九十年代，一些著名公司，如 Thinking Machines、Kendall Square, …等退出商业领域，这种狂热有所降温论据变为：可扩展性能机遇的“拉动”，而非“单处理器性能稳定”的“推动”

什么级别的并行性? 指令级并行 (ILP): 1985 到今天位级并行性: 1970 到 1985左右
4位、8位、16位、 32位处理器指令级并行 (ILP): 1985 到今天流水技术超标量超长指令字乱序执行指令级并行性的限制? 进程级或线程级并行性；是否能够成为通用计算的主流?

Multiprocessors and Multicomputer Clusters
(TreadMarks, Wind Tunnel, Shrimp) UMA NUMA PVP SMP COMA CC- NUMA NCC-NUMA DSM or SC-NUMA (Cray T90) (Intel SHV, Sun, SGI, IBM) (KSR-1) (Stanford Dash, SGI Origin, Sequent NUMA, HP Exemplar) (Cray T3E, etc.) Multiprocessors (Single address space with shared-memory) UMA: Uniform Memory Access NUMA: Non_ Uniform Memory Access NORMA: NO Remote Memory Access PVP: Parallel Vector Processor SMP: Symmetrical Multiprocessor COMA: Cache-only Memory Access CC-NUMA: Cache-coherence NUMA NCC-NUMA: Non Cache-coherence NUMA SC-NUMA:software-coherent NUMA : MIMD (Intel Option Red, IBM Blue Pacific, SGI/Cray Blue Mountain) NORMA Cluster MPP (IBM SP2, TruCluster, Solaris MC, Tandem Hymalaya, Wolfpack, NOW, PearlCluster) (Multiple address spaces with no remote memory access) Multicomputers

并行体系结构 “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 并行体系结构用通信体系结构（ communication architecture ）对传统的体系结构进行扩展抽象 (硬件/软件接口) 组成结构支持有效实现上述抽象

并行体系结构历史发展方向的不确定性严重影响了并行软件的开发! 历史上，并行体系结构与编程模型紧密结合
出现大量不同的结构，不可预测到底会如何发展应用软件脉动阵列系统软件 SIMD 体系结构消息传递数据流共享主存发展方向的不确定性严重影响了并行软件的开发!

Two Models for Communication and Memory Architecture
Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors (aka multicomputers) Modern cluster systems contain multiple stand-alone computers communicating via messages Communication occurs through a shared address space (via loads and stores): shared-memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non-Uniform Memory Access time multiprocessor) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

Centralized vs. Distributed Memory
Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory

Centralized Memory Multiprocessor
Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches  single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

Distributed Memory Multiprocessor
Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Software must be aware of data placement to take advantage of increased memory BW

Challenges of Parallel Processing
Big challenge is % of program that is inherently sequential What does it mean to be inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? 10% 5% 1% <1%

Symmetric Multiprocessors
Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer)

当今的并行体系结构对“计算机体系结构”进行扩展以支持通信和协作定义旧：指令系统体系结构新：通信体系结构
决定性抽象、边界和原语（接口）实现接口的组成结构（硬件或软件）今天，编译程序、库和操作系统是重要桥梁

当代并行计算机系统分层构架物理通信介质 CAD 数据库并行应用多道程序共享地址消息传递数据并行编程模型编译器或库
科学建模并行应用多道程序共享地址消息传递数据并行编程模型编译器或库通信抽象用户/系统边界操作系统支持硬件/软件边界通信硬件物理通信介质

并行构架分层：编程模型通信抽象：多道程序（Multiprogramming）：许多工作、之间没有通信共享地址空间：通过存储器通信
消息传递：发送和接收信息（信报）数据并行：多个代理同时对不同的数据集合进行操作，然后同时在全局交换信息 (共享或消息传递) 通信抽象：共享地址空间：例如，load, store, atomic swap 消息传递：例如，send, receive的库调用关于这一论题的争论（易于编程、可扩展能力）

共享地址模型小结每个处理器可以指定（name）该机器中所有的物理位置每个进程可以指定（name）它与其他进程共享的所有数据
数据通过 load和store传输数据大小：字节、字、 ... 或 cache块使用虚拟存储技术来将虚拟地址映射到本地或远程的物理地址存储层次模型要求：通信将数据移动到本地处理器的cache (就象load把数据从存储器移动到cache) 通信时，时延、带宽、可扩展性?

共享地址空间模型机器的物理地址空间进程组的虚拟地址空间的通信通过共享地址完成私用私用私用私用 P Load P 公共物理空间
n 私用 Load P n 公共物理空间 P 2 P 1 P Store 地址空间共享部分 P 私用 2 P 私用 1 地址空间私用部分 P 私用

共享地址/存储多处理器模型通过 Load 和 Store 通信基于时间共享：多处理器上的多进程与共享单处理器
最老的、也是使用最广的模型基于时间共享：多处理器上的多进程与共享单处理器进程：单一虚拟地址空间和单或多线程控制多进程可以重叠 (共享)，但是所有线程共享同一进程地址空间一个线程对共享地址空间的写操作对其他线程的读操作是可见的通常模型：共享代码、私有栈、一些共享堆、和一些私用堆

示例：小规模多处理器设计 I/O系统主存存储器：具有相同访问时间（uniform access time：UMA）和总线互联、I/O
示例：Sun Enterprise 6000、SGI Challenge、Intel SystemPro I/O系统主存处理器一级或多级 Cache

可扩展性 ... Network ... “闺房式” “舞厅式” Network 互联网络是问题所在: 成本 (交叉开关) 或带宽(总线)
M M M Network Network ... $ $ ° ° ° $ M $ M $ M $ P P P P P P “舞厅式” “闺房式” 互联网络是问题所在: 成本 (交叉开关) 或带宽(总线) 舞厅式：带宽仍可扩展，但比交叉开关的成本更低到存储器的时延是统一的，但统一为最大时延分布存储器和非统一存储器访问（non-uniform memory access ：NUMA）在通用网络上，构造成简单的信报（消息）事务之外的共享地址空间 (例如，读请求（read-request）、读响应（read-response））高速缓存共享（特别是非本地）的数据?

SMP互联处理器连到存储器并且连到I/O
基于总线：所有的存储位置具有相同的访问时间，因而SMP = 对称多处理器（Symmetric MP）随着处理器和I/O的增加，共享限制带宽交叉开关：扩展的成本很高多级网络（与具有更高带宽的交叉开关相比，扩展的成本较低） “舞厅式”设计: 所有的处理器在网络的一侧，所有的存储器在网络的另一侧

大规模多处理器系统设计注: 利用非统一访问时间（nonuniform access time ：numa）和可扩展的互联网络来实现分布（分布存储）示例: T3E 处理器+Cache 处理器+Cache 处理器+Cache 1周期 T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M I/O系统 I/O系统 I/O系统存储器存储器存储器 40周期 100周期互联网络存储器存储器存储器 I/O系统 I/O系统 I/O系统处理器+Cache 处理器+Cache 处理器+Cache

消息传递模型所有计算机 (CPU、存储器、I/O设备)用显式I/O操作来完成通信 Send 指定本地缓冲器 + 远程计算机的接收进程
本质上是NUMA，但利用I/O设备集成，而非存储系统 Send 指定本地缓冲器 + 远程计算机的接收进程 Receive 指定远程计算机的发送进程 + 存放数据的本地缓冲器通常，send包括进程标志（tag）并且receive遵从基于该标志的规则：单一匹配、任意匹配同步（Synch）: 当send完成、当缓冲器空闲、当请求接受（request accepted）、receive等待发送 Send+receive => 存储器-存储器拷贝，每个原语都提供本地地址，并且进行成对同步！

消息传递抽象 Match Receive Y,P,t Address Y Send X, Q, t Address X
Local process address space Local process address space Process P Process Q

消息传递模型（续） Send+receive => 即使在单处理器上运行，也进行存储器-存储器拷贝，利用操作系统同步信息传递的历史：
由于只能发送数据给最临近的结点，因而网拓扑结构非常重要典型的同步：阻塞发送与接收后来，具有非阻塞发送的DMA，DMA负责将接收数据放在缓冲器中直到处理器真地开始接收，然后将数据传输到本地存储器后来，通过软件库来实现任意通信示例：IBM SP-2，放在机架上的RS6000工作站网络接口卡中使用Intel 960 8X8的交叉开关作为基本通信模块每条链路的带宽40 MByte/sec

通信模型共享存储消息传递在两种硬件的基础上可能支持两种软件模型处理器通过共享地址空间进行通信易于在小规模机器上实现优点：
单处理器和小规模多处理器系统选用的模型易于编程低时延易于使用硬件控制的高速缓冲存储技术消息传递处理器具有私用存储器，通过消息进行通信使用硬件少，易于设计注意点在费时的非本地操作在两种硬件的基础上可能支持两种软件模型

流行的Flynn分类 SISD (Single Instruction Single Data)
单处理器 MISD (Multiple Instruction Single Data) ??? SIMD (Single Instruction Multiple Data) 示例： Illiac-IV、CM-2 编程模型简单低开销灵活全部都是定制的集成电路 MIMD (Multiple Instruction Multiple Data) 例如：Sun Enterprise 5000、Cray T3D、SGI Origin 使用商业化的微处理器

数据并行模型多个同样操作并行作用于一个大的规则数据结构（例如，数组）的每个元素 1个控制处理器广播到多个PEs
数据分布在每个存储器中八十年代早期，VLSI => SIMD的复活： PE采用32个1位PE +片载存储器数据并行编程语言给出数据在处理器上的布局 PE ... Control processor

数据并行模型向量处理器具有类似的ISA，但是没有数据放置的限制 SIMD产生的数据并行编程语言
VLSI的发展产生了单片FPU和整个高速处理器 (SIMD 的吸引力弱) SIMD编程模型发展为单程序多数据(SPMD) 模型所有的处理器执行同样的程序数据并行编程语言仍有用，立即完成所有的通信：大批同步（Bulk Synchronous）-- 一些在一个全局栅栏（ barrier ）之后完成所有通信的阶段

并行体系结构逐步集中在通信帮助下，将完整计算机连接到一个可扩展网络（“闺房式”）不同的编程模型对通信帮助的需求不同
共享地址空间：与存储器紧密集成以捕获与其他处理器相互作用的存储器事件 + 以接受其他结点的请求消息传递：发送消息快速，并对到来消息响应：标志比较、分配缓冲器、传输数据、等待接收置入数据并行：快速全局同步高性能Fortran（HPF）共享存储、数据并行；消息传递接口（MPI）消息传递库；都可以在许多机器上工作，有多种不同实现

基本论题刻画并行机器的3个要点命名（Naming）同步（Synchronization）
时延和带宽（Latency and Bandwidth）

基本论题之一: 命名命名：如何快速求解大问题
哪些数据需要共享如何对它进行寻址哪些操作可以访问数据处理器之间如何引用命名的选择影响编译产生的代码；通过load记住的地址或对消息传递跟踪处理器号和本地虚拟地址命名的选择影响数据的重复；通过装载cache存储层次或通过软件重复和一致性

基本论题之一: 命名（续）全局物理地址空间：在单一操作中，任何处理器都能产生、寻址和访问它
存储器可以在任何地方：通过虚拟地址变换来处理它全局虚拟地址空间：如果每个进程的地址空间可以被配置成包括该并行程序的所有共享数据分段共享地址空间：对位置进行命名 <进程号，地址> 对并行程序的所有进程都统一

基本论题之二: 同步为了协作，进程必须协调消息传递是一种具有数据发送或抵达的隐含协调
共享地址 => 为显式协调需要额外操作：例如，写一个标志、唤醒线程、中断一个处理器

基本论题之三: 时延和带宽带宽时延时延隐藏通信中需要高带宽受制于网络、存储器和处理器通信的开销在许多机器中是一大问题
由于处理器需要等待，因而影响性能由于需要考虑许多问题来重叠通信和计算，因而也影响易于编程性时延隐藏一种机制如何帮助隐藏时延？示例：把消息发送与计算重叠，预取数据，切换到其他任务

小规模多处理器系统 Cache的功效： cache一致性如何? 增加带宽与总线/存储器减少访问的时延对私有数据和共享数据都非常有效
I/O系统主存处理器一级或多级 Cache Cache的功效：增加带宽与总线/存储器减少访问的时延对私有数据和共享数据都非常有效 cache一致性如何?

Cache一致性问题 Processor b) Writes green, red stale
c) Update memory (green), red stale in cache

一致性的含义如何? 非正式：较好：保证上述要求的两个原则：任何读操作都必须返回最近写的内容太严格，也太难实现
任何写操作的结果最终都会被任何一次读操作看见所有的写操作都以正确的次序可见 (序列化 serialization) 保证上述要求的两个原则：如果 P 写 x 且 P1 读 x, 且读和写之间足够远，那么P的写效果将被P1看见对单一位置的写操作是序列化的：以一确定次序可见将看见最后的写否则将以不合逻辑的次序看见多次写 (在较新的数值写后，还看见较旧的数值)

可能的硬件一致性解决方案 Snooping Solution (Snoopy Bus): Directory-Based Schemes
Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

基本窥探的协议 Write Invalidate Protocol: Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration

窥探协议的一个例子 Invalidation protocol, write-back cache
Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, its writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

Snoopy-Cache 状态机-I State machine for CPU requests for each cache block
CPU Read hit State machine for CPU requests for each cache block CPU Read Shared (read/only) Invalid Place read miss on bus CPU read miss Write-back block Write back block CPU Write Invalid: read => shared write => dirty shared looks the same CPU Read miss Place read miss on bus Place Write Miss on bus CPU Write Place Write Miss on Bus Cache Block State Exclusive (read/write) CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit

Snoopy-Cache 状态机-II State machine for bus requests for each cache block Write miss for this block Shared (read/only) Invalid CPU Read miss Write Back Block; (abort memory access) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Exclusive (read/write)

Remote Write or Miss due to
窥探Cache：状态机 CPU Read hit Remote Write or Miss due to address conflict Shared (read/only) Invalid CPU Read Place read miss on bus CPU Write CPU Read miss Place rd miss on bus Place Write Miss on bus Remote Write or Miss due to address conflict Write back block CPU Write Place Write Miss on Bus Remote Read CPU Read Miss Write back block Exclusive (read/write) CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit

示例 Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2 Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

示例（续一） Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2. Active arrow = Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

示例（续二） Assumes initial cache state

示例（续三） Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 ≠ A2. Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back

示例（续四） Assumes initial cache state

示例（续五） Assumes initial cache state
Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event. Remote Write or Miss Write Back Remote Write or Miss Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ≠ A2

实现复杂 Write Races: Must support interventions and invalidations
Cannot update cache until bus is obtained Otherwise, another processor may get bus first, and then write the same cache block! Two step process: Arbitrate for bus Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus: Bus transaction is not atomic: can have multiple outstanding transactions for a block Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block Must support interventions and invalidations

Cache State Transition Diagram The MSI protocol
M: Modified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 reads or writes M Other processor reads P1 writes back P1 intent to write Write miss Other processor intent to write Read miss S I Read by any processor Other processor intent to write Cache state in processor P1

MESI: An Enhanced MSI protocol increased performance for private data
M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits P1 read P1 write P1 write or read M E Read miss, not shared P1 intent to write Write miss Other processor reads P1 writes back Other processor intent to write Read miss, shared S I Read by any processor Other processor intent to write Cache state in processor P1

Performance of Symmetric Shared-Memory Multiprocessors
Cache performance is combination of Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses 4th C: coherence miss Joins Compulsory, Capacity, Conflict

Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word

MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine
True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory)

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine
True sharing, false sharing increase going from 1 to 8 CPUs

较大的多处理器系统 Separate Memory per Processor
Local or Remote access via memory controller 1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is f(memory size) vs. f(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Processor have copies of their blocks

分布目录多处理器系统互联网络存储器存储器存储器 I/O系统 I/O系统 I/O系统分布目录分布目录分布目录分布目录分布目录
处理器+Cache 处理器+Cache 处理器+Cache 存储器存储器存储器 I/O系统 I/O系统 I/O系统分布目录分布目录分布目录互联网络分布目录分布目录分布目录 I/O系统 I/O系统 I/O系统存储器存储器存储器处理器+Cache 处理器+Cache 处理器+Cache

目录协议 Similar to Snoopy Protocol: Three states
Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent

目录协议（续） No bus and don’t want to broadcast:
interconnect no longer single arbitration point all messages have explicit responses Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address

目录协议消息 Message type Source Destination Msg Content Read miss Local cache Home directory P, A Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A Processor P writes data at address A; make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A Invalidate a shared copy at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data Return a data value from the home memory (read miss response) Data write-back Remote cache Home directory A, Data Write-back a data value for address A (invalidate response)

基于目录系统中一个独立Cache块的状态变换图
States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch requests Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

Invalidate or Miss due to
CPU-Cache 状态机 CPU Read hit Invalidate or Miss due to address conflict: State machine for CPU requests for each memory block Invalid state if in memory Shared (read/only) Invalid CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message to home directory Invalid: read => shared write => dirty shared looks the same Fetch: send Data Write Back message to home directory Exclusive (read/write) CPU read hit CPU write hit

目录的状态变换图 Same states & structure as the transition diagram for an individual cache 2 actions: update of directory state & send msgs to satisfy requests Tracks all copies of memory block. Also indicates an action that updates the sharing set, Sharers, as well as sending a message.

目录状态机 State machine for Directory requests for each memory block
Read miss: Sharers += {P}; send Data Value Reply State machine for Directory requests for each memory block Uncached state if in memory Read miss: Sharers = {P} send Data Value Reply Shared (read only) Uncached Write Miss: Sharers = {P}; send Data Value Reply msg Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Reply msg Invalid: read => shared write => dirty shared looks the same Data Write Back: Sharers = {} (Write back block) Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write)

目录协议示例 Message sent to directory causes two actions:
Update the directory More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared. Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date: Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

目录协议示例（续） Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared. Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

示例 A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 and A2 map to the same cache block

示例（续二） A1 and A2 map to the same cache block Processor 1 Processor 2

示例（续三） A1 and A2 map to the same cache block Processor 1 Processor 2

示例（续四） A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory Write Back A1 and A2 map to the same cache block

示例（续五） A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 A1 and A2 map to the same cache block

示例（续六） A1 and A2 map to the same cache block Processor 1 Processor 2
Interconnect Directory Memory A1 A1 and A2 map to the same cache block

实现一个目录 We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of bufffers in network Optimizations: read miss or write miss in Exclusive: send data directly to requestor from owner vs. 1st to memory and then from memory to requestor

同步 Why Synchronize： Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Locks or Semaphores E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section

Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Fetch&Add (m), RV, R: R  M[m]; M[m] R + RV; Swap (m), R: Rt  M[m]; M[m] R; R  Rt;

Using the Test&Set Instruction
P: Test&Set (mutex),Rtemp if (Rtemp!=0) goto P Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead V: Store (mutex),0 process(R) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section?

Nonblocking Synchronization
Compare&Swap(m), Rt, Rs: if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status success; else status fail; status is an implicit argument try: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rnewhead = Rhead+1 Compare&Swap(head), Rhead, Rnewhead if (status==fail) goto try process(R)

Load-reserve & Store-conditional
Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr>  <1, m>; R  M[m]; Store-conditional (m), R: if <flag, adr> == <1, m> then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead = Rhead + 1 Store-conditional (head), Rhead if (status==fail) goto try process(R)

Performance of Locks Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later ...

另一多处理器论题: 存储器一致性模型 What is consistency? When must a processor see the new value before another processor update it? e.g., seems that P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above SC: delay all memory accesses until all invalidates done

存储器一致性模型 Schemes faster execution to sequential consistency
Not really an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x) release (s) {unlock} acquire (s) {lock} read(x) Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed) Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Sequential Consistency
Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0

Sequential Consistency A Memory Model
P “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs

Sequential Consistency
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2: Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ? more on this later

Issues in Implementing Sequential Consistency
Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a  b Store(a); Load(b) yes if a  b Store(a); Store(b) yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors

高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日.

Similar presentations

Presentation on theme: "高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

高等计算机系统结构 多处理器系统 （第八讲） 程 旭 2011年5月16日.

Similar presentations

Presentation on theme: "高等计算机系统结构 多处理器系统 （第八讲） 程 旭 2011年5月16日."— Presentation transcript:

Similar presentations

About project

反馈

高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日.

Presentation on theme: "高等计算机系统结构多处理器系统（第八讲）程旭 2011年5月16日."— Presentation transcript: