现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:http://glearning.tju.edu.cn/ 通信邮箱:gzhang@tju.edu.cn 2018年
主要参考书(一) Computer Architecture A Quantitative Approach 机械工业出版社 (英文版第5版) John L. Hennessy David A. Patterson 机械工业出版社 电子书网址: http://www.doc88.com/p-112663203506.html 现代计算机体系结构
主要参考书(二) 计算机体系结构 量化研究方法 人民邮电出版社 (第5版) John L. Hennessy David A. Patterson 贾洪峰 译 人民邮电出版社 现代计算机体系结构
主要参考书(三) Computer Architecture A Quantitative Approach 机械工业出版社 (英文版第4版) John L. Hennessy David A. Patterson 机械工业出版社 现代计算机体系结构
主要参考书(四) 计算机系统结构 一种定量的方法 (第四版) 电子工业出版社 John L. Hennessy David A. Patterson著 白跃彬 译 电子工业出版社 现代计算机体系结构
Stanford主页上对Hennessy的介绍 现代计算机体系结构
Stanford主页上对Hennessy的介绍 现代计算机体系结构
主要参考书(五) 可扩展并行计算 Scalable Parallel Computing 机械工业出版社 技术、结构与编程 Technology, Architecture, Programming 黄铠 徐志伟 著 陆鑫达 等译 机械工业出版社 现代计算机体系结构
主要参考书(六) 计算机系统结构(第二版) 郑纬民 等 清华大学出版社 现代计算机体系结构
课程时间安排 课程安排:2018年3月8日开始 上课时间:1-8周,每周四下午1:30-5:00 上课地点:第55楼A区117教室 现代计算机体系结构
The Main Contents课程主要内容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts 现代计算机体系结构
先修课要求 本科课程: 计算机组成原理 计算机系统结构 操作系统 计算机网络 现代计算机体系结构
考试与成绩 出勤(包括Quizs和回答问题): 20% 作业(网上提交): 20% 期末考试(闭卷): 60% 提交作业要求: 作业(网上提交): 20% 期末考试(闭卷): 60% 提交作业要求: 写清姓名和作业号,张某某 作业几 作业以附件形式提交,附件不要使用WPS格式 提交时间要求: 周一早8点之前提交 现代计算机体系结构
The Main Contents课程主要内容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts 现代计算机体系结构
Computer Technology Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by High Level Language (HLL) compilers, UNIX Lead to RISC architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages 现代计算机体系结构
Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: 20%/year 2002 to present 现代计算机体系结构
Single Processor Performance Move to multi-processor RISC 现代计算机体系结构
Original Food Chain Big Fishes Eating Little Fishes 现代计算机体系结构
Massively Parallel Processors 1986 Computer Food Chain Mainframe Work- station PC Mini- computer Mini- supercomputer Supercomputer Massively Parallel Processors 现代计算机体系结构
Massively Parallel Processors Mini- supercomputer Mini- computer Massively Parallel Processors 2002 Computer Food Chain Mainframe Work- station PC Server Supercomputer Now who is eating whom? 现代计算机体系结构
Why Such Change in 16 years? Performance Technology Advances CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance Computer architecture advances improves low-end RISC, superscalar, RAID, … 现代计算机体系结构
作业1: 列举近20年来在计算机系统结构方面出现的各项新技术 现代计算机体系结构
Why Such Change in 16 years? Price: Lower costs due to … Simpler development CMOS VLSI: smaller systems, fewer components Higher volumes CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units Lower margins by class of computer, due to fewer services 现代计算机体系结构
Why Such Change in 16 years? Function Rise of networking/local interconnection technology 现代计算机体系结构
Moore’s Law Exponential Growth – doubling of transistors every couple of years 现代计算机体系结构
Growth in CPU Transistor Count 现代计算机体系结构
现代计算机体系结构
Moore’s Law Graph In 1965, Gordon Moore prediction, popularly known as Moore's Law, states that the number of transistors on a chip will double about every two years. 现代计算机体系结构
Moore’s Law Graph 芯片尺寸大些好?小些好? 图中灰色圆形为晶圆 图中黄点为杂质 现代计算机体系结构
Moore’s Law Graph 试想如果一个晶圆只出一个芯片会怎样? 现代计算机体系结构
Moore’s Law Graph 适当的芯片数总成本最少 现代计算机体系结构
Do you want to be a millionaire? You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire? 20 days 27 days 37 days 365 days Lifetime ++ 现代计算机体系结构
Do you want to be a millionaire? You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire 20 days One million cents 27 days Millionaire 37 days Billionaire Doubling transistors every 18 months This growth rate is hard to imagine 现代计算机体系结构
现代计算机体系结构
现代计算机体系结构
现代计算机体系结构
Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: 20%/year 2002 to present 现代计算机体系结构
Why does the improvement have dropped? The End of the Uniprocessor Era Single biggest change in the history of computing systems 现代计算机体系结构
Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application 现代计算机体系结构
Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size芯片面积: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM 现代计算机体系结构
Memory Capacity (Single Chip DRAM) year size(Mb) cyc time 1980 0.0625 250 ns 1983 0.25 220 ns 1986 1 190 ns 1989 4 165 ns 1992 16 145 ns 1996 64 120 ns 2000 256 100 ns 现代计算机体系结构
Bandwidth and Latency Bandwidth or throughput Latency or response time Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks 现代计算机体系结构
Log-log plot of bandwidth and latency milestones 现代计算机体系结构
Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .032 microns in 2011 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically 现代计算机体系结构
Power and Energy Thermal Design Power (TDP) 热量设计功耗 Characterizes sustained power consumption持续功耗 Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement 现代计算机体系结构
Power and Energy Intel公司对Core i7处理器给出的是最大TDP (Max TDP),并不是 TDP 现代计算机体系结构
Dynamic Energy and Power Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2 Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched Reducing clock rate reduces power, not energy 现代计算机体系结构
Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 (1st G) consumes 130 W Heat must be dissipated from 1.5 cm x 1.5 cm chip This is the limit of what can be cooled by air 现代计算机体系结构
Static Power Static power consumption Currentstatic x Voltage Scales with number of transistors To reduce: power gating – turning off the power supply to idle circuits to reduce leakage. 现代计算机体系结构
Energy Saving Do nothing well 以逸待劳 Turn off the clock of inactive modules E.g. floating-point unit, cores Dynamic Voltage-Frequency Scaling (DVFS)动态电压—频率调整 Design for typical case 典型情况设计 Overclocking 超频 现代计算机体系结构
Energy Saving Dynamic Voltage-Frequency Scaling (DVFS)动态电压—频率调整 当CPU处于仅有 3%的使用率时, CPU也非要处于 全速运行的状态 吗? 现代计算机体系结构
Energy Saving Why is DVS, not is DVFS? “Figure 5.11 shows the potential power savings of CPU dynamic voltage scaling (DVS) for that same server by plotting the power usage across a varying compute load for three frequency-voltage steps.” 现代计算机体系结构
Energy Saving Design for typical case 典型情况设计 Memory and storage offer low power modes “Emergency slowdown” Overclocking 超频 Intel从2008年开始在芯片中提供Turbo模式。 在Turbo模式下,允许在少数几个核(核心)上以高于标称时钟频率的更高频率短时运行。 例如,3.3GHz Core i7是多核微处理器,不同型号的Core i7有2-8个核(核心)不等,Core i7可以在很短的时间内让部分核(核心)以3.6GHz的频率运行 现代计算机体系结构
Energy Saving The primary evaluation now is tasks per joule or performance per watt Not is performance per mm2 of silicon 现代计算机体系结构
思考题 有一个现象:相同的程序、在相同的计算机上运行,室温的变化会影响程序的执行速度。 为什么室温会影响程序执行的速度?或者说为什么室温会影响计算机系统的性能? 现代计算机体系结构
Trends in Cost Cost driven down by learning curve 学习曲线 DRAM: price closely tracks cost Microprocessors: price depends on volume(产量) 10% less for each doubling of volume 现代计算机体系结构
Dependability Module reliability Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF 现代计算机体系结构
Conventional Wisdom in Computer Architecture Old Conventional Wisdom: Power is free, Transistors expensive New Conventional Wisdom: “Power wall” Power expensive, Transistors free (Can put more on chip than can afford to turn on) 现代计算机体系结构
Conventional Wisdom in Computer Architecture Old CW: Sufficient increasing Instruction-Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” law of diminishing returns on more HW for ILP 现代计算机体系结构
Conventional Wisdom in Computer Architecture Old CW: Multiplies are slow, Memory access is fast New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) 现代计算机体系结构
Conventional Wisdom in Computer Architecture Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) More, simpler processors are more power efficient 现代计算机体系结构
计算机体系结构课程的内容 1950s to 1960s: Computer Architecture Course: Computer Arithmetic 1970s to mid 1980s: Computer Architecture Course: Instruction Set Design, especially ISA appropriate for compilers 1990s: Computer Architecture Course: Design of CPU, memory system, I/O system, Multiprocessors, Networks 2010s: Computer Architecture Course: Self adapting systems? Self organizing structures? DNA Systems/Quantum Computing? 现代计算机体系结构
计算机体系结构的研究内容 进一步提高单个微处理器的性能。(光速极限问题) 基于微处理器的多处理器体系结构。 全面提高计算机的系统性能:可用性,可维护性,可缩放性。 新型器件的处理器:如光计算机;新原理的计算机(生物,分子,又提出了DNA计算机)。 现代计算机体系结构
What is Computer Architecture? Application Gap too large to bridge in one step (but there are exceptions, e.g. magnetic compass) Physics In its broadest definition, computer architecture is the design of the abstraction layers that allow us to implement information processing applications efficiently using available manufacturing technologies. 现代计算机体系结构
Abstraction Layers in Modern Systems Application Algorithm Original domain of the computer architect (‘50s-’80s) Programming Language Reliability, power, … Parallel computing, security, … Reinvigoration of computer architecture, mid-2000s onward. Operating System/Virtual Machine Domain of recent computer architecture (‘90s) Instruction Set Architecture (ISA) Microarchitecture Gates/Register-Transfer Level (RTL) Circuits Devices Physics 现代计算机体系结构
Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Implementation Complexity Benchmarks Technology Trends Implement Next Generation System Simulate New Designs and Organizations Workloads 现代计算机体系结构
Types of Computers Traditional Computers come in many shapes and sizes Supercomputers Mainframes Minicomputers Microcomputers, Also known as a PC Palm computers, Also known as PDAs Embedded computers 现代计算机体系结构
Supercomputers Designed for ultra-high performance tasks weather analysis large expensive massively parallel-processing 现代计算机体系结构
Mainframes Require high performance Generate and process large numbers of transactions IBM S/390 126 MIPS in a single-processor configuration. 现代计算机体系结构
Minicomputers Designed for real-time dedicated applications or as high-performance, multiple user applications Digital Alpha IBM RS/6000 Sun Ultra 现代计算机体系结构
Microcomputers The most prevalent form Sitting on a standard desktop or even laptop The first PC was built by IBM 现代计算机体系结构
Apple 现代计算机体系结构
Palm computers These computers are about the size of a human hand word processing spreadsheet calculations handwriting recognition game playing faxing 现代计算机体系结构
Types of Computers Now Personal Mobile Device (PMD) Desktop Computing Servers Clusters/Warehouse-Scale Computers (WSC) Many desktop computers or servers are connected by local area networks to act as a single larger computer The largest of the clusters Embedded Computers What are embedded computers? 现代计算机体系结构
Types of Computers Now 现代计算机体系结构
Classes of Parallelism and Parallel Architectures In applications Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Hardware support Instruction-Level Parallelism Vector Architectures and Graphic Processor Unit (GPUs) Thread-Level Parallelism Request-Level Parallelism 现代计算机体系结构
Flynn Categories Single instruction stream, single data stream (SISD) Single instruction stream, multiple data stream (SIMD) Multiple instruction stream, single data stream (MISD Multiple instruction stream, multiple data stream (MIMD) 现代计算机体系结构
Flynn Categories 现代计算机体系结构
Flynn Categories Some further divide the MIMD category into SPMD(Single Program, Multiple Data) and MPMD(Multiple Program, Multiple Data) SPMD Multiple autonomous processors simultaneously executing the same program on different data MPMD Multiple autonomous processors simultaneously operating at least 2 independent programs 现代计算机体系结构
Flynn’s Web Page Copy from Stanford University 现代计算机体系结构
Intel 4004 现代计算机体系结构
Intel 8008 现代计算机体系结构
Intel 80286 现代计算机体系结构
Intel 80386 现代计算机体系结构
Intel 80486 现代计算机体系结构
Intel Pentium 现代计算机体系结构
Intel Pentium Pro 现代计算机体系结构
Intel Pentium II 现代计算机体系结构
Pentium Evolution (1) 8080 first general purpose microprocessor 8 bit data path Used in first personal computer – Altair 8086 much more powerful 16 bit instruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC 80286 16 Mbyte memory addressable up from 1Mb 80386 32 bit Support for multitasking 现代计算机体系结构
Pentium Evolution (2) 80486 Pentium Pentium Pro sophisticated powerful cache and instruction pipelining built in maths co-processor Pentium Superscalar (超标量) Multiple instructions executed in parallel Pentium Pro Increased superscalar organization Aggressive register renaming branch prediction data flow analysis speculative execution (推测执行) 现代计算机体系结构
Pentium Evolution (3) Pentium II Pentium III Pentium 4 MMX technology graphics, video & audio processing Pentium III Additional floating point instructions for 3D graphics Pentium 4 Note Arabic rather than Roman numerals Further floating point and multimedia enhancements 现代计算机体系结构
Sea Change in Chip Design Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache RISC II shrinks to ~ 0.02 mm2 at 65 nm Caches via DRAM or 1 transistor SRAM? Processor is the new transistor? 现代计算机体系结构
Problems with Sea Change Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread-Level Parallelism or Data-Level Parallelism for 1000 CPUs / chip, 现代计算机体系结构
Problems with Sea Change Architectures not ready for 1000 CPUs / chip Unlike Instruction-Level Parallelism, cannot be solved by computer architects and compiler writers alone, but also cannot be solved without participation of architects 现代计算机体系结构
Problems with Sea Change This edition of our course and 4th Edition of textbook “Computer Architecture: A Quantitative Approach” explores shift from Instruction-Level Parallelism to Thread-Level Parallelism / Data-Level Parallelism 现代计算机体系结构
Measurement and Evaluation Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems 现代计算机体系结构
Measurement and Evaluation Creativity Cost / Performance Analysis Good Ideas Mediocre Ideas Bad Ideas 注意:英文中常用的Cost/Performance与中文中常用的性能/价格正好相反! 现代计算机体系结构
现代计算机体系结构
现代计算机体系结构
性能和成本 “X is n times faster than Y” mean =n 现代计算机体系结构
Amdahl’s Law Speedup=(Performance for entire task using the enhancement)/ (Performance for entire task without using the enhancement) Speedup=(Execution time for entire task without using the enhancement)/ (Execution time for entire task using the enhancement) 现代计算机体系结构
Amdahl’s Law Depends on Two Factors Fraction enhanced The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement (可改进部分占用的时间)/(改进前整个任务的执行时间)< 1 例:改进前整个任务60秒,可改进部分为20秒,则Fraction enhanced=20/60 现代计算机体系结构
Amdahl’s Law Depends on Two Factors Speedup enhanced The improvement gained by the enhanced execution mode (改进前改进部分的执行时间)/(改进后改进部分的执行时间)> 1 例:改进前改进部分5秒,改进后改进部分2秒,则Speedup enhanced=5/2 现代计算机体系结构
由Amdahl’s Law得出的结论(一) [(可改进部分占用的时间)/(改进前整个任务的执行时间)] / [(改进前改进部分的执行时间)/(改进后改进部分的执行时间)] = (改进后改进部分的执行时间)/(改进前整个任务的执行时间) 现代计算机体系结构
由Amdahl’s Law得出的结论(二) 由结论(一)得: Speedup overall = 1 / [(1-Fraction enhanced) + (Fraction enhanced / Speedup enhanced)] 现代计算机体系结构
Amdahl’s Law结论的例子(1) Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95 现代计算机体系结构
Amdahl’s Law结论的例子(2) 现代计算机体系结构
Amdahl’s Law结论的例子(3) 现代计算机体系结构
CPU Time or CPU Time =CPU clock cycles for a program / Clock rate =CPU clock cycles for a program Clock cycle time 现代计算机体系结构
Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count CPU Time = Instruction Count * CPI * Clock cycle Time = Instruction Count * CPI / Clock Rate 现代计算机体系结构
Cycles Per Instruction (Throughput) 现代计算机体系结构
Cycles Per Instruction (Throughput) “Instruction Frequency” Invest Resources where time is Spent! 现代计算机体系结构
Example: 现代计算机体系结构
现代计算机体系结构
Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X 现代计算机体系结构
Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix 现代计算机体系结构
性能标准 MIPS ( Million Instruction Per Second ) =指令条数 /(执行时间106) 缺陷: 依赖于指令集 在同一台机器上,因程序不同而不同 可能与性能相反 现代计算机体系结构
性能标准 MFLOPS ( Million Floating Point Oprations Per Second ) =程序中的浮点操作次数 /(执行时间106) 优点:可以比较不同的机器 缺陷: 不能体现整体性能 依赖浮点操作类型 现代计算机体系结构
性能标准 基准测试程序 衡量性能的唯一固定而且可靠的标准是真正执行程序的时间。 实际应用程序 核心测试程序 小型基准测试程序 综合基准测试程序 衡量性能的唯一固定而且可靠的标准是真正执行程序的时间。 现代计算机体系结构
Benchmark Suites Desktop SPEC CPU2006: 12 integer, 17 floating-point SPECviewperf, SPECapc: graphics benchmarks Server SPEC CPU2006: running multiple copies, SPECrate SPECSFS: for NFS performance SPECWeb: Web server benchmark TPC-x: measure transaction-processing, queries, and decision making database applications Embedded Processor New area EEMBC: EDN Embedded Microprocessor Benchmark Consortium 现代计算机体系结构
性能比较 两个程序在三台计算机上的执行时间 总执行时间:一致的衡量标准 现代计算机体系结构
性能比较 平均执行时间 各执行时间的算术平均值 其中Ti是第i个程序的执行时间 现代计算机体系结构
性能比较 调和均值执行速率 其中Ri=1/Ti ,Ti是第i个程序的执行时间 现代计算机体系结构
性能比较 加权执行时间 加权算术平均值 其中Wi是第i个程序在任务中所占的比重,Ti是该程序的执行时间。 现代计算机体系结构
性能比较 几何平均 Geometric Mean Execution time ratio is normalized to a base machine Is used to figure out SPECrate 现代计算机体系结构
作业2 阅读关于Power Wall 、 ILP Wall、 Memory Wall方面的英文文献 要求: 每人至少阅读一篇英文文献 ; 写一篇类似大摘要的读书报告(中英文均可),注明文献出处; 提交所阅读的文献+读书报告(文件名:作业2+姓名) 现代计算机体系结构
作业3 第五版 Case Studies 1.4 完整的题目见下页 现代计算机体系结构
现代计算机体系结构
现代计算机体系结构