Computer Architecture Chapter 1 Computer Abstractions and Technology Yu-Lun Kuo 郭育倫 Department of Computer Science and Information Engineering Tunghai University, Taichung, Taiwan R.O.C. sscc6991@gmail.com http://www.csie.ntu.edu.tw/~d95037/ CS252 S05
This book http://www.elsevierdirect.com/product.jsp?isbn=9780123744937
Computer Architecture Computer Organization Related Courses Parallel & Advanced Computer Architecture Parallel Architectures, Hardware-Software Interactions System Optimization Computer Organization Computer Architecture Hardware-Software Co-design Why, Analysis, Evaluation How to build it, Implementation details How to make embedded systems better Software Embedded Systems Software Special Topics on Computer Performance Optimization OS, Programming Lang, System Programming RTOS, Tools-chain, I/O & Device drivers, Compilers Performance tools, Performance skills, Compiler optimization tricks 2018/11/28 CS252 S05
Computer Architecture and Organization Architecture is those attributes visible to the programmer Instruction set, number of bits used for data representation, I/O mechanisms, addressing techniques. e.g. Is there a multiply instruction? Organization is how features are implemented Control signals, interfaces, memory technology. e.g. Is there a hardware multiply unit or is it done by repeated addition?
Computer Architecture and Organization All Intel x86 family share the same basic architecture The IBM System/370 family share the same basic architecture This gives code compatibility At least backwards Organization differs between different versions
Class of Computing Applications (1/2) Desktop computers Emphasize delivering good performance to a single user at low cost Price-performance, Graphics performance Intel, AMD, Apple, Microsoft, Linux Servers Accessed only via a network Provide for greater expandability of both computing and input/output capacity Availability, Scalability, Throughput IBM, HP-Compaq, Sun, Intel, Microsoft, Linux 11/28/2018 CS252 S05
Class of Computing Applications (2/2) Supercomputers Consist of hundreds to thousands of processors Usually gigabytes to terabytes of memory Terabyte to petabytes of storage Cost million to hundreds of millions of dollars Embedded computers Computer inside another device Include the microprocessors Washing machine, car, cell phone, video game, PDA, and digital TVs 11/28/2018 CS252 S05
Where is the Market? 百萬台電腦 圖1.1從1988至2002年,不同種類的處理器的銷售量。這些數字的獲得有些許不同,因此需要注意這些結果的解釋。如桌上型電腦和伺服器的總數計算完整的電腦系統,因為其中的一部份為多重處理器,使的處理器的銷售數字較高些,但大約只有全部的10~20%(由於伺服器平均雖有著超過一顆以上的處理器,但僅為單一處理器系統的桌上型電腦銷售量3%)。嵌入式電腦的總數,實際上是計算處理器的數目。有些嵌入式系統是看不見處理器的,更有些單一設備卻有多顆的處理器。
Instruction Set Architecture (ISA) ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on. “... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.” – Amdahl, Blaauw, and Brooks, 1964
百萬顆處理器 圖1.2 1998至2002年所有的指令集架構為處理器的銷售量。關於「其餘」 的種類是指定應用或客製化的處理器。在ARM的例子裡,大約有80%的 銷售量是使用在手機上,他們結合了ARM和特定應用邏輯在單一晶片上。
Hierarchical Layers System Software Sitting between the hardware and applications software Including operating systems, compilers, and assemblers 11/28/2018 CS252 S05
Compilers & assemblers Translation of a program written in a high-level language, such as C or JAVA, into instructions that the hardware can execute Assemblers Translates a symbolic version of an instruction into the binary version Assembly language A symbolic representation of machine instructions 11/28/2018 CS252 S05
11/28/2018 CS252 S05
編譯器 組譯器 高階語言程式 (c語言) 圖1.4 C程式編譯成組合語言 在組譯成二位元機械語言。 雖然從高階語言轉譯成二位 元機械語言有兩個步驟,有 些編譯器會將中間過程刪除 ,直接產生二位元機械語言。 這些語言和程式在第二章會 有更為詳細的介紹。 組合語言程式 (MIPS規格) 二位元機械 語言程式 (MIPS規格)
What is “Computer Architecture”? Applications Compiler Operating System Firmware Instruction Set Architecture (ISA) I/O system Instr. Set Proc. Digital Design Circuit Design Datapath & Control Layout & fab Semiconductor Materials Coordination of many levels of abstraction Under a rapidly changing set of forces Design, Measurement, and Evaluation 2018/11/28 CS252 S05
Registers vs. Memory Arithmetic instructions operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables
Impacts of Advancing Technology Processor logic capacity: increases about 30% per year performance: 2x every 1.5 years ClockCycle = 1/ClockRate 500 MHz ClockRate = 2 nsec ClockCycle 1 GHz ClockRate = 1 nsec ClockCycle 4 GHz ClockRate = 250 psec ClockCycle For lecture
Impacts of Advancing Technology Memory DRAM capacity: 4x every 3 years, now 2x every 2 years memory speed: 1.5x every 10 years cost per bit: decreases about 25% per year Disk capacity: increases about 60% per year 11/28/2018 CS252 S05
圖1.6 桌上型電腦。液晶顯示螢幕是主要的輸出裝置,鍵盤與滑鼠為主要 的輸入裝置。主機箱內則包含了處理器和額外的輸入/輸出裝置。本圖是 Dell Optiplex GX260系統。
DVD 驅動器 電源 供應器 ZIP 驅動器 有罩子 的風散 主機板 硬碟 圖1.8 在15頁圖1.6的個人電腦內部圖。這種包裝因為它開啟的方式,旁邊有絞鍊 ,所以有時稱做蛤殼式(clamshell)包裝。為了看看裡邊有什麼,我們從左上角開始。 左上角的金屬盒是電源供應器,下方是個有罩子的風散。在風扇的右下方是印刷 電路板(printed circuit (PC)board),在電腦裡稱做主機板,包含了電腦裡大部分的 電子零件。圖1.10是個接近此種板子的圖例。處理器就是在風扇右邊的大型凸起 矩形物。在右手邊我們可以看見擺放各種驅動盤機器的隔間,最上面是DVD驅動 器,中間是ZIP驅動器,下面是硬碟。
Example Machine Organization Workstation design target 25% of cost on processor 25% of cost on memory (minimum memory size) Rest on I/O devices, power supplies, box Computer CPU Memory Devices Control Input That is, any computer, no matter how primitive or advance, can be divided into five parts: 1. The input devices bring the data from the outside world into the computer. 2. These data are kept in the computer’s memory until ... 3. The datapath request and process them. 4. The operation of the datapath is controlled by the computer’s controller. All the work done by the computer will NOT do us any good unless we can get the data back to the outside world. 5. Getting the data back to the outside world is the job of the output devices. The most COMMON way to connect these 5 components together is to use a network of busses. Datapath Output
編譯器 介面 電腦 輸入 輸出 控制單元 資料路徑 處理器 記憶體 效能評估 圖1.5 構成電腦五種要素的組織圖。處理器從記憶體中抓取指令和資料。 記憶體中的資料由輸入裝置寫入,並由輸出裝置讀出。控制單元則送出 運作訊號以決定資料流程、記憶體、輸入和輸出裝置的動作。
Inside the Pentium 4 Processor Chip
圖1.9 在圖1.8的電路板上所使用的處理器的內部圖。左手邊的是Pentium4處理器晶片 控制 單元 其它介面邏輯 輸入/輸 出介面 指令快取記憶體 資料快取 記憶體 增強型浮點 及多媒體運 算單元 控制單元 第二階 快取及 介面 進階管線化多執 行緒支援單元 圖1.9 在圖1.8的電路板上所使用的處理器的內部圖。左手邊的是Pentium4處理器晶片 的縮影照片,右手邊則顯示了該處理器內部的主要區塊。
圖1.10 貼近個人電腦主機板。這塊板子使用Intel Pentium 4處理器,位 記憶體 介面 輸入/輸出裝置 匯流排插槽 圖形化介面卡 碟盤及通 用序列埠 圖1.10 貼近個人電腦主機板。這塊板子使用Intel Pentium 4處理器,位 於板子的左上角。它的上面覆蓋了一個似鰭狀的金屬散熱器。這是個散 熱裝置,幫助晶片散去熱量。記憶體部分包含了一個或多個電路板,垂 直插在主機板上,靠近中央。動態隨機存取記憶體鑲嵌在這些小電路板 上(稱之為雙同軸記憶體模組(dual inline memory modules,DIMMS)),然 後插入進接器。主機板上其餘的大部分用來連接外部輸入/輸出裝置, 如音頻信號/MIDI、右邊的平行/序列埠、底部的兩個週邊元件連接介面 (PCI)卡插槽和連接硬碟的進階連接技術(advanced technology attachment,ATA)連接器。
Safe Place for Data Memory Floppy disks Optical disks Primary memory (Main memory) Volatile, when it loses power Secondary memory Nonvolatile memory Magnetic disk – hard disk Floppy disks Optical disks CDs, DVDs, HDVD, BD Flash based removable memory 11/28/2018 CS252 S05
圖1.11 圖中顯示了10片碟盤和讀寫頭的硬碟。
Total transistors in PCs 1972 – 4004 - 2000 trs 1974 – 8080 - 7000 trs 1978 – 8086 - 50,000 trs 1982 – 286 - 200,000 trs 1985 – 386 - 500,000 trs 1987 – 486 - 1 million trs 1992 – Pentium - 5 million trs 1995 – Pentium II - 7 million trs 1999 – Pentium III - 10 million trs 11/28/2018 CS252 S05
Moore’s Law In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time). Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. 2300 transistors, 1 MHz clock (Intel 4004) - 1971 16 Million transistors (Ultra Sparc III) 42 Million transistors, 2 GHz clock (Intel Xeon) – 2001 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004 140 Million transistor (HP PA-8500) Tbyte = 2^40 bytes (or 10^12 bytes) Note that Moore’s law is not about speed predictions but about chip complexity
Moore’s Law “Cramming More Components onto Integrated Circuits” Gordon Moore, Electronics, 1965 # on transistors on cost-effective integrated circuit double every 18 months 2018/11/28 CS252 S05
圖1.14 晶片的製造過程。矽碇在切成薄片後,空白的晶圓會經過20到40道的圖 空白晶圓 將晶片 封裝 測試過的晶片 切割機 測試過 的晶圓 晶圓 測試機 圖樣晶圓 封裝過的晶片 零件 測試過的 封裝晶片 賣給 顧客 20到40道的 製程 圖1.14 晶片的製造過程。矽碇在切成薄片後,空白的晶圓會經過20到40道的圖 樣製造(查閱第28頁圖1.15),處理過後的晶圓會以晶圓測試機測試,並顯示好的 部份的電腦映圖。之後晶圓會被切成一塊一塊的小方塊,(查閱第19頁的圖1.9) 。在本圖裡,這片晶圓有20個晶片,其中有17個通過測試(x表示壞的晶片)。本 例中的良率是17/20/即85%,之後好的晶片會封裝起來,在賣給消費者前在測試 一次。這個例子裡,封裝過後的晶片有一顆是壞的。
圖1.15 包含了Intel Pentium 4晶片的8吋(200mm)晶圓。百分之百良率 的晶圓裡,有165顆Pentium晶片。第19頁圖1.9便是這些Pentium4晶片 的顯微照片。一顆晶片的面積為250 ,裡頭有5500萬顆電晶體, 使用0.18製程,意思是最小的電晶體大小約0.18微米,然而一般來說它 們會稍微較實際的製程大小較小些,而實際的製程大小意指電晶體的大 小相對於最後製造出的大小是差不多的。Pentium4晶片也有使用更先進 的0.13製程製造。晶圓的周圍有數十顆部份製造的晶片是無用的,它們 之所以會被製造,是如此一來會較容易設計晶圓圖樣所需的光罩圖。 Micrometre,µm(10^-6)
圖1.16 散熱片上的Intel Pentium4(3.06Ghz)晶片,散熱片要散去 晶片所製造出的82瓦熱量。
年 使用於電腦的技術 相對效能/單位成本 真空管(vacuum tube) 1 電晶體 35 積體電路 900 超大型積體電路 2,400,000 2005 極大型積體電腦 6,200,000,000 圖1.12 長時間以來,使用在電腦的各項技術其單位成本的相對效能。 資料來源:波士頓電腦博物館,2005年為作者推算而得。
效能 圖1.17 1978~2003年,工作站效能增進圖。此處,效能以大約比VAX-11/780 快幾倍的數字表示,這是常用的衡量標準。每年的效能成長率介於1.5和1.6倍 間。這些效能數字是基於SPECint(見第二章),根據時間之不同調整以應付測試 程式的變動。處理器名字後方所列出的x/y,x是模型數字,y是速度(MHz)。
千位元容量 發表時間 圖1.13 動態隨機存取記憶體晶片隨時間演變的容量成長圖。Y軸以千位元 做量測,千指的是1024 。這二十年來,動態隨機存取記憶體工業幾乎 每三年便會提高四倍的容量,相當每年百分之六十。每三年增加四倍的估 計為動態隨機存取記憶體的成長法則。近年來,成長率已經逐漸趨緩,而 收為接近每二年倍增或每四年增加四倍。
Disks: Archaic (Nostalgic) vs. Modern (Newfangled) CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes 11/28/2018 CS252 S05
Latency Lags Bandwidth (for last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Disk: 33X BW, 6X latency (latency = simple operation w/o contention BW = best-case) 11/28/2018 CS252 S05
Memory: Archaic (Nostalgic) vs. Modern (Newfangled) 1980 DRAM (asynchronous) 0.06 Mbits/chip 64,000 xtors, 35 mm2 16-bit data bus per module, 16 pins/chip 13 Mbytes/sec Latency: 225 ns (no block transfer) 2000 Double Data Rate Synchr. (clocked) DRAM 256.00 Mbits/chip (4000X) 256,000,000 xtors, 204 mm2 64-bit data bus per DIMM, 66 pins/chip (4X) 1600 Mbytes/sec (120X) Latency: 52 ns (4X) Block transfers (page mode) 11/28/2018 CS252 S05
Latency Lags Bandwidth (last ~20 years) Performance Milestones Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) DRAM: 120X BW, 4X latency Jog between 3rd and 4th point is because a lot of time between 32b Fast page mode and 64b (latency = simple operation w/o contention BW = best-case) 11/28/2018 CS252 S05
LANs: Archaic (Nostalgic) vs. Modern (Newfangled) Ethernet 802.3ae Year of Standard: 2003 10,000 Mbits/s (1000X) link speed Latency: 190 msec (15X) Switched media Category 5 copper wire Ethernet 802.3 Year of Standard: 1978 10 Mbits/s link speed Latency: 3000 msec Shared media Coaxial cable Copper, 1mm thick, twisted to avoid antenna effect Twisted Pair: "Cat 5" is 4 twisted pairs in bundle Coaxial Cable: Plastic Covering Braided outer conductor Insulator Copper core 11/28/2018
Latency Lags Bandwidth (last ~20 years) Performance Milestones Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Network: 1000X BW, 13X Latency (latency = simple operation w/o contention BW = best-case) 11/28/2018 CS252 S05
CPUs: Archaic (Nostalgic) vs. Modern (Newfangled) 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar, Dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache 11/28/2018 CS252 S05
Latency Lags Bandwidth (last ~20 years) Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) CPU high, Memory low (“Memory Wall”) Processor: 2250X, 22X Latency 11/28/2018 CS252 S05
Computing Devices Then… EDSAC, University of Cambridge, UK, 1949 1/22/2008 CS252 S05
Computing Devices Now Sensor Nets Cameras Games Set-top boxes Media Players Laptops Servers Robots Routers Smart phones Automobiles Supercomputers CS152-Spring’08 1/22/2008 CS252 S05