Download presentation
Presentation is loading. Please wait.
1
Ch 6: CPU - Datapath and Control
中央处理器:数据通路和控制器 第一讲 单周期数据通路的设计 第二讲 单周期控制器的设计 第三讲 多周期处理器的设计 第四讲 微程序控制器设计与异常处理 Start X:40
2
第一讲 单周期数据通路的设计 主 要 内 容 CPU的功能及其与计算机性能的关系 数据通路的位置 单周期数据通路的设计 数据通路的功能和实现
操作元件(组合逻辑部件) 状态 / 存储元件(时序逻辑部件) 数据通路的定时 选择MIPS指令集的一个子集作为CPU的实现目标 下条指令地址计算与取指令部件 R型指令的数据通路 访存指令的数据通路 立即数运算指令的数据通路 分支和跳转指令的数据通路 综合所有指令的数据通路
3
CPU功能及其与计算机性能的关系 CPU执行指令的过程: 取指令 PC+1送PC 指令译码 进行主存地址运算 取操作数
进行算术 / 逻辑运算 存结果 判断和检测“异常”事件 若有异常,则自动切换到异常处理程序 检测是否有“中断”请求,有则转中断处理 CPU的实现与计算机性能的关系 计算机性能(程序执行快慢)由三个关键因素决定: 指令数目、CPI、时钟周期 指令数目由编译器和指令集决定 时钟周期和CPI由CPU的实现来决定 取指阶段 问题: “取指令”一定在最开始做吗? “PC+1”一定在译码之前做吗? “译码”须在指令执行前做吗? “异常”和“中断”的差别是什么? 指令执行过程 译码和 执行阶段 CPU的基本功能 (1)控制指令执行顺序 (2)控制指令执行操作 (3)控制操作时序 (4)对数据进行运算 (5)对访存或I/O访问进行控制 (6)异常和中断处理 因此,CPU的设计与实现非常重要!它直接影响计算机的性能。
4
组成指令功能的四种基本操作 每条指令的功能总是由以下四种基本操作来实现: 操作功能可形式化描述
(1)读取某一主存单元的内容,并将其装入某个寄存器; (2)把一个数据从某个寄存器存入给定的主存单元中; (3)把一个数据从某个寄存器送到另一个寄存器或者ALU; (4 进行某种算术运算或逻辑运算,将结果送入某个寄存器。 操作功能可形式化描述 描述语言称为寄存器传送语言RTL (Register Transfer Language) 本章所用的RTL规定如下: (1)用R[r]表示寄存器r的内容; (2)用M[addr]表示读取主存单元addr的内容; (3)传送方向用“←”表示,传送源在右,传送目的在左; (4)程序计数器PC直接用PC表示其内容; (5)用OP[data]表示对数据data进行OP操作。
5
CPU基本组成原理图 控制部件 执行部件 CPU 由 执行部件 和 控制部件组成 控制器 由 指令译码器 和 控制信号形成部件 组成
6
数据通路的位置 计算机的五大组成部分: CPU 什么是数据通路(DataPath)?
指令执行过程中,数据所经过的路径,包括路径中的部件。它是指令的执行部件。 控制器(Control)的功能是什么? 对指令进行译码,生成指令对应的控制信号,控制数据通路的动作。能对指令的执行部件发出控制信号,是指令的控制部件。 CPU Input Control Memory Datapath Datapath Output Before we go any further, let’s step back for a second and take a look at the big picture. All computer consist of five components: (1) Input and (2) output devices. (3) The Memory System. And the (4) Control and (5) Datapath of the Processor. Today’s lecture covers the datapath design. In the next lecture, I will show you how to design the processor’s control unit. +1 = 5 min. (X:45)
7
数据通路的基本结构 因此,数据通路是由操作元件和存储元件通过总线方式或分散方式连接而成的进行数据存储、处理、传送的路径。
数据通路由两类部件组成 组合逻辑元件(也称操作元件) 存储元件(也称状态元件) 元件间的连接方式 总线连接方式 分散连接方式 数据通路如何构成? 由“操作元件”和“存储元件”通过总线方式或分散方式连接而成 数据通路的功能是什么? 进行数据存储、处理、传送 因此,数据通路是由操作元件和存储元件通过总线方式或分散方式连接而成的进行数据存储、处理、传送的路径。
8
何时要用到adder, ALU, MUX or Decoder?
操作元件:组合逻辑电路 32 A B Sum Carry Adder CarryIn 译码器 (Decoder) 加法器(Adder) 多路选择器 (MUX) 算逻部件(ALU) 3 Decoder out0 out1 out7 out2 加法器需要什么控制信号? 32 A B Y Select MUX 控制信号 何时要用到adder, ALU, MUX or Decoder? 二选一 也可以多选一 Based on the Register Transfer Language examples we have so far, we know we will need the following combinational logic elements. We will need an adder to update the program counter. A MUX to select the results. And finally, an ALU to do various arithmetic and logic operation. +1 = 30 min. (Y:10) 组合逻辑元件的特点: 其输出只取决于当前的输入。即:输入一样,其输出也一样 定时:所有输入到达后,经过一定的逻辑门延时,输出端改变,并保持到下次改变,不需要时钟信号来定时 32 A B Result Zero OP ALU
9
状态元件:时序逻辑电路 状态(存储)元件的特点: 具有存储功能,在时钟控制下输入被写到电路中,直到下个时钟到达
输入端状态由时钟决定何时被写入,输出端状态随时可以读出 定时方式:规定信号何时写入状态元件或何时从状态元件读出 边沿触发(edge-triggered)方式: 状态单元中的值只在时钟边沿改变。每个时钟周期改变一次。 上升沿(rising edge) 触发:在时钟正跳变时进行读/写。 下降沿(falling edge)触发:在时钟负跳变时进行读/写。 最简单的状态单元(回顾:数字逻辑电路课程内容): D触发器:一个时钟输入、一个状态输入、一个状态输出 cycle time rising edge falling edge
10
存储元件中何时状态被改变? 数据通路中的状态元件有两种:寄存器(组) + 存储器
这期间D的变化不影响Q Q总是在clk-Q后跟着D变化 ( Latch Prop - 锁存延迟 ) 切记:状态单元的输入信息总是在一个时钟边沿到达后的“Clk-to-Q”时 才被写入到单元中,此时的输出才反映新的状态值 数据通路中的状态元件有两种:寄存器(组) + 存储器
11
寄存器的种类 经过一个clk-to-Q,输入信号在寄存器的输出端有效! 寄存器由N位触发器构成 有各种不同类型的寄存器
由锁存器构成的暂存器:带“写使能”信号 用于和总线相连的、输出端带三态门的寄存器:带“三态门控”信号 带“复位”(清0)功能的寄存器:带“复位”信号 带计数(自增)功能的寄存器:可带“自增”信号 带移位功能的寄存器:带“移位”信号 组合上述多个功能的寄存器:带多个控制信号 寄存器组有若干个寄存器组成 通常是双口:两个读口 + 一个写口 可带时钟输入信号 用于控制输入信号何时被写入到寄存器中 经过一个clk-to-Q,输入信号在寄存器的输出端有效!
12
存储元件: 寄存器和寄存器组 寄存器(Register) 有一个写使能(Write Enable-WE)信号 0: 时钟边沿到来时,输出不变
1: 时钟边沿到来时,输出开始变为输入 若每个时钟边沿都写入,则不需WE信号 寄存器组(Register File) 两个读口(组合逻辑操作):busA和busB分别由RA和RB给出地址。地址RA或RB有效后,经一个“取数时间(AccessTime)”,BusA和BusB有效。 一个写口(时序逻辑操作):写使能为1的情况下,时钟边沿到来时,busW传来的值开始被写入RW指定的寄存器中。 Write Enable Data In Data Out N N Clk RW RA RB Write Enable 5 5 5 As far as storage elements are concerned, we will need a N-bit register that is similar to the D flip-flop I showed you in class. The significant difference here is that the register will have a Write Enable input. That is the content of the register will NOT be updated if Write Enable is zero. The content is updated at the clock tick ONLY if the Write Enable signal is set to 1. +1 = 31 min. (Y:11) Complement: To know more detailed diagram for a register, see page B22-B25. busA busW 32 32 32-bit Registers 32 busB Clk 32
13
寄存器组的内部结构 RA RB Write Enable 32-to-1 RW Decoder M busA U X busW Clk M
Register 0 C 32-to-1 Decoder 1 D RW Register 1 C M D 30 busA 31 U C Register 30 X D C Register 31 busW D Clk M 每个寄存器由32个触发器组成; 输入数据来自busW,读出数据分别送busA和busB; WriteEnable信号控制是否写入新值。 busB U X
14
存储元件: 理想存储器 为简化数据通路操作的说明,在此把存储器简化为带时钟信号Clk的理想模型。
Address Write Enable 理想存储器( idealized memory ) Data Out:32位读出数据 Data In: 32位写入数据 Address:读写公用一个32位地址 读操作(组合逻辑操作):地址Address有效后,经一个“取数时间AccessTime”,Data Out上数据有效。 写操作(时序逻辑操作) :写使能为1的情况下,时钟Clk边沿到来时,Data In传来的值开始被写入Address指定的存储单元中。 Data In DataOut 32 32 Clk The last storage element you will need for the datapath is the idealized memory to store your data and instructions. This idealized memory block has just one input bus (DataIn) and one output bus (DataOut). When Write Enable is 0, the address selects the memory word to put on the Data Out bus. When Write Enable is 1, the address selects the memory word to be written via the DataIn bus at the next clock tick. Once again, the clock input is a factor ONLY during the write operation. During read operation, it behaves as a combinational logic block. That is if you put a valid value on the address lines, the output bus DataOut will become valid after the access time of the memory. +2 = 35 min. (Y:15) 为简化数据通路操作的说明,在此把存储器简化为带时钟信号Clk的理想模型。
15
数据通路与时序控制 现代计算机已不再采用三级时序系统,机器周期的概念已逐渐消失。
同步系统(Synchronous system) 所有动作有专门时序信号来定时 由时序信号规定何时发出什么动作 例如,指令执行过程每一步都有控制信号控制,由定时信号确定控制信号何时发出、作用时间多长 什么是时序信号? 同步系统用于同步控制的定时信号 什么叫指令周期? 取并执行一条指令的时间 不同指令的指令周期可能不同 早期计算机的三级时序系统 机器周期 - 节拍 - 脉冲 指令周期可分为取指令、读操作数、执行并写结果等多个基本工作周期,称为机器周期。 机器周期有取指令、存储器读、存储器写、中断响应等不同类型 现代计算机已不再采用三级时序系统,机器周期的概念已逐渐消失。 整个数据通路中的定时信号就是时钟,一个时钟周期就是一个节拍。
16
数据通路与时序控制 现代计算机的时钟周期 Clk Setup Hold Setup Hold 寄存器的输入可变化 . Clk Clk
Remember, we will be using a clocking methodology where all storage elements are clocked by the same clock edge. Consequently, our cycle time will be the sum of: (a) The Clock-to-Q ( or latch propagation) time of the input registers. (b) The longest delay path through the combinational logic block. (c) The set up time of the output register. (d) And finally the clock skew. In order to avoid hold time violation, you have to make sure this inequality is fulfilled. +2 = 18 min. (X:58) Complements: Why use edge-triggerd clocking methodology? simpler to explain in contrast to level-triggered. Clock skew: difference in absolute time between the times when two state elements see a clock edge. It arises because the clock signal often follows different paths, with slightly delays, to reach two different state elements. Clock skew may cause a forward race of new inputs to the next flip-flop, leading to incorrect operation. (see Fig.B.31 at page B-41). Clock-to-Q(or latch propagation): the propagation time of signal through a flip-flop from clock to the output Q. That is why it is called clock to Q time. Setup time/hold time of flip-flop: the minimum time during which the input must be valid( or stable) before/after the clock edge.(page B-24) (Latch Prop + Shortest Delay Path - Clock Skew) > Hold Time : this might be difficult to explain, but otherwise it will cause the race problem as shown in Fig.B.31. 数据通路由 “ … + 状态元件 + 操作元件( 组合电路) + 状态元件 + …” 组成 只有状态元件能存储信息,所有操作元件都须从状态单元接收输入,并将输出写入状态单元中。其输入为前一时钟生成的数据,输出为当前时钟所用的数据 假定采用下降沿触发(负跳变)方式(也可以是上升沿方式) 所有状态单元在下降沿写入信息,经过Latch Prop (clk-to-Q) 后输出有效 Cycle Time = Latch Prop + Longest Delay Path + Setup + Clock Skew(最大偏移) 约束条件:(Latch Prop + Shortest Delay Path - Clock Skew) > Hold Time
17
早期累加器型指令系统数据通路 最简单的数据通路结构 取指令数据路径为: 取操作数、运算、送结果的数据路径为: PC→MAR, Read M,
M→MBR→IBR→IR 取操作数、运算、送结果的数据路径为: 操作数地址→MAR, M→MBR→ALU输入端, AC→ALU输入端, ALU操作, ALU结果→MBR, Write M IAS计算机(冯.诺依曼等设计)是现代计算机的原型
18
单总线数据通路 四种基本操作的时序 1Cycle? 3Cycles? R[R2]←M[R[R1]] 3Cycles?
在通用寄存器之间传送数据 R0out,Yin 完成算术、逻辑运算 R1out,Yin R2out,Add,Zin Zout,R3in 从主存取字 R1out,MARin Read, WMFC (等待MFC) MDRout,R2in 写字到主存 R2out,MDRin, Write, WMFC 1Cycle? 3Cycles? R[R2]←M[R[R1]] 3Cycles? M[R[R1]] ← R[R2] Read/Write时间更长,故以此为准! 3Cycles? CPU访存有两种通信方式 早期:直接访问MM, “异步”方式,用MFC应答信号 现在:先Cache后MM,“同步”方式,无需应答 问题:时钟周期的宽度如何确定? 以“Riout,OP,Rjin”所花时间来确定还是以 “Read/Write”所花时间来确定? 以上四种操作各需要几个时钟周期?
19
三总线数据通路 单总线中一个时钟内只允许传一个数据,因而指令执行效率很低 可采用多总线方式,同时在多个总线上传送不同数据,提高效率
例如:三总线数据通路 总线A、B分别传送两个源操作数,总线C传送结果 单总线中的暂存器Y和Z在此可取消,Why? 采用双口寄存器 如何实现: R[R3] ←R[R1] op R[R2] R1outA,R2outB,op,R3inC 只要一个时钟周期(节拍)即可! Y 三个总线各自传不同数据,不会发生冲突,故无需Y和Z Z 目前,计算机大都采用流水线方式执行指令,上述单总线或三总线数据通路很难实现指令流水执行。 以下以MIPS指令系统为例介绍非总线式CPU的设计。
20
复习:MIPS的三种指令类型 大家记得是哪三种类型? R-Type、I-Type、J-Type 这些指令具有代表性!
op rs rt rd shamt func 6 11 16 21 26 31 6 bits 5 bits ADD and SUBSTRACT add rd, rs, rt sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 这些指令具有代表性! 有算术运算、逻辑运算;有RR型、RI型;有访存指令;有条件转移、无条件转移。 In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) 本讲目标:实现以上7条指令对应的数据通路! 教材中实现了11条指令,可将7条指令和11条指令的数据通路进行对比,以深入理解设计原理。 op target address 26 31 6 bits 26 bits
21
设计处理器的步骤 第一步:分析每条指令的功能,并用RTL(Register Transfer Language)来表示。
第二步:根据指令的功能给出所需的元件,并考虑如何将他们互连。 第三步:确定每个元件所需控制信号的取值。 第四步:汇总所有指令所涉及到的控制信号,生成一张反映指令与控制信 号之间关系的表。 第五步:根据表得到每个控制信号的逻辑表达式,据此设计控制器电路。 处理器设计涉及到数据通路的设计和控制器的设计 数据通路中有两种元件 操作元件:由组合逻辑电路实现 存储(状态)元件:由时序逻辑电路实现 So let’s design a processor. How and where do we start? Well, the best place to start is the processor’s instruction set architecture. After all, the goal of your design is to execute the instructions in the instruction set correctly. What you need to do is to describe each instruction’s operation in register transfer language. By looking at the Register Transfer Language description of the instruction, you can figure out the datapath components you need and how to connect these components together. As I will show you, each datapath component will have its own set of control signals. And the last step of the processor design task is to design the control unit that generates the control signals for the datapath. So what do we mean by Register Transfer Language? +2 = 27 min. (Y:07)
22
RTL: The ADD Instruction(加法指令)
000000 rs rt rd shamt 100000 6 11 16 21 26 31 6 bits 5 bits add rd, rs, rt M[PC] 从PC所指的内存单元中取指令 R[rd] ← R[rs] + R[rt] 从rs、rt 所指的寄存器中取数后相加,结果送rd 所指的寄存器中 PC ← PC + 4 PC加4,使PC指向下一条指令 Here is an example. In terms of Register Transfer Language, this is what the Add instruction need to do. First, you need to fetch the instruction from memory. Then you perform the actual add operation. And finally, you need to update the program counter to point to the next instruction. +1 = 28 min. (Y:08)
23
RTL: The Load Instruction(装入指令)
100011 rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits lw rt, rs, imm16 M[PC] (同加法指令) Addr ← R[rs] + SignExt(imm16) 计算数据地址 (立即数要进行符号扩展) R[rt] ← M[Addr] 从存储器中取出数据,装入到寄存器中 PC ← PC (同加法指令) Here is another example. The load instruction also starts off by fetching the instruction from Instruction Memory. Then you calculate the memory address, use the address to fetch the data from memory (Mem(Addr)), and then load the data into the register. Finally, you need to update the PC to point to the next sequential instruction. +1 = 29 min (Y:09)
24
数据通路中的关键路径(Load操作) Load操作: 记住:寄存器组和理想存储器的定时方式 R[Rt] ← M[R[Rs]+Imm16]
写操作时,作为时序逻辑电路。即: 时钟到达前,输入需setup;到达后经“Clk to Q”,写入数据到达输出端 读操作时,作为组合逻辑电路。即: 地址有效后经过 “access time”,输出开始有效 Clk Critical Path (Load Operation) = PC’s prop time + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew PC Instruction Address Ideal Instruction Memory Instruction bus Rd Rs Rt Imm 5 5 5 16 Now with the clocking methodology back in your mind, we can think about how the critical path of our “abstract” datapath may look like. One thing to keep in mind about the Register File and Ideal Memory (points to both Instruction and Data) is that the Clock input is a factor ONLY during the write operation. For read operation, the CLK input is not a factor. The register file and the ideal memory behave as if they are combinational logic. That is you apply an address to the input, then after certain delay, which we called access time, the output is valid. We will come back to these points (point to the “behave” bullets) later in this lecture. But for now, let’s look at this “abstract” datapath’s critical path which occurs when the datapath tries to execute the Load instruction. The time it takes to execute the load instruction are the sum of: (a) The PC’s clock-to-Q time. (b) The instruction memory access time. (c) The time it takes to read the register file. (d) The ALU delay in calculating the Data Memory Address. (e) The time it takes to read the Data Memory. (f) And finally, the setup time for the register file and clock skew. +3 = 21 (Y:01) Complement: Clock skew(B-40): Data Address Rw Ra Rb ALU 32 32 32 Ideal Data Memory DataOut 32 32-bit Registers Data In Clk Clk 32
25
取指令部件(Instruction Fetch Unit)
每条指令都有的公共操作: 取指令: M[PC] 更新PC:PC ← PC + 4 转移(Branch and Jump)时,PC内容再次被更新为 “转移目标地址” 取指令部件 顺序:先取指令,再改PC的值(具体实现时,可以并行) 绝不能先改PC的值,再取指令 32 Instruction Word Address Instruction Memory PC Clk Next Address Logic 下地址逻辑 Now let’s take a look at the first major component of the datapath: the instruction fetch unit. The common RTL operations for all instructions are: (a) Fetch the instruction using the Program Counter (PC) at the beginning of an instruction’s execution (PC -> Instruction Memory -> Instruction Word). (b) Then at the end of the instruction’s execution, you need to update the Program Counter (PC -> Next Address Logic -> PC). More specifically, you need to increment the PC by 4 if you are executing sequential code. For Branch and Jump instructions, you need to update the program counter to “something else” other than plus 4. I will show you what is inside this Next Address Logic block when we talked about the Branch and Jump instructions. For now, let’s focus our attention to the Add and Subtract instructions. +2 = 37 min. (Y:17) 取指后,每条指令功能不同,数据通路中信息流动过程也不同 下面分别对每条指令进行相应数据通路的设计
26
加法和减法指令(R-type类型) 首先考虑add和sub指令(R-Type指令的代表) 实现目标(7条指令):
ADD and subtract add rd, rs, rt sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt func 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits
27
ALUctr 、RegWr: 指令译码后产生的控制信号
RR(R-type)型指令的数据通路 op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits 功能:R[rd] ← R[rs] op R[rt] Example: add rd, rs, rt 不考虑公共操作,仅R-Type指令执行阶段的数据通路如下: 32 Result ALUctr:add/sub Clk busW RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers rs rt rd ALU And here is the datapath that can do the trick. First of all, we connect the register file’s Ra, Rb, and Rw input to the Rd, Rs, and Rt fields of the instruction bus (points to the format diagram). Then we need to connect busA and busB of the register file to the ALU. Finally, we need to connect the output of the ALU to the input bus of the register file. Conceptually, this is how it works. The instruction bus coming out of the Instruction memory will set the Ra and Rb to the register specifiers Rs and Rt. This causes the register file to put the value of register Rs onto busA and the value of register Rt onto busB, respectively. But setting the ALUctr appropriately, the ALU will perform either the Add and Subtract for us. The result is then fed back to the register file where the register specifier Rw should already be set to the instruction bus’s Rd field. Since the control, which we will design in our next lecture, should have already set the RegWr signal to 1, the result will be written back to the register file at the next clock tick (points to the Clk input). +3 = 42 min. (Y:22) Ra, Rb, Rw 分别对应指令的rs, rt, rd 指令“add rd, rs, rt”的控制信号应为? ALUctr=add,RegWr=1 ALUctr 、RegWr: 指令译码后产生的控制信号
28
带立即数的逻辑指令(ori指令) 实现目标(7条指令): ADD and subtract add rd, rs, rt
sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 2. 考虑ori 指令(I-Type指令和逻辑运算指令的代表) In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits
29
RTL: The OR Immediate Instruction
op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 逻辑运算,立即数为逻辑数 ori rt, rs, imm16 M[PC] 取指令(公共操作,取指部件完成) R[rt] ← R[rs] or ZeroExt(imm16) 立即数零扩展,并与rs内容做“或”运算 PC ← PC 计算下地址(公共操作,取指部件完成) immediate 16 15 31 16 bits 零扩展 ZeroExt(imm16) : The or immediate is a I-type instruction. The immediate field of the instruction (Imm16 of the format diagram) is zero extended to 32 bits before it is operated with the other operand. The other operand is selected by the Rs field of the instruction. The destination register of this instruction will be selected by the Rt field. +2 = 57 min. (Y:27) 思考:应在前面数据通路上加哪些元件和连线?用什么控制信号来控制?
30
带立即数的逻辑指令的数据通路 R[rt] ← R[rs] op ZeroExt[imm16]] Example: ori rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits R-Type类型的结果写入Rd 应加兰色部分.为什么? Rt Rd RegDst Mux 1 Don’t Care (Rt) Rs ALUctr RegWr 5 5 5 busA Rw Ra Rb busW 32 Result 32 32-bit Registers Here is the datapath for the Or immediate instructions. We cannot use the Rd field here (Rw) because in this instruction format, we don’t have a Rd field. The Rd field in the R-type is used here as part of the immediate field. For this instruction type, Rw input of the register file, that is the address of the register to be written, comes from the Rt field of the instruction. Recalled from earlier slide that for R-type instruction, the Rw comes from the Rd field. That’s why we need a MUX here to put Rd onto Rw for R-type instructions and to put Rt onto Rw for the I-type instruction. Since the second operation of this instruction will be the immediate field zero extended to 32 bits, we also need a MUX here to block off bus B from the register file. Since bus B is blocked off by the MUX, the value on bus B is don’t care. Therefore we do not have to worry about what ends up on the register file’s Rb register specifier. To keep things simple, we may just as well keep it the same as the R-type instruction and put the Rt field here. So to summarize, this is how this datapath works. With Rs on Register File’s Ra input, bus A will get the value of Rs as the first ALU operand. The second operand will come from the immediate field of the instruction. Once the ALU complete the OR operation, the result will be written into the register specified by the instruction’s Rt field. +3 = 50 min. (Y:30) ALU 32 32 Clk ZeroExt Mux 16 32 imm16 ALUSrc 1 busB 32 R-Type的操作数来自busB Ori指令的控制信号:RegDst=?;RegWr=?;ALUctr=?;ALUSrc=? Ori指令的控制信号:RegDst=1;RegWr=1;ALUSrc=1;ALUctr=or
31
访存指令中的数据装入指令 (lw) 实现目标(7条指令): ADD and subtract add rd, rs, rt
sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 3. 考虑lw 指令(访存指令的代表) In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits
32
RTL: The Load Instruction
op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 立即数用补码表示 lw rt, rs, imm16 M[PC] 取指令(公共操作,取指部件完成) Addr ← R[rs] + SignExt(imm16) 计算存储单元地址 (符号扩展!) R[rt] ← M [Addr] 装入数据到寄存器rt中 PC ← PC 计算下地址(公共操作,取指部件完成) 符号扩展( 为什么不是零扩展? ) : Like the OR immediate instruction I just showed you, the load instruction also uses the I format (point to the format diagram). But unlike the OR immediate instruction, the immediate field (Imm16 of the format diagram) is sign extended instead of zero extended. That is we will duplicate the most significant bit of 16 times to the left to form a 32-bit value. This sign extended value (SignExt) is then added to the register selected by the Rs field of the instruction to form the memory address. The memory address is then used to load the value into the register specified by the Rt field of the instruction (Rt of the format diagram). +2 = 57 min. (Y:37) immediate 16 15 31 16 bits 16 15 31 immediate 16 bits 1 思考:应在原数据通路上加哪些元件和连线?用什么控制信号来控制?
33
装入(lw)指令的数据通路 R[rt] ← M[ R[rs] + SignExt[imm16] ] Example: lw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rd Rt 应加兰色部分。为什么? RegDst 1 Mux Don’t Care (Rt) Rs ALUctr RegWr 5 5 5 Extender ExtOp Mux MemtoReg Clk Data In WrEn 32 Adr Data Memory MemWr busA Rw Ra Rb busW 32 32 32-bit Registers 1 ALU 32 32 Clk busB 1 Once again we cannot use the instruction’s Rd field for the Register File’s Rw input because load is a I-type instruction and there is no such thing as the Rd field in the I format. So instead of Rd, the Rt field is used to specify the destination register through this two to one multiplexor. The first operand of the ALU comes from busA of the register file which contains the value of Register Rs (points to the Ra input of the register file). The second operand, on the other hand, comes from the immediate field of the instruction. Instead of using the Zero Extender I used in datapath for the or immediate datapath, I have to use a more general purpose Extender that can do both Sign Extend and Zero Extend. The ALU then adds these two operands together to form the memory address. Consequently, the output of the ALU has to go to two places: (a) First the address input of the data memory. (b) And secondly, also to the input of this two-to-one multiplexer. The other input of this multiplexer comes from the output of the data memory so we can place the output of the data memory onto the register file’s input bus for the load instruction. For Add, Subtract, and the Or immediate instructions, the output of the ALU will be selected to be placed on the input bus of the register file. In either case, the control signal RegWr should be asserted so the register file will be written at the end of the cycle. +3 = 60 min. (Y:40) 32 Mux imm16 32 16 ALUSrc 0:零扩展,1:符号扩展 控制信号RegDst, RegWr, ALUctr, ExtOp, ALUSrc, MemWr, MemtoReg 各取何值? RegDst=1, RegWr=1, ALUctr=add, ExtOp=1, ALUSrc=1, MemWr=0, MemtoReg=1
34
访存指令中的存数指令 (sw) 实现目标(7条指令): ADD and subtract add rd, rs, rt
sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 4. 考虑sw 指令(访存指令的代表) In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits
35
RTL: The Store Instruction
立即数用补码表示 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits sw rt, rs, imm16 M[PC] 取指令(公共操作,取指部件完成) Addr ← R[rs] + SignExt(imm16) 计算存储单元地址(符号扩展!) Mem[Addr] ← R[rt] 寄存器rt中的内容存到内存单元中 PC ← PC 计算下地址(公共操作,取指部件完成) Just like the load instruction: (a) The store instruction also uses the I format. (b) And the store instruction also forms the memory address by adding the contents of the register selected by the Rs field to the sign extended immediate field. However, unlike the load instruction, which gets data from memory and put the data into the the register file, the store instruction: (a) Get the register selected by the Rt field of the instruction (R[rt]). (b) And then write this register into the data memory. +2 = 62 min. (Y:42) 思考:应在原数据通路上加哪些元件和连线?用什么控制信号来控制?
36
存数(sw)指令的数据通路 M[ R[rs] + SignExt[imm16] ← R[rt] ] Example: sw rt, rs, imm16 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits Rd Rt RegDst 1 Mux 应加兰色部分。为什么? Rs Rt ALUctr RegWr 5 5 5 MemWr MemtoReg busA Rw Ra Rb busW 32 32 32-bit Registers 1 ALU 32 32 Clk busB Mux And here is the datapath for the store instruction. The Register File, the ALU, and the Extender are the same as the datapath for the load instruction because the memory address has to be calculated the exact same way: (a) Put the register selected by Rs onto bus A and sign extend the 16 bit immediate field. (b) Then make the ALU (ALUctr) adds these two (busA and output of Extender) together. The new thing we added here is busB extension (DataIn). More specifically, in order to send the register selected by the Rt field (Rb of the register file) to data memory, we need to connect bus B to the data memory’s Data In bus. Finally, the store instruction is the first instruction we encountered that does not do any register write at the end. Therefore the control unit must make sure RegWr is zero for this instruction. +2 = 64 min. (Y:44) 1 32 32 Mux Data In WrEn Adr 32 Data Memory imm16 Extender 32 16 Clk ALUSrc ExtOp 控制信号RegDst, RegWr, ALUctr, ExtOp, ALUSrc, MemWr, MemtoReg 各取何值? RegDst=x, RegWr=0, ALUctr=add, ExtOp=1, ALUSrc=1, MemWr=1, MemtoReg=x
37
分支(条件转移)指令(相等转移:beq)
实现目标(7条指令): ADD and subtract add rd, rs, rt sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits 5. 考虑beq指令(条件转移指令的代表) In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits
38
RTL: The Branch Instruction
立即数用补码表示 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits beq rs, rt, imm16 M[PC] 取指令(公共操作,取指部件完成) Cond ← R[rs] - R[rt] 做减法比较rs和rt中的内容 if (COND eq 0) 计算下地址(根据比较结果,修改PC) PC ← PC ( SignExt(imm16) x 4 ) else PC ← PC + 4 How does the branch on equal instruction work? Well it calculates the branch condition by subtracting the register selected by the Rt field from the register selected by the Rs field. If the result of the subtraction is zero, then these two registers are equal and we take a branch. Otherwise, we keep going down the sequential path (PC <- PC +4). +1 = 65 min. (Y:45) 思考:立即数的含义是什么?是相对指令数还是相对单元数? 应在原数据通路上加哪些元件和连线?用什么控制信号来控制?
39
条件转移指令的数据通路 beq rs, rt, imm16 We need to compare Rs and Rt !
op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits PC Clk Next Address Logic 16 imm16 Branch To Instruction Memory Zero Rd Rt RegDst 1 Mux Rs Rt ALUctr RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers 思考:下地址逻辑如何设计? ALU 32 Clk busB The datapath for calculating the branch condition is rather simple. All we have to do is feed the Rs and Rt fields of the instruction into the Ra and Rb inputs of the register file. Bus A will then contain the value from the register selected by Rs. And bus B will contain the value from the register selected by Rt. The next thing to do is to ask the ALU to perform a subtract operation and feed the output Zero to the next address logic. How does the next address logic block look like? Well, before I show you that, let’s take a look at the binary arithmetics behind the program counter (PC). +2 = 67 min. (Y:47) 1 32 Mux imm16 Extender 32 16 ALUSrc ExtOp 控制信号RegDst, RegWr, ALUctr, ExtOp, ALUSrc, MemWr, MemtoReg, Branch 各取何值? RegDst=x, RegWr=0, ALUctr=sub, ExtOp=x, ALUSrc=0, MemWr=0, MemtoReg=x, Branch=1
40
下地址计算逻辑的设计 PC是一个32位地址: 顺序执行时: PC<31:0> = PC<31:0> + 4
转移执行时: PC<31:0> = PC<31:0> SignExt[Imm16] * 4 采用32位PC时,可用左移2位实现“*4”操作,计算转移地址用2个加法器! 用更简便的方式实现如下: MIPS按字节编址,每条指令为32位,占4个字节,故PC的值总是4的倍数,即后两位为00,因此,PC只需要30位即可。 PC采用30位后,其转移地址计算逻辑变得更加简单。 In theory, the Program Counter (PC) is a 32-bit byte address into the Instruction memory. The Program Counter is increment by four after each sequential instruction. When a branch is taken, we need to sign extend the 16 bit immediate field, multiply this sign extended value by four, and add it to the sequential instruction address (PC + 4). Why does this magic number “4” always come up? Well the reason is that the 32-bit PC is a byte address and all MIPS instructions are four bytes, or 32 bits, long. In other words, if we keep a 32-bit Program Counter, then the two least significant bits of the Program Counter will always be zeros. And if these two bits are always zeros, there is no reason to have hardware to keep them. So in practice, we will simply the hardware by using a 30 bit program counter. That is, we will build a Program Counter that only keep tracks of the upper 30 bits (<31:2>) of the instruction address because we know the 2 least significant bits will always be 0s. Then instead of always increase the Program Counter by four for sequential operation, we only have to increase it by 1. And for branch operation, we don’t need to multiply the sign extended immediate field by four before adding to the sequential PC (PC + 1). And when we apply the program counter to the address of the instruction memory, we need to attach two zeros to its least significant bits. +3 = 70 min. (Y:50) 下地址计算逻辑简化为: 顺序执行时: PC<31:2> = PC<31:2> + 1 转移执行时: PC<31:2> = PC<31:2> SignExt[Imm16] 取指令时: 指令地址 = PC<31:2> 串接 “00”
41
先根据当前PC取指令, 计算的下条指令地址在下一个时钟到来后才能写入PC!
下址逻辑设计方案1: 快速但昂贵 Using a 30-bit PC: 顺序执行时: PC<31:2> = PC<31:2> + 1 转移执行时: PC<31:2> = PC<31:2> SignExt[Imm16] 取指令时: 指令地址 = PC<31:2> concat “00” 30 PC Clk Addr<31:2> 30 Addr<1:0> Mux 1 “00” 30 Instruction Memory Adder 30 “1” Adder 32 So let’s see how we can put all these theories (point to the equations) into practice. The PC plus one is implemented by this first adder here. For branch operation, we need to sign extend the immediate field of the instruction and then add it to the output of the first adder to implement this equation (PC SignExt(imm16)). For sequential operation, the output of the first adder is selected by the two-to-one mux so it will be saved into the PC register at the next clock tick. For a taken branch, that is we have a branch_on_equal and the condition Zero is true, the output of the second adder is selected. In all cases, the 30 bit Program Counter is used as instruction address bit 31 to bit 2. The two least significant bits of the instruction address will always be zeroes. One question you may want to ask is: Do we really need an adder just to add “1”? Well may be not. +2 = 72 min. (Y:52) 30 imm16 SignExt 30 16 Instruction<15:0> Instruction<31:0> Branch Zero 标志位ZF,由ALU产生! 先根据当前PC取指令, 计算的下条指令地址在下一个时钟到来后才能写入PC! 为什么这里没有用“ALU”而是用“Adder”? “ALU”和“Adder”有什么差别?
42
非Branch时也不能很快得到下条指令地址
下址逻辑设计方案2: : 慢但便宜 为什么慢? 只能等到“Zero”有值后才能进行地址计算 对性能有没有影响? 没有,因为Load指令更慢。 为什么便宜? “+1”操作用“进位”来实现,节省一个“Adder” 30 PC Clk Addr<31:2> 30 “1” Addr<1:0> “00” Carry In Instruction Memory “0” One way to simplify the implementation is to use the CarryIn input of the adder to implement the PC<31:2> = PC<31:2> plus 1 operation. Then we can put a MUX in front of the adder to add the branch offset if the branch is taken. If the branch is not taken, we simply set the 2nd output of the ALU to zeros so we only add one through the CarryIn input. Why is this implementation slow? Well because we cannot start the address add until the Zero input is valid. And when will the Zero input become valid? Not until we have performed a 32-bit subtract in the main datapath. But does it matter that this is slow in the overall scheme of things? Well, probably not in this single cycle implementation. The critical path of this single cycle implementation will be the load instruction’s memory access so the extra time it takes to calculate the PC can be hidden behind the critical path. +3 = 75 min (Y:55) Adder Mux 30 1 32 imm16 SignExt 30 30 16 Instruction<15:0> Instruction<31:0> 非Branch时也不能很快得到下条指令地址 Branch Zero
43
无条件转移指令 实现目标(7条指令): ADD and subtract add rd, rs, rt sub rd, rs, rt
OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) 6. 考虑Jump指令(无条件转移指令的代表) op target address 26 31 6 bits 26 bits
44
RTL: The Jump Instruction
op target address 26 31 6 bits 26 bits j target M[PC] 取指令(公共操作,取指部件完成) PC<31:2> ← PC<31:28> 串接 target<25:0> 计算目标地址 FFFF FFFF F EFFF FFFF E AFFF FFFF A 0FFF FFFF j target 想一想:跳转指令的转移范围有多大? 是当前指令后面的0x ~0xFFFFFFC 处? Finally, let’s take a look at the jump instruction which uses the J format. The effect of the jump instruction is to change the lower 26 bits of the Program Counter to the value specified in the address field of the instruction. +1 = 76 min. (Y:46) 不对!它不是相对寻址,而是绝对寻址 思考:应在原数据通路上加哪些元件和连线?用什么控制信号来控制?
45
Instruction Fetch Unit: 取指令部件
j target PC<31:2> ← PC<31:28> concat target<25:0> 30 Addr<31:2> 30 Addr<1:0> 26 4 Mux 1 Target 30 Jump Instruction<25:0> “00” Instruction Memory PC Clk 30 32 PC的改变在下个Clk到达后发生! Adder 30 “1” Adder Mux Well this (points to the equation) is easy to implement. All we have to do is grab the four most significant bits of the PC and put them right next to the 26 bits target, and we will have the next PC for the jump (point to the feedback path). If we are running Powerview, what we will do now is to create a symbol called Instruction Fetch Unit. The output of this symbol is the 32-bit instruction word. The input to the Instruction Fetch Unit are two control signals, Branch and Jump, and one conditional input Zero from the datapath. Using this new symbol, we can complete our single cycle datapath. +2 = 78 min. (Y:58) 1 Instruction<31:0> 30 imm16 SignExt 30 这是“取指部件”的完整设计 3 个输入: jump, Branch, Zero 1个输出: 指令字 16 Instruction<15:0> Branch Zero RegDst, RegWr, ALUctr, ExtOp, ALUSrc, MemWr, MemtoReg, Branch, Jump 各取何值? RegDst=ExtOp=ALUSrc=MemtoReg=ALUctr=x, RegWr=0, MemWr=0, Branch=0, Jump=1
46
The MIPS Subset(考察实现以下指令的数据通路)
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits ADD and subtract add rd, rs, rt sub rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 JUMP: j target op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both the load and store instructions use the I format and both add the Rs and the immediate filed together to form the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specify the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op target address 26 31 6 bits 26 bits 所有指令的数据通路都已经设计好,合起来的数据通路是什么样的?
47
Putting it All Together: A Single Cycle Datapath
已完成的每条指令所用数据通路(元件及其互连) 及其控制信号如下 Instruction<31:0> Branch Instruction Fetch Unit Jump Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst Clk 1 Mux Rs Rt Rs Rt Rd Imm16 RegWr ALUctr 5 5 5 Zero MemtoReg busA MemWr Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux Mux 32 So here is the single cycle datapath we just built. If you push into the Instruction Fetch Unit, you will see the last slide showing the PC, the next address logic, and the Instruction Memory. Here I have shown how we can get the Rt, Rs, Rd, and Imm16 fields out of the 32-bit instruction word. The Rt, Rs, and Rd fields will go to the register file as register specifiers while the Imm16 field will go to the Extender where it is either Zero and Sign extended to 32 bits. The signals ExtOp, ALUSrc, ALUctr, MemWr, MemtoReg, RegDst, RegWr, Branch, and Jump are control signals. And I will show you how to generate them in the next class.. +2 = 80 min. (Z:00) WrEn Adr 1 1 Data In 32 Extender Data Memory imm16 32 16 Clk ALUSrc ExtOp 指令执行结果总是在下个时钟到来时开始保存在 寄存器 或 存储器 或 PC 中! 下一讲考虑:如何产生控制信号!(这就是控制器的设计内容)
48
第一讲小结 CPU设计直接决定了时钟周期宽度和CPI,所以对计算机性能非常重要! CPU主要由数据通路和控制器组成
数据通路:实现指令集中所有指令的操作功能 控制器:控制数据通路中各部件进行正确操作 数据通路中包含两种元件 操作元件(组合电路):ALU、MUX、Ext.、Adder、Reg/Mem Read等 状态 / 存储元件(时序电路):PC、Reg/Mem Write 数据通路的定时 数据通路中的操作元件没有存储功能,其操作结果必须写到存储元件中 在时钟到达后clk-to-Q时存储元件开始更新状态 MIPS指令集的一个子集作为CPU的实现目标 公共操作:取指令和PC+4 下址计算:30位PC,三路选择:顺序、Branch(结合标志Zero)、Jump R型:ALU两个操作数来自rs和rt,结果写到rd 访存:符号扩展,数据在rt和主存单元中交换 立即数:0扩展后的操作数送到ALU的一个输入端
49
第二讲 单周期控制器的设计 主 要 内 容 考察每条指令在数据通路中的执行过程和设计到的控制信号的取值 公共操作:取指令和计算下址PC
R-Type指令(add / sub) 立即数指令 (ori) 访存指令(lw / sw) 分支指令 (beq) 跳转指令 (j) 汇总各指令的控制信号取值 分两类控制信号:直接送往数据通路 / 送往局部控制单元 分析ALU操作对应的控制信号与func字段之间的关系 设计ALU局部控制单元 设计主控制单元
50
The Big Picture: Where are We Now?
The Five Classic Components of a Computer Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 下一个目标:设计单周期数据通路的控制器。 设计方法: 根据每条指令的功能,分析控制信号的取值,并在表中列出。 根据列出的指令和控制信号的关系,写出每个控制信号的逻辑表达式。
51
ADD / SUB 指令 add rd, rs, rt M[PC] 取指(每条指令一样)
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits add rd, rs, rt M[PC] 取指(每条指令一样) R[rd] ← R[rs] + R[rt] 实际操作(每条指令可能不同) PC ← PC + 4 计算顺序执行时PC的值(每条指令一样) OK, let’s get on with today’s lecture by looking at the simple add instruction. In terms of Register Transfer Language, this is what the Add instruction need to do. First, you need to fetch the instruction from Memory. Then you perform the actual add operation. More specifically: (a) You add the contents of the register specified by the Rs and Rt fields of the instruction. (b) Then you write the results to the register specified by the Rd field. And finally, you need to update the program counter to point to the next instruction. Now, let’s take a detail look at the datapath during various phase of this instruction. +2 = 10 min. (X:50) Complement: The next 3 slides will show you details about such 3 step: how to flow the data and how to control the flow?
52
Add / Sub操作开始时取指部件中的动作
取指令: Instruction ← M[PC] 所有指令都相同 新指令还没有取出译码,所以控制信号的值还是原来指令的旧值。 新指令还没有执行,所以标志也为旧值。 30 Addr<31:2> PC<31:28> 30 Addr<1:0> “00” 4 Mux 1 Target Instruction Memory Instruction<25:0> 30 26 PC Adder Mux 1 30 32 30 “1” Adder First let’s look at the Instruction Fetch Unit where everything begins. Every instruction begins at the clock tick. The clock tick in this case is the high to low transition of the Clk (points to the “bubble” of PC). What happens right after the clock tick? After Clk-to-Q delay, the PC gets the value that points to the Add instruction and fetch the add instruction from the memory but sending the address to the Ideal Instruction memory. Notice that since this is the beginning of the instruction, Control signals Branch and Jump will still have the old values from the previous instruction. At the beginning of ALL instructions execution, the instruction unit behaves the same way as shown here and we won’t repeat this picture for every instruction. +2 = 12 min. (X:52) Jump = previous Instruction<31:0> Clk 30 取出指令的第31-26位作为操作码首先被译码。 op=000000, 则为R-type指令 imm16 SignExt 30 16 Instruction<15:0> Branch = previous Zero = previous 取指部件由旧控制信号控制,会不会有问题? 因为PC输入端的值不会写入直到下个Clk到来 只要保证下个Clk来之前能产生正确的PC即可! 没有问题!Why?
53
指令译码后R型指令(Add / Sub)操作过程
op rs rt rd shamt funct 6 11 16 21 26 31 R[rd] ← R[rs] + / - R[rt] Instruction<31:0> Branch = 0 Instruction Fetch Unit Jump = 0 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = 1 Clk 1 Mux ALUctr = Add or Sub Rs Rt Rs Rt Rd Imm16 RegWr = 1 5 5 5 MemtoReg = 0 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 This picture shows the activities at the main datapath during the execution of the Add or Subtract instructions. The active parts of the datapath are shown in different color as well as thicker lines. First of all, the Rs and Rt of the instructions are fed to the Ra and Rb address ports of the register file and cause the contents of registers specified by the Rs and Rt fields to be placed on busA and busB, respectively. With the ALUctr signals set to either Add or Subtract, the ALU will perform the proper operation and with MemtoReg set to 0, the ALU output will be placed onto busW. The control we are going to design will also set RegWr to 1 so that the result will be written to the register file at the end of the cycle. Notice that ExtOp is don’t care because the Extender in this case can either do a SignExt or ZeroExt. We DON’T care because ALUSrc will be equal to 0--we are using busB. The other control signals we need to worry about are: (a) MemWr has to be set to zero because we do not want to write the memory. (b) And Branch and Jump, we have to set to zero. Let me show you why. +3 = 15 min. (X:55) Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc = 0 ExtOp = x
54
R型指令(Add /Sub)最后阶段取指部件中的动作
PC ← PC + 4 除 Branch and Jump以外的指令都相同 30 Addr<31:2> PC<31:28> 30 Addr<1:0> “00” 4 Mux 1 Target Instruction Memory Instruction<25:0> 30 PC Clk 26 Adder Mux 1 30 32 30 Jump = 0 “1” Adder This picture shows the control signals setting for the Instruction Fetch Unit at the end of the Add or Subtract instruction. Both the Branch and Jump signals are set to 0. Consequently, the output of the first adder, which implements PC plus 1, is selected through the two 2-to-1 mux and got placed into the input of the Program Counter register. The Program Counter is updated to this new value at the next clock tick. Notice that the Program Counter is updated at every cycle. Therefore it does not have a Write Enable signal to control the write. Also, this picture is the same for or all instructions other than Branch and Jump. Therefore I will only show this picture again for the Branch and Jump instructions and will not repeat this for all other instructions. +2 = 17 min. (X:57) Instruction<31:0> 30 imm16 SignExt 30 16 Instruction<15:0> Branch = 0 Zero = x 因为新的控制信号保证了正确的PC值的产生,在足够长的时间后,下个时钟Clk到来!
55
Register-Register(R型指令) Timing
Clk PC PC Clk-to-Q PC Old Value New Value PC+4 Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value RegWr Old Value New Value Register Write Occurs Here Register File Access Time busA, B Old Value New Value ALU Delay busW Old Value New Value Let’s take a more quantitative picture of what is happening. At each clock tick, the Program Counter will present its latest value to the Instruction memory after Clk-to-Q time(虽然早就产生,但PC值的改变一定在时钟边沿的控制下进行的). After a delay of the Instruction Memory Access time, the Opcode, Rd, Rs, Rt, and Function fields will become valid on the instruction bus. Once we have the new instruction, that is the Add or Subtract instruction, on the instruction bus, two things happen in parallel. First of all, the control unit will decode the Opcode and Func field and set the control signals ALUctr and RegWr accordingly. We will cover this in the next lecture. While this is happening (points to Control Delay), we will also be reading the register file (Register File Access Time). Once the data is valid on busA and busB, the ALU will perform the Add or Subtract operation based on the ALUctr signal. Hopefully, the ALU is fast enough that it will finish the operation (ALU Delay) before the next clock tick. At the next clock tick, the output of the ALU will be written into the register file because the RegWr signal will be equal to 1. +3 = 45 min. (Y:25) Complement: Instruction bus: see slide 12 on An Abstract View of the Critical Path Rd Rs Rt ALUctr RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers Result ALU 32 32 Clk busB 32
56
ori 指令译码后的执行过程 R[rt] ← R[rs] or ZeroExt[Imm16] op rs rt immediate 16
16 21 26 31 R[rt] ← R[rs] or ZeroExt[Imm16] Instruction<31:0> Branch = 0 Instruction Fetch Unit Jump = 0 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = 0 Clk 1 Mux Rs Rt Rs Rt Rd Imm16 ALUctr = Or RegWr = 1 5 5 5 MemtoReg = 0 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Now let’s look at the control signals setting for the Or immediate instruction. The OR immediate instruction OR the content of the register specified by the Rs field to the Zero Extended Immediate field and write the result to the register specified in Rt. This is how it works in the datapath. The Rs field is fed to the Ra address port to cause the contents of register Rs to be placed on busA. The other operand for the ALU will come from the immediate field. In order to do this, the controller need to set ExtOp to 0 to instruct the extender to perform a Zero Extend operation. Furthermore, ALUSrc must set to 1 such that the MUX will block off bus B from the register file and send the zero extended version of the immediate field to the ALU. Of course, the ALUctr has to be set to OR so the ALU can perform an OR operation. The rest of the control signals (MemWr, MemtoReg, Branch, and Jump) are the same as theAdd and Subtract instructions. One big difference is the RegDst signal. In this case, the destination register is specified by the instruction’s Rt field, NOT the Rd field because we do not have a Rd field here. Consequently, RegDst must be set to 0 to place Rt onto the Register File’s Rw address port. Finally, in order to accomplish the register write, RegWr must be set to 1. +3 = 20 min. (X:60) Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc = 1 ExtOp = 0
57
Load指令译码后的执行过程 R[rt] ← Data Memory {R[rs] + SignExt[imm16]} op rs rt
immediate 16 21 26 31 R[rt] ← Data Memory {R[rs] + SignExt[imm16]} Instruction<31:0> Branch = 0 Instruction Fetch Unit Jump = 0 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = 0 Clk 1 Mux Rs Rt ALUctr = Add Rs Rt Rd Imm16 RegWr = 1 5 5 5 MemtoReg = 1 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Let’s continue our lecture with the load instruction. What does the load instruction do? It first adds the contents of the register specified by the Rs field to the Sign Extended version of the Immediate field to form the memory address. Then it uses this memory address to access the memory and write the data back to the register specified by the Rt field of the instruction. Here is how the datapath works: first the Rs field is fed to the Register File’s Ra address port to place the register onto bus A. Then the ExtOp signal is set to 1 so that the immediate field is Sign Extended and we place this value (output of Extender) onto the ALU input by setting ALUsrc to 1. The ALU then add (ALUctr = add) the two together to form the memory address which is then placed onto the Data Memory’s address port. In order to place the Data Memory’s output bus onto the Register File’s input bus (busW), the control needs to set MemtoReg to 1. Similar to the OR immediate instruction I showed you earlier, the destination register here is specified by the Rt field. Therefore RegDst must be set to 0. Finally, RegWr must be set to 1 to complete the register write operation. Well, it should be obvious to you guys by now that we need to set Branch and Jump to 0 to make sure the Instruction Fetch Unit update the Program Counter correctly. +3 = 28 min. (Y:08) Clk 32 Mux Mux WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 32 16 Clk ALUSrc = 1 ExtOp = 1
58
Store指令译码后的执行过程 M{R[rs] + SignExt[imm16]} ← R[rt] op rs rt immediate
16 21 26 31 M{R[rs] + SignExt[imm16]} ← R[rt] Instruction<31:0> Branch = 0 Instruction Fetch Unit Jump = 0 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = x Clk 1 Mux Rs Rt ALUctr = Add Rs Rt Rd Imm16 RegWr = 0 5 5 5 MemtoReg = x busA Zero MemWr = 1 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 The store instruction performs the inverse function of the load. Instead of loading data from memory, the store instruction sends the contents of register specified by Rt to data memory. Similar to the load instruction, the store instruction needs to read the contents of register Rs (points to Ra port) and add it to the sign extended verion of the immediate filed (Imm16, ExtOp = 1, ALUSrc = 1) to form the data memory address (ALUctr = add). However unlike the Load instructoion where busB is not used, the store instruction will use busB to send the data to the Data memory. Consequently, the Rt field of the instruction has to be fed to the Rb port of the register file. In order to write the Data Memory properly, the MemWr signal has to be set to 1. Notice that the store instruction does not update the register file. Therefore, RegWr must be set to zero and consequently control signals RegDst and MemtoReg are don’t cares. And once again we need to set the control signals Branch and Jump to zero to ensure proper Program Counter updataing. Well, by now, you are probably tied of these boring stuff where Branch and Jump are zero so let’s look at something different--the bracnh instruction. +3 = 31 min. (Y:11) Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc = 1 ExtOp = 1
59
Branch指令译码后的执行过程 if (R[rs] - R[rt] == 0) then Zero ← 1 ; else Zero ← 0
op rs rt immediate 16 21 26 31 if (R[rs] - R[rt] == 0) then Zero ← 1 ; else Zero ← 0 Instruction<31:0> Branch = 1 Instruction Fetch Unit Jump = 0 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = x Clk 1 Mux ALUctr = Sub Rs Rt Rs Rt Rd Imm16 RegWr = 0 5 5 5 MemtoReg = x busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 So how does the branch instruction work? As far as the main datapath is concerned, it needs to calculate the branch condition. That is, it subtracts the register specified in the Rt field from the register specified in the Rs field and set the condition Zero accordingly. In order to place the register values on busA and busB, we need to feed the Rs and Rt fields of the instruction to the Ra and Rb ports of the register file and set ALUSrc to 0. Then we have to instruction the ALU to perform the subtract (ALUctr = sub) operation and set the Zero bit accordingly. The Zero bit is sent to the Instruction Fetch Unit. I will show you the internal of the Instruction Fetch Unit in a second. But before we leave this slide, I want you to notice that ExtOp, MemtoReg, and RegDst are don’t cares but RegWr and MemWr have to be ZERO to prevent any write to occur. And finally, the controller needs to set the Branch signal to 1 so the Instruction Fetch Unit knows what to do. So now let’s take a look at the Instruction Fetch Unit. +2 = 33 min. (Y:13) Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc = 0 ExtOp = x
60
Branch指令最后阶段取指部件中的动作
31 26 21 16 op rs rt immediate if (Zero == 1) then PC = PC SignExt[imm16]*4 ; else PC = PC + 4 30 Addr<31:2> PC<31:28> 30 Addr<1:0> “00” 4 Mux 1 Target Instruction Memory 30 PC Clk Instruction<25:0> 26 Adder Mux 1 30 32 30 “1” Adder Let’s look at the interesting case where the branch condition Zero is true (Zero = 1). Well, if Zero is not asserted, we will have our boring case where PC + 1 is selected. Anyway, with Branch = 1 and Zero = 1, the output of the second adder will be selected. That is, we will add the sequential address, that is output of the first adder, to the sign extended version of the immediate field, to form the branch target address (output of 2nd adder). With the control signal Jump set to zero, this branch target address will be written into the Program Counter register (PC) at the end of the clock cycle. +2 = 35 min. (Y:15) Jump = 0 Instruction<31:0> 30 imm16 SignExt 30 16 Instruction<15:0> Branch = 1 Zero = 1
61
Jump指令译码后的执行过程 IFU中目标地址送PC,其他什么都不做(只要保证存储部件不发生写的动作) 如何保证存储部件不发生写? op
target address 26 31 IFU中目标地址送PC,其他什么都不做(只要保证存储部件不发生写的动作) 如何保证存储部件不发生写? Instruction<31:0> Branch = 0 Instruction Fetch Unit Jump = 1 Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = x Clk 1 Mux Rs Rt ALUctr = x Rs Rt Rd Imm16 RegWr = 0 5 5 5 MemtoReg = x busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers The control signals setting in the main datapath for the Jump instruction is pretty boring because in most cases, we DON’T CARE. More specifically, control signals ExtOp, ALUSrc, ALUctr are all don’t cares because the ALU is not used at all for the Jump instruction. Control signals MemtoReg and RegDst are don’t are because Jump does not write the register file. That is the reason why we still need to set RegWr to zero. Furthermore, we also need to set MemWr to zero to avoid Data Memroy write. Finally, the control signal Branch is set to zero but Jump is set to 1. +2 = 37 min. (X:17) Complement: So far as we have already seen, almost all instructions use ALU. But this jump instruction is exceptional. which instruction doesn't use ALU? ALU 32 busB 32 Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 Extender Data Memory imm16 32 16 Clk ALUSrc = x ExtOp = x
62
Jump指令结束前IFU中的动作 op target address 26 31 PC ← PC<31:29> concat target<25:0> concat “00” 30 Addr<31:2> PC<31:28> 30 Addr<1:0> “00” 4 Mux 1 Target Instruction Memory Instruction<25:0> 30 PC Clk 26 Adder Mux 1 30 32 30 “1” Adder Inside the Instruction Fetch Unit, with Branch set to zero and Jump set to 1, we will not use the output of neither Adder. What we will use is the concatenation of the four most significant bits of the current program counter and the twenty six bits of the target address. With the control signal Jump set to 1, this value will be send to the Program Counter and get written into PC at the next clock tick (points to the Clk bubble). +2 = 39 min. (Y:19) There should be an improvement: When Jump=1, we don’t have to set Branch=0. We can set Branch=x to simply logical design of the control part. But we have to set Jump=0 when Branch=1. Jump = 1 Instruction<31:0> 30 imm16 SignExt 30 16 Instruction<15:0> Branch = 0 Zero = x
63
综合分析结果,得到如下指令与控制信号的关系表
func We Don’t Care :-) op add sub ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp ALUctr<2:0> 1 x Add Subtract Or xxx Here is a table summarizing the control signals setting for the seven (add, sub, ...) instructions we have looked at. Instead of showing you the exact bit values for the ALU control (ALUctr), I have used the symbolic values here. The first two columns are unique in the sense that they are R-type instrucions and in order to uniquely identify them, we need to look at BOTH the op field as well as the func fiels. Ori, lw, sw, and branch on equal are I-type instructions and Jump is J-type. They all can be uniquely idetified by looking at the opcode field alone. Now let’s take a more careful look at the first two columns. Notice that they are identical except the last row. So we can combine these two rows here if we can “delay” the generation of ALUctr signals. This lead us to something called “local decoding.” +3 = 42 min. (Y:22) op rs rt rd shamt funct 6 11 16 21 26 31 R-type add, sub I-type op rs rt immediate ori, lw, sw, beq J-type op target address jump
64
主控制单元和ALU局部控制单元 MIPS指令格式中指示操作性质的字段有两个:op(主控) 和 func(ALU局控)。 R-type ori
lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp ALUctr 1 x Add/Subtract Or Add Subtract xxx op That is, instead of asking the Main Control to generates the ALUctr signals directly (see the diagram with the ALU), the main control will generate a set of signals called ALUop. For all I and J type instructions, ALUop will tell the ALU Control exactly what the ALU needs to do (Add, Subtract, ...) . But whenever the Main Control sees a R-type instructions, it simply throws its hands up and say: “Wow, I don’t know what the ALU has to do but I know it is a R-type instruction” and let the Local Control Block, ALU Control to take care of the rest. Notice that this save us one column from the table we had on the last slide. But let’s be honest, if one column is the ONLY thing we save, we probably will not do it. But when you have to design for the entire MIPS instruction set, this column will used for ALL R-type instructions, which is more than just Add and Subtract I showed you here. Another advantage of this table over the last one, besides being smaller, is that we can uniquely identify each column by looking at the Op field only. Therefore, as I will show you later, the Main Control ONLY needs to look at the Opcode field. How many bits do we need for ALUop? +3 = 45 min. (Y:25) func Main Control op 6 ALU (Local) N=? ALUop ALUctr 3 ALUctr的值取决于ALUop和func,其他控制信号仅取决于op ALUop有5种情况,N至少应为3!哪5种? R、I-ori、I-lw/sw、I-beq、J
65
ALUop和“func”字段的译码 Main Control op 6 func N ALUop ALUctr 3
(Local) func N ALUop ALUctr 3 ALUop的编码定义如下: 书P182表6.3中ori和beq只能有一个x,否则编码冲突! R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 xx 0 10 0 00 0x1 问题:ALUop能否仅用2位? 能!因为jump时任意,故可仅用两位:R:11, I-ori:10, I-beq:01, I-lw/sw:00, J-xx R-Type取1xx,不会发生编码冲突! 000000 rs rt rd shamt funct 6 11 16 21 26 31 R-type What this table and diagram implies is that if the ALU Control receives ALUop = 100, it has to decode the instruction’s “func” field to figure out what the ALU needs to do. Based on the MIPS encoding in Appendix A of your text book, we know we have a Add instruction if the func field is If the func field is , we know we have a subtract operation and so on. Notice that the bit 5 and bit 4 of this field is the same for all these operations so as far as the ALU control is concerned, these bits are don’t care. Now recall from your ALU homework, the ALUctr signals has the following meaning (point to the table): 000 means Add, 001 means subtract, ... etc. Based on these three tables (point to the last row of the top table and then the two other tables) and the fact that bit 5 and bit 4 of the “func” field are don’t care, we can derive the following truth table for ALUctr. +2 = 48 min. (Y:28) funct<5:0> Instruction Operation add subtract and or set-on-less-than ALUctr<2:0> ALU Operation 000 001 100 101 010 Add Subtract And Or ALUctr ALU ALUctr与func后4位有关,需建立ALUctr 与ALUop和func后四位之间对应关系
66
ALUctr控制信号的真值表 建立ALUop、func后4位和ALUctr之间的关系表 由关系表可得出ALUctr的逻辑表达式:
funct<3:0> Instruction Op. ALUctr[i] = f (ALUop[i], func[i] ) 0000 add R-type ori lw sw beq ALUop (Symbolic) “R-type” Or Add Subtract ALUop<2:0> 1 00 0 10 0 00 0 x1 0010 subtract 0100 and 0101 or 1010 set-on-less-than ALUop func bit2 bit1 bit0 bit<2> bit<1> bit<0> bit<3> x ALUctr ALU Operation Add 1 Subtract Or And That is, whenever ALUop is 000, we don’t care anything about the func field because we know we need the ALU to do an ADD operation (point to Add column). Whenever the ALUop bit<2> is 0 and bit<0> is 1, we know we want the ALU to perform a Subtract regardless of what func field is. Bit<1> is a don’t care because for our encoding here, ALUop<1> will never be equal to 1 whenever bit<0> is 1 and bit<2> is 0. Similarly, whenever ALUop bit<2> is 0 and bit<1> is 1, we need the ALU to perform Or. The tricky part occurs when the ALUOp bit<2> equals to 1. In that case, we have a R-type instruction and we need to look at the Func field. In any case, once we have this Symbolic column, we can get this actual bit columns by referring to our ALU table on the last slide (use the last table of last slide if time permits). +2 = 30 min. (Y:30) 头三行是非R-Type,操作由ALUop决定,与func无关。R-Type时,操作完全由func决定。 ALUctr可用更多位数,这样便于扩充,例如,可加入异或、移位等操作。
67
The Logic Equation for ALUctr<0>
ALUop func bit<2> bit<1> bit<0> bit<3> ALUctr<0> x 1 This makes func<3> a don’t care ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0> From the truth table we had before the break, we can derive the logic equation for ALUctr bit 2 by collecting all the rows that has ALUCtr bit 2 equals to 1 and this table is the result. Each row becomes a product term and we need to OR the product terms together. Notice that the last row are identical except the bit<3> of the func fields. One is zero and the other is one. Together, they make bit<3> a don’t care term. With all these don’t care terms, the logic equation is rather simple. The first product term is: not ALUOp<2> and ALUOp<0>. The second product term, after we making Func<3> a don’t care becomes ... +2 = 57 min. (Y:37)
68
The Logic Equation for ALUctr<1>
ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<1> 1 x x x x 1 1 x x 1 1 1 x x 1 1 1 Here is the truth table when we collect all the rows where ALCctr bit<1> equals to 1. Once again, we can simplify the table by noticing that the first two rows are different only at the ALUop bit<0> position. We can make ALUop bit<0> into a don’t care. Similarly, the last three rows can be combined to make Func bit<3> and bit<1> into don’t cares. I cannot understand it. Consequently, the logic equation for ALUctr bit<1> becomes ... +2 = 59 min. (Y:39) ALUctr<1> = !ALUop<2> & ALUop<1> & ! ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1>
69
The Logic Equation for ALUctr<2>
ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<2> 1 x x x x 1 1 x x 1 1 1 ALUctr<2> = !ALUop<2> & ALUop<1> & !ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> & func<0> Finally, after we gather all the rows where ALUctr bit 0 are 1’s, we have this truth table. Well, we are out of luck here. I don’t see any simple way to simplify these product terms by just looking at them. There may be some if you draw out the 7 dimension K map but I am not going to try it. So I just write down the logic equations as it is. +2 = 61 min. (Y:41)
70
局部ALU控制单元逻辑 总结前面的结果,得到:
Control (Local) func 3 6 ALUop ALUctr 总结前面的结果,得到: ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0> ALUctr<1> = !ALUop<2> & ALUop<1> & !ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> ALUctr<2> = !ALUop<2> & ALUop<1> & !ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> & func<0> With all the logic equations available, you should be able to implement this logic block without any problem. +1 = 62 min. (Y:42) 根据以上逻辑方程,可实现局部ALU控制单元!
71
主控制单元的真值表 : Main Control op 6 ALU (Local) func 3 ALUop ALUctr RegDst
ALUSrc : 主控单元的输出 主控单元的输入 op R-type ori lw sw beq jump RegDst 1 x x x ALUSrc 1 1 1 x MemtoReg 1 x x x RegWrite 1 1 1 MemWrite 1 Now that we have taken care of the Local Control (ALU Control), let’s refocus our attention to the Mian Controller. The job of the Main Control is to look at the Opcode field of the instruction and generate these control signals for the datapath (RegDst, ... ExtOp) as well as the 3-bit ALUop field for the ALU Control. Here, I have shown you the symbolic value of the ALUop field as well as the actual bit assignment. For example here (2nd column), the R-type ALUop is encode as 100 and the Add operation (3rd column) is encoded as 000.. This is call a quote “Truth Table” unquote because if you think about it, this is like having the truth table rotates 90 degrees. Let me show you what I mean by that. +3 = 65 min. (Y:45) Branch 1 Jump 1 ExtOp x 1 1 x x ALUop (Symbolic) “R-type” Or Add Add Subtract xxx ALUop <2> 1 x ALUop <1> x 1 x x ALUop <0> x 1 x
72
考察每个控制信号的逻辑方程(如:RegWrite)
op R-type ori lw sw beq jump RegWrite 1 1 1 RegWrite = R-type + ori + lw = !op<5> & !op<4> & !op<3> & !op<2> & !op<1> & !op<0> (R-type) + !op<5> & !op<4> & op<3> & op<2> & !op<1> & op<0> (ori) + op<5> & !op<4> & !op<3> & !op<2> & op<1> & op<0> (lw) op<0> op<5> . <0> R-type ori lw sw beq jump 指令译码器 For example, consider the control signal RegWrite. If we treat all the don’t cares as zeros, this row here means RegDest has to be equal to one whenever we have a R-type, or an OR immediate, or a load instruction. Since we can determine whether we have any of these instructions (point to the column headers) by looking at the bits in the “OP” field, we can transform this symbolic equation to this binary logic equation. For example, the first product term here say we have a R-type instruction whenever all the bits in the “OP” field are zeros. So each of these big AND gates implements one of the columns (R-type, ori, ...) in our table. Or in more technical terms, each AND gate implements a product term. In order to finish implementing this logic equation, we have to OR the proper terms together. In the case of the RegWrite signal, we need to OR the R-type, ORi, and load terms together. +2 = 67 min. (Y:47) RegWrite
73
Main Control的PLA实现 . . . . . . . . . . . . 指令译码器 op<5>
<0> <0> <0> <0> <0> op<0> R-type ori lw sw beq jump RegWrite ALUSrc RegDst MemtoReg MemWrite Similarly, for ALUSrc, we need to OR the ori, load, and store terms together because we need to assert the ALUSrc signals whenever we have the Ori, load, or store instructions. The RegDst, MemtoReg, MemWrite, Branch, and Jump signals are very simple. They don’t need to OR any product terms together because each is asserted for only one instruction. For example, RegDst is asserted ONLY for R-type instruction and MemtoReg is asserted ONLY for load instruction. ExtOp, on the other hand, needs to be set to 1 for both the load and store instructions so the immediate field is sign extended properly. Therefore, we need to OR the load and store terms together to form the signal ExtOp. Finally, we have the ALUop signals. But clever encoding of the ALUop field, we are able to keep them simple so that no OR gates is needed. If you don’t already know, this regular structure with an array of AND gates followed by another array of OR gates is called a Programmable Logic Array, or PLA for short. It is one of the most common ways to implement logic function and there are a lot of CAD tools available to simplify them. +3 = 70 min. (Y:50) Branch Jump ExtOp ALUop<2> ALUop<1> ALUop<0>
74
执行前述7条指令的完整的单周期处理器 : ALUop ALU Control ALUctr 3 RegDst func op Main
Instr<5:0> 6 ALUSrc 6 : Instr<31:26> Branch Instruction<31:0> Instruction Fetch Unit Jump Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst Clk 1 Mux Rs Rt Rt Rs Rd Imm16 RegWr ALUctr 5 5 5 busA MemtoReg Zero MemWr Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 OK, now that we have the Main Control implemented, we have everything we needed for the single cycle processor and here it is. The Instruction Fetch Unit gives us the instruction. The OP field is fed to the Main Control for decode and the Func field is fed to the ALU Control for local decoding. The Rt, Rs, Rd, and Imm16 fields of the instruction are fed to the data path. Based on the OP field of the instruction, the Main Control will set the control signals RegDst, ALUSrc, .... etc properly as I showed you earlier using separate slides. Furthermore, the ALUctr use the ALUop from the Main control and the func field of the instruction to generate the ALUctr signals to ask the ALU to do the right thing: Add, Subtract, Or, and so on. This processor will execute each of the MIPS instruction in the subset in one cycle. There is, however, a couple of subtle differences between this single-cycle processor and a real MIPS processor in terms of instruction execution. +2 = 72 min (Y:52) Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Instr<15:0> Clk ALUSrc ExtOp
75
lw指令的执行时间最长, 它所花时间作为时钟周期
Clk PC PC Clk-to-Q PC Old Value New Value PC+4 Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value MemtoReg Old Value New Value Register Write Occurs RegWr Old Value New Value This timing diagram shows the worst case timing of our single cycle datapath which occurs at the load instruction. Clock to Q time after the clock tick, PC will present its new value to the Instruction memory. After a delay of instruction access time, the instruction bus (Rs, Rt, ...) becomes valid. Then three things happens in parallel: (a) First the Control generates the control signals (Delay through Control Logic). (b) Secondly, the register file access is to put Rs onto busA. (c) And we have to sign extended the immediate field to get the second operand (busB). Here I assume register file access takes longer time than doing the sign extension so we have to wait until busA valid before the ALU can start the address calculation (ALU delay). With the address ready, we access the data memory and after a delay of the Data Memory Access time, busW will be valid. And by this time, the control unit would have set the RegWr signal to one so at the next clock tick, we will write the new data coming from memory (busW) into the register file. +3 = 77 min. (Y:57) Register File Access Time busA Old Value New Value Delay through Extender & Mux busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New
76
单周期计算机的性能 单周期处理器的CPI为多少? 其他条件一定的情况下,CPI越小,则性能越好! CPI=1,不是很好吗?
单周期处理器的性能会不会很好?为什么? 计算机的性能除CPI外,还取决于时钟周期的宽度 单周期处理器的时钟宽度为最复杂指令的执行时间 很多指令可以在更短的时间内完成 CPI=1! 单周期计算机的性能 假设在单周期处理器中,各主要功能单元的操作时间为: 存储单元:200ps ALU和加法器:100ps 寄存器堆(读/写):50ps 假设MUX、控制单元、PC、扩展器和传输线路都没有延迟,则下面实现方式中,哪个更快?快多少? (1)每条指令在一个固定长度的时钟周期内完成 (2)每条指令在一个时钟周期内完成,但时钟周期仅为指令所需,是可变的 (实际不可行,只是为了比较) 假设指令组成为:25%取数、10%存数、45%ALU、15%分支、5%跳转
77
单周期计算机的性能 解:CPU执行时间=指令条数 x CPI x 时钟周期=指令条数 x 时钟周期
各指令类型要求的时间长度为:
78
单周期计算机的性能 对于方式(1),时钟周期由最长指令来决定,应该是load指令,为600ps
对于方式(2),时钟周期取各条指令所需时间,时钟周期从600ps至200ps 根据各类指令的频度,计算出平均时钟周期长度为: CPU时钟周期=600x25%+550x10%+400x45%+350x15%+200x5%=447.5ps CPU性能比= = = =1.34 方式(1)的CPU执行时间 方式(1)的CPU 时钟周期 600 方式(2)的CPU执行时间 方式(2)的CPU 时钟周期 447.5 由此可见,可变时钟周期的性能是定长周期的1.34倍! 但是,对每类指令采用可变长时钟周期实现非常困难,而且所带来的额外开销会很大,不合算! 早期的小指令集计算机用过单周期实现技术,但现代计算机都不采用。 下一讲介绍多周期数据通路和控制器,其特点是: 时钟周期固定、时钟周期数可变
79
第二讲 小结 考察每条指令在单周期数据通路中的执行过程 每条指令在一个时钟周期内完成 每个时钟到来时,都开始进入取指令操作
经过clk-to-Q,PC得到新值,经过access time后得到当前指令 按三种方式分别计算下条指令地址,在branch / zero / jump的控制下,选择其中之一送到PC输入端,但不会马上写到PC中,一直到下个时钟到达时,才会更新PC。三种下址方式为: branch=jump=0:PC+4 branch=zero=1: PC+4+signExt[imm16]*4 jump=1: PC<31:28> concat target<25:0> concat “00” 指令取出后被译码,产生指令对应的控制信号 R-type指令:rd为目的寄存器,无访存操作,…… ori指令:rt为目的寄存器,0扩展,无访存操作,…… lw指令: rt为目的寄存器,计算地址、符号扩展,读内存,…… sw指令: rt为源寄存器,计算地址、符号扩展,写内存,…… 汇总每条指令控制信号的取值,生成真值表,写出逻辑表达式,设计主控制逻辑和ALU局部控制逻辑
80
第三讲 多周期处理器的设计 主 要 内 容 多周期数据通路实现思想 单周期数据通路和多周期数据通路的差别
通过简要分析LOAD指令分阶段执行过程,以加深理解单周期和多周期数据通路的差别 多周期通路中存储单元的“竞争”问题及其解决思路 详细分析7条指令在多周期通路中的执行过程 在分析执行过程基础上,分析每个周期内控制信号的取值,生成相应的状态 综合生成所有指令的状态转换图 根据状态转换图,生成控制器输出的逻辑表达式 根据逻辑表达式,用PLA(硬布线)实现控制逻辑
81
Drawback of Single Cycle Processor
单周期处理器的CPI为1,所有指令的执行时间都以最长的load指令为准 最长指令时间为: Cycle time must be long enough for the load instruction PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew 时钟周期远远大于其他指令实际所需的执行时间,效率极低 R-type指令、立即数运算指令不需要读内存 Store指令不需要写寄存器 分支指令不需要访问内存和写寄存器 Jump 不需要ALU运算,不需要读内存,也不需要读/写寄存器 Well, the last slide pretty much illustrates one of the biggest disadvantage of the single cycle implementation: it has a long cycle time. More specifically, the cycle time must be long enough for the load instruction which has the following components: Clock to Q time of the PC, .... Having a long cycle time is a big problem but not the the only problem. Another problem of this single cycle implementation is that this cycle time, which is long enough for the load instruction, is too long for all other instructions. We will show you why this is bad and what we can do about it in the next few lectures. That’s all for today. +2 = 79 min (Y:59)
82
多周期处理器的实现思想 单周期处理器的问题根源: 时钟周期以最复杂指令所需时间为准,太长! 解决思路:
把指令的执行分成多个阶段,每个阶段在一个时钟周期内完成 时钟周期以最复杂阶段所花时间为准 尽量分成大致相等的若干阶段 规定每个阶段内最多只能完成:1次访存 或 1次寄存器堆读/写 或 1次ALU 每步都设置相应的存储元件,每部执行结果都在下个时钟开始保存到相应单元 多周期处理器的好处: 时钟周期短 不同指令所用周期数可以不同,如: Load: five cycles Jump: three cycles 允许功能部件在一条指令执行过程中被重复使用。如: Adder + ALU(多周期时只用一个ALU,在不同周期可重复使用) Inst. / Data mem(多周期时合用一个存储器,不同周期中重复使用) Well, the root of these problems of course is that fact that the Single Cycle Processor’s cycle time has to be long enough for the slowest instruction. The solution is simple. Just break the instruction into smaller steps and instead of executing an entire instruction in one cycle, we will execute each of these steps in one cycle. Since the cycle time in this case will be the time it takes to execute the longest step, our goal should be keeping all the steps to have similar length when we break up the instruction. Well the last two bullets pretty much summarize what a multiple cycle processor is all about. The first advantage of the multiple cycle processor is of course shorter cycle time. The cycle time now only has to be long enough to execute the longest step. But may be more importantly, now different instructions can take different number of cycles to complete. For example: (1) The load instruction will take five cycles to complete. (2) But the Jump instruction will only take three cycles. This feature greatly reduce the idle time inside the processor. Finally, the multiple cycle implementation allows a functional unit to be used more than once per instruction as long as it is used on different clock cycles. For example, as I will show you later in today’s lecture, we can use the ALU to increment the Program Counter as well as doing address calculation. +3 = 11 min. (X:51)
83
多周期数据通路 不行!因为…. 能否对P.185图6.32作如下调整? 指令 只有一个ALU、一个Memory、多处增加MUX和临时寄存器
MDR 只有一个ALU、一个Memory、多处增加MUX和临时寄存器
84
Load指令分成5个阶段 Instruction Fetch Instr Decode / Address Data Memory
Reg Wr Reg. Fetch Clk Clk-to-Q PC Old Value New Value Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value 1 Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value RegWr Old Value New Value 2 Register File Access Time busA Old Value New Value Well let’s take a look at the Load instruction’s timing diagram and see how we can break it up into smaller steps. The biggest contributors to the cycle time appears to be: (1) Instruction Memory Access Time. (2) Delay through the Control Logic, which happens in parallel with Register File Access. (3) ALU Delay. (4) Data Memory Access Time. (5) And Register File Write Time. Therefore, it makes sense to break up the Load instructions into these five steps: (1) Instruction Fetch. (2) Instruction Decode “slash/” Register Fetch. (3) Memory Address Calculation. (4) Data Memory Access. (5) And finally, Register File Write. Notice that here I have used the term Register File Write time instead of Register File Write Setup time. The reason is that in a “real” register file, there is no such thing as set up time. +2 = 13 min. (X:53) Delay through Extender & Mux 3 Register File Write Time busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New
85
Load指令各阶段分析 取指令阶段 执行一次存储器读操作 读出的内容(指令)保存到寄存器IR(指令寄存器)中
IR的内容不是每个时钟都更新,所以IR必须加一个“写使能”控制 在取指令阶段结束时,ALU的输出为PC+4,并送到PC的输入端,但不能在每个时钟到来时就更新PC,所以PC也要有“写使能”控制 译码/读寄存器堆阶段 经过控制逻辑延迟后,控制信号更新为新值 执行一次寄存器读操作 读出的内容(操作数)保存到临时寄存器A和B中 每个时钟到来时,A和B中的值都要更新,所以不需“写使能”控制 对16位立即数进行符号扩展后,送到ALU的B口的多路选择器 地址生成阶段(ALU运算) ALU的A口和B口的多路选择器在相应控制信号控制下选择操作数进行加法运算,输出结果在下个时钟到达时,保存到临时寄存器BranchTarget (ALUout)中 读存储器阶段 由ALUout作为地址访问存储器,读出数据,保存在临时寄存器MDR中 写结果到寄存器 把MDR中的内容写到寄存器堆中
86
寄存器堆和存储器的写定时( Ideal vs. Reality)
单周期机器中,寄存器组和存储器被简化为理想的有时钟控制的: 时钟边沿到来时,才进行写 时钟边沿到来之前,地址、数据和写使能都已经稳定 实际机器中,寄存器组和存储器的情况为: 寄存器有时钟输入,存储器没有时钟输入 写操作不是由时钟边沿触发,而是一个组合电路,其过程为: 写使能(Write Enable)为 1,并且 Din信号已稳定的前提下,经过Write Access时间,Din信号被写入 Adr 处 重要之处: 地址和数据必须在写使能为1前稳定 Adr Din WrEn Dout Ideal Memory 32 Clk Adr Din WrEn Dout Real Memory 32 Because in a real register file, there is NO clock input (use the bottom picture). In previous lectures, I tried to simplify things by giving both the register file and data memory a clock input such that all write happens at the clock tick-that is H to L transition of the clock. Consequently, the address bus, the Data In bus, and the Write Enable signals must ALL be stable at least ONE set up time before the clock tick. But in real life, neither register file nor ideal data memory has clock input. The Write path is pure combinational. That is after the control signal: (1) Write Enable has gone to 1 and the Data In bus has settle down to a given value. (2) It will take a delay equal to the Memory Write Access Delay. (3) BEFORE the value on the Data In bus is written into the memory location specified by the address bus. It is very VERY important that the address bus is stable BEFORE the control signal Write Enable is set to 1. Otherwise, you may end up destroying data already in memory by writing to the wrong address location if there is any glitches on the address bus when Write Enable is asserted. +2 = 15 min. (X:55) 因此,存在地址Adr、数据Din和写使能WrEn信号的“竞争”问题!
87
竞争(race)问题 Register File(寄存器组): 实际寄存器组(不带Clk的)在单周期通路中不能可靠工作 这是因为:
Reg File Ra Rw busW Rb busA busB RegWr 5 32 Register File(寄存器组): 实际寄存器组(不带Clk的)在单周期通路中不能可靠工作 这是因为: 不能保证 Rw 在RegWr = 1之前稳定 即:在 Rw 和RegWr (write enable)之间存在“race” Memory(存储器): 实际存储器在单周期通路中也不能可靠工作 不能保证 Adr在WrEn = 1之前稳定 即:在 Adr 和 WrEn之间存在“race” Adr Din WrEn Dout Real Memory 32 Notice that this real register file, which does not have a clock input, may not work reliably in our single cycle processor because if you look at the timing diagram, you will notice that: (1) We cannot guarantee Rw, which specifies the register to be written, will be stable BEFORE the control signal RegWr goes to 1. (2) In other words we have a race between the setting of Rw and the assertion of RegWr. On a good day, if Rw does settle down before RegWr goes to 1, everything works. But once in a while, if RegWr happens to go to 1 before Rw settles down, we have a problem. Race condition like this is what caused machine to crash mysteriously during initial testing. Similarly, I did not use this data memory in our single cycle processor design because we cannot guarantee the address bus to be stable BEFORE Write Enable is set to 1. Once again, we have a race condition between the Address and the Write Enable signal. How can we avoid these two race conditions in our multiple cycle implementation? +2 = 17 min. (X:57)
88
如何在多周期通路中避免“race”问题 多时钟周期中解决“竞争”问题的方案 “Race”问题有时会导致机器神秘出错,甚至崩溃!
确认地址和数据在第N周期结束时已稳定 使写使能信号在一个周期后(即:第N+1周期)有效 在写使能信号无效前地址和数据不改变 Ra RegWr WrEn RegWr WrEn 5 Rb Rw Adr busA Adr 5 32 32 Reg File Real Memory Rw busB 5 32 Well, for the multiple cycle implementation, we can avoid this race condition by: (1) Making sure the address bus is stable by the end of Cycle N. (2) Then we can assert the write enable signal ONE cycle later at Cycle N + 1. (3) Finally, we have to make sure the address bus does not change until the Write Enable signal is disasserted. +1 = 18 min. (X:18) busW Din Dout 32 32 32 “Race”问题有时会导致机器神秘出错,甚至崩溃!
89
取指周期(取指令、计算下地址)开始时 在一个时钟到来的下降沿开始取指令周期的任务: M[PC] ; PC ← PC + 4 Clk
You are here! One “Logic” Clock Cycle IRWr=? PCWr=? ALU PC 32 MemWr=? 32 32 Clk RAdr 4 32 32 下个时钟到达时,PC和IR的输入端应是什么? 能否每个时钟更新PC和IR? PC和IR怎样在必要时更新? 加“写使能”控制! Real Memory I R ALU Control 32 As far as LOGIC is concerned, I think the easiest way to think about a clock cycle is that a clock cycle begins right AFTER a clock tick and ends at the next clock tick. I have intentionally shown the L time to be much longer than the H time to emphasis a point: the H and L time does not affect your design as long as you use the simple clocking methodology where all storage elements are triggered at the same clock tick. The only important thing here is the time between the two clock ticks, the cycle time. Most of the time, however, the high and low time are the same because it is much easier to generate a clock that has high and low time the same length. Well enough about clock ticks. Let’s see what happens at the beginning (You are Here) of the Ifetch cycle: (a) We need to fetch the instruction from Memory so we sent the address to the memory. (b) We also needs to update the PC so we better send the address to the ALU as well. ***** What values do you think the control signals PCwr and ALUop have at this point? Well since we are only at the beginning of the cycle (Your are Here), these two signals will still have the old values from the last cycle of the previous value. See next slide of their new values. +2 = 27 min. (Y:07) WrAdr 32 Dout Din 32 ALUop=? 32 Clk 控制信号PCWr=?, MemWr=?, IRWr=?, ALUop=? 控制信号PCWr=1, MemWr=0, IRWr=1, ALUop=add
90
取指周期结束时 每一个周期都在下一个时钟到来时结束 (此时,存储元件被更新): IR ← M[PC] PC← PC + 4 Clk
You are here! One “Logic” Clock Cycle PCWr=1 ALU PC 32 MemWr=0 IRWr=1 32 32 RAdr Clk 4 32 32 Real Memory Instruction Reg 取指结束时,新的PC值(PC+4)开始写入PC ? 即:下个周期里,PC中已经是PC+4了。 ALU Control As time goes by, the output of the memory will become valid and the ALU, with ALUOp sets to Add, will finish the 32-bit add. Hopefully, we are smart enough to set the cycle time so the time between the clock tick is long enough to allow these (output of Memory and ALU) to stabilize. So at the end of the cycle, the clock tick will trigger the Instruction Register to save the current instruction word (output of Instruction Memory). Similarly, the Program Counter register is triggered (point to the clock input) to save the next instruction’s address (output of ALU). Unlike the single cycle processor where a 30-bit PC can reduce the length of two adders by two bits, here we are using the 32-bit ALU to do the PC update anyway. So the only saving we can get for using a 32 bit PC are two register bits. That’s why we didn’t bother to do it and keep a 32-bit Program Counter. The Memory Unit here is also used to store data and the ALU here is also use for instruction execution. Therefore, we know we will need some MUXes in front of them. +2 = 29 min. (Y:09) 32 32 WrAdr Dout Din 32 ALUOp = Add 32 Clk 取指结束时,当前指令开始写入IR !为保证本指令期间IR中指令不变,后面周期中IRWr应该为0
91
考察整个取指周期(第一个周期) 想想看,和单周期有哪些不同? 分析:取指周期中各控制信号的取值应为? PC的更新时间
多了一个指令寄存器IR 每个周期产生各自的控制信号 。。。。。。 为什么多周期时需要PCWr? 1: PCWr, IRWr ALUOp=Add Others: 0s x: PCWrCond RegDst, Mem2R Ifetch PCWr=1 PCWr PCWrCond=x PCSrc=0 BrWr=0 Zero IorD=0 MemWr=0 IRWr=1 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero ALU Mux 1 32 RAdr 32 busA 32 Real Memory Instruction Reg 32 busB 32 32 WrAdr 32 1 32 32 4 For example, the Memory can get its read address from the PC for instruction fetch but it can also get the read address from other part of the datapath for data fetch. Similarly, the ALU can get its operands from the PC and a constant 4 as I showed you on the last slide, but we know the ALU can also gets its operands from the register file. We will fill in the details here (Hole) later but for now, we know we need to set the control signals IorD, MemWr, ALUSelA, ALUSelB to zeros and IRWr and PCWr to 1s. Notice that I have added a MUX in the PC feedback path because we know for the branch instruction, the next PC will have a value OTHER than PC plus 4 (ALU inputs). We will worry about how we get this “other” value (Target) later. For this cycle, we have to set the MUX control (PCSrc) to zero to select the PC plus 4 value. The settings of all the control signals are summarized in this circle. Due to space limitation, I have only shown the signals that have values other than zeros. I want to emphasis that this is the picture at the END of the Instruction Fetch Cycle where evaluation is completed and control signals are settled. This is the interesting part. The start of the cycle is boring by comparison. Consequently, all the datapath pictures I show from now on are the pictures at end of a cycle. +2 = 31 min. (Y:11) Din Dout 2 32 ALU Control 3 分析的结果就是生成的一个“状态” ALUSelB=01 ALUOp=Add
92
寄存器取 / 指令译码周期(第二个周期) busA ← RegFile[rs] ; busB ← RegFile[rt] ;
Decoder ← Op and Func; ALU is not being used: ALUctr = xx 指令未译码,故只执行公共操作 ALU空闲,可用ALU“投机计算”转移地址! PCWr=0 PCWrCond=0 PCSrc=x Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=x 1 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 Now that we have the instruction word saved in the IR, the next thing we can do is decode the instruction (Go to the Control) and fetch the registers from the Register file (Rs Rt). I want to point out at this point, we do not know what instruction we have yet because we are still in the process of decoding the Op and Func field. Therefore we are “jumping the gun” in fetching the registers Rs and Rt from the register file. The Rt field may not even be a source register if we have a I-type instruction. But this is OK because if after we decode the instruction, we realize we don’t need the registers, we just don’t use them. No big deal. Notice that the ALU is not being used in this cycle. That is not good. Instead of just letting the ALU sits idle, we may just as well let it do something. ***** Can we think of anyway we can use this ALU at this cycle? (see next slide) We cannot use the ALU to do anything involving the registers because we are still in the process of reading them-we do not have the register values yet. +2 = 33 min. (X:13) Complement: path from Rb to bus B is speculative because it might not be needed. Go to the Control Op 6 Imm ALUSelB=xx Func 6 16 ALUOp=xx 问题:PC中已是下条顺序指令的地址,对本条指令的执行有没有影响? 没有影响,因为IRWr=0! 考虑转移地址的投机计算的数据通路如何?
93
寄存器取 / 指令译码周期(第二个周期) : busA ← Reg[rs] ; busB ← Reg[rt] ;
Decoder ← Op and Func; 投机:Target ← PC + SignExt(Imm16)*4 (为什么不是PC +4+ SignExt(Imm16)*4?) 1: BrWr, ExtOp ALUOp=Add Others: 0s x: RegDst, PCSrc ALUSelB=10 IorD, MemtoReg Rfetch/Decode 为什么不直接送 PC? 为什么加BrWr? PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 What we can do is to use this ALU to calculate the branch address in advance. (1) We will set ALUSelA to 0 such that the PC is fed to the ALU input. (2) The other ALU input will come from (ALUSelB=10) the Sign Extended (ExtOp=1) version of the 16-bit immediate filed. Once we added (ALUOp = Add) these two numbers together, we will save the result in the Target register (BrWr = 1). We cannot write the ALU output to the PC yet (PCWr = 0). The OP and Func is still being decode in this cycle (Control) so we cannot update the PC to this value unless we are SURE we have a branch and the branch condition is met (AND-OR). Once again, I have summarized all the control signals settings inside this circle. So far this and the Instruction Fetch cycles are shared by all instructions. But by the end of this cycle, we will know exactly what instruction we have (Control output). Let’s say, we have a branch, what do we do? +2 = 35 min. (Y:15) 2 32 ALU Control 3 << 2 指令译码器 Beq Control Op Rtype Imm 6 ALUSelB=10 Ori Func Extend Memory 6 16 32 ALUOp=Add : ExtOp=1 第二周期结束时,执行的结果是什么?
94
寄存器取 / 指令译码周期(第二个周期) : PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x
MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Rtype Control Beq 如果指令译码输出为:Beq Op Imm 6 ALUSelB=10 Ori Func Extend Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a R-type instruction, what do we do then? Well, simple enough: we just go to the R-type execution cycle. +1 = 37 min. (Y:17) Memory 6 16 32 ALUOp=Add : ExtOp=1 下面第三个周期就是Beq指令的第一个执行周期!
95
Branch指令执行并完成周期(第三个周期)
1: PCWrCond ALUOp=Sub x: IorD, Mem2Reg ALUSelB=01 RegDst, ExtOp ALUSelA BrFinish PCSrc 如果指令译码输出为:Branch 若不“投机”,则在此周期前还要加一个周期,用来计算转移地址后保存到Target中! if (busA == busB) PC ← Target 控制信号的取值是什么? PCWr=0 PCWrCond=1 PCSrc=1 BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 We already have the values of registers Rs and Rt on busA and busB from last cycle, all we have to do is perform a Subtract (ALUOp) to compare them (ALUSelA, B). If they are equal, the ALU’s Zero output will be asserted, and with PCSrc and PCWrCond set to one, the Branch Target will get written into the Program Counter. The Branch is taken. If Rs and Rt are not equal, Zero will not be asserted and the Target value will NOT be written into the Program Counter. That is the Branch is NOT taken. Since I am running out of space in this circle, I did not say it explicitly but all control signals not specified in this circle are default to zeros (point to the datapath for examples). +1 = 36 min. (Y:16) 2 32 ALU Control 3 << 2 PC中是否在下个周期更新为Target,则由Zero决定! Imm ALUSelB=01 Extend “ALU”每步都不空闲,被重复使用 16 32 ALUOp=Sub ExtOp=x
96
寄存器取 / 指令译码周期(第二个周期) : PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x
MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Beq Control Rtype 如果指令译码输出为:R-Type Op Imm 6 ALUSelB=10 Ori Func Extend Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a R-type instruction, what do we do then? Well, simple enough: we just go to the R-type execution cycle. +1 = 37 min. (Y:17) Memory 6 16 32 ALUOp=Add : ExtOp=1 下面第三个周期就是R-Type指令的第一个执行周期!
97
R-type指令的执行周期(第三个周期)
1: RegDst ALUOp=Rtype ALUSelB=01 x: PCSrc, IorD MemtoReg ALUSelA ExtOp RExec ALU Output ← busA op busB R-type指令的第一个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=1 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 Mux 1 32 Rd 32 Din Dout busW busB 32 1 2 32 ALU Control 1 Mux 3 << 2 Once again, fetching the registers Rs and Rt in the previous cycle pays off. We need these two registers now and they are already on busA and busB, respectively. So all we need is set the ALUSelA and ALUSelB to feed busA and busB into the ALU and tell the ALU local control we have a R-type instruction (ALUOp). The ALU will then generate the correct result (ALU output) at the end of this cycle. Notice that I have set RegDst to 1 here even though we are not writing the register file (RegWr is zero). Register file is not written until the next cycle. You would think RegDst should be don’t care at this point. ****** Anybody want to guess why I set RegDst to 1 at this point? Remember: for this Real memory and register file that do not have a clock input , the address (Rw) MUST be stable BEFORE we set Write Enable (RegWr) to one. Here by setting RegDst to one, I can guarantee the Rw specifier will be stable by the next clock cycle where I will perform the write by setting RegWr to 1. +2 = 39 min. (Y:19) 为解决“Race”问题,该周期使RegDst=1,而使RegWr=0. Why? Extend Imm 16 保证地址Rw在写使能RegWr前先稳定,准备好下周期写 32 ALUOp=Rtype ExtOp=x MemtoReg=0 ALUSelB=01
98
R-type完成周期(第四个周期) R[rd] ← ALU Output Rfinish 没有Target, 则一定要有ALUout!
1: RegDst, RegWr ALUOp=Rtype ALUselA x: IorD, PCSrc ALUSelB=01 ExtOp Rfinish 没有Target, 则一定要有ALUout! R[rd] ← ALU Output R-type指令的第二个周期,控制信号取值? PCSrc=x PCWr=0 PCWrCond=0 BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=1 RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 A 32 Real Memory ALUout Instruction Reg 5 Reg File 32 Rt 4 Mux Rw 32 WrAdr 32 B 1 32 32 Rd Din Dout busW busB 32 1 2 32 ALU Control Mux 1 3 << 2 RegDst=1 使Rw继续稳定 RegWr=1使busW上的值写入 ALUSelA=1, ALUSelB=01, ALUop=Rtype,使ALU的输出保持稳定,直到本周期结束。 So here is the picture where we finish off the R-type instruction by writing the ALU output back to the register file (MemtoReg=0 and RegWr = 1). Notice that in order to keep the ALU output from changing, the ALUSelA, ALUSelB, and ALUOp control signals must remain the same as the previous cycle. This brings us to a side topic I want to cover. +1 = 40 min. (Y:20) Extend Imm 16 32 ALUOp=Rtype ExtOp=x MemtoReg=0 ALUSelB=01 有谁发现和最初多周期通路的说明有何不同?这里少了些什么?为什么能少?
99
寄存器取 / 指令译码周期(第二个周期) : PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x
MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Intruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 32 Rd 1 32 Din Dout busW busB 32 32 2 ALU Control 3 << 2 Beq Control Op Rtype Imm 6 ALUSelB=10 Ori 指令译码输出为:ori Func Extend Memory Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a OR immediate instruction, what do we do then? Well, we go to the OR immediate execution cycle. +1 = 56 min. (Y:36) 6 16 32 ALUOp=Add : ExtOp=1 下面第三个周期就是ori指令的第一个执行周期!
100
Ori 指令执行周期(第三个周期) ALU output ← busA or ZeroExt[Imm16] OriExec
ALUOp=Or IorD, PCSrc 1: ALUSelA ALUSelB=11 x: MemtoReg OriExec ALU output ← busA or ZeroExt[Imm16] ori指令的第一个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Din Dout Rd 32 busW busB 32 2 32 ALU Control Mux 1 3 << 2 The first operand of OR immediate comes from register Rs. It is already on busA so we just set ALUSelA to 1. The second operand, on the other hand, does NOT come from Rt. It comes from the Zero Extended (ExtOp = 0) version of the immediate field (ALUSelB = 11). Once we have the operands, all we have to do is to ask the ALU to OR (ALUop) them together and the ALU output will have the correct result at the end of this cycle. Notice that I have set RegDst to zero so the Rt field of the instruction word will be stable at Register File’s Rw address port before the next cycle. What do we do in the next cycle? +2 = 58 min. (Y:38) 为解决“Race”问题,该周期使RegDst=0,而使RegWr=0 Extend Imm 16 32 ALUOp=Or ExtOp=0 MemtoReg=0 ALUSelB=11
101
Ori 指令完成周期(第四个周期) R [rt] ← ALU output OriFinish ori指令的第二个周期,控制信号取值?
1: ALUSelA ALUOp=Or x: IorD, PCSrc RegWr ALUSelB=11 OriFinish R [rt] ← ALU output ori指令的第二个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Well we do a register write (RegWr=1). Once again, I have set up the register write address Rw in advance (RegDst = 0) during the previous cycle so I can guarantee Rw is stable when I assert RegWr in this cycle. Also, remember we have a multiple cycle delay path from the Instruction Register to the Register File Write Port. Therefore, IRWr must be 0 and ALUSelA & B, and ALUOp must remain the same as the previous cycle in order to guarantee ALU output to be stable during register write. +1 = 59 min. (Y:39) RegDst=0使Rw继续稳定,RegWr=1使busW上的值写入 Extend Imm 16 32 ALUOp=Or ExtOp=0 MemtoReg=0 ALUSelB=11
102
寄存器取 / 指令译码周期(第二个周期) : PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x
MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Beq Control Op Rtype Imm 6 ALUSelB=10 Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a memory access instruction, that is we either have a load or store. The next cycle we need to get into is the Memory Address Calculation cycle. +1 = 60 min. (Y:40) Ori Func Extend Memory 指令译码输出为:访存指令(lw 或 sw) 6 16 32 ALUOp=Add : ExtOp=1 下面第三个周期就是lw/sw指令的第一个周期!
103
lw/sw内存地址计算周期(第三个周期)
ALUOp=Add PCSrc 1: ExtOp ALUSelB=11 x: MemtoReg ALUSelA MemAdr ALU output ← busA + SignExt[Imm16] lw/sw指令的第一个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 How do we calculate the memory address? Simple, we have to add the contents of register Rs (busA) to the Sign Extended (ExtOp=1) version of the Immediate field (ALUSelB = 11). With the ALUOp set to add, the memory address will be valid at the ALU output by the end of this cycle. Let’s say we do have a store instruction and see what happens next. +1 = 61 min. (Y:41) ALU的输出可能是读地址,也可能是写地址! Extend Imm 16 32 ALUOp=Add 读地址是哪条指令?写地址是哪条? ExtOp=1 MemtoReg=x ALUSelB=11
104
sw指令存数周期(第四周期) M[ALU output] ← busB swFinish sw指令的第二个周期,控制信号取值? PCWr=0
ALUOp=Add x: PCSrc,RegDst 1: ExtOp ALUSelB=11 MemtoReg MemWr ALUSelA swFinish M[ALU output] ← busB sw指令的第二个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=1 IRWr=0 RegDst=x RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Well, the address is already set up at the Memory’s write address port. The data is also already available on the Memory’s data port via busB. Therefore, all we have to do is to set MemWr to 1. Notice that it is very important that we keep ALUSelA, ALUSelB, and ALUOp the same as the previous cycle, the Memory Address calculation cycle. Otherwise, if any of these control signals changes during Memory Write, the address will also change because we do not have a register to save the ALU output. Any changes in the address during this cycle with MemWr = 1 will have catastrophic result. We will end up destroying data stored in memory by writing to the wrong address location. +2 = 63 min. (Y:43) 必须保持 ALUSelA, ALUSelB, ALUOp与上个周期取值相同! 才能保证WrAdr稳定不变! Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=x ALUSelB=11
105
lw指令取数周期(第四周期) Mem Dout ← M[ALU output] MemFetch lw指令的第二个周期,控制信号取值?
ALUOp=Add x: MemtoReg 1: ExtOp ALUSelB=11 ALUSelA, IorD PCSrc MemFetch Mem Dout ← M[ALU output] lw指令的第二个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=1 MemWr=0 IRWr=0 RegDst=0 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 If after the Memory Address calculation cycle, we realize we have a load. We then enter the Load Memory Access cycle. All we have to do is set the control signal IorD to 1 then after the memory read access delay, the data we want will be available at the output of the Ideal Memory (Dout). Once again, we need to set RegDst to zero in this cycle so Rt will be stabilized at the Register file’s write address port (Rw) before next cycle. +2 = 45 min. (Y:45) RegDst=0,RegWr=0,MemtoReg=1使Rw和busW在RegWr=1前先稳定 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=1 ALUSelB=11
106
lw指令回写周期(第五周期) R[rt] ← Mem Dout lwFinish lw指令的第三个周期,控制信号取值? PCWr=0
ALUOp=Add x: PCSrc 1: ALUSelA ALUSelB=11 MemtoReg RegWr, ExtOp IorD lwFinish R[rt] ← Mem Dout lw指令的第三个周期,控制信号取值? PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Real Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 32 Rd 1 32 Din Dout busW busB 32 32 2 ALU Control Mux 1 3 << 2 Because in this next cycle, the Write Back cycle, we will write the data from memory (MemtoReg = 1) into the register specified by the Rt field of the instruction. +1 = 66 min. (Y:46) RegDst=0,RegWr=1使Rw在RegWr=1后继续保持稳定 ALU输出在ALU控制不变时保持稳定,以使Dout和busW保持稳定。 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=1 ALUSelB=11
107
完成前述6条指令的完整多周期数据通路 PCWr PCWrCond PCSrc BrWr Zero IorD MemWr IRWr
RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 32 32 Rd 1 Din Dout busW busB 32 32 2 ALU Control Mux 1 3 << 2 Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47) Extend 到现在为止,给出了指令在每个周期内的数据流动过程,以及每个周期包含的控制信号取值 和书中的图有一些不同 书中是7条指令对应的图 Imm 16 32 ALUOp ExtOp ALUSelB MemtoReg 下面关键是如何控制在不同周期产生不同的控制信号取值!这就是控制器的任务。下面考虑如何设计控制器!
108
状态转换图 每来一个时钟,进入下一个状态 问题:各指令的时钟数多少?
Well, we pretty much concentrated on the multiple cycle datapath today. But if you think about it, by summarizing all the control signals in circles along the way, we have pretty much specified the control in a state diagram. All instructions start out at the Instruction Fetch cycle and continue to the Instruction Decode slash Register Fetch cycle. Once the instruction is decoded, we will either go to the Branch Complete cycle to complete the branch or go to one of the following: (1) R-type executioin or OR immediate execution for R-type or Or immediate instructions. (2) Or we will go to the memory address calculation cycle for load and store instrution. The rest is pretty straight forward. +5 = 75 min. (Y:55) 问题:各指令的时钟数多少? R-4, ori-4, beq-3, Jump-3, lw-5, sw-4 下一步目标:设计“状态转换电路” 即:控制器
109
“microprogrammed control”
多周期控制器的实现 回忆单周期控制器的实现: 控制信号在整个指令执行过程中不变,用真值表能反映指令和控制信号的关系。 根据真值表就能实现控制器! 多周期控制器能不能这样做? 多周期数据通路的控制更复杂,体现在: 每个指令有多个周期,每个周期控制信号取值不同! “hardwired control” Finite State Diagram Explicit Next State Function Logic Equations PLA “microprogrammed control” Microprogram Microprogram counter + Dispatch ROMs Truth Tables ROM 多周期控制器功能描述方式: 有限状态机: 采用组合逻辑设计 用硬连线路(PLA)实现 微程序: 用ROM存放微程序实现 初始表示 顺序控制 逻辑表示 实现技术 硬连线路控制器 (硬布线控制器) SKIP 微程序控制器
110
复习:单周期数据通路(The Main Control)
. . . . . . op<5> . op<5> . op<5> . op<5> . op<5> . op<5> . <0> <0> <0> <0> <0> op<0> R-type ori lw sw beq jump RegWrite ALUSrc RegDst MemtoReg MemWrite Branch Jump Well, the Main Control is implemented in a rather regular structure called a PLA. The row of AND gates decode the Opcode bits to decide what type of instructions we have. The row of OR gates then generate the control signals based on whether a particular control signal needs to be asserted for a given type of instruction. For example here (1st Row), the OR gate says the control signal RegWr needs to be asserted for R-type, Or Immediate, and Load instructions. Well enough for the review. Let’s take a look at what we are going to learn today. +1 = 5 min. (X:45) ExtOp ALUop<2> ALUop<1> ALUop<0> BACK
111
时序控制的描述 组合逻辑控制单元 Multicycle Datapath
思路:由时钟、当前状态和操作码确定下一状态。不同状态输出不同控制信号值 控制逻辑采用“摩尔机”方式,即:输出函数仅依赖于当前状态 组合逻辑控制单元 输出 Multicycle Datapath Next State 输入 下一状态被看成和其他控制信号一样。 下一状态是当前状态和操作码的函数。 每来一个时钟,当前状态变到下一个状态 在不同状态下输出不同的控制信号。 Opcode 下一状态是当前状态和操作码的函数,在不同的状态下输出不同的控制信号。 状态寄存器 clk 下一步目标:设计控制逻辑(control Logic)
112
多周期控制器状态转换表 State 2 ->State3 & State4 State 3 ->State 4 RHS
当前状态 S3S2S1S0 指令操作码OP5OP4OP3OP2OP1OP0 下一状态NS3NS2NS1NS0 State2、3、5、7、9、11 State0 (IFetch) State1 (ID/RFetch) (beq) (jump) (ori) State4 (OriExec) (R-type) State6 (RExec) (lw) (sw) State8 (MemAdr) State10 (MemFetch) State 2 ->State3 & State4 State 3 ->State 4 RHS _____________State0 -> State 1 State1 & op = lw | op= sw -> State 2 State2 & op = lw -> State 3 State 3 -> State 4 State2 & op = sw -> State 5 State2 & op = Rtype -> State 6 State 6 -> State 7 State2 & op = beq -> State 8 State2 & op = jmp -> State 9 State2 & op = ori-> State 10 以上功能可以由PLA电路来实现!
113
用PLA电路实现的组合逻辑控制单元(硬布线方式)
左上角:由操作码和当前状态确定下一状态的电路 右下角:由当前状态确定控制信号的电路 你能找出图中的错误吗? 有三个点的位置不对!
114
用PLA电路实现的组合逻辑控制单元(另一种布局方式)
beq j ori R lw sw lw sw Op5 R= beq=000100 lw= sw= ori= j= Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 当前状态 NS3 NS2 NS1 NS0 0 = 0000 1 = 0001 2 = 0010 3 = 0011 4 = 0100 5 = 0101 6 = 0110 7 = 0111 8 = 1000 9 = 1001 10 = 1010 11 = 1011 PCWr IorD … RegDst …
115
第三讲小结 单周期CPU和多周期CPU的成本比较: 单周期下功能部件不能重复使用;而多周期下可重复使用,比单周期省
单周期指令执行结果直接保存在PC、Regfile和Memory;而多周期下需加一些临时寄存器保存中间结果,比单周期费 单周期CPU和多周期CPU的性能比较: 单周期CPU的CPI为1,但时钟周期为最长的load指令执行时间 多周期CPU的CPI是多少?时钟周期多长? 假定程序中22%为Load,11%为Store,49%为R-Type,16%为Branch,2%为Jump。每个状态需要一个时钟周期,CPI为多少? 分析如下:每种指令所需的时钟周期数为: Load:5;Store:4;R-Type:4;Branch:3;Jump:3 CPI计算如下: CPI=CPU时钟周期数 / 指令数 = Σ(指令数 i x CPI i )/ 指令数 = Σ(指令数 i / 指令数 )x CPI i CPI = 0.22x5+0.11x4+0.49x4+0.16x3+0.02x3 = 4.04 假设单周期时钟宽度为1,则多周期时钟周期约为单周期的1/5,所以, 多周期的总体时间约:4.04x1/5=0.8 ;而单周期总体时间为:1x1=1 由此看出:多周期比单周期效率高!
116
第四讲 微程序设计和异常处理 主 要 内 容 硬连线路控制器设计的优点和缺点 微程序设计控制器的基本思想
微程序、微指令、微操作和微命令的概念及其关系 微指令格式设计 微操作码字段 水平微程序:不译法、字段直接编译法、字段间接编译法 垂直微程序:垂直编译法 下条微指令地址确定方式 增量法(计数器法) 断定法(下址字段法) MIPS指令子集的微程序控制器设计 为什么处理器设计要考虑异常的处理 “异常”和“中断”的概念 如何在数据通路中加入异常处理部件 如何控制数据通路中的异常处理部件
117
硬连线路设计和微程序设计 微程序设计的特点:具有规整性、可维性和灵活性,但速度慢。 硬连线路设计的特点:
优点:速度快,适合于简单或规整的指令系统,例如,MIPS指令集。 缺点:它是一个多输入/多输出的巨大逻辑网络。对于复杂指令系统来说,结构庞杂,实现困难;修改、维护不易;灵活性差。甚至无法用有限状态机描述! 简化控制器设计的一个方法:微程序设计 微程序控制器的基本思想: 仿照程序设计的方法,编制每个指令对应的微程序 每个微程序由若干条微指令构成,各微指令包含若干条微命令 (一条微指令相当于一个状态,一个微命令就是状态中的控制信号) 所有指令对应的微程序放在只读存储器中,执行某条指令时,取出对应微程序中的各条微指令,对微指令译码产生对应的微命令,这个微命令就是控制信号。 这个只读存储器称为控制存储器(Control Storage),简称控存CS 。 微程序设计的特点:具有规整性、可维性和灵活性,但速度慢。
118
微程序控制器的基本结构 输入:指令、条件码 输出:控制信号(微命令) 核心:控存CS µPC:指出将要执行的微指令在CS中的位置
µIR: 正在执行的微指令 每个时钟执行一条微指令 微程序第一条微指令地址由起始地址发生器产生 顺序执行时, µPC+1 转移执行时,由控制转移字段指出对哪些条件码进行测试,转移地址发生器根据条件码修改µPC 最初把固化在只读存储器的微程序称为固件(Firmware),表示用软件实现的硬部件,现在对固件通俗的理解是在ROM中“固化的软件”。
119
状态和微程序的对应关系 每条指令用一个微程序实现 微程序由若干微指令组成,每个状态对应一条微指令
取指令和译码用专门的微程序实现,称为取指微程序 By grouping all the control signals in circles along the way, we have pretty much specified the control in a state diagram. All instructions start out at the Instruction Fetch cycle and continue to the Instruction Decode slash Register Fetch cycle. Once the instruction is decoded, we will either go to the Branch Complete cycle to complete the branch or go to one of the following: (1) R-type execution or OR immediate execution of R-type or Or immediate instructions. (2) Or we will go to the memory address calculation cycle for load and store instruction. The rest is pretty much straight forward. +5 = 75 min. (Y:55) 问题:上述取指微程序包含几条微指令? lw指令有几条微指令? 2条 3条
120
微程序\微指令\微命令\微操作的关系 一条机器指令 一个微程序 微指令1 微指令2 微指令n 微命令1 微命令2 微命令m 微操作
将指令的执行转换为微程序的执行 微程序是一个微指令序列 每条微指令是一个0/1序列,其中包含若干个微命令(即:控制信号) 每个微命令控制数据通路的执行 一条机器指令 一个微程序 微指令1 微指令2 微指令n 微命令1 微命令2 微命令m 微操作 控制程序执行要解决什么问题? (1) 指令的编码和译码 (2) 下条指令到哪里去取 微程序执行也要解决两个问题: (1)微指令中如何对微命令编码 (2)下条微指令在哪里
121
第一个问题:微指令格式的设计 微指令格式设计风格取决于微操作码的编码方式 (微命令:控制信号)
微指令中包含了若干微命令、下条微指令地址(可选)、常数(可选) 微指令格式: µOP µADD 常数 µOP: 微操作码字段,产生微命令; µADD:微地址码字段,产生下条微指令地址 微指令格式设计风格取决于微操作码的编码方式 (微命令:控制信号) 微操作码编码方式: 不译法(直接控制法) 字段直接编码(译)法 字段间接编码(译)法 最小(最短、垂直)编码(译)法 水平型微指令风格 指令采用的是哪种编码方式? 垂直型微指令风格 水平型微指令 基本思想:相容微命令尽量多地安排在一条微指令中。 优点:微程序短,并行性高,适合于较高速度的场合。 缺点:微指令长,编码空间利用率较低,并且编制困难。 下面讨论第二个问题: 下条微指令的指定 垂直型微指令 基本思想:一条微指令只控制一、二个微命令。 优点:微指令短,编码效率高,格式与机器指令类似,故编制容易。 缺点:微程序长,一条微指令只能控制一、二个,无并行,速度慢。 垂直型微指令面向算法描述,水平型微指令面向内部控制逻辑的描述
122
不译法(直接控制法) 基本思想: 一位对应一个微命令(控制信号),不需译码。 对于二值微命令(0/1),本来就占一位,没有增加位数
对于多值微命令,因为没有进行编码,因而相对来说,增加了位数。例如: 4-1 MUX:编码则只需2位,不编码则要4位 ALUCtrl:编码则只需4位,不编码则要16位 优点: 并行控制能力强,不必译码,故执行速度快。 编制的微程序短。 缺点: 微指令字很长,可能多达几百位。 编码空间利用率低。(几百位中可能只有几位为1) 刚提出微程序设计时,采用的就是不译法。
123
Wilkes微程序控制器 IR 微地址寄存器Ⅱ 微地址寄存器Ⅰ G 微地址 译码器 下条微指令地址 时钟 条件信号 控制信号
124
多周期数据通路对应的微操作码 控制字(即:微指令)的长度等于控制信号(微命令)的总位数 PCWr PCWrCond PCSrc BrWr
Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg 5 Reg File 32 Mux 1 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Imm Extend Because in this next cycle, the Write Back cycle, we will write the data from memory (MemtoReg = 1) into the register specified by the Rt field of the instruction. +1 = 66 min. (Y:46) 16 32 ALUOp 采用不译法,则微操作码格式为: ExtOp MemtoReg ALUSelB PCWr IorD MemWr PCSrc BrWr BACK 控制字(即:微指令)的长度等于控制信号(微命令)的总位数
125
字段直接编码法 基本思想: 将微指令分成若干字段,每个字段对包含的若干微命令编码 把互斥微命令组合在同一字段,相容微命令组合在不同字段
一条微指令中最多可同时发出的微命令个数就是字段数 优点: 有较高的并行控制能力,速度较快。 微指令短,能压缩到不译法的1/2到1/3,节省控存容量。 缺点: 增加译码线路,并开销一部分时间。但因分段后各字段位数少,所以译码对微指令的执行速度影响不大。 相容微操作:能同时进行的微操作,称为相容的。 互斥微操作:不能同时进行的微操作,称为互斥的。如: ALU运算( add/sub/or/… ),存储器操作( 读指令 / 读数据 / 写数据 ) 你还能想出哪些互斥微操作? 多路选择器的输入控制信号,等等 鉴于以上特点,它为大多数微程序控制的计算机所采用。
126
直接控制法和字段直接编译法举例 例1:假定图6.9和6.10所示单总线数据通路有4个通用寄存器R0,R1,R2和R3, 16种ALU操作,主存和CPU间采用“异步”方式通信,存取操作有Read和Write信号控制。每条指令结束时,都要执行一个公共操作,用来进行指令结束处理(如,查询是否有外部“中断”请求),由控制信号End控制。要求:分别写出采用直接控制法和字段直接编码法的微操作码格式。 寄存器和总线间传送信号三组共17个: Rin:R0in, R1in, R2in, R3in, Yin, PCin, IRin Rout:R0out, R1out, R2out, R3out, Zout, PCout, MARout, MDRout MRin:MARin, MDRin ALU操作类型16种:add/sub/or/and/xor/…/mov ALU进位信号1个:1→C0 暂存器Y清0信号1个:ClearY 存储器信号3个:Read、Write、WMFC 结束信号1个:End
127
直接控制法和字段直接编译法举例 直接控制法 µOP的长度 = 控制信号的总个数
ALU操作控制信号不是16个(是4个), 这是由ALU结构和功能决定的(为什么?) 共有 =27个控制信号(微命令) 微操作码字段共27位。某位为1,对应微命令有效,否则对应微命令无效 字段直接编码法(P.193 表6.8) 哪些微操作之间是互斥的? Rout中信号之间:某时刻只能有一个寄存器输出到总线; ALU操作控制信号间:某时刻ALU只能做一种操作 主存读/写信号:不能同时读和写,有些节拍中没有读和写(No action) 如何分组? 按互斥关系分组:上述3个互斥组在3个不同字段中 可同时做但不可能同时发生的:如Rin,MRin(这两组间可能同时发生) 其余的需直接控制(无需编码):如1→C0, Clear Y, WMFC, END等 共分9组, µOP仅有 =19位,比直接控制法少8位 9组中有5组进行了编码,执行微指令时需译码
128
字段直接编码法举例(P.194 表6.9 / P.195 表6.10 ) BACK
可分8组:ALUop(2位), ALUSelA(1位), ALUSelB(2位), RegOP(3位: RegWr/RegDst/MemtoReg), MemOP(2位: MemWr/IorD/IRWr), ExtOp, BrWr, PCWrOP(2位: PCSource/PCWr/PCWrCond)
129
字段间接编码法 基本思想: 在字段直接编码法基础上,进一步压缩微指令长度。 通过另一字段的编码或标志位来对某个字段的编码加以解释。
即:一个微命令字段可以表示多个微命令组,到底代表哪一组微命令,则由另一个专门的字段来确定。 特点: 可进一步缩短微指令字的长度,节省控存容量。(意义不大!) 译码线路复杂,时间开销大。 鉴于以上特点,它只限于局部场合使用。 BACK
130
最小(最短、垂直)编码法 基本思想: 采用指令编码思想(每条指令产生一个操作),每条微指令只包含一个微命令。即将所有微命令进行全编码。
采用这种方式编码的微指令称为垂直型微指令 由其组成的微程序称为垂直微程序。 特点: 能得到最短的微指令字。 微程序规整、直观,易于编制。 但并行能力差,速度慢,并且微程序长。 主要用在具有两级微程序的控制器设计中,用垂直微程序解释指令,用水平微程序解释垂直微指令。此时,水平微程序称为毫微程序。 BACK
131
第二个问题:下条微地址的确定方式 什么是微程序执行顺序的控制? 指在现行微指令执行完毕后,怎样控制产生下一条微指令的地址。
怎样控制微程序的执行顺序? 通过在本条微指令中明显或隐含地指定下条微指令在控存中的地址来控制。 微指令地址的产生方法有两种: 增量(计数器)法:下条微指令地址隐含在微程序计数器μPC中。 断定(下址字段)法:在本条微指令中明显地指定下条微指令的地址。 选择下条要执行的微指令有三种情况: 第一条微指令:每条指令执行完,就会取出下条指令执行,当指令取出后,需要转移到下条指令对应的第一条微指令执行。 顺序执行时:在每条指令的微程序执行过程中顺序取出下条微指令执行。 分支执行时:在遇到按条件转移到不同微指令执行时,需要根据控制单元的输入来选择下条微指令。 还有一种情况: 取指微程序首址:每条指令都要先执行“取指微程序”
132
不同微地址产生方法对应的控制器结构 指令 指令 转移 控制 增量(计数器)法 断定(下址字段)法
133
微程序控制器的设计 书 P.198 BrCtr 01和10换一下 举例:用“转移控制”字段实现分支,指令微程序首址在ROM中。分别采用计数器法和下址字段法实现表6.10给出的微程序,画出微程序控制器结构。 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 状态号 (微地址) 转移控制字段 下址字段 BrCtr= 00:取指首址 01:ROM1 10:ROM2 11: µPC+1 BrCtr= 00:下址字段 01:op3修改 10:ROM1 当op3为1时,将应将1010修正为1001 问题:哪是增量法?哪是断定法? SKIP
134
多周期CPU的有限状态机 分支2中op3是区分lw和sw的标志,当op3为1时,将应将1010修正为1001。即:将后2位取反。 BACK
分支 1(ROM 1) Op Name State Rtype jmp beq ori lw sw 1000 分支 2( ROM 2) lw sw 1001 多周期CPU的有限状态机 分支2中op3是区分lw和sw的标志,当op3为1时,将应将1010修正为1001。即:将后2位取反。 BACK
135
微指令字的解释执行 用户程序和数据 MM 可以修改 ADD SUB AND . 每条指令对应一段微指令构成的微程序 DATA 执行部件
CPU AND 微程序 控存 ADD 微程序 SUB 微程序
136
异常和中断的处理 程序执行过程中,CPU会遇到一些特殊情况,使正在执行的程序被“中断”
使程序执行被 “中断” 的事件有两类 内部“异常”:在CPU内部发生的意外事件或特殊事件 按发生原因分为硬故障中断和程序性中断两类 硬故障中断:如电源掉电、硬件线路故障等 程序性中断:执行某条指令时发生的“例外(Exception)”,如溢出、缺页、越界、越权、非法指令、除数为0、堆栈溢出、访问超时、断点、单步、系统调用等 按处理方式分为故障(fault)、自陷(Trap)和终止(Abort)三类 故障:执行指令引起的异常事件,如溢出、缺页、堆栈溢出、访问超时等 自陷:预先安排的事件,如单步跟踪、系统调用(执行访管指令)等(自愿中断) 终止:硬故障事件,机器将“终止”,调出中断服务程序来重启操作系统 外部“中断”:在CPU外部发生的特殊事件,通过“中断请求”信号向CPU请求处理 如实时钟、控制台、打印机缺纸、外设准备好、采样计时到、DMA传输结束等 (中断是一种I/O方式,所以有关中断的概念在第9章介绍。) 思考:自陷处理完成后回到哪条指令执行? 回到下条指令! 思考:哪些故障补救后可继续执行,哪些只好终止当前进程? 缺页等:补救后可继续,回到发生故障的指令重新执行 溢出、除数为0、非法操作、内存保护错等:终止当前进程 本章主要介绍如何在数据通路中增加“程序性异常”的检测和处理逻辑 不同体系结构和教科书对“异常”和“中断”定义的内涵不同,在看书时要注意!
137
举例-8086/8088中断系统 统称为“中断”:内中断(内部异常)和外中断(外部中断)
内中断:CPU自己产生而不通过中断请求线请求,皆为不可屏蔽中断。 指令引起异常:CPU执行预置的指令后在特定的情况下发生的异常。 INTO 溢出:执行算术指令后,若发生溢出,则产生类型4中断。 INT n 用户定义:指令的第二字节给出一个类型号(n=0~255)。 其中n=3 (INT 3)时为断点设置,该指令执行后,自动产生类型3中断。 处理器检测异常:CPU执行指令时产生的异常,如:除法错、无效操作码、缺页、单 步跟踪调试等。如: 除法错:除数为0或商溢出,则产生类型0中断。 单步跟踪:当自陷位TF=1且处在开中断状态(即IF=1)时,每条指令执行完就自动产生类型1中断。 外中断:通过中断请求线INTR和NMI来实现。 INTR:可屏蔽中断 (外设中断源引起的中断)。 NMI:不可屏蔽中断 (重要或紧急的硬件故障),属于类型2中断。 统称为“中断”:内中断(内部异常)和外中断(外部中断) 所有事件都被分配一个“中断类型号” 每个中断都有相应的“中断服务程序” 可根据中断类型号找到中断服务程序的入口地址
138
8086/8088的中断向量表 000~003H 004~007H 008~00BH 3FC~3FFH 除法错 CS:IP 单步 CS:IP
中断向量表也称中断入口地址表(或异常表),位于0000H~03FFH。共256组,每组占四个字节 CS:IP 。向量地址=中断类型号 x 4 例1:除法错的中断类型号 为0,故其向量地址 为:0x4=0 例2:NMI的中断类型号为 2,故其向量地址为: 2x4=8 除法错 CS:IP 000~003H 004~007H 008~00BH 3FC~3FFH 单步 CS:IP NMI CS:IP CS:IP CS:IP 中断向量表(异常表)中每一项是对应中断服务程序的入口地址。被 称为中断向量(Interrupt Vector) 中断向量表的起始地址存放在一个异常表基址寄存器中。
139
处理器中的异常处理机制 检测到异常时,处理器必须进行以下基本处理:
① 关中断:使处理器处于“禁止中断”状态,以防止新异常(或中断)破坏断点和现场 ②保护断点和程序状态:将断点和程序状态保存到堆栈或特殊寄存器中 PC→堆栈 或 EPC(专门存放断点的寄存器) PSWR →堆栈 或 EPSWR (专门保存程序状态的寄存器) ( PSW(Program Status Word):程序状态字,包括条件码、中断码、状态位等 PSWR(PSW寄存器):用于存放程序状态字的寄存器。如,X86的FLAGS) ③识别异常事件:有两种不同的方式:软件识别和硬件识别(向量中断方式) (1)软件识别(MIPS采用) 设置一个异常状态寄存器(MIPS中为Cause寄存器),用于记录异常原因。操作系统使用一个统一的异常处理程序,该程序按优先级顺序查询异常状态寄存器,识别出异常事件。 (例如:MIPS中位于内核地址0x 处有一个专门的异常处理程序,用于检测异常的具体原因,然后转到内核中相应的异常处理程序段中进行具体的处理) (2)硬件识别(向量中断)(80x86采用) 用专门的硬件查询电路按优先级顺序识别异常,得到“中断类型号”,根据此号,到中断向量表中读取对应的中断服务程序的入口地址。
140
8086/8088的中断向量表 000~003H 004~007H 008~00BH 3FC~3FFH 除法错 CS:IP 单步 CS:IP
中断向量表也称中断入口地址表(或异常表),位于0000H~03FFH。共256组,每组占四个字节 CS:IP 。向量地址=中断类型号 x 4 例1:除法错的中断类型号 为0,故其向量地址 为:0x4=0 例2:NMI的中断类型号为 2,故其向量地址为: 2x4=8 除法错 CS:IP 000~003H 004~007H 008~00BH 3FC~3FFH 单步 CS:IP NMI CS:IP CS:IP CS:IP 中断向量表(异常表)中每一项是对应异常处理程序的入口地址。被 称为中断向量(Interrupt Vector) 中断向量表的起始地址存放在一个异常表基址寄存器中。
141
MIPS带异常处理的数据通路设计 MIPS采用软件(操作系统提供的一个特定的异常查询程序)识别中断源 数据通路中需增加以下两个寄存器:
EPC:32位,用于存放断点(异常处理后返回到的指令的地址) 写入EPC的断点可能是正在执行的指令(故障时),也可能是下条指令(自陷和中断时)。前者需要把PC的值减4后送到EPC,后者则直接送PC到EPC Cause:32位(有些位还没有用到),记录异常原因 假定处理的异常类型有以下两种: 未定义指令(Cause=0) 数据溢出(Cause=1) 需要加入两个寄存器的“写使能”控制信号 EPCWr:在保存断点时该信号有效,使断点PC写入EPC CauseWr:在处理器发现异常(如:非法指令、溢出)时,该信号有效,使异常类型被写到Cause寄存器 需要一个控制信号IntCause来选择正确的值写入到Cause中 需要将异常查询程序的入口地址(MIPS为0x )写入PC,可以在原来PCSource控制的多路复用器中再增加一路,其输入为0x
142
带异常处理的数据通路 需加入两个寄存器的“写使能”控制信号 EPCWr:保存断点时该信号有效,使断点PC写入EPC
CauseWr:在处理器发现异常(如:非法指令、溢出)时 该信号有效,使异常类型被写到Cause寄存器 需一个控制信号IntCause来选择正确的值写入到Cause中 需将异常查询程序入口地址(MIPS为0x )写入PC 可在原PCSource控制的多路器中再增加一路,其输入为 0x
143
带异常处理的控制器设计 在有限状态机中增加异常处理的状态,每种异常占一个状态 每个异常处理状态中,需考虑以下基本控制 Cause寄存器的设置
计算断点处的PC值(PC-4),并送EPC 将异常查询程序的入口地址送PC 将中断允许位清0(关中断) 假设要控制的数据通路中有以下两种异常处理 未定义指令(Cause=0):状态12 数据溢出(Cause=1):状态13 在原来状态转换图基础上加入两个异常处理状态 如何检测是否发生了这两种异常 未定义指令:当指令译码器发现op字段 是一个未定义的编码时 数据溢出:当R-Type指令执行后在ALU 输出端的Overflow为1时 ALUop=Sub IntCause=0 ALUSelA=0 CauseWrite=1 ALUSelB=01 EPCWrite=1 PCSrc=11 12 未定义指令异常状态 PCWrite=1 12 UndefInstr ALUop=Sub IntCause=1 ALUSelA=0 CauseWrite=1 ALUSelB=01 EPCWrite=1 PCSrc=11 13 数据溢出异常状态 PCWrite=1 13 Overflow 注:7条指令共需12个状态:第0~11状态
144
加入异常处理后的有限状态转换图 “fault”异常的检测在指令执行中。 “trap”异常怎样检测?
问题:何时检测“缺页”异常? MMU中地址转换时! 加入异常处理后的有限状态转换图 “fault”异常的检测在指令执行中。 “trap”异常怎样检测? 问题:中断检测能否和异常检测一样 在指令执行中进行? 中断随机发生,与指令执行不同步 不能在指令执行中检测 总是每条指令执行结束时检测 问题: 为什么在指令执行中不能 响应中断? 因为无法回到一条指令的中间 继续执行 指令译码(系统调用)或条件码检测(单步) By grouping all the control signals in circles along the way, we have pretty much specified the control in a state diagram. All instructions start out at the Instruction Fetch cycle and continue to the Instruction Decode slash Register Fetch cycle. Once the instruction is decoded, we will either go to the Branch Complete cycle to complete the branch or go to one of the following: (1) R-type execution or OR immediate execution of R-type or Or immediate instructions. (2) Or we will go to the memory address calculation cycle for load and store instruction. The rest is pretty much straight forward. +5 = 75 min. (Y:55) 异常响应周期 加入异常处理后的控制器设计可根据上述有限状态机实现!
145
TLB缺失处理和缺页处理 TLB缺失处理(可以由硬件处理,也可发出“TLB缺失”异常由软件来处理)
TLB miss说明可能发生以下两种情况之一: 页在内存中:只要把主存中的页表项装载到TLB中 页不在内存中(缺页):OS从磁盘调入一页,并更新主存页表和TLB 缺页(page fault)处理 当主存页表的页表项中“valid”位为“0”时,发生page fault Page fault是一种“故障”异常,按以下方式处理(MIPS异常处理) 关中断(中断允许位清0) 在Cause寄存器置相应位为“1” 发生缺页的指令地址(PC减4)送EPC 0x (异常查询程序入口)送PC 执行OS的异常查询程序,取出Cause寄存器中相应的位分析,得知发生了“缺页”,转到“缺页处理程序”执行 page fault一定要在发生缺失的存储器操作时钟周期内捕获到,并在下个时钟转到异常处理,否则,会发生错误。 例:lw $1, 0($1) ,若没有及时捕获“异常”而使$1改变,则再重新执行该指令时,所读的内存单元地址被改变,发生严重错误!
146
实例:IA-32处理器的实现 问题:IA-32处理器适合用单周期还是多周期方式来实现? 单周期方式:
每条指令都按最复杂指令时间执行(指令执行效率低!) 功能部件不能重复使用,对于一条具有多个复杂寻址的指令来说,可能要用到相当多个ALU。(成本高!) 多周期方式: 每条指令执行时间可以不同,简单指令3-4个时钟,复杂指令几十个时钟 (指令执行效率高!) 功能部件可以在一条指令执行过程中重复使用,这对于一条指令中具有多个复杂寻址的指令,非常有好处(成本低!) 问题:IA-32处理器适合用硬连线路控制器还是微程序控制器来实现? Hardwired Control:速度快,但无法实现复杂指令 Microprogrammed control:容易实现复杂指令,但速度慢 从80x486开始,采用了一种折中的方式: 简单指令(在数据通路中可一遍执行完)用Hardwired Control 复杂指令用microcoded control,不需为复杂指令构造复杂的数据通路 多周期数据通路和微程序控制器为IA-32指令集提供了一个实现框架 下一章详细介绍Pentium4处理器(是一种IA-32结构)的流水线实现
147
本讲小结 硬连线路控制器的优点是速度快,适合于简单规整指令集的数据通路;缺点是设计周期长、繁琐、不灵活、不易修改和增删指令
微程序控制器设计借用程序设计思想,将每个周期所涉及的状态用只读存储器保存起来,执行到某条指令时,把这条指令对应的状态按序取出,转换为控制信号。优点:简化设计、灵活、易修改、易维护;缺点:速度慢。 微指令格式设计 微操作码字段大多采用字段直接编译法,将互斥微命令组合在同一个字段进行编码。这样,在缩短微指令字的同时,保证了并行性,并避免同一周期出现两个不能同时执行的微命令的问题。 下条微指令地址可以采用计数器(增量)法和下址字段(断定)法;两种方法都要解决分支问题。可以增加一个“转移控制”信号来解决下条微地址的顺序控制问题。 异常会改变程序执行流程,所以处理器设计要考虑异常处理 在数据通路中加入异常处理必须考虑: 保存断点和异常原因,并将控制转到异常处理程序的首地址处 带异常的有限状态机中,每个异常对应一个状态和进入状态的检测条件
148
本章总结1 CPU的主要功能 周而复始执行指令 执行指令过程中,若发现异常情况,则转异常处理
定时查询有没有DMA请求,有DMA请求的话,则让出总线 每个指令结束,查询有没有中断请求,有则响应中断 CPU的内部结构 由数据通路(Datapath)和控制单元(Control unit)组成 数据通路中包含组合逻辑单元和存储信息的状态单元 组合逻辑单元用于对数据进行处理,如:加法器、运算器ALU、扩展器(0扩展或符号扩展)、多路选择器、以及状态单元的读操作线路等。 状态单元包括触发器、寄存器、寄存器堆、数据/指令存储器等,用于对指令执行的中间状态或最终结果进行保存。 控制单元对取出的指令进行译码,与指令执行得到的条件码或当前机器的状态、时序信号(时钟)等组合,生成对数据通路进行控制的控制信号
149
本章总结2 CPU中的寄存器 用户可见寄存器(用户可使用) 通用寄存器:用来存放地址或数据,需在指令中明显给出
专用寄存器:用来存放特定的地址或数据,无需在指令中明显给出 数据寄存器:专用于保存数据,可以是通用或专用寄存器 地址寄存器:专用于保存地址,可以是通用或专用寄存器。如:段指针、变址器、基址器、堆栈指针、栈帧指针等。 标志(条件码)寄存器:部分可见。由CPU根据指令执行结果设定,只能以隐含方式读出其中若干位,用户程序(非内核程序)不能改变 控制和状态寄存器(用户不可使用) 程序计数器PC 指令寄存器IR 存储器地址寄存器MAR 存储器缓冲(数据)寄存器 MBR / MDR 程序状态字寄存器PSWR 临时寄存器:用于存放指令执行过程中的临时信息 其他寄存器:如,进程控制块指针、系统堆栈指针、页表指针等
150
本章总结3 指令执行过程 取指、译码、取数、运算、存结果、查中断 指令周期:取出并执行一条指令的时间,由若干个时钟周期组成
时钟周期:CPU中用于信号同步的信号,是CPU最小的时间单位 (注:传统处理器中,一个指令周期由多个机器周期组成。一般把完成一次总线操作访问主存或I/O的时间称为机器周期 ,一个机器周期由多个时钟组成) 数据通路的定时方式 现代计算机都采用时钟信号进行定时 一旦时钟有效信号到来,数据通路中的状态单元可以开始写入信息 如果状态单元每个周期都更新信息,则无需加“写使能”控制信号,否则,需加“写使能”控制信号,以使必要时控制信息写入 数据通路中信息的流动过程 每条指令在取指令阶段和指令译码阶段都一样 每条指令的功能不同,故在数据通路中所经过的部件和路径可能不同 数据在数据通路中的流动过程由控制信号确定 控制信号由控制器根据指令代码来生成
151
本章总结4 单周期处理器的设计 每条指令都在一个时钟周期内完成 时钟周期以最长的Load指令所花时间为准
无需加临时寄存器存放指令执行的中间结果 同一个功能部件不能重复使用 控制信号在整个指令执行过程中不变,所以控制器设计简单,只要写出指令和控制信号之间的真值表,就可以设计出控制器 多周期处理器的设计 每条指令分成多个阶段,每个阶段在一个时钟内完成 不同指令包含的时钟个数不同 阶段的划分要均衡,每个阶段只能完成一个独立、简单的功能,如: 一次ALU操作 一次存储器访问 一次寄存器存取 需加临时寄存器存放指令执行的中间结果 同一个功能部件能在不同的时钟中被重复使用 可用有限状态机来表示指令执行流程,并以此设计控制器
152
本章总结5 控制单元实现方式 有限状态机描述方式
每个时钟周期包含的控制信号的值的组合看成一个状态,每来一个时钟,控制信号会有一组新的取值,也就是一个新的状态 所有指令的执行过程可用一个有限状态转换图来描述 用一个组合逻辑电路(一般为PLA电路)来生成控制信号,用一个状态寄存器实现状态之间的转换 也称为组合逻辑电路设计方式 实现的控制器称为硬布线控制器 微程序描述方式 每个时钟周期所包含的控制信号的值的组合看成是一个0/1序列,每个控制信号对应一个微命令,控制信号取不同的值,就发出不同的微命令 若干微命令组合成一个微指令,每条指令所包含的动作就由若干条微指令来完成,每来一个时钟,执行一条微指令 每条指令执行时,先找到对应的第一条微指令,然后按照特定的顺序取出后续的微指令执行 实现的控制器称为微程序控制器
Similar presentations