Differences

This shows you the differences between two versions of the page.

--- projects:optimization_of_a_pipelined_cpu [2013/12/19 14:55]
cbailey [ID]
+++ projects:optimization_of_a_pipelined_cpu [2013/12/20 00:23] (current)
jhanford [Optimizing a Pipelined CPU]
@@ Line 2: / Line 2: @@
 **Brendan Caporaletti, Chelsea Bailey, Jeffrey Hanford**
+Code: [[https://www.dropbox.com/s/2ulskrgce5xltpq/CPU_ResourceFiles.zip]]
 ===== The Vision =====
@@ Line 26: / Line 27: @@
 As part of our pipeline optimization, the six comparators at the bottom of the diagram check to see if later registers being requested in later phases of the CPU are the same as the ones being operated on. If the registers are the same, the ID phase outputs the updated data instead of the original, outdated information in that register that has not yet been written back to.
-In a jump operation, **BRENDAN PUT STUFF HERE THAT IS GOOD AND ACCURATE ABOUT ALL THE JUMPS**the amount it should jump by is contained within 16 bits of the opcode. This register is fed through a sign extender, shifted by two, added to the current PC number and output to the next phase of the CPU.
+In a jump operation, the amount it should jump by is contained within 16 bits of the opcode. This immediate is fed through a sign extender, shifted by two, added to the current PC number and output to the next previous of the CPU. If this is a jump and link, Jal and RegWrite are put high. WriteReg allows us to write the PC to the register file. Jal selects the PC+4 to replace the ALUOut and to replace the WriteReg with register 31. If this is a normal Jump Jal is set high but RegWrite is not. Therefore the register will not be set despite Jal being on. Forwarding ignores register 0 in order to avoid forwarding of Jumps writing to register 31. In either case Jal forces PC to jump to the Branch location.
 If WriteEnable is high, the number being fed into WD3 is written to the write register A3 from the writeback phase.
-[TALK ABOUT BRANCH COMPARATOR THING]
+In order for us to avoid the number of No-Ops that incorrect branching would entail, we moved our BNE comparator into the ID phase. This way we only have to cancel a single instruction upon branching. Our method of branch prediction in this model is static. We always assume that we are not branching, until the comparator tells us otherwise.
 ==== EX ====
 {{:projects:ex.png?800|}}
 In the execution phase, ALU operations are performed on the registers pulled from the instruction decode phase. The ALU operation selected depends on the ALUControl line. For add immediate, instead of adding RD2 to RD1, the CPU selects the immediate line from ID and adds it to RD1 using the ALUSrc selector bit.
 The register that is to be written to is decided by the RegDst line. In operations like Add, where two registers are added and a third is written to, Rd is used. In operations like add immediate, where only one of the input registers to register file is used, Rt is used.
@@ Line 55: / Line 54: @@
 In our original CPU design, we treated branching in the same method as jumping, which was a different implementation from MIPS. To make our CPU fully MIPS compliant we had to add concatenation into the ID phase where we do our jumping. According to MIPS, branch commands are based on a delta where as jump commands require a target destination. The jump machine codes provides 26 bits which when concatenated with two 0's in the low order bits and the 4 highest order bits of the PC, provides the jump destination.
-EXPLAIN JAL JR 11111 00000 THING
+In order to use all 26 bits of immediate normal Jump operations must forcibly have their WriteReg set to 0 as to avoid errors in forwarding. This is why the RegWrite line, being the difference between Jump and Jal,   controls whether Jal will force WriteReg to be 31 or 0.
-EXPLAIN JALE LINE MUX
 {{:projects:pipelinedcpu_v2.0.png?800|}}
@@ Line 65: / Line 62: @@
 Once our design was MIPS compliant, it allowed us to implement another feature. The concatenation made jumping much quicker and therefore we decided to move it up into the IF phase. The sooner jumping can happen the less clock cycles we waste.
+{{:projects:pipelinedcpu_v3.0.png?800|}}
 ==== V4 Dynamic Branch Prediction ====
 Our original solution for hazards related to branching/jumping was static branch prediction. We always assumed that we were not branching. When we began our final project, we knew that our branch prediction was area in which we greatly wanted to add improvements. After much research we settled on dynamic 2-bit prediction.
-The idea behind dynamic prediction is that we are always changing our guess based on the accuracy of the previous guess. In the simplest case this is a one bit value. When a branch is taken the value is set to 1. When it is later not taken the value is set back to 0. A 2-bit predictor allows us slightly more accuracy. The diagram below explains how our 2-bit predictor works in the form of a finite state machine. The four states are strongly taken (11), weakly taken (01), weakly not taken (10), and strongly not taken (00). The green arrows represent when a branch was taken and the red arrows represent when a branch was not taken. Every time a branch is taken or not taken it follows the path dictated by those arrows and the state it is currently in.
+The idea behind dynamic prediction is that we are always changing our guess based on the accuracy of the previous guess. In the simplest case this is a one bit value. When a branch is taken the value is set to 1. When it is later not taken the value is set back to 0. A 2-bit predictor allows us slightly more accuracy. The diagram below explains how our 2-bit predictor works in the form of a finite state machine. The four states are strongly taken (11), weakly taken (10), weakly not taken (01), and strongly not taken (00). The green arrows represent when a branch was taken and the red arrows represent when a branch was not taken. Every time a branch is taken or not taken it follows the path dictated by those arrows and the state it is currently in.
 {{:projects:bitpredict.png?800|}}
@@ Line 77: / Line 76: @@
 {{:projects:bht.png?800|}}
-===== Timing Analysis =====
+To figure out how to implement this new way of branching, we relied on some of our oldest knowledge from Computer Architecture. First we generated the following truth table:
-==== Version 0 (No Ops)====
+^State^Taken^New State^
+|AB|C|DE|
+|00|0|00|
+|00|1|01|
+|01|0|00|
+|01|1|11|
+|10|0|00|
+|10|1|11|
+|11|0|10|
+|11|1|11|
+Then we used Karnaugh maps for each bit of the new state to determine the simplest logic that could be implemented. Those maps are shown below, followed by the final implementation that was settled on added to the CPU in red.
+^ ^A B^A !B^!A !B^!A B^
+|C|1|1|0|1|
+|!C|1|0|0|0|
+D = AB+BC+AC
+^ ^A B^A !B^!A !B^!A B^
+|C|1|1|1|1|
+|!C|0|0|0|0|
+E = C
+{{:projects:pipelinedcpu_v4final.png?900|}}
+===== Reflection =====
+The whole process of this project was incredibly gratifying and overall successful, but we did face a lot of difficulties along the way. One of our biggest hurdles was synthesizing our design. None of us had previous experience with synthesis using XILINX and it had a good bit of a learning curve for us to surmount. That is why we did not go as much into timing as we wish we could have. On the bright side, we spent a lot more time doing research, design, and implementation. This turned out to be the most interesting and beneficial part of the project so it is where we exerted most of our effort.
+We also put a lot of emphasis on the documentation. From the very start it was most important to us that whatever work we got done was incredibly well documented to be helpful to future classes. In this regard, we feel we succeeded. We worked very hard to make our diagrams comprehensive and understandable. We put a lot of thought into how we would describe our implementations so they would be as intuitive as possible.
+Looking back on our work plan, shown below, we were fairly accurate with our time predictions. The huge difference is that our final demo took a different form from our original plan. Although we had intended to do speed comparisons and show the timing differences, we earlier realized that we had bitten off more than we could chew. It consumed all of our time to simply focus on researching improvements, figuring out implementation, and then getting them all to a fully functional state in Verilog. We still had enough time to analyze the big picture of how our timings compared, although we did not do a lot of specific time testing through synthesis.
+If we had more time on this project our TODO would surely be to fully test the timing of each phase of each individual version of our CPU through synthesis with XILINX. That being said, we are completely happy with the work we managed to do during our time working on this project. We are proud of our outcome and each had an incredibly valuable experience.
+<code>
+Goal:
+Optimize and document beautifully our pipelined CPU from MP3.
+Final Demo:
+A presentation of speed tests and comparisons of simple programs on our original CPU to the optimized CPU.
+Schedule:
+Study timing and balancing of CPU: Dec 11: 10 hours
+Implement new design features for speed optimization: Dec 17: 20 hours
-^ Phase ^ Clockspeed (ns) ^
+Documentation: Dec 19: 10 hours
-| IF ||
-| ID ||
-| EX ||
-| MEM ||
-| WB ||
-**Throughput:**
+</code>
-==== Version 1 (Original)====
-==== Version 2 (MIPS Compliant)====
-==== Version 3 (Jump Forward)====
-==== Version 4 (Dynamic Branching)====

ca Wiki

User Tools

Site Tools

Differences

Page Tools