A journey through the CPU pipeline

Guest post by MOS6510 2013-05-17 Hardware 25 Comments

“It is good for programmers to understand what goes on inside a processor. The CPU is at the heart of our career. What goes on inside the CPU? How long does it take for one instruction to run? What does it mean when a new CPU has a 12-stage pipeline, or 18-stage pipeline, or even a ‘deep’ 31-stage pipeline? Programs generally treat the CPU as a black box. Instructions go into the box in order, instructions come out of the box in order, and some processing magic happens inside. As a programmer, it is useful to learn what happens inside the box. This is especially true if you will be working on tasks like program optimization. If you don’t know what is going on inside the CPU, how can you optimize for it? This article is about what goes on inside the x86 processor’s deep pipeline.”

25 Comments

2013-05-18 1:03 am
Drumhellar
I do believe the terminology used in the article is off.
The 486 didn’t introduce a “superscalar pipeline” to x86; it’s just a pipeline, meaning multiple instructions at different stages of execution in a single execution unit.
“Superscalar” refers to having multiple execution units, whether pipelined or not.
He also conflates “Core” with “Core 2”. They are different chips.
“Core” was derived from the Pentium M, and was 32-bit, and was single or dual core.
The Core 2 was 64-bit, and available in single, dual, or quad versions.

2013-05-18 1:51 am
tylerdurden
since we’re nitpicking, by your own definition the 486 is superscalar; it had multiple functional units.
Superscalar refers to the ability of running multiple functional units (usually redundant ones) in parallel, which I believe the 486 couldn’t. The rule of thumb usually is that a superscalar microarchitecture can support theoretical IPCs larger than 1.

2013-05-18 3:03 am
butters
Superscalar means multiple instructions may be issued in a single cycle. The P5 Pentium was the first superscalar x86 chip. It could dispatch and issue one or two instructions per cycle in program order. The P6 (Pentium Pro through Pentium III) could dispatch three instructions per cycle and issue five instructions per cycle out of program order.
Pentium M and Core (1) are a direct evolution of P6 with the same three dispatch ports and five issue ports. Pentium M added micro-ops fusion with an additional two pipeline stages (12 to 14). Core allowed two cores to share a common L2 cache.
Core 2, besides the 64-bit GPRs, is wider than P6, with four dispatch ports and six issue ports. And Haswell is adding another two issue ports for a total of eight.
As for pipeline depth, the entire industry has converged on 12-15 cycles for CPUs designed to be clocked in the 2-4GHz range. Apple A6 and Qualcomm Snapdragon have 12-cycle pipelines. Atom is moving from a 14-cycle pipeline to a brand-new 13-cycle pipeline. ARM Cortex A15 has an eponymous 15-cycle pipeline.
But at clock frequencies below ~1.5Ghz, a shorter pipeline is more optimal. The 7-cycle ARM Cortex A7 is the best example of a modern core designed to perform well at low clock frequencies.
2013-05-18 3:09 am
theosib
I’ll admit that there is some variation in terminology, but the defining characteristic of a superscalar processor is that it will fetch, decode, and dispatch (queue up to wait for dependencies) multiple instructions on a cycle. It is a partially orthogonal issue that an out-of-order processor may issue (start actually executing) multiple instructions on a cycle due to muliple dependencies being resolved at the same time. There have been plenty of scalar processors with OoO capability (IBM 360 FPU using Tomasulo’s algorithm, CDC6600, etc.). And there have been plenty of in-order superscalar processors (original Pentium, original Atom, ARM Cortex A8, various SPARCs, etc.).
I say that these are PARTIALLY orthogonal, because superscalar processors typically have multiple functional units of the SAME type, while scalar processors typically do not. Having the ability to decode multiple instructions at once massively increases the probability that more than one of those will target the same type of functional unit, putting pressure on the execution engine to have redundant functional units to get good throughput.
Out-of-order was developed as a way to decouple instruction fetch from instruction execution. Some instruction types naturally lend themselves to taking multiple cycles (e.g. multiply). Fetch and decode are a precious resource, and you don’t want to hold them up just because you have a multiply instruction holding up the works. While that’s going on, it would be nice to keep the adder busy doing other independent work, for instance.
So OoO was developed to keep fetch and decode productive. But then that opens up another problem where fetching and decoding one instruction per cycle can’t keep the execution units busy. Thence came superscalar. It’s an interesting optimization problem to find the right dispatch width and the right number of redundant functional units in order to get the best throughput, especially when power consumption constraints come into play.
Note: I’m a professor of computer science, and I teach Computer Architecture.
Edited 2013-05-18 03:28 UTC

2013-05-18 3:44 am
theosib
Let me put this another way: The number of instructions per clock (IPC) of a scalar in-order processor is much less than 1, because there are many types of multi-cycle instructions. Adding out-of-order allows instruction fetch to continue despite the multi-cycle instructions, as long as those instruction are not dependent on the executing multi-cycle instructions. That brings the IPC closer to 1, but it cannot exceed 1. The objective of Superscalar is to make IPC exceed 1, and it does this, most primarily, by making instruction fetch and decode process multiple instructions per cycle. Whether or not you combine these ideas, superscalar invariably needs to have rendant functional units for some instruction types in order to actually achieve the goal of IPC > 1, in part because the majority of instructons are simple single-cycle integer instructions, so you’ll get lots of them together.

2013-05-18 3:51 am
tylerdurden
… which is why I said “theoretical” IPCs > 1. 😉
Cheers.

2013-05-18 5:22 am
Alfman verbose=1
theosib,
You sound very knowledgeable, certainly more than me. What do you think about cpu cores eventually being replaced / enhanced with massive FPGAs?
The issue I have with current CPU architectures is how there’s so much hardware and R&D being thrown at running sequential instruction sets in parallel rather than actually using native parallel instruction sets in the first place. We have undeniably seen dramatic gains for sequential code, and yet, all this focus on sequential code optimization seems to be a major detractor away from what could have been a much better overall strategy for maximizing parallel computation.
For illustrative purposes, take the case of bitcoin mining as good example of a parallel problem where performance is king and software compatibility isn’t a factor. The next link contains a sizable dataset very roughly showing how different computational technologies compare:
https://en.bitcoin.it/wiki/Mining_Hardware_Comparison
Intel’s latest processors top out at ~65Mhash/s for 6*2 hyperthreaded cores at 200W. As we’ll see, the sequential programs running on these super-scalar CPUs cannot compete with real parallel algorithms.
The ARM processors listed top out at ~1Mhash/s running on .5W. If we ran 400 of these to match intel’s power consumption, we’d get very roughly 400Mhash/s.
Now when we look at FPGAs, they have 400+ Mhash/s running on less than 20W. If we ran 10 of these to match 200W, we’d get 4000Mhash/s, or 62X the processing power of the x86 cpu.
There are ASICs that have 5000Mhash/s running on 30w (I mention it for comparison only, obviously it’s not a reprogrammable part so it wouldn’t have a place in a software programmable PC).
While I know CUDA is doing it’s part to introduce parallel software to the PC via GPUs, it still fairs poorly compared to the FPGAs. In fact GPU bitcoin miners are throwing in the towel (like CPU miners before them) because electricity costs more than the value of the bitcoins earned.
So in your expert opinion, do you think we’re bumping against the wall of diminishing returns with today’s superscalar CPUs? Do you see FPGAs as a good contender for even higher performance PCs in the future (assuming we ever get away from sequentially based software programming practices)?
Edit: I realize the numbers are very imprecise and might not even be apples to apples. Never the less bitcoin was the best example I could come up with to compare parallel computation technologies.
Edited 2013-05-18 05:41 UTC

2013-05-18 3:21 pm
siride
> all this focus on sequential code optimization seems to be a major detractor away from what could have been a much better overall strategy for maximizing parallel computation.
Did you miss the Itanium?
2013-05-18 3:59 pm
tylerdurden
It depends what the previous poster meant by parallelism.
2013-05-18 7:31 pm
Alfman verbose=1
tylerdurden,
Haha, I know right. Both Intel Itanium and superscalar x86 architectures are different ways to achieve a bit more parallelism by doing more work per cycle. But both of those architectures are still sequential in nature and neither of them are parallel in the sense that FPGAs are.
2013-05-18 8:29 pm
tylerdurden
Well, FPGAs are just seas of programmable logic cells with somewhat flexible interconnects, so their “parallelism” depends on the designs being implemented. E.g. there are plenty of FPGA’s used to synthesize algorithms which could be considered “sequential” and non-parallel in the nature of how they process data. However, modern large FPGAs provide a sea of ALUs as well, which indeed lend themselves naturally to parallel programming models.
To be fair, modern CPUs do support most forms of parallelism; whether it be some form of instruction level parallelism (superscalar, SMT, out-of-order, multicore, etc), as well as data parallel structures like SIMD and Vector units. However, general purpose CPUs have to hit certain “balance” when it comes to their designs; how much chip area/power should be dedicated to control structures, how much to execution, how much to memory, etc. In order to hit a wide range of performance targets of general programmability. Whereas GPUs and ASICs have more restricted application targets. In the case of GPUs, they’re used to run algorithms with elevated degrees of data parallelism, so they can dedicate most of their area to execution structures, rather than control (since they don’t have to dynamically squeeze as much performance from a single instruction stream as possible), as an oversimplied example.
AMDs newer fusion microarchitectures are something that may interest you, since they are starting to support elevated degrees of data parallelism on die.
Edited 2013-05-18 20:47 UTC
2013-05-19 2:17 am
Alfman verbose=1
tylerdurden,
“Well, FPGAs are just seas of programmable logic cells with somewhat flexible interconnects, so their ‘parallelism’ depends on the designs being implemented.”
Yes, it all depends on design, it’d be very powerful in the hands of innovative software developers, but I don’t know if/when consumer CPUs will provide FPGA like technology enabling software developers to take advantage of them.
“To be fair, modern CPUs do support most forms of parallelism; whether it be some form of instruction level parallelism (superscalar, SMT, out-of-order, multicore, etc), as well as data parallel structures like SIMD and Vector units.”
True, but it’s watered down. Every time I look at SSE I ask myself why intel didn’t make SIMD extension instructions that could accommodate much greater parallelism. x86 SIMD extensions only offer low parallel scaling factors. I know you are right that intel had to strike a balance somewhere, but never the less I feel their whole ‘lite SIMD’ approach is impeding significantly higher software scalability.
“In the case of GPUs, they’re used to run algorithms with elevated degrees of data parallelism, so they can dedicate most of their area to execution structures”
I like the way GPUs in particular are designed to scale to arbitrary numbers of execution units without otherwise changing the software. This is just awesome for “embarrassingly parallel algorithms” (https://en.wikipedia.org/wiki/Embarrassingly_parallel).
“AMDs newer fusion microarchitectures are something that may interest you, since they are starting to support elevated degrees of data parallelism on die.”
Yes maybe, I’ll have to read up on it.
2013-05-19 6:19 pm
theosib
Sorry for the long time to reply. Also sorry for not giving you a more thorough reply.
The bitcoin problem is interesting, and some friends and I are working on trying to get better computer/area out of an FPGA, just for the fun of it. As a chip designer, I see FPGAs as an obvious choice for accelerating this kind of thing far beyond what a general-purpose CPU can do.
One of the problems with using FPGAs is a general fear of hardware development. (Well, that and the cost of FPGA hardware, but that doesn’t apply to supercomputing.) Another problem is reasoned avoidance. For me, having worked as a chip designer, I like to just put together solutions straight in Verilog. But we can’t retrain all HPC programmers in chip design, and it’s sometimes not a good cost/benefit tradeoff. The holy grail is being able to convert software source code into logic gates. There’s plenty of work on that, but the results aren’t necessarily all that great. There’s a huge difference in performance between a custom-designed FPGA circuit (i.e. knowing what you’re doing) versus something that came out of an automatic translator.
2013-05-20 4:03 am
Alfman verbose=1
theosib,
I’d like to say software engineers could figure it out given widely accessible hardware, but I might be overestimating our abilities Most CS grads these days just end up becoming ridiculously overqualified web devs since that’s where most jobs are.
“The holy grail is being able to convert software source code into logic gates. There’s plenty of work on that, but the results aren’t necessarily all that great. There’s a huge difference in performance between a custom-designed FPGA circuit (i.e. knowing what you’re doing) versus something that came out of an automatic translator.”
This surprises me a bit. Even though the human mind is an incredible analytical machine, it has it’s limits whereas computers just keep getting better. In the Kasparov vs Deep Blue chess championship, it was inevitable that the brute force capabilities of the computer would ultimately overtake the best humans, the only question was when.
At university I made a realtime 3d java program to place components on a virtual circuit board using genetic algorithms and a fitness function. It was just a fun project I presented for an undergrad GA course I was taking, to be honest I don’t know if it’s solutions were any good since it was never compared against expert solutions. But in any case my gut instinct tells me that given enough computing power, even a naive algorithm should be able to brute force the finite solution space and consistently beat the best humans. I do believe you when you say automatic solutions aren’t as good as experts, however do you think that could change if there were more computing power thrown at the FPGA problem?
I’m interested in what you have to say about it because I don’t have expertise with FPGAs and I don’t personally know anyone else who does either.
2013-05-20 2:12 pm
theosib
My opinion is that this is less about more compute power and more about the limits of compiler developers. This reminds me of Ray Kurzweil’s stupid singularity thing, which seems to imply that the instant that computers are as fast as the human brain, they’ll magically develop human intelligence. It doesn’t matter how fast they are if we don’t know the algorithms for human intelligence. And we still don’t.
There’s the same problem with compilers. I’m reminded of two events in computer history. One is LISP machines, and the other is Itanium. In both cases, hardware designers assumed that a “sufficiently smart compiler” would be able to take advantage of their features. But people were not able to develop those sufficiently smart compilers. Consider predicated execution for Itanium. Predication turns out to be a hard problem. With architectures (like ARM32) that have only one predicate, it gets used SOME of the time. Itanium has an array of 64 predicate bits. Humans can specially craft examples that show the advantages of the Itanium ISA, but compilers just don’t exist that can do that well in the general case.
2013-05-20 3:22 pm
Alfman verbose=1
theosig,
“This reminds me of Ray Kurzweil’s stupid singularity thing, which seems to imply that the instant that computers are as fast as the human brain, they’ll magically develop human intelligence. It doesn’t matter how fast they are if we don’t know the algorithms for human intelligence. And we still don’t.”
I don’t know if I could believe that. I think we might eventually get computers that are convincing enough to mentally pass as human and be indiscernible in every non-physical test, and yet I’d still have alot of trouble considering that any such invention could be sentient because I “know” that it’s not, but then again it’s hard to understand what consciousness is at all.
“There’s the same problem with compilers. I’m reminded of two events in computer history. One is LISP machines, and the other is Itanium. In both cases, hardware designers assumed that a ‘sufficiently smart compiler’ would be able to take advantage of their features.”
I know what you mean, however it’s not necessarily the case that we’d have to solve such problems directly. With brute forcing (or optimized variants like genetic algorithms) the programmer doesn’t solve the problem at all, but writes a fitness function who’s sole purpose is to rate the success of solutions that are derived in random and/or evolutionary ways.
There was a pretty awesome game I played (a java applet) many years ago where you would specify the fitness function, and the game would evolve 2d “creatures” with muscles and basic neurons and after a few thousand iterations you’d have creatures that could walk. More iterations and they could avoid obstacles. Maybe it would be possible to develop a genetic algorithm for the compiler as well. It’s kind of what I meant earlier, even naive (unintelligent) approaches like this can produce good results, given enough iterations.
2013-05-20 5:58 pm
theosib
Yeah, we call that genetic programming. There’s lots of interesting work in that area.
2013-05-20 9:49 pm
Alfman verbose=1
theosib,
Thanks for the conversation! It’s nice to occasionally escape from the intellectually mind numbing OS turf wars that are so frequent here in the threads.
2013-05-20 6:01 pm
asb_
The problem is the number of iterations. The state space for finding an optimal layout for a given FPGA design would be immense, I would imagine. Something in the order of the universes age (in hours or even seconds). All the computers on earth wouldn’t be able to search that state space in any reasonable period of time.
Think about a simple problem like all the permutations of a 32-bit 1024×1024 bitmap. The amount of possibilities are huge..
Still, with an intelligent algorithm, that can reduce this state space without sacrificing optimal design, there is potential I’d say.
2013-05-20 10:55 pm
Alfman verbose=1
asb_,
“The problem is the number of iterations. The state space for finding an optimal layout for a given FPGA design would be immense, I would imagine. Something in the order of the universes age (in hours or even seconds). All the computers on earth wouldn’t be able to search that state space in any reasonable period of time.”
Having never studied the problem in any serious detail, I cannot know the full scope. Never the less the typical software engineering way to work with such problems is to “divide and conquer” them. So if we have a program that needs to be converted into FPGA logic, assuming the full state of the program is too big to handle (since complexity grows exponentially with size of the program), then we could break it up into individual functions and optimize those independently from the rest of the system. This way it’s not necessarily brute forcing the entire space at once, just pieces of the puzzle at a time.
“Still, with an intelligent algorithm, that can reduce this state space without sacrificing optimal design, there is potential I’d say.”
Well, if our goal isn’t to find the global maxima, then it becomes far easier to search for local maxima. Consider a multidimensional function for which we want to find the “best” or highest (or lowest) point. The search space could be immense, the “intelligent” way to find the maxima is use calculus to determine the roots and evaluate the function at each root. However when the calculus is not feasible (or when it produces too many roots), then we can cheat by solving the easier problem of searching for local maxima instead. Starting with candidates at random, we can trivially use newton’s method to find peaks in the graph. By using sheer computing power we could find many such local maximal solutions. The best solution in our set might not be the best solution absolutely possible, but it still might be better than a human can do since humans generally are not able to solve for global maxima either (aka we’re more fallible than an imperfect computer playing at chess).
It’s really very interesting to think about how probabilistic solutions like this can be used to solve hard problems.
Edited 2013-05-20 23:05 UTC
2013-05-20 11:55 pm
theosib
I’m going to slightly abuse the term “NP-hard” here, but anyhow, the search space for place and route is such that if you had the optimal solution, you would not even be able to verify that it was. Immense doesn’t even begin to describe the complexity of automatic circuit layout. Oh, and humans still do better: Bulldozer performed sub-par by about 20% for the technology because they didn’t bother to have humans go back in and hand-optimize critical circuits.
2013-05-21 12:54 am
Alfman verbose=1
theosib,
“I’m going to slightly abuse the term ‘NP-hard’ here, but anyhow, the search space for place and route is such that if you had the optimal solution, you would not even be able to verify that it was.”
Why is this abusing NP-hard? I think it is (can be converted into) an NP-hard problem.
“Immense doesn’t even begin to describe the complexity of automatic circuit layout. Oh, and humans still do better: Bulldozer performed sub-par by about 20% for the technology because they didn’t bother to have humans go back in and hand-optimize critical circuits.”
I had to look it up, you must be referring to this:
https://en.wikipedia.org/wiki/Bulldozer_%28processor%29
How do you arrive at the 20% figure? I have no idea what kind of algorithmic strategy was used to design bulldozer, but in any case being only 20% sub-par still sounds impressive to me. We’re on the cusp of computers being able to beat humans in many specialized domains, it’s only a matter of time before they can beat humans for circuit optimization.
Being NP-Hard means it may never find the “best” solution, but as long as it can find incrementally better ones over each automated generation, that’s quite amazing.
Does your university do work on active research projects? Sometimes it seems like being a professor could be a lot of fun for professors who get to make a living overseeing cutting edge research projects. This assuming the distracting “deploma mill” responsibilities don’t get in the way, haha.
2013-05-21 2:37 am
theosib
Although I have teaching responsibilities (that I take very seriously), my primary job is research. And doing better on automated synthesis is one of the projects I’m working on.
As for the 20%, this is where I got the figure from:
http://www.xbitlabs.com/news/cpu/display/20111013232215_Ex_AMD_Engi…
2013-05-21 4:02 am
Alfman verbose=1
theosib,
Thank’s again, and that’s a very interesting link.
There’s a lot of unverifiable info there, but I have to admit that at face value 20% worse compared to experts for basic adders and multipliers does seem disappointing. Reading in between the lines I am inclined to think the skill of the outsourced company may have been inferior to AMD’s engineers and they built upon inferior knowledge of the problem domain such that the algorithmic results were inevitably sub-par for AMD. (For all we know, these same algorithms may have been above-par compared to the engineers at the outsourced company).
I’m actually a bit shocked that AMD outsources such a crucial bit of their engineering process. I knew they went fabless a few years ago, but I didn’t realize they began outsourcing the block designs as well. My own experience with outsourcing companies is that they’ll say just about anything to land a contract to “save money”, but then they hire far less experience engineers who routinely under-perform compared to the developers they just made redundant. I wouldn’t be surprised at all if that was the case here, although this is all uninformed speculation
“And doing better on automated synthesis is one of the projects I’m working on.”
It sounds really cool. I’m envious

2013-05-18 1:57 am
Ensue85A
So, we are supposed to think inside the box? That’s different….te he. I enjoyed programming the i8085A back in the day. But I can’t see myself doing so with a modern processor. That’s way too heady for me now. The article still makes you think about the core though. And that can make one a better programmer.