The new crop of Pentium 4s, which will spawn a number of new desktop PC models, will include three chips based on a fresh processor design, code-named Prescott. Intel will add two new speed versions of its current Pentium 4, dubbed Northwood. A sixth Prescott Pentium 4, running at 3.4GHz will be announced Monday, but it won’t be available until later in the quarter. Read the article at C|Net News.com and the pricelist at TheInquirer.
http://www.anandtech.com/cpu/showdoc.html?i=1956
IMHO Anandtech is the authority on processor reviews and benchmarks.
One thing that I was happy about is the attempt to do indirect branch prediction. Predicting indirect branches is extremely important for languages like Java/C#/Perl/Python, because all method calls are indirect. Current processors have *terrible* indirect branch performance. On my P4, an indirect branch can cost 50 clock-cycles or more!
I made a post on comp.lang.asm about this:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&threadm=a3995c0d…
It would be interesting to run these tests again on a Prescott or Pentium M CPU.
I’m rather suspicious about the game benchmarks done at Anandtech. Given than they know that those benchmarks are essentially GPU bound, why don’t they reduce the resolution to 640×480 instead of leaving it at 1024×768? I know that’s the resolution many people play at, but we’re benchmarking CPUs here, so its only makes sense to make the GPU as little of a bottleneck as possible.
I really liked Anandtech’s article, specially the page before the last, where they did a very interesting remark about Prescott scalability. If you haven’t read the article yet, go and check it out.
The proccessor also deals with gaming stuff, this theory is proven by the Gaming performance jump from the Athlon64 512KB to 1MB jump . Also these review companies have pretty neat GPUs so I wouldn’t worry about bottleneck .
-DaMouse
What would be the point of benchmarking a CPU in a less than a reality based situation? Makes perfect sense to me they would attempt to recreate a “real world” load for their testing.
I find it a safe assumption most of the initial use for this CPU would be gaming, and apparently so did they when they tested it.
Amen to that, Anandtech is still largely excellent (wish the same could be said for THG..).
Sure, a real life gaming situatiuon is relevant. But why not add 1024 AND 640×480 in then?
An while they’re at it add in some tests with videocards doing AA to stress them to their max to see if cpu’s can help out there somehow.
I agree it isn’t the best test they have done. Recent;y I have found Anandtech’s tests a little less than even a year before.
I especially dislike Derek Wilson’s work.
a 5 stage pipeplin CPU at 1 GHz performs exactly like a 10 stage pipeline CPU at 2 GHz, performs exactly like a 20 stage CPU at 4 GHz, performs exactly like a 40 stage CPU at 8 GHz.
see the problem?
higher GHz does not in any way a reliable marker for performance.
I would much rather have a 1 GHz 5 stage (like there even is one) CPU than an 8 GHz 40 stage CPU, if for nothing else but power consumption.
if more research went into shrinking the pipeline, while maintaining the Frequency, we would see much higher boosts in performance.
I quote Debman: “a 5 stage pipeplin CPU at 1 GHz performs exactly like a 10 stage pipeline CPU at 2 GHz, performs exactly like a 20 stage CPU at 4 GHz, performs exactly like a 40 stage CPU at 8 GHz.”
This is patently incorrect. Were it not for hazards (mostly branches), a 5 stage pipeline would peform the EXACLY THE SAME as a 50 stage pipeline at the same clock rate. The only difference is that the 50-stage pipeline could be clocked faster, allowing it to perform better.
Hazards (such as mispredicted branches) can be a problem because a longer pipeline will cause a waste of more clock cycles. However, advanced branch-prediction algorithms (especially with prediction of indirect branches!) are so sophistocated as to make mispredicted branches a tiny minority (a few percent).
Additionally, given that the longer pipeline can be clocked faster, the actual real-time wasted by a branch misprediction becomes less important. For instance, a 4.5GHZ Prescott would have the same misprediction penalty as a 3GHZ Northwood.
Also, you should keep in mind that a cache miss can be much more expensive than a branch misprediction.
Thus, it is reasonable for Intel to say that Prescott is roughly the same performance of Northwood at the same clock speed, despite the 50% longer pipeline.
Please, go do some research into what pipelines are FOR.
I did…I read the anadtech article before posting what I said.
Anandtech is really simplifying the issue, to the point of almost being incorrect.
A pipeline behaves, well, like a pipeline. A 20-stage pipeline isn’t inherently half as fast as a 10-stage pipeline. The difference is that when you have a 20-stage pipeline, it takes 20-clocks for a single instruction to come out the other end, while with a 10-stage pipeline, it takes 10-clocks for a single instruction to come out the other end. However, once both pipelines are filled, both spit out one instruction per clock cycle. So the latency is 10 clocks vs 20 clocks, but the throughput is 1 instruction per clock. In terms of “instructions executed per second” both are theoretically the same speed.
Now the performance loss of a 20-stage pipeline comes from the fact that CPUs can’t keep their pipelines filled all the time. Consider what happens when you have a branch instruction. Let’s simplify things for a moment and assume that there is no speculative execution. The CPU can’t stuff any instructions behind the branch instruction, because the CPU doesn’t know what code should go there until the branch instruction completes. With the 10-stage pipeline, the CPU waits for 10 clocks and executes only one instruction (the branch instruction). With the 20-stage pipeline, the CPU waits for 20 clocks and executes only one instruction. That’s where the performance hit comes into play. Now most modern CPUs use speculative execution, so they guess what code should come after the branch and start executing that. However, when the CPU guesses wrong, the 10-stage CPU pays a 10-cycle penalty to flush those unneeded instructions, while the 20-stage CPU pays a 20-cycle penalty. The net result of this is that while both pipelines only perform at a fraction of the 1-instruction/cycle ideal, the 10-stage CPU gets closer to that ideal.
Thus, the actual performance loss depends on the amount of instructions that are branches. Integer code can be 25% branches. That means every four instructions, the 20-stage CPU faces a (slim — branch predication is 95% accurate) possibility of a 20-cycle delay, while the 10-stage CPU only faces the possibility of a 10-second delay. This is why the P4’s performance on business applications, etc, is relatively poor. However, many types of code (particularly 3D, scientific, etc) has a low density of branches. On this type of code, the 20-stage pipeline gets much closer to the 1-instruction/cycle ideal than it does on branch-heavy code.
In the end, the length of the pipeline is an engineering decision. If you will be running very branch-heavy code, the instructions-per-cycle (IPC) penalty of the long-pipeline will outweight the clockspeed advantage. If you are running code with fewer branches, the clock-speed advantage of the longer pipeline outweighs the disadvantage of the reduced IPC. If you consider that the type of code the P4 was designed to excel at (games, media, 3D) has a low-percentage of branches, then Intel’s decision to go with a long pipeline makes sense. Note that the Pentium 4’s performance issues don’t just stem from its long pipeline. Its got other limitations, namely a crappy 387 FPU, limited decode bandwidth, and small caches. The G5 has a very long pipeline too, and still manages to maintain more even performance than the Pentium 4.
if having more pipelines is better , why doesnt every one do it ?
if having five pipleline stages is better than having none/ one [ depends on how you look at it], it should imply that more the number of pipelined stages,the merrier .. right?
ie if the circuitry that comprises each pipelined stage can be broken down into several, each new sub-pipeline stage [wrt the original ]
and assuming the latencies between successive stages is not taken into acocunt, or is very insignificant to affect the calculations ;
It seems more likely that one can take more advantage of a higher number of stages ;
But why dont all people do it? Technology can only advance,
feature sizes can only go down and colck speeds can only go down; we wont have a situation where we will have to abandon existing working technology because we cant continue doing that any more due to its not being possible ) ;
also does this have any connection to the instruction size?
ie does the decision to use a given number of pipelined stages depend on whether its a RISC or CISC PRocessor?
cheers
ram
I forgot to mention one thing. Its not just branches that introduce empty pipeline stages. Consider what happens when you have instructions A and B right next to each other in the stream. B depends on the results of A. So A goes into the pipeline, and B can’t be put into the pipeline until A is complete. That again causes a pipeline bubble, which impacts the performance of the longer pipeline more than it impacts the performance of shorter pipeline. That’s why the Pentium 4 has such a huge reorder buffer — it has to have lots of options to put into the pipeline while waiting for instruction dependencies to be resolved.
if having five pipleline stages is better than having none/ one [ depends on how you look at it], it should imply that more the number of pipelined stages,the merrier .. right?
—
Um, no. Engineering is almost never like that. A five-stage pipeline allows you to clock the CPU much higher than a one stage pipeline. But you have to balance that against the fact that it causes reduced performance on branch-heavy code. Its a trade-off.
also does this have any connection to the instruction size?
—
No. The G5 is a long-pipeline CPU with fixed-length 4-byte instructions. The G4 is a short-pipeline CPU with variable-length instructions. The Athlon XP is a medium-pipeline CPU with variable-length instructions.
ie does the decision to use a given number of pipelined stages depend on whether its a RISC or CISC processor?
—
Not really. First, there is no such thing as RISC vs CISC anymore. The MIPS was the closest thing you ever had to a pure-RISC CPU. The PowerPC is pretty un-RISC-y, and modern x86 chips consist of RISC-y cores wrapped in an x86 shell. Now, x86 CPUs tend to have slightly longer pipelines because they have additional decode stages that need to translate x86 instructions to the internal format. Of course, some RISC CPUs (the G5) also have such stages.
there is one more issue to be considered;
your comment assumes that, IPS is same for both .. but anandtech’s article says that it is easier to imcrease the clock frequency, for longer pipelines than it is for shorter ones;
please corret me if wrong but
also to be considered is the fact that in most cases a failure in prediction wont have to start with a blank pipe, you might have to roll back to only a certain lenght backwards.. this is also explained in the article;
cheers
ram
The only way to make cpus go way much faster is to break completely out of the “let the cpu find the parallelism” etc mantra that the entire industry is stuck in.
If programmers will learn how to program in Par languages like Occam, or any CSP like language (which can be C like or HDLs) then many opportunities open up to cpu designers that go far beyond what Intel/Amd can extract from pure Seq programs.
Note that x86 cpus have gotten 30x faster over last 10years (THG) comparing 100MHz Pentium to 3GHz P4 and the clock has gotten 30x faster too. But didn’t anyone notice that the no of transisters needed also went up by probably 100x. Dah, we paid twice. If the entire transister budget was used to build 100x P100s on 1 chip still at 100MHz it might seem possible to beat the 3GHz P4 with plenty of upstream MHz to claim. But the x86 is not the right architecture to support closely coupled processes by any stretch, barely even one process either.
If cpus include process communication and scheduling in HW then when hazards are expected to occur, switch process. It really isn’t necessary for any cpu to ever wait unless there really are no more threads to run when hazards occur.
Now pipelining can be used for what it is best at, splitting what would be long cycles ie 64bit adds into 32 or even 16bit slices. So the pipeline can get longer because following opcodes will not be allowed to enter if they are dependant. All pipelines should be running near the most critical delay limit which is usually the cache. Arithmetic and logicals can usually be broken into smaller parts.
Instead of spending x no transistors on branch prediction, register renaming, forwarding etc etc, I would rather see it spent on the process communication & scheduling and no register file at all (cache local workspaces instead T9000 etc).
Instead of ever faster DDR DRAM and ridiculous unuseable bandwith that I never get to see on my Athlon, I would like to see 20ns cycle RLDRAM as Level3 cache, fits in very nicely between sub ns cpu cycles and effective 100ns random access speeds of most DRAM and is available in 32Mbyte chunks way larger than teensy 256K-1M L2 caches..
And the best part is it can be done with lower clock speeds and with many more cpus working in parallel. But only if programmers get off their rear ends and think more like HW engineers.
In the end ever faster cpus become embarrasing when used in a PC that has HDs that only handle a few dozen true random accesses per/sec. I just love how the industry advertises nos that are completely unobtainable except to peculiar benchmarks.
During bootup W2K uses a good part of a Trillion opcodes (a few per cycle) and gets nothing done.
End of comment
I’m tempted to get a pc, I haven’t had a new one in 7 years. I just don’t want to shell out money for a new G5 right now. So a pc might be a cheaper alternitive for me.
No, I’m not assuming that the IPC is the same for both. The IPC for a 20-stage pipeline is always lower than the IPC of a 10-stage pipeline. That’s why its a tradeoff:
Speed = Instructions_Per_Clock * Clock_Rate
A longer pipeline raises Clock_Rate but lowers IPC. That’s why its a tradeoff. Depending on the type of code you want to optimize for, the decrease in the first term can outweigh the increase in the second term, resulting in a net loss of speed. Or, it could be the other way around.
As for the misprediction penalty, I’m pretty sure you can’t just roll-back a certain number of instructions, but must flush the entire pipeline. Consider how it works: you speculative execute the instruction stream right after the branch instruction. The branch instruction completes in 20 clock cycles, and the result says that you predicated wrong. Since all 19 of the instructions behind the branch instruction was from the wrong branch target, so the 19 cycles spent on those instructions were all wasted.
What i don’t understand is why intel doesn’t take advantage of the huge speed boost a larger L1 Cache size produces. AMD gets alot of its speed from that aspect. Anyway i wish intel would decide on a socket! They switch to damn much.
JJ
Intel had a hard time persuading software makers to recompile their programs with SSE/SSE2 support. Just what force can persuade everyone to throw away their code and write it in new language? I don’t think anything short of aircraft carriers could work.
@Phuzzi
Not happy with review sites. Try:
http://www.hardocp.com
They don’t review nearly as much as THG, but the reviewers _really_ know what they are talking about.
The Pentium 4’s “smart” trace cache is much more well-suited to its long-pipeline than the Athlon’s L1 cache. The trace cache is actually estimated to be quite large — about 100KB. Its 12,000 u-ops, and if you assume that each instruction is 32-bits like the average RISC instruction, that’s 48KB just for instructions. Add in the additional overhead necessary to support the trace-cache’s functionality, and the 100KB number makes sense. As for the tiny data cache, its smaller because that makes it easier to make it low-latency. The Athlon’s data cache has a 50% higher latency than the P4’s data cache.
well, if you are still using OS 9, I would say…OK, but if you are choosing between getting a G5 and this, I would go where the best OS is for your taste.
that is where it is for me…I have not been in the clock speed market since the Athlon hit the 1GHz mark 5 years ago…now, any CPU over 1 GHz is fine for me, and even 800 is ok as long as I have the RAM.
It is easier to up the clock for a longer pipeline. But you need to remember that a higher clock dissipates more heat. That’s a huge factor for cooling and laptop battery life.
When you have a branch misprediction you can flush the pipe once the misprediction is determined. I don’t know enough about the P4 to say whether that determination is made in the last pipeline stage or not. If it is made earlier then you wouldn’t have to flush the whole pipe. But practically speaking you’ll probably need to go grab the new destination from memory (or hopefully cache) so you’re still waiting for a bit.
Its not a question of throwing away old code, but more a matter of new code that is computationally naturally parallel to some degree being written in CSP like languages. Several C,C++ dialects that support Occam Par,Alt,Seq primitives such as handelC, SystemC, PrecisionC and others and even JavaCSP, JavaPP are available and ofcourse all Hardware Description languages are already there but with much smaller HW audience.
In many cases only small parts of a program need to be rearchitected to remove artificial sequencing. Ofcourse you can parallel program to death, it won’t make any real difference if it all runs on 1 cpu even HT enabled and right now real multi cpus are woefully expensive because they are not based on the message passing model. N way x86 cost way more than N x 1 cpu unit because they weren’t really ever intended to do that.
I know that after the Transputers initial success with Occam as its native language, all? UK Uni CS students took courses in Occam, which probably put them off more than helped because it was so lexically wierd. It was more assembler like than high level. Also the Transputer family was not updated timely so the HW to run parallel programs started to fall behind the seq stuff we have today. The Transputer appeared when silicon still had another 100x performance up its sleeve but that time has really already run out. Intel & co are really using transister counts more than freq to get the HW to fake the performance.
I think its also a matter of momentum, Intel has the ball rolling at a certain rate that seems fast to many people certainly for mere desktop use. But for really compute challenging problems, it is woefully inadequate ie bioinformatics etc.
Intels SSE changes are miniscule compared to what is needed to exploit true multithreading. But most new cpus are gradually getting the threading part but not the process communication & scheduling part so all one can do is run multiple unrelated processes that will also split the cache, ouch which often cancels out any gains made with the HT. Transputer like cpus would most likely have sibling processes sharing the cache so at least they are pulling together at the same time hiding the latencies but not really thrashing the cache.
Those that want to do parallel programming in the C/Fortran world might also be using MPI or PVM which share alot of the ideas of Occam ie message passing & so on, but these look pretty poor to me when compared with the possiblity of processes switching in 0 or <20 cycles compared to the OS level swaps in 1000s of cycles.
Only time will tell, I can’t see Intel pushing the clock another 30x and the transister count another 100x just for another 30x speed bump. I can see low cost slower cooler parallel cpus reappearing because they just make more sense, even for a desktop user.
For a another glimpse of how Intel is gradually getting it and coming back to the Transputer way of doing things, look out for PCI Express, a pt to pt PCI replacement that does same thing over serial 2.5GHz diff pairs v the large connector we are all used to. The only difference in a future large PCI network is that Intel insists on only 1 cpu being at the root of the PCI tree, where the Transputer would have a cpu in every node.
The good stuff always gets reinvented a few times before finally taking over!
Intel screwed up BAD
why not just enable the X86-64 part of the processor?
then they coould compete to AMD (ya right!)
that is the ONLY reason SO MUCH of the die is “disabled”
not a good design IMHO
8-way OPTERON here I come!!!~
It’s now less expensive to have a pair of very high speed buffers than a bunch of slower buffers. The same applies to connectors.
To keep signal integrity ( impedance adaptation, differential current pairs ) , high speed busses mandates point to point connections.
These point to point connections leads toward parallel systems ( link based, like Transputers years ago ), even for desktop computing. Instead of a common bus sharing :
– Network / Video / Main CPU / HD controller
You can build up crossbar switches and intelligent peripherals allowing truely simultaeous operation between, for example Network <-> Video and HD <-> CPU, and, more than that, several parallel CPU’s.
There is much work on the software side to handle this.
you should try to read x86-secret.com review i know it’s in french but they did some interesting find .. like prescoot performance number is slow not because of the longer pipeline but the cache latency (more slower than northwood) and some transitor not used right now … (like hyperthreading in the first p4) the review is far better than anandtech review.
RE: Rayiner Hashem
I agree with your comments on the RISC set. I used to work on the Power line of boxes. ugh. I would add though the Alpha seemed to be a little better than MIPS for a pure RISC CPU. The key factor being the Alpha outlawed any instructions that would kill the cache in a multi cpu enviroment.
On benchmarking test?
The Alpha is know for a feature where if you insert NOPS after a branch and before the next branch it will speed the code up. (Sort of a manual pipeline reseting.) Has anyone tried this on a Intel chip?
For the chip engineers out there. Is there a maximum perfomance calculation that can be run on a chip. Sort of a perfect academic design for a chip then the efficency per clock cycle could be measured against the idel. I was thinking given a perfect chip it should be able to perform 1 mul per 4 clock cycles (load laod mul store) and a give chip performs this.
Donaldson
hmm.. i still dont get one thing clear;
if both the shorter and longer pipelines suffer the same[ in real time, not clock cycles] due to a branch misprediction or anyother reason for which the whole pipeline has to be refreshed..
we can ignore that for a moment;
assume the pipe is full;
now since at the end of each clockcycle, the operation performed by the last pipeline stage is completed and thus completes one instruction, it is free to process the same stage for another instruction.
thus at every cycle, one instrution is processed, after the pipeline is full!! ;
in which case, instructions per cycle should be the same!!
and since cycles/sec is higher for a longer pipeline, its performance shouldbe more
====================================
acknowledging another post…
yes power consumtion is a factor ; i completely forgot about thank you ;
cheers
ram
hmm.. i still dont get one thing clear;
if both the shorter and longer pipelines suffer the same[ in real time, not clock cycles] due to a branch misprediction or anyother reason for which the whole pipeline has to be refreshed..
—
They don’t. If you have to flush the whole pipeline, the 20-stage CPU pays 20 clocks, the 10-stage CPU pays 10-clocks. But you can’t necessarily clock a 20-stage CPU 2x as fast as a 10-stage CPU. So the 20-stage CPU pays more *in real time* than the 10-stage CPU.
now since at the end of each clockcycle, the operation performed by the last pipeline stage is completed and thus completes one instruction, it is free to process the same stage for another instruction.
—
One clarification: since the pipeline is linear, the last stage is free to process the same stage for the *next* instruction in the pipeline, not any other instruction. The problem comes not at the end of the pipeline, but at the beginning. Say the pipeline is full. As each instruction is completed, another is put into the pipeline. However, assume you put a branch instruction into a full pipeline. In 20 clock cycles, the full pipeline will be emptied, and 20 instructions will be completed. But what happens for the *next* 20 clock cycles? Since which instructions get put at the beginning depends on on the result of the branch instruction, as long as the branch instruction is in the pipeline, no others can be put in behind it. So in the next 20 clock cycles, only the branch instruction is processed, because the stages behind it in the pipeline were all empty.
From the looks of things although the Prescott has 11 more stages then the Northwood, due to some secret unknown tweaks or whatnot that intel has done to the core it more or less performse the same if not faster at 3.2ghz. Like the reviews say, and I beleave this will happen, once the Prescott moves over to 3.4ghz it should have a pretty good % gain over the northwood P4EE even though the EE has a bigger L2 cache then the Prescott.
I think people hear 31 stage pipeline and freak out. This new chip should perform really well once it moves over to its new home of socket775. We saw this with the original P4’s also, plus the fact that much of the core is secretive and probably unused at this point says something.
Intel could unlock something in a few months and we could all be left with our mouths hanging open wondering how the chip just got faster all of a sudden.
Anything is possible, and the fact intel is keeping quiet about how prescotts insides work we don’t have the whole picture.