“A Brief Look at the PowerPC 970” explains what the new IBM CPU is all about and how does it stand against the x86 competition today, and in a year from now (release time). Another article is titled “When is PowerPC Not PowerPC?“. On ExtremeTech you will find “AMD Tips Opteron Benchmarks“. Two articles at EETimes, “Intel describes billion-transistor four-core Itanium processor” and “Intel to debut 90-nm ‘Banias’ processor in 2H ’03“. Of embedded interest: “MemoryLogix to disclose ‘586 core’ for SoC applications“, a CPU to compete with ARM.
That info about the XFire thing is quite a relief. Apparently, even though Opteron is kinda-NUMA-ish, AMD implemented a method to read from other processors quickly, so in a two way system, one proc can read from another’s memory at 7.02 GB/sec. That means little need to design NUMA-friendly OS kernels.
Power, PowerPC, Amazon… so many architectures, so much money to pay for consultants. SPARC is easy, V7, V8 and V9… 32bit, 32 bit and 64 bit… all binary compatible
Well, I just hope AMD doesn’t delay the Opteron any more. I could be amazing if it comes out middle of next year, but we never know. It seems middle of the line in terms of IPC and clock. More clock than PPC 970 and POWER4 and Itanium 2, and much higher IPC than pentium 4
If the PPC 970 is all Apple has to look for, things don’t look really good for them. Its nice that they are have a better chip, but it may be much too late for them. If they just focus on their strenghts and not on BOGUS photoshop filter benchmarks, they’ll do fine.
Itanium 2 article reminds me much of POWER4, because isn’t that kinda what IBM already did with POWER4? In the mean time, Itanium 2 might finally find some marketshare.
I can’t wait to try compiling Gentoo on a clawhammer!
First, I refuse to use the absurd GigaProcessor Ultra-Lite name. Henceforth, the GPUL is redubbed the P4L. That said, the P4L seems to basically vindicate Intel’s basic design strategy.
1) It maps PPC to its own internal instruction set, allowing IBM to change the innards of the CPU without affecting compatibility. Heh heh, RISC comes crawling back…
2) It doesn’t try to be super-wide like Itanium. If speculation is correct and the P4L does have 2 AltiVec units, then it will indeed be a little wider than the P4, but it’s not a giant difference. This recognizes the fact that most desktop code just doesn’t parallelize that well, and most compilers (especially GCC) aren’t cut out to auto-vectorize programs.
3) It features an extremely long pipeline. Depending on which PPC chip the ExtremeTech article is talking about, the P4L’s pipeline will be anywhere from 15 (3x as long as the G4) to 21 (3x as long as the G4e) stages long. This recognizes the fact that a lot of media desktop style code is straightforward processing (for example, a bunch of matrix multiplies, in the case of 3D) and isn’t as branch-heavy as most integer code. Because integer performance isn’t really a number one priority on desktops these days (most integer-using apps run fast enough) this tradeoff allows IBM to jack up the clock speed (which accelerates media stuff greatly) without many side-effects. Given IBM’s great fabrication technology (which, unlike Motorola’s is comparable to Intel’s) I’d bet that the P4L will scale quite well to high clock-speeds. There might be some initial dissapointment with the performance of the 1.8 GHz models, but after IBM makes sure the design is solid, I think they will be able to speed it up, just like Intel did with the P4. This long-pipeline design realizes that clock-speed is king, at least of desktop style code.
4) It has fairly good memory bandwidth. The 450MHz DDR bus is good for about 6.4 GB/sec after messaging overhead is accounted for. This compares well to the 4.2 GB/sec of bandwidth in a P4 and the 6.4 GB/sec+ bandwidth in an Opteron. However, the P4L bus isn’t as flexible, because it is composed of two unidirectional 32-bit links instead of one bi-directional 64-bit link. Thus, you can only get 3.2 GB/sec in any one direction. Also, the Opteron’s Hypertransport SMP mesh allows bandwidth to scale better with more processors.
5) At ~40 watts, it uses almost as much power as a modern x86 chip. No more sissy embedded CPUs in here!
” 1) It maps PPC to its own internal instruction set, allowing IBM to change the innards of the CPU without affecting compatibility. Heh heh, RISC comes crawling back…”
You mean like the way in which CISC admitted its ineficiencies and started to incorporate more RISC technology?
(Mr Hashem is a forum troll)
” 3) It features an extremely long pipeline. Depending on which PPC chip the ExtremeTech article is talking about, the P4L’s pipeline will be anywhere from 15 (3x as long as the G4) to 21 (3x as long as the G4e) stages long. This recognizes the fact that a lot of media desktop style code is straightforward processing (for example, a bunch of matrix multiplies, in the case of 3D) and isn’t as branch-heavy as most integer code.”
The implication here is that the PPC 970 (GPUL) has a penalty for branch mispredicts, etc.
What you’re forgetting, don’t understand or simply wish to put out of mind is the fact that IPC is *increased* from the current G4. It will now fetch 8 instructions per clock, and retire 5 per clock.
The 970 will now fetch 8 instructions per clock, and retire 5 per clock.
The current G4 IIRC fetches either 3 or 4 per clock.
The 970 will also have, a very very nice branch prediction scheme. The POWER4 uses a total of 3 branch predicters to the Intel P4s one. The 3rd table weighs the comparative performance of the first two tables to acheive the highest possible correct branch prediction.
In addition, the PowerPC architecture includes a static branch prediction bit for branching instructions, which allows the compiler to “hint” to the processor the likely branch, the x86 architecture has no equivalent feature.
In short, branch misprediction will occur less often with the 970 for the above reasons. In addition, the “tripling” of the G4 pipeline in the 970 is still shorter than Intel’s 20 stage P4.
IBM has *always* been conservative about what not-quite-ready chips will do as far as clock, and benchmarks. I expect “Real World” performance to be quite good.
When you couple all this with a quick move to a .09 process shows me that this 970 chip has legs. Another thing…
(Mr Hashem is a forum troll)
Even at around 40 watts, this chip would run cooler and cheaper than a comparable x86 chip. That said, even if it does suck power like there’s no tomorrow, Motorola and IBM’s lower-powered PPCs are not going away.
If Apple does use this processor, it does not by any stretch of the imagination mean Apple will be dumping Motorola as a supplier of chips. In fact, if the 970 pushes the G4 down into the rest of the line, including the iBook, Motorola will likely end up selling Apple more processors than they are right now.
Check it out:
http://slashdot.org/comments.pl?sid=42486&cid=4462733
Either GB is the slashdot user Visigothe, or he/she enjoys cutting and pasting between online forums and pretending that it’s his/her own words. Now that we’ve got that out of the way…
Rayiner Hashem:
That said, the P4L seems to basically vindicate Intel’s basic design strategy.
1) It maps PPC to its own internal instruction set, allowing IBM to change the innards of the CPU without affecting compatibility. Heh heh, RISC comes crawling back…
Gill Bates:
You mean like the way in which CISC admitted its ineficiencies and started to incorporate more RISC technology?
(Mr Hashem is a forum troll)
Actually, that’s PRECISELY WHAT HE SAID. IBM’s approach vindicates Intel’s approach for decomposing CISC into smaller pieces because that’s exactly what they are doing to support the 3 different ISA’s that the P4L will incorporate. This chip isn’t going to be RISC, despite what you might think because of the three ISA’s involved means that it will have over 350 instructions.
I think that between your inability to read combined with your cutting and pasting of other people’s (informed and insightful) words as your own (ignorant and stupid) words makes YOU the troll, holmes.
And Gill claims
One thing I have not read yet is whether the 970 “altivec” SIMD will support 64-bit floating point. The current G4 altivec unit supports only 32-bit floating point, which makes it useless for many scientific and engineering applications.
I’m a little dissapointed to see the 40-watt power requirement. One advantage of the PPC chips, especially for laptop and for multiprocessor 1U rack mount units, has been their low power consumption. For comparison, current P4s at higher clock rates require >70 watts, which is a problem for 1U rackmount and laptop machines. Maybe the .09 process can get the power requirements back down to the 15-30 watt range of the current G4s.
One thing I have not read yet is whether the 970 “altivec” SIMD will support 64-bit floating point. The current G4 altivec unit supports only 32-bit floating point, which makes it useless for many scientific and engineering applications.
Doesn`t need it, The Power4s FPUs will be equally fast.
I’m a little dissapointed to see the 40-watt power requirement. One advantage of the PPC chips, especially for laptop and for multiprocessor 1U rack mount units, has been their low power consumption. For comparison, current P4s at higher clock rates require >70 watts, which is a problem for 1U rackmount and laptop machines. Maybe the .09 process can get the power requirements back down to the 15-30 watt range of the current G4s.
They`ll maybe get it down but I doubt they`ll get it to that level – then again given IBMs engineers…
The lower wattage is however a big advantage especially given that P4 and Hammer will both be significantly higher by the time the 970 is launched.
However MPR who run the Microprocessor forum have just released their latest x86 report (got a few $000?) and one thing it predicts is that Intel will run into serious heat problems in 2004. IBM however won`t until much later. One of the small advantages or RISC may actually turn out to be it`s greatest.
I expect something similar to the P4 story:
At launch P4 was actually slower than the top P3 chips, however Intel rapidly moved to a new process and speeds have been soaring upwards ever since. I fully expect the 970 to appear at over 2GHz within 6 months of launch in which case it will quite possibly give IBM a clear lead over x86.
Is that one about Banias. The only news story that interest me. Oh well…
1) IPC is not necessarily increased. Issue bandwidth is increased. The chip can issue 8 instructions per clock cycle. Even in code that runs entirely out of L1 cache, consists of exactly 2 integer, 2 floating point, 2 load/store, and 1 comparison each cycle (because maps directly to the available execution units) the processor will still not get 8 instructions per cycle, because a) the 8th unit is a branch unit and branching each cycle will kill you b) the processor can only retire 5 instructions per cycle. The 8 instruction issue bandwidth is there to allow the processor to have bursts of activity without issueing becoming a bottleneck, not for sustained 8 instructions per cycle computing.
2) It’s true that the P4L has more available resources than the G4, but whether that actually increases IPC depends on the number of cycles between branches and how well the code can be made parallel. In comparison to a P4, the P4L has fewer available integer units, but more available floating point units (especially if an Altivec unit is included).
3) The Pentium 4 also has a very good branch prediction unit. In fact, Intel is famous for the quality of its branch prediction. Also, the Power4L has nothing comparable to the trace cache, which can help cut the cost of branches quite significantly. I’d be willing to say that IBM might have a better overall BPU, but only benchmarks will tell, and it’s almost certain that it won’t completely make up for the long pipeline.
4) A G4e’s pipeline is 7 stages. 7 x 3 = 21 stages, or 1 more than the current P4.
I’m not trolling here. I’m pointing out some design features of the chip and how they confirm Intel’s design decisions with the Pentium 4. The Pentium 4, despite what all the PPC-heads seem to think is a wonderful chip. Even the Power4, with all its resources and 128 MB of cache (which helps a *lot* on the specfp) is less than 50% faster than the leading Pentium 4 in floating point. For a mass market chip that costs less than $500, that’s quite an achievement. For desktop code, “fast and narrow” simply seems to be a good idea. There is nothing wrong with IBM for realizing that as well. As for the P4L, I pointed out that it looks like quite an interesting chip, and with seperate AltiVec units, it fixes some of the P4’s architectural shortcuts (single FPU, handling both FP and SIMD) and if IBM can ramp up the clock-speed, could become quite competitive.
Now, as for SIMD double precision floating point, I hope the P4L has it too. The P4 can do double precision SIMD (two doubles per clock cycle). With the P4L’s two FPUs, the P4L can also do two double’s per clock cycle. Since the P4L probably can’t match the P4’s sheer clock-speed, it will be a lot slower than the P4 for scientific apps. Allowing the AltiVec unit to do douple precision SIMD would alleviate that problem.
Harbinjer, you’re from St. Cloud? =)
When I hear that cpus issue 8 & retire 5 or so ops per cycle, I always get really excited, like I want one of those, NOT. My humble 1GHz Athlon made claims like that although I am not sure what the issue/retire no’s are but def >>1.
Well I often test this out by finding my fav innermost loops in C, push the optimiser to the floor & time the damn thing running out of context in a test bench a couple of billion times. Guess I had the wool pulled over my eyes coz I usually see IPC values closer to 1-1.1, sometimes I think it might of touched 2 on occasion like I saw that UFO last night, but I usually assume the K7 toddles along at just 1 IPC. Now the much higher no mentioned in the ref material no doubt makes up for the inevitable branch miss predicts & cache fails so without 2 or 3 datapaths, the avg IPC mightt be somewhat <1.
The Alpha also used to make this claim but studies also showed it to also avg’ed mid 1.5 IPC IIRC so I don’t feel too bad about it.
If I’m not mistaken about this 970 chip, I don’t think it was designed with Apple in mind. I believe it’s running a original ppc core, as opposed to the g series cores(baring what they have done to make this new entrant a 32/64 hybrid). If I was outright lying, then BeOS and a few other OS’s would run on the G series chips(even on an eval board, which is non “apple propietary”) without any modifications whatsoever.
But hey.. the original tip off was the name “POWERPC 970”, a name remincient of “PowerPC 604”, but I must digress. If it is a true ppc, then, it was designed for purposes I previously stated almost specifically.
<One thing I have not read yet is whether the 970 “altivec” SIMD will support 64-bit floating point. The current G4 altivec unit supports only 32-bit floating point, which makes it useless for many scientific and engineering applications.
Doesn`t need it, The Power4s FPUs will be equally fast.>
I read that the 970 might come with two “altivec” SIMD units (but IBM calls them somethng different). Are you saying that they are not necessary, that the two FPUs will be the same speed? Or are you saying that if they support 64-bit floating point, they would be slower, and the two FPUs would be as fast?
Another related question is whether the 970 FPUs will support the 64-bit floating point merged multiply-add instruction. Some PPC chips support it, some only emulate it. If they do support it, then each FPU will deliver two floating point operations per clock cycle. Combined with the branch segment in the same instruction, this combination would allow single-instruction loops, each performing 4 floating point operations, and the max GFLOP rating would be 4 times the clock rate.
This is just for the FPUs not the SIMD units. The SIMD units would produce either 8 or 16 floating point operations per clock (if they are like the current G4 SIMD), but I still don’t know if they support 64-bit floating point or only 32-bit floating point.
Are you saying that they are not necessary, that the two FPUs will be the same speed?
Yes.
A post above mentioned that without 64 bit vector instrs the 970 would be usless for Sci/Tech work and x86 will outgun it.
This is only true however if the algorithums in question are vectorisable and the processor has sufficient bandwidth to keep the vector unit fed. If not the P4 will fall back to it’s standard FPU which is actually pretty weak.
Rayiner Hashem: I’m not trolling here. I’m pointing out some design features of the chip and how they confirm Intel’s design decisions with the Pentium 4.
I have to agree with you here. However, I think PPC-heads (and AMD zealots, like me 🙂 find it hard to accept that P4 is actually a good processor because the first generation of it sucks. Way too expensive, slowe than P3s…
Of course now, if I had the money, I would rather buy a top end P4 than anything else. But the most I can afford is a Athlon XP 2000+. (yeah, I’m no longer a AMD freak of nature, but I was).
anonymous205: I read that the 970 might come with two “altivec” SIMD units (but IBM calls them somethng different).
It may be similar, but I wouldn’t bet my money on it. Altivec (or Velocity Engine for you Apple employees 🙂 has 162 instructions, while this one has 160. It may sound close enough, but in reality, it could be as different as day to night. I guess for now we would never know.
<Are you saying that they are not necessary, that the two FPUs will be the same speed?
Yes.>
How so? The G4 SIMD can initiate 8 32-bit floating point operations each clock cycle. How is the FPU unit going to be as fast for 32-bit floating point?
<A post above mentioned that without 64 bit vector instrs the 970 would be usless for Sci/Tech work and x86 will outgun it.>
If it was my post, then I did not say that the 970 would be useless for sci/tech work without 64-bit SIMD, I said that the 32-bit altivec unit in the current G4 is useless for a large fraction of such work because these applications require 64-bit floating point. The G4 FPU still does pretty well with scalar floating point code, but a 64-bit FP SIMD unit would do much better. I would hope for 8 64-bit FP operations per clock, of course, like the current 32-bit FP unit achieves.
<This is only true however if the algorithums in question are vectorisable and the processor has sufficient bandwidth to keep the vector unit fed.>
I agree with this. The scientific and engineering applications I have in mind do vectorize well, but I do not know if the 970 SIMD unit(s) will support 64-bit FP arithmetic or if the memory bandwidth (cache) is sufficient to sustain a 64-bit FP SIMD unit.