IBM Shows off Power7 HPC Monster

Thom Holwerda 2009-11-27 IBM 27 Comments

“IBM likes to go on and on about the transaction processing power and I/O bandwidth of its System z mainframes, but now there is a new and much bigger kid on the block. Its name is the Power Systems IH supercomputing node, based on the company’s forthcoming Power7 processors and a new homegrown switching system that blends optical and copper interconnects. The Power7 IH node was on display at the SC09 supercomputer trade show last week in Portland, Oregon, and El Reg was on hand to get the scoop from the techies who designed the iron. This server node is the heart of the 20 petaflop ‘Blue Waters’ supercomputer being installed at the University of Illinois.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

27 Comments

2009-11-28 2:06 pm
cerbie
In terms of HPC, if you got the money, I get it.
In terms of transaction processing, however, what makes Power so important? Is it a lack of maturity in clustering and partitioning, legacy apps that need to run faster (instead of, you, being rewritten), or…?

2009-11-29 7:59 am
MrVain
It uses quite a lot of power. It is easy to throw more and more power. The difficult thing is to build an efficient solution under constraints. The new Fujitsu eight core Sparc VENUS delivers the same GFlops, clock for clock. The Venus has maybe half the clock speed, but consumes much less power. The Power7 uses 200W or more. The half clocked Venus uses 50W. Double Venus clock speed and it still uses 100W, but delivers the same GFlops.
But the biggest drawback is that the CPU is much higher clocked than RAM. This big difference in speed, penalizes cpu cache misses severely.
Studies from Intel shows that a typical 2.5GHz x86 server idles 50% of the time UNDER FULL LOAD. This is due to cache misses. If the L2 cache has no data, the cpu has to wait for RAM to deliver the data. The CPU is far to fast, compared to RAM. The higher the CPU is clocked, the greater the penalty.
2.5GHz cpu leads to 50% utilization under full load. A 4GHz Power7 leads to maybe 25-33% utilization under full load? There will be far more waiting for data for such a high clocked CPU.
There is NO way to increase utilization, other than if the cache has extrasensory perception and can deliver data before the CPU even knows it need the data. Psychic CPUs are impossible.
So a 4GHz Power7 under full load, maybe corresponds to a 1GHz cpu that never suffers from cache misses. A server work load with thousands of clients is too big to fit into a L2 cache, the L2 cache needs to be several GB. 1GHz cpu is quite bad in my opinion.
BTW. SUNs Niagara SPARC cpu reaches 90% utilization. This is due to it’s unique construction. No other CPU can reach this high number. It is achieved because, as soon there is a cache miss in a thread, the core will switch thread in one clock cycle and continue work with another thread while waiting for L2 data. The core will never idle, it has many threads to work with. Normally, to switch thread takes very long time. But the Niagara does it very fast, and can work all the time. This is revolutionzing and the reason a 1.4GHz corresponds to a normal 3GHz under full load. A 3GHz CPU reaches maybe 50% utillzation, that is 1.5GHz. Just like the Niagara. But, the Niagara has lots of cores, doing lots of threads. In best circumstances, one core can execute 8 threads simultaneously.
Edited 2009-11-29 08:07 UTC

2009-11-29 11:33 am
Francis Kuntz
I was wondering how many time I should wait until a Sun fanboy comes in…
Edited 2009-11-29 11:33 UTC
2009-11-29 5:07 pm
cerbie
Big FLOPS and IPS numbers you can throw around don’t relate to real work.
All those threads are well and good, but then you get less performance of plain in-order integer code. You also sacrifice latency for throughput w/ Niagara, making it less desirable for some uses. That’s what I’m asking about: what makes such a fast huge CPU better, in the case of transaction processing, where you can’t utilize all those number crunching units?

2009-11-29 5:39 pm
Nicholas Blachford
Big FLOPS and IPS numbers you can throw around don’t relate to real work.
They might not relate to your work but there’s plenty of people who do need high FP numbers.
All those threads are well and good, but then you get less performance of plain in-order integer code. You also sacrifice latency for throughput w/ Niagara, making it less desirable for some uses. That’s what I’m asking about: what makes such a fast huge CPU better, in the case of transaction processing, where you can’t utilize all those number crunching units?
POWER is sold into a number of markets, each with different requirements, you’ll find there’s different features present for the different markets. There’s also things that simply don’t exist in x86 land like Decimal Floating Point and RAS (Reliability, Availability, Serviceability) stuff which a big government dept or a bank will want but isn’t much use for anyone else.
As for transaction processing, this isn’t exactly a weak point of POWER, it is often the leader (for the large boxes).
Having these sorts of features aren’t unique to POWER, all the big Iron vendors have similar things.

2009-11-30 3:58 am
cerbie
“They might not relate to your work but there’s plenty of people who do need high FP numbers.”
Being able to theoretically reach some arbitrary number, in an ideally coded program (which your problems may not map to), with ideally sized data sets, is largely marketing. My point was not about needing that kind of capability, but that such numbers on spec sheets are not the same as harnessing the hardware to get close to them with real code.

2009-11-29 6:46 pm
viton
The new Fujitsu eight core Sparc VENUS delivers the same GFlops, clock for clock.
SUNs Niagara SPARC cpu reaches 90% utilization.
OK, if it _so_ powerful as unreleased Power7, cheap and effecient, why there is only 2 (two) Sparc machines in top500 list (quadcore Sparc).
4GHz Power7 leads to maybe 25-33% utilization under full load?
Where did you got this number? p7 has 4 HW threads & 32MB on-chip cache, so utilization should be pretty good.
This is revolutionzing and the reason a 1.4GHz corresponds to a normal 3GHz under full load.
Slow Sparc in-order core @ 1.4ghz will never catch OOO x86 regardless of the amount of “fanatic power”.
Edited 2009-11-29 18:48 UTC

2009-12-02 8:39 pm
MrVain
“OK, if it _so_ powerful as unreleased Power7, cheap and effecient, why there is only 2 (two) Sparc machines in top500 list (quadcore Sparc).”
The reason is because the SPARC Venus is also not released yet. And another reason is that the Venus is lower clocked than the Power7. If you clock the Venus equally high, it will give same performance, but using half the wattage. My professor said about IBM “they use a steam engine, and if they need more performance, they just tuck on another steam engine” – implying that the solutions are not efficient nor beautiful nor clever. Just throw more resources at the problem, is IBM approach. Like government. Instead I like new radical efficient approaches. New clever solutions. Like SUN does. See it’s ZFS, DTrace, Niagara, SMF, etc. And compare to other solutions. New, clever, revolutionzing. Challenges old solutions.
Another reason: you should not rely on Top500. That list says nothing about the power of the CPU. Study position no 6: Blue Gene. It has 750MHz PowerPC!!! Does this mean that the 750MHz is sixth fastest in the world? No. The list says nothing about performance. Instead, the supercomputer cpus are chosen because they consume less power and other merits. Power wattage is a big concern for supercomputers. Study supercomputers more, and you will see they face other difficulties.
“4GHz Power7 leads to maybe 25-33% utilization under full load?
Where did you got this number? p7 has 4 HW threads & 32MB on-chip cache, so utilization should be pretty good.”
I only guess, which is evident if you reread my statement. I use the word “maybe” and finish with a question mark. However, I KNOW that Intel says that a 2GHz x86 server idles 50% under FULL load. Under heavy load it idles 50% of the time. This is because of cache misses. The difference between RAM and CPU is so big that when CPU needs data, it has to wait forever. What happens if you crank up the CPU speed even more? Then the utilization will drop further from 50% down to 33% or even more. This is simple logic. The only way to increase utilization is if the cache can foresee the future and deliver data beforehand. Otherwise, the utilization will drop. This is for SERVER workloads, where the workload is big it can never fit into cache. Imagine thousands of clients and the workload it generates, they take turns and does different things, and also you need to cache AIX and kernel code. The cache needs to be several GB to cache all that data. Instead, the cache for a server will seldom be used, it will swap in and out client data all the time. The clients access different data all the time. And the performance will degrade severely. This is also common sense, if you have studied CPUs. Or do yo have any objections?
It is also true that Niagara does NOT suffer from cache misses. It will work with another thread, while waiting for data from RAM. So it can keep utilization high as 90% or even more! That is the reason a 1.4GHz Niagara is three times faster than a 4.7GHz Power6 on certain server benchmarks. See Siebel v8 benchmarks in white papers from IBM and from SUN and compare. One SUN T5440 with four 1.4GHz Niagara is twice as fast as three IBM power570 servers, with 14 cpus in total using 4.7GHz Power6.
“This is revolutionzing and the reason a 1.4GHz corresponds to a normal 3GHz under full load.
Slow Sparc in-order core @ 1.4ghz will never catch OOO x86 regardless of the amount of “fanatic power”.
Bull shit. 1.4GHz Niagara has many world records right now. Dont you know? See below. How can it have many word records, if it “never catches” the other cpus? And how can it be so fast at 1.4GHz if it is not using new solutions and new technology?
http://www.sun.com/servers/coolthreads/t5440/performance.xml
One company migrated 251 Dell dual CPU xeon Linux servers, running 700 instances of MySQL down to 24 SUN T5440 machines. How could that happen, if not the Niagara is extremely fast for some workloads?
The Power6 is like a Porsche, transports 2 persons in 10 minutes. The Niagara is like a slow big bus, transports 100 persons in 30 minutes. Which vehicle transports 10.000 people fastest? For big server loads, Niagara owns. For small few threaded work, Power6/7 owns. Even SUN admits this.

2009-11-30 5:46 pm
tylerdurden
Duuuuude, where did you get those utilization numbers from? Given the amount of cache, and the aggressive nature of the predictors in the P7, you are like 1 order of magnitude off in your estimates of utilization.
Those things get upwards of 90% utilization under full load.

2009-11-28 11:18 pm
strim
Hurrrr. I wish someone would do a new Power/PowerPC desktop machine.

2009-11-29 12:00 am
cerbie
x86 and x86_64 are faster, more efficient, and offer more overall hardware choices. I could see wanting ARM desktops, but why PPC?

2009-11-29 3:48 am
strcpy
For the heck of it.
2009-11-29 5:29 am
darknexus
Which is why the ppc G4, which was very low-power and efficient, kicked the arse of Intel or AMD in its performance to power consumption ratio? Yes, Core2 and Athlon-64 were faster in terms of raw computing power, but they sucked up a hell of a lot more energy too. I’ve yet to see an X86 chip that can match the power-to-performance ratio of the ppc, Atom is a joke by comparison.
An ARM desktop might be nice for some uses, especially when Cortex A9 comes around. Current ARM chips, however, seem a bit under-powered for what many people use their desktops to handle. Curent ARM chips would be ideal for netbooks or other small devices. They’re powerful enough to handle mobile computing and yet use very little power making battery life amazing.

2009-11-29 6:09 am
cerbie
I could see a quad A9, w/ a nice GPU tacked on (new PowerVR somethingrather?), being good enough for an overwhelming majority of desktop needs (as long as Windows is not involved).
The x86 chips that will beat it will be the next generation SFF-oriented ones (the Atom is around even, though it needs Intel’s 45nm to manage that), much thanks to the only new general-purpose PPCs out there being whatever Freescale works on.
For every good solution you can think of, Intel can throw billions of R&D dollars at the problem and make it use x86, instead. AMD will then follow, with their own features, that vaguely look like something an Alpha had, or that a future Alpha was planned to have. Rinse and repeat .
Edited 2009-11-29 06:10 UTC

2009-11-29 1:43 pm
BluenoseJake
All the really important features of current CPUs, AMD had first. Multi-core? AMD had it before Intel. On chip memory controller? AMD had it before Intel. AMD had Hypertransport years before Intel had CSI (and that’s just on Nehalems). They even broke 1Ghz first, back in the day.
AMD may be behind right now, but for most of this decade, Intel has been following AMD.
2009-11-29 4:39 pm
cerbie
You have to spend $150+ on an Intel CPU for a general case improvement. Even in an area where power is expensive (idle is much better w/ i5), they aren’t looking shabby at all. S3 helps alleviate the idle power problem, too.
It’s much more than Intel and AMD have vested interests in x86, to the point that nothing else has been able to keep a good foothold. ARM could prove to be proper competition; but with Apple, and not IBM, truly backing it, PPC was dead, regardless of any technical merit chips built using it may have or have had.
Edited 2009-11-29 16:56 UTC
2009-11-29 5:24 pm
BluenoseJake
Most consumers have a vested interest in x86, in that most of their apps run on x86. They just don’t know it.

2009-11-29 6:37 am
Drumhellar
I’ve yet to see an X86 chip that can match the power-to-performance ratio of the ppc,
You haven’t been paying attention, then.
Steve Jobs said, in the 2006 MacWorld keynote, that Intel’s Core Duo had 4x the performance/watt of the PowerPC chips they were using.
The numbers he gave place the G4 doing a tad better than 1/4th the performance/watt, and the G5 doing a little worse than 1/4th.

2009-11-29 4:55 pm
cerbie
It’s a matter of perspective. As the clock speeds ramped up, the older versions do lack. However, the newer versions were quite impressive (like in the Mac Mini, later Macbooks, and now Freescale SoCs)…but the performance/watt, much like the Atom, is predicated on keeping the speed and voltage down, and thus the performance ceiling low. If you hunt benchmarks on the ‘net, newer Macbook G4s are neck and neck with Atoms, using chips that are much older, and physically much bigger.
The G5 (970) was competitive, but not spectacular in any way, and it wasn’t worth it for IBM to try to keep it up, as a part for Apple. It was really good in terms of throughput, but the K8 put the smack down on it in normal tasks (where, “smackdown,” can be defined as, “this runs 5-10% faster on Linux/A64 v. OS X/G5, doesn’t use much more power, and is way cheaper“), and it went downhill from there. Ironically, beefing up G4 (not based on a Power CPU), as Freescale did, might have been a better path to take, in hindsight.
Also, Apple cooked half their benchmarks, when they did the move. Many them were quite a ways off from what actual users could get. It’s not that x86 are bad, but the G4 was quite a piece of work, and might still be in PC-like devices, updated with new names, if we had a fairer market.
2009-11-30 5:51 pm
tylerdurden
Except that is not the reality. The G4 pipeline was rather poorly partitioned, and the OOO structures were pretty mild. Both severely limited the scalation of the design. And that is exactly what happened.
A G4 is in no way shape or form comparable to a core2, even on a clock per clock basis.
2009-11-30 7:19 pm
cerbie
When did I say it was comparable to a Core 2? I didn’t. I said it was comparable to an Atom. That is, a several-year-old 90nm CPU, which was a tweaked version of one from several years before that, being competitive with a current power-efficient CPU from Intel.
With updates in areas that helped to bottleneck it, I see no reason to believe that a modern part derived from it (not just a shrink and speedup) would not be able to beat the Atoms by a significant margin in every way but x86 compatibility.
When it comes to general integer performance, you won’t beat Intel and AMD.
Edited 2009-11-30 19:24 UTC
2009-11-30 11:19 pm
tylerdurden
No it wouldn’t. The G4 was getting trounced in terms of power/performance by the Pentium M, which was its contemporary in terms of fabrication technology (the G4 was behind in architectural generation though).
There is a reason why freescale has been in so much trouble for a while now.
2009-12-01 2:04 am
cerbie
…and how an old PPC design gets Freescale in so much trouble, I don’t know. It seems more business-related.
This is going way off on tangents, and I’m not disagreeing with you one little about about specific pieces of hardware that exist (hypothetical thoughts are hypothetical)–you’re reading subtly different things I’m trying to communicate, and what ifs aren’t helping, either, I guess. As much as I’m putting forth what ifs, you’re moving around targets.
Apple likes to stretch the truth, massage configurations to make benchmarks look how they want, etc.. 4x for a Core Solo (not Core 2) is quite a bit off. Maybe not a Core Duo. Maybe not old video v. new video. And so on. A perfect comparison is difficult, and Apple is happier that way. 4x perf/watt does not jive.
2009-11-29 5:49 pm
Nicholas Blachford
Steve Jobs said, in the 2006 MacWorld keynote, that Intel’s Core Duo had 4x the performance/watt of the PowerPC chips they were using.
The numbers he gave place the G4 doing a tad better than 1/4th the performance/watt, and the G5 doing a little worse than 1/4th.
They were not comparing like with like. They were comparing a high end desktop chip with a mobile chip, a bit of a dodgy comparison.
As for the G4, same thing, they were comparing a new Intel with an old G4.
They also completely forgot to include PA Semi’s chip, which is understandable because it would have completely destroyed their argument, it had better per/Watt than the Core2.
There is a word for this sort of thing, it’s called marketing.
2009-11-29 6:42 pm
Drumhellar
They were comparing a high end desktop chip with a mobile chip,
He wasn’t talking about desktop chips. Apple’s Core Duo debut was in laptops, so it was a like with like comparion.
And even if Steve Jobs stretched the truth about performance, as he frequently does, the performance/watt is still significantly higher.

2009-11-29 6:42 am
Drumhellar
I’m not sure about how the G4 compared to Core2 in performance/watt, as comparisons between the two haven’t really been made anywhere, as far as I can tell. When Apple switched to x86, they started with the Core Duo, an earlier chip. That chip has much better performance/watt than the G4.
Intel’s very strict philosophy for the original Core line of chips was 2% performance increase for 1% power consumption increase. The Intel engineers were very strict about that rule, and due to it’s success, all Intel x86 cpus that followed were designed with the same principles.

2009-11-30 9:39 am
Tuxie
There is rumour spreading that this monster will power the Playstation 4!
http://www.n4g.com/News-434125.aspx