IBM Shows Cell Blade in Action

Thom Holwerda 2006-03-11 IBM 27 Comments

“The Cell chip isn’t just for gaming – IBM is using it to power its next generation of blade servers, and we have the pictures to prove it. IBM has showed off a prototype blade server based on the Cell processor in action this week. The company demonstrated the Cell blade server running visualisation software to display real-time three-dimensional video footage of a beating heart at the CeBIT show in Hannover.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

27 Comments

2006-03-12 12:16 am
hhcv
This sort of thing really gets my blood pumping
2006-03-12 12:28 am
nimble
IBM is using it to power its next generation of blade servers
If that was true, Sun and others could be rubbing their hands, because the Cell is useless for most of the server market. Theoretical flops are totally irrelevant for running databases, serving web pages or filtering email.
Surely IBM aren’t that stupid, although the same can’t necessarily be said of ZDnet “journalists”.
And where did they get the idea that there’s a PPC970 (aka G5) core in the Cell?

2006-03-12 1:46 am
kaiwai
If that was true, Sun and others could be rubbing their hands, because the Cell is useless for most of the server market. Theoretical flops are totally irrelevant for running databases, serving web pages or filtering email.
True, very true; if they came out and said, “this is a cell blade, but its only suitable in these areas [markets], then sure, but if they’re going to market it as a general purpose, all round blade, its going to disappoint. Couple that with the electricity sucking power of these processors, and the mountain of heat given off, I doubt that most companies would wish to have their costs of operation go through the roof simply on the basis of some wild promises by an IBM spokes person.
And where did they get the idea that there’s a PPC970 (aka G5) core in the Cell?
eh <waves hand around> it kinda does, but a very stripped down, no frills PPC970 core – well, correctly, everything is just based off a *VERY* basic POWER core, and depending on the product, things are added and removed as required; the whole ‘POWER’ achitecture is designed in such a way that you can pick ‘n mix the parts you want for a given scenario/market situation.
In the case of the Cell, it appears they’ve taken some old ideas from RISC, some new ideas, and mashed them together; if they’re going to market it as a CPU for a given scenario, no problems, but this processor will fall flat on its ass if it is marketed as a general purpose one.
Couple that with the difficulty to properly optimise software for its unique and quirky design, even if all were going to plan, its not going to deliver if the software which runs ontop can’t exploit the full power of the processors.
SUN has their own Niagara, but at the same time, they’ve realised that there is a time and place for each processor; some have their benefits for a given situation and others there are better alternatives; its about marketing the right product to the right people.

2006-03-12 2:21 am
rayiner
eh <waves hand around> it kinda does, but a very stripped down, no frills PPC970 core – well, correctly, everything is just based off a *VERY* basic POWER core,
The chip uses the POWER instruction set, but otherwise is no more like a PPC970 than a Pentium is like a Pentium Pro. In fact, they are completely different philosophies of processor design — the Cell PPE is a narrow in-order core, and the PPC970 is a wide highly OOO core.
2006-03-12 1:59 pm
jonas.kirilla
“Couple that with the electricity sucking power of these processors”
You can probably turn off the SPEs one by one, and scale down the PPE, in low workload situation. It might even work on a portable computer, if power-saving aggressively. I’m sure BeOS would fly on Cell. (BeOS won’t, of course, but Haiku might. Tapping the power of the SPEs, in Haiku, is left as an excersize for the reader.)
Edited 2006-03-12 14:05

2006-03-12 6:13 pm
nimble
Tapping the power of the SPEs, in Haiku, is left as an excersize for the reader.
How so? Has it been touched by the distributed computing fairy or something?
BeOS/Haiku isn’t much more multi-threaded than any other OS these days. And no amount of multi-threading will do anything about the problem of actually distributing code and data onto the SPEs with their smallish local memories.

2006-03-12 8:35 pm
jonas.kirilla
You’re jumping the gun.
(1) The single PPE would be plenty of power for any non-bloated desktop OS. (E.g. BeOS) If VMX is used.
(2) Figuring out how to integrate use of the SPEs in -any- desktop OS (except Linux, if you’re happy with Spufs) remains “an excersize for the reader”.
Edited 2006-03-12 20:37
2006-03-13 10:21 am
nimble
(1) The single PPE would be plenty of power for any non-bloated desktop OS. (E.g. BeOS) If VMX is used.
But then you could just as well use the G4, because it’s cheaper, more energy-efficient, and probably even a bit faster than the Cell’s PPE.
(2) Figuring out how to integrate use of the SPEs in -any- desktop OS (except Linux, if you’re happy with Spufs) remains “an excersize for the reader”.
Low-level “integration” a la Spufs may be fairly straightforward, but it doesn’t help much at all with actually utilising the SPEs.
All that Spufs provides is a way to run code on the SPUs and exchange data with them. It does nothing about the fact that programs have to be specially (re)designed and (re)written to deal with the local memory model.
It doesn’t even virtualise the SPEs, i.e. if you ask for SPE number 3, and that’s already being used by another program, then you’re out of luck, even if all other SPEs are unused.

2006-03-12 5:37 pm
Nicholas Blachford
this is a cell blade, but its only suitable in these areas [markets], then sure, but if they’re going to market it as a general purpose, all round blade, its going to disappoint.
It’s easy to think that but from what I’m hearing it does well even in areas it wasn’t designed for. It’s pretty clear a Cell isn’t going to like some sorts of algorithms / data structures, some processors are better with these but you’ll find no processor is actually good at them. If data is placed randomly around memory your CPU – any CPU – is going to be sitting idle most of the time.
The first generation was designed for the PS3 so I wouldn’t expect it to be good at everything, a later generation could be quite different though.
Couple that with the electricity sucking power of these processors, and the mountain of heat given off
No numbers have ever been published for power consumption, the only numbers that exist in the public domain are estimates from early last year. They weren’t that bad (80W @ 4GHz) and a lot could have (and probably has) changed from then.
2006-03-12 7:03 pm
somebody
@Parent post
If that was true, Sun and others could be rubbing their hands, because the Cell is useless for most of the server market. Theoretical flops are totally irrelevant for running databases, serving web pages or filtering email.
No, they won’t rub their hands. I think POWER6 will give them enough headache. And POWER6 is supposed to be direct competitor to Niagara. Cell is a completely different story. Its mission is to fill in where too complex math (at least for current CPUs) is involved, nothing else.
IBM will have two kinds of blades in the market in 2007. POWER6 and Cell.
Now guess this. One is going to rock on databases and such but suck in math comparing to Cell, while Cell is going to suck on databases and such but rock on math.
Put a complex setup where POWER6 (or Niagara) serves data to Cell from databases, and Cell actualy does the complex math to produce results.
@kaiwai
True, very true; if they came out and said, “this is a cell blade, but its only suitable in these areas [markets], then sure, but if they’re going to market it as a general purpose, all round blade, its going to disappoint.
Strange IF here. IBM clearly pointed out it is a special purpose CPU from the first moment. Find one IBM Cell presentation where they don’t specify this clear enough. This is clearly bashing over IBM without any reason.
Couple that with the difficulty to properly optimise software for its unique and quirky design, even if all were going to plan, its not going to deliver if the software which runs ontop can’t exploit the full power of the processors.
You don’t need to optimize all of it. Cell was not designed for that. As soon as you get out of running service on one computer only you might start to understand this. Cell is a number crunching monster, not hammer for any purpose. IBM was clear as day from he moment one about his special abilities and disabilities.
eh [waves hand around] it kinda does, but a very stripped down, no frills PPC970 core – well, correctly, everything is just based off a *VERY* basic POWER core, and depending on the product, things are added and removed as required; the whole ‘POWER’ achitecture is designed in such a way that you can pick ‘n mix the parts you want for a given scenario/market situation.
Now, tell me one thing. Why would you need PPC core if Cell is what you need. PPC core is nothing but a basic distributor for Cells SPUs. And SPUs are the reason why one would use Cell.
Using Cell as basic POWER makes just as much sense as commiting suicide from happyness after you won the lotery.
SUN has their own Niagara, but at the same time, they’ve realised that there is a time and place for each processor; some have their benefits for a given situation and others there are better alternatives; its about marketing the right product to the right people.
??? What you described is IBMs not SUNs logic. IBM is the one with two completely different products (POWER5-6 and Cell) here. SUN is not, they just market Niagara as “worlds best hammer” nothing else, just like they do it with Java and Solaris (Read next part before bashing back without reason).
I’m not saying Niagara sucks. It rocks, and its price rocks even more. Now that Ubuntu runs on top of it I’m seriously thinking about ordering two T1000 with fiber (no, no desktop purposes). And I seriously hope to employ at least one Cell blade as special service processor in the back somewhere in the future.
Again, I’m not saying that Java or Solaris sucks, just that there are many purposes where Java or Solaris doesn’t fit. In fact “where Java sucks” completely describes my needs, but they are special and very uncommon so I don’t apply them as general logic, but personal bias and personal needs. And “where Solaris does not rock” describes my personal preference and the fact that Solaris is not working with Xen yet (no, zones are a long way from satisfactory solution for me).
Marketing logic (as being best hammer in every way, or better said untruth in marketing) you’ve been describing is SUNs way, not IBMs.
Couple that with the difficulty to properly optimise software for its unique and quirky design, even if all were going to plan, its not going to deliver if the software which runs ontop can’t exploit the full power of the processors.
Same thing was said about moving from DOS to Windows, SSE, SMP, multicore… So, your point would be? Just as there are now SSE optimizations, there will be Cell optimizations (octopiler) and just as the only good SSE optimization is handcrafted, the only good Cell optimization will be like that too. Where is the difference?
/*speaking as one who will definitely try out Cell, so you could say I’m biased*/
If one sees reason why appying and optimizing is needed he will do it, if not? Well, he didn’t need that feature anyway (in Cell case it will be probably just one of the worst decisions of his life, he would fare with POWER or Niagara much better and probably for lower price). But if you don’t see the reason, this shouldn’t make the rule for everybody.
I for one definitely see the use of Cell in my services.
Edited 2006-03-12 19:08

2006-03-12 2:56 am
sithgunner
Besides that, I thought Cell doesn’t even run current flock of applications. It’s not like it’s another x86 compatible cpu.
That’s why they made 20x faster cpu than current pentium 4, because they made it for something that doesn’t need to run applications from 10 years ago and put the newest and greatest technology to make applications made for it to run good.

2006-03-12 3:06 am
MediaSex
“Besides that, I thought Cell doesn’t even run current flock of applications. It’s not like it’s another x86 compatible cpu.
That’s why they made 20x faster cpu than current pentium 4, because they made it for something that doesn’t need to run applications from 10 years ago and put the newest and greatest technology to make applications made for it to run good.”
WTF???
2006-03-12 6:15 pm
somebody
Besides that, I thought Cell doesn’t even run current flock of applications. It’s not like it’s another x86 compatible cpu.
??? It is a PPC. How could it be x86 compatible? And yes it runs current flock of PPS apps. Just not as fast as it could run them if it would be a regular POWER5.
That’s why they made 20x faster cpu than current pentium 4, because they made it for something that doesn’t need to run applications from 10 years ago and put the newest and greatest technology to make applications made for it to run good.
No, they made 20x faster CPU because they designed its special purpose. Number crunching.
It is easier to do better one thing than many things.

2006-03-12 3:05 am
MediaSex
“Couple that with the difficulty to properly optimise software for its unique and quirky design, even if all were going to plan, its not going to deliver if the software which runs ontop can’t exploit the full power of the processors. ”
As someone who works on PS3 devkit daily I can say without any hesitation that you are an idiot.
Gee, who to believe…you and your drivel…or IBM,Sony,Toshiba, and the 200+ other aerospace, defense, medical and media companies that are in the process of either building Cell systems or working with IBM to migrate their systems to Cell?
The Broadband Engine in the PS3 is an utter dream to code for. It is what I and every other console engineer that I use to have discussions with five years ago dreamed of when the end of the Mhz race was on the horizon.

2006-03-12 6:11 am
rayiner
I hope you realize that console programmers live in a subspace completely seperate from the regular programming universe. Console programmers just discovered C a few years ago, and think nothing of breaking out the assembler on occasion.
Regular programmers live with a million layers of software between them and the hardware. We write performance-intensive simulations in interpreted languages and don’t give it a second thought. We get pissy if we don’t have a GC, and would probably faint at the sight of PowerPC assembly.
My point is that just because you as a console programmer think its easy to code for doesn’t mean it is. I know lots of EEs who’d think Cell is a dream to code for because it beats writing custom signal processing logic on an FPGA. However, this niche of programmers is relatively small compared to the whole software universe. Undoubtedly, Cell will find a nice home in such markets (which is exactly what IBM wants), but I think people in these markets who find Cell impressive tend to overestimate it’s general appeal.

2006-03-12 12:44 pm
peskanov
Nobody is stopping you from working in high level languages. In fact, HL languages are easier for automatic parallelization.
What we need is more real work done in parallel computers, or paralellism will be foerever stuck in the research realms…

2006-03-12 5:39 pm
rayiner
Well Cell doesn’t exactly make that easy. Making an auto-parallelizing compiler for a high level language (heck, even one for C) is the work of several PHD dissertations. To my knowledge, nobody has made one yet. Making an auto-parallelizing compiler for a high-level language that can also deal transparently with Cell’s local memory architecture and crappy integer and branch performance is pretty much a pipe dream.

2006-03-12 6:10 pm
peskanov
It seems you are not well informed about parallelization. There a lot of specific compilers and languages for parallel computers.
The big problem resides in paralellize low level languages like C & C++. Any language using pointers is a true nightmare for automatic paralellism extraction. Even a language like Java uses lot of serial structures like linked list, and it’s hard to parallelize. The only sane way with those languages is massive threading, wich is a very weak method because it does not care about data locality (the REAL problem with any parallel system).
BTW, your statement about the Cell being weak in integers and branching it’s quite funny. You can say that when you compare a current SPE with an idealized SPE implementation.
In raw numbers, a Cell blow any other micro out of the water in both measures. Remenber that we are talking about 9 cores at 3 Ghz! And each one is capable of executing 1 integer inst. / cycle, and the instruction density is quite good. Cell instructions are triadic, ann the operations are quite sophisticated.

2006-03-12 7:17 pm
rayiner
You said: “In fact, HL languages are easier for automatic parallelization.”
I know of some compilers that can parallelize C and FORTRAN code in very primitive ways, and of languages for parallel computer systems, but I have yet to encounter a compiler that auto-parallelizes code in a general manner. As far as I’m aware, the technology just isn’t there yet. Could you provide a link to a system that does what you claim?
BTW, your statement about the Cell being weak in integers and branching it’s quite funny. You can say that when you compare a current SPE with an idealized SPE implementation.
In raw numbers, a Cell blow any other micro out of the water in both measures. Remenber that we are talking about 9 cores at 3 Ghz!
I’d take 2 highly OOO cores at 3 GHz than 9 narrow in-order cores at 3 GHz. Let’s look at each component:
SPE: 2-issue, in order, 18-cycle branch misprediction penalty, no dynamic branch prediction. 2-cycle simple integer latency, 3-cycle cycle taken-branch penalty, and 6-cycle latency to the local store — the closest thing it has to an L1 cache.
PPE: 2-issue, in order, 21-cycle branch misprediction penalty, small branch prediction resources. 2-cycle simple integer latency, 5-cycle latency to L1 cache, 31 (!) cycle latency to L2 cache. God knows how big of a latency to memory!
The lack of dynamic branch prediction alone will kill the SPE on HLL code. Such code has a branch every 4-5 instructions, and if a substantial number of them are not statically predictable, the SPE will spend all its time handling mispredicted branches. Now, Itanium gets good integer performance without dynamic branch prediction, but it has features like predication that the SPE’s “deep and narrow” philosophy rule out, and a branch misprediction penalty about a third as high.
The high latency to local store will do the rest. HLL code is usually full of loads/stores, and with no OOO to cover the load-store latency, and no architectural tricks like Itanium’s ALAT to allow speculative loads, you’ll just have to depend on prefetching, which is mostly useless on integer code.
The original Itanium got about half the integer IPC of the best highly OOO designs. It was a far more sophisticated chip than the PPE, and in a completely different league than the SPE.
All that aside, I doubt any of this is news to IBM. Clearly, Cell is meant for a particular niche, and clearly IBM compromised the design to best serve that niche. I just wish everyone expecting Cell and designs like it to replace the current generation of OOO general purpose cores would realize what IBM does.
PS) Don’t get me wrong, I was pretty enthused about Cell too. However, the more the numbers came out (especially the latency and branch penalty figures), the more it became clear that IBM had pretty severely compromised integer performance in order to be able to fit 9 coures on a single die at 3+ GHz.

2006-03-12 9:52 pm
Setien
I don’t get how many people are arguing about this as if the Cell was a replacement for your desktop CPU.
I am so sick and tired of people whining about the Cell just because the magical compiler fairy doesn’t make it’s power accessible to them in their scripting language of choice.
Get it straight already:
The Cell is not your moms microprocessor.
Besides the 1 inst per cycle per SPE peskanov mentioned, each SPE is capable of SIMD with it’s 128 bit wide registers (for example 4×32 bit instructions), so it can actually do a hell of a lot more work than your run-of-the-mill 1 inst / cycle.
That’s one 3 GHz PowerPC core with no OOO to speak of and and 8 3 GHz SPEs that can run the same operation on 4 32 bit numbers.
Yes, the Cell lacks OOO functionality, but it is a raw, steaming speed monster when used right.
No, you can’t drop your lame-ass messy crap code onto a Cell and expect it to outperform your desktop processor with some large percentage of its silicon dedicated to OOO.
Yes, you can write code that runs _insanely_ fast on Cell.
No, you might not be able to do it with Visual Basic 7.
So what you have is a powerhouse of a processor that requires you to exercise some skill when programming it.
Yes, it might require you to learn move than the “–optimize-for-cell” compiler switch. Boohoo.
Saying that it will suck at databases and such is not necessarily true, because you might write a database server that relies particularly on the Cells strengths and come up with something that completely obliterates competition.
If not, then don’t use the Cell for that task.
People are constantly arguing that it will be inferior to OOO cores at different tasks, just because they are thinking in the context of them not having to do any work for the benefit.
Many of these could likely be solved if you rethink the algorithms for the Cells architecture.
Guess what? Sometimes huge steps forwards come at a cost. And yes, the Cell is a huge step forward – if you hire a programmer that isn’t afraid of writing a bit of assembly or C and maybe rethink that inner loop so it doesn’t have 12 branches, then you have unparalleled computing power at what will quickly become a ridiculous price.
Oh but wait… if it doesn’t come with a compiler switch and a visual studio wizard, then it’s worthless, right?

2006-03-12 11:02 pm
rayiner
Oh but wait… if it doesn’t come with a compiler switch and a visual studio wizard, then it’s worthless, right?
Not worthless, but certainly not generally applicable. The fact of the matter is that in most fields, programmers are more expensive than microprocessors. Making the programmer work harder so you can design a simpler processor is the wrong tradeoff in such circumstances. Where I used to work, we threw dual 4GB Xeon servers at a simulation until it ran fast enough. It was much cheaper to do that than to get a couple of developers to spend a week making to code run faster. A lot of places are like that.
Now, as I said, there are niches where Cell makes sense. If you’re designing a HUD for a fighter jet, sticking a bunch of big computers in there isn’t an option, and Cell is a great thing. But programming is becoming more generalized, not less. This is becoming true even in the gaming industry — games are becoming very complicated, and game budgets are approaching the size of small movies. Even in this traditional bastion of “to the metal” programming, easier-to-program designs are becoming more important. Nintendo might just very well carve a nice little niche for itself, simply because its Revolution console will be an order of magnitude easier to program than the XBox 360 or PS3.

2006-03-13 12:33 am
SamuraiCrow
Has anybody here heard of the Octopiler? I saw a report on /. about how this compiler is going to allow C programmers to parallelize and pipeline optimize their code to run on SPEs fairly transparently. (If you don’t mind a little interactive optimization from the programmers’ perspective.)
2006-03-13 12:34 am
peskanov
I know of some compilers that can parallelize C and FORTRAN code in very primitive ways, and of languages for parallel computer systems, but I have yet to encounter a compiler that auto-parallelizes code in a general manner. As far as I’m aware, the technology just isn’t there yet. Could you provide a link to a system that does what you claim?
Maybe you are talking about OpenMP (for C/C++ languages)? Anyway, in FORTRAN it’s quite usual to implement parallelization of big matrix operations.
Languages with automatic paralellism extraction: NESL, LUCID, GLUT, Scout, several flavours of FORTRAN, some visual languages and several more.
Languages for multiple processors based in concurrency: Java, Erlang (I don’t like this aproach by the way, but the model used should be possible in a PPE) and some database query languages which use distributed computing.
There exist a vast research in paralellism, mostly from the nineties. Take a look at the old IEEE transactions on parallel computers.
PPE: 2-issue, in order, 21-cycle branch misprediction penalty, small branch prediction resources. 2-cycle simple integer latency, 5-cycle latency to L1 cache, 31 (!) cycle latency to L2 cache. God knows how big of a latency to memory!
What’s the problem with these numbers? These are quite common for >3 Ghz CPUs, except the 2 issue. Which is not so bad, as issue != IPC, and IPC < issue (always).
SPE: 2-issue, in order, 18-cycle branch misprediction penalty, no dynamic branch prediction. 2-cycle simple integer latency, 3-cycle cycle taken-branch penalty, and 6-cycle latency to the local store — the closest thing it has to an L1 cache.
The local memory is not an L1, obviously. You can NOT miss any access to a local memory, you never get a penalty. And it is quite big, bigger than most L1 caches. This model has been used in many consoles (PS1, PS2, Dreamcast, GBA, Xbox360) and I have programmed for it several times. It’s extremely efective.
The 18 cycle missprediction is the same or smallet than a Pentium4 (except you are getting 8 CPUs for the price of one).
Latencies are a bit larger than other >3 Ghz micros, but not very much.
The lack of dynamic branch prediction alone will kill the SPE on HLL code. Such code has a branch every 4-5 instructions, and if a substantial number of them are not statically predictable, the SPE will spend all its time handling mispredicted branches. Now, Itanium gets good integer performance without dynamic branch prediction, but it has features like predication that the SPE’s “deep and narrow” philosophy rule out, and a branch misprediction penalty about a third as high.
When I talk about HLL, I talk about languages tailored to create parallelizable programs without doing low level grunt work. Languages which supports structures like array and set natively.
I am not talking about languages with run time type checking and things like that, holy jesus!
Anyway, I would like to remind you that the SPE does, in fact, carry a branch prediction mechanism, although it’s not the usual branch cache. It uses special intructions to keep track of the interesting branches (compiler wise).
Also, the SPEs have a big quantity of registers so speculative branching can be implemented easily in software (execute both branches, use only the valid brach results). This mechanism is being implemented for the Xenon processor right now (Xbox360).
The high latency to local store will do the rest. HLL code is usually full of loads/stores, and with no OOO to cover the load-store latency, and no architectural tricks like Itanium’s ALAT to allow speculative loads, you’ll just have to depend on prefetching, which is mostly useless on integer code.
mmm? Prefetching is useless in integer code? What are you talking about? Prefetching is useful in any code which can request data beforehand, specially sequential accessing.
Maybe you mean useles in code with frequent branching…
The original Itanium got about half the integer IPC of the best highly OOO designs. It was a far more sophisticated chip than the PPE, and in a completely different league than the SPE.
Itanium is a fiasco. It’s computation/$ ratio is ridiculous. I am not interested in this behemot as a reference, really.
All that aside, I doubt any of this is news to IBM. Clearly, Cell is meant for a particular niche, and clearly IBM compromised the design to best serve that niche. I just wish everyone expecting Cell and designs like it to replace the current generation of OOO general purpose cores would realize what IBM does.
OOO is a silly patch for a bad software hait.Wasting 40 million transistors to execute 3 instructions/cycle (in a good day) is anti-economical.
Only a vicious market allows this kind of situations, but there is people that will pay anything to have his copy of “Word” running a 30% faster than his previous computer.
PS) Don’t get me wrong, I was pretty enthused about Cell too. However, the more the numbers came out (especially the latency and branch penalty figures), the more it became clear that IBM had pretty severely compromised integer performance in order to be able to fit 9 coures on a single die at 3+ GHz
I respectfully disagree.
A serious compromise is having to translate x86 code (with several extensions) in an already large pipeline.
What IBM has made with the Cell has sense in efficience terms. But I don’t expect much people will apreciate it.
The quantity of persons who have worked seriously in non-x86 environments is incresingly small.
And the same can be said about paralellism.
And curiosuly, believe it or not, the problems the Cell will face are very similar to the ones a multicore x86 will face: memory synch, memory bandwith, concurrency mechanisms speed, and the inhability of current languages to automatically parallelise the code.

2006-03-13 2:31 am
rayiner
Languages with automatic paralellism extraction: NESL, LUCID, GLUT, Scout, several flavours of FORTRAN, some visual languages and several more.
The things you mention strike me as being a lot like APL. They don’t extract parallelism from sequential code, but provide constructs that can be parallelized without resorting to a threading model. I think those designs have a lot of merit, but what I was talking about was more along the lines of IBM’s Octopus compiler. However, I don’t know how well Cell will deal with them. Context-switching the SPE is rather slow, so Cell was really meant for something more like an agent-oriented or producer-consumer model.
What’s the problem with these numbers? These are quite common for >3 Ghz CPUs, except the 2 issue. Which is not so bad, as issue != IPC, and IPC < issue (always).
The numbers are horrible! I don’t know where you find that these are common for 3 GHz CPUs. The only ~3GHz processor with numbers that bad is the P4, and at least its L1 cache latency and simple integer latency are good. Not to mention the fact that its highly OOO and has a giant branch predictor table.
Other ~3GHz CPUs, like the Opteron and G5 have much better numbers. The Opteron’s branch misprediction latency is 11 cycles, the L1 cache latency is 3 cycles, the L2 is 15 cycles. The G5’s branch misprediction is 15 cycles, the L1 cache is 3 cycles, the L2 is 13 cycles. Both have much bigger branch predictors.
The local memory is not an L1, obviously. You can NOT miss any access to a local memory, you never get a penalty.
The 6-cycle penalty is for load-to-use. Since there is no L1 cache, that’s the minimum latency for any memory operation, and is thus comparable to the L1 latency in a cached architecture. Even IBM’s docs admit that this penalty is quite bad for integer code.
This model has been used in many consoles (PS1, PS2, Dreamcast, GBA, Xbox360) and I have programmed for it several times. It’s extremely efective.
None of those things are meant to run integer code with any speed, and at least some of those at least have an L1 cache with less than a 6 cycle latency!
The 18 cycle missprediction is the same or smallet than a Pentium4 (except you are getting 8 CPUs for the price of one).
Yes, but at least the Pentium 4 has enormous branch prediction resources to mitigate it. On an SPE, you have 3 cycles up front for ANY branch, and if its not statically predictable, you pay 18 cycles every single time. Something like an AI code will spend all its time paying branch penalties.
When I talk about HLL, I talk about languages tailored to create parallelizable programs without doing low level grunt work. Languages which supports structures like array and set natively.
I am not talking about languages with run time type checking and things like that, holy jesus!
So why did you reply to my original post? The whole point of my post was that the Cell might very well be nice to program if you’re used to working with the metal, but makes no sense for the type of programming most people are doing. Even people who program in C spend most of their time writing rather abstract code. On an SPE, even something basic like a virtual function call in C++ becomes an extremely expensive operation.
mmm? Prefetching is useless in integer code? What are you talking about? Prefetching is useful in any code which can request data beforehand, specially sequential accessing.
Maybe you mean useles in code with frequent branching…
When I say integer code, I don’t mean a triangle rasterizer. I mean conventional integer programs (what you’d find in SPECint). That means lots of branches and random load/store.
OOO is a silly patch for a bad software hait.Wasting 40 million transistors to execute 3 instructions/cycle (in a good day) is anti-economical.
Transistors are cheap. Billable programmer hours aren’t.
I respectfully disagree.
A serious compromise is having to translate x86 code (with several extensions) in an already large pipeline.
The overhead of x86 translation is ~5% on modern CPUs. A few million transistors here or there, who cares?
problems the Cell will face are very similar to the ones a multicore x86 will face: memory synch, memory bandwith, concurrency mechanisms speed, and the inhability of current languages to automatically parallelise the code.
They’ll face it in an entirely different way. Multi-core x86 will be able to attack the problem with high single-thread performance cores (so you have far less parallelism you need to extract), and with the benefit of truely high-level languages. Cell programmers will have to deal with a primitive in-order architecture in addition to all the concurrency headaches. x86 programmers will get to work with a shared memory model. Cell programmers will have to deal with the obscure local memory model. The basic problems will be similar, but x86 programmers will get a lot more help from the hardware in dealing with them.

2006-03-13 9:20 pm
peskanov
The things you mention strike me as being a lot like APL. They don’t extract parallelism from sequential code, but provide constructs that can be parallelized without resorting to a threading model. I think those designs have a lot of merit, but what I was talking about was more along the lines of IBM’s Octopus compiler. However, I don’t know how well Cell will deal with them. Context-switching the SPE is rather slow, so Cell was really meant for something more like an agent-oriented or producer-consumer model.
Yes, PPEs are no good for multithreading; but threading is not a very good parallel computing model, it does not scale well.
The numbers are horrible! I don’t know where you find that these are common for 3 GHz CPUs. The only ~3GHz processor with numbers that bad is the P4, and at least its L1 cache latency and simple integer latency are good.
I don’t think the numbers are bad at all. Latency is higher, yes, but throughput is the same (1 cycle).
And talking about latencies, for example a P4 needs 2 cycles to read from L1 while a PPE needs 6 cycles. However, a L1 cache miss in a P4 invalidates a large section of the pipeline loosing dozens of instructions (like in a branch missprediction), while the PPE has not this problem.
The PPE only stalls seriously when waiting for external memory access (through DMA). This is not very different to the P4, which will stall for HUNDREDS of cycles when waiting for main memory.
Other ~3GHz CPUs, like the Opteron and G5 have much better numbers. The Opteron’s branch misprediction latency is 11 cycles, the L1 cache latency is 3 cycles, the L2 is 15 cycles. The G5’s branch misprediction is 15 cycles, the L1 cache is 3 cycles, the L2 is 13 cycles. Both have much bigger branch predictors.
So? 18 cycles vs. 11/15. What’s so terrible? …and PPE units don’t rely on static branch prediction. More about that later;
The 6-cycle penalty is for load-to-use. Since there is no L1 cache, that’s the minimum latency for any memory operation, and is thus comparable to the L1 latency in a cached architecture. Even IBM’s docs admit that this penalty is quite bad for integer code.
Yes; but the throughput is still 1 cycles, remenber.
None of those things are meant to run integer code with any speed, and at least some of those at least have an L1 cache with less than a 6 cycle latency!
You are making a drama about something is not. Yes, latency is higher, but not SO higher to make a big difference with other micros.
By the way, when I programmed on PS1, I used the scratchpad (1 KB if aI remenber correctly) in a A* algorithm. I think this is a good example of what you call “integer code”. The question is that the programmer knows far better than the simple L1 replacement mechanism which data is going to be reused and which data will not. A compiler with profiling information could also make good use of a local memory of fixed lenght.
Fast local memory is very good, the problem is you need to tailor your routines to it’s size.
Yes, but at least the Pentium 4 has enormous branch prediction resources to mitigate it. On an SPE, you have 3 cycles up front for ANY branch, and if its not statically predictable, you pay 18 cycles every single time. Something like an AI code will spend all its time paying branch penalties.
Ok, so let’s clear this question.
PPEs are not based in static branch prediccion.
PPEs use a special mechanism for dinamyc branch prediction, software aided. It has special instructions, “hint for branch” which allows to dinamically change the default path to take on any branch (the instructions are: hbr/hbra/hbrp/hbrr).
Using a register as a counter, any program can dinamically predict it’s own branches, In fact, this system allows for more sophisticated branch prediction than the common hardware ones.
Of course this system also has it’s cost and it’s limits, but it’s quite far from being “static prediction”.
So why did you reply to my original post? The whole point of my post was that the Cell might very well be nice to program if you’re used to working with the metal, but makes no sense for the type of programming most people are doing. Even people who program in C spend most of their time writing rather abstract code. On an SPE, even something basic like a virtual function call in C++ becomes an extremely expensive operation.
Because I believe C/C++/Java are not good languages for multiprocessing, and I believe special languages (high level ones) must be developed to really make use of computers with dozens or hundreds of processor. And because I believe it’s better to improve software and make more efficient hardware, instead of wasting millions and millions of transistors in executing a few poor instructions.
In other words, because I believe the “leave it to the compiler” philosophy is the best one, and the Cell is a goo example of this philosophy.
When I say integer code, I don’t mean a triangle rasterizer. I mean conventional integer programs (what you’d find in SPECint). That means lots of branches and random load/store.
I have been programming integer code for most of my professional life and I can swear to you most of the heavy duty algorithms are highly sequential in memory addressing, or have good locality. I have nothing to say about branching because I don’t think the PPE is so badly served in that section, as I explained before.
Transistors are cheap. Billable programmer hours aren’t
Taking into account how many programs are done in thousands of lines of C++ or Java when a few hundreds of LISP or SQL are enough, that’s kind of ironic.
I am a profressional programmed, and I don’t buy that myth. Produce good hardware and market it well, and the software will come after it.
The overhead of x86 translation is ~5% on modern CPUs. A few million transistors here or there, who cares?
This is a missrepresentation of the problem, and maybe even an urban legend. Certainly it does not look that way in the pictures of Yonah I have seen.
Why don’t you call Intel and ask them how many money went into the translator this time? It’s was fully reworked, by the way. And this money is money we all pay, you know? There is a reason why Intel and AMD needs 10 times more engineers than IBM or ARM to design a CPU.
They’ll face it in an entirely different way. Multi-core x86 will be able to attack the problem with high single-thread performance cores (so you have far less parallelism you need to extract), and with the benefit of truely high-level languages. Cell programmers will have to deal with a primitive in-order architecture in addition to all the concurrency headaches. x86 programmers will get to work with a shared memory model. Cell programmers will have to deal with the obscure local memory model. The basic problems will be similar, but x86 programmers will get a lot more help from the hardware in dealing with them.
The compiler deals with in-order and the new branch prediction system, the programmer does not. The main problem code-wise is the explicit local memory system.
The usual concurrency problems are common to both. In performance terms, bandwith will be a problem for both of them also.
I would like to see high level languages to deal with all the low level Cell detail, as I said.
In fact I’m working in a small one myself. I think it’s a far more interesting architecture than, let’s say, a dual core x86 or PowerPC which have no real muscle in paralellism terms.

2006-03-13 11:23 pm
rayiner
I don’t think the numbers are bad at all. Latency is higher, yes, but throughput is the same (1 cycle).
The original point of my post was to point out that the Cell will be quite weak on traditional integer codes. Statements like “latency is higher and but throughput is the same” don’t really address my points. Traditional integer codes live and die by latency. For a lot of the code I write, I’d rather have half the memory bandwidth half the latency than the other way around.
You are making a drama about something is not. Yes, latency is higher, but not SO higher to make a big difference with other micros.
The latencies are up to three times higher! In deeply pipelined processors running integer code, L1 latencies are important enough that Intel often uses tiny (eg: the Northwood P4’s 8KB L1) L1 caches just to get 2-cycle latency verus 3-cycle latency. Even in IBM’s own presentations, they point out the 6-cycle load-use hurts for traditional integer programs.
Ok, so let’s clear this question.
PPEs are not based in static branch prediccion.
PPEs use a special mechanism for dinamyc branch prediction, software aided. It has special instructions, “hint for branch” which allows to dinamically change the default path to take on any branch (the instructions are: hbr/hbra/hbrp/hbrr).
1) What magic compiler is going to insert these instructions? Because we’re most assuredly not talking about writing assembly code in 2006…
2) Only one active branch hint that needs to be issued 11 cycles in advance? Branch histories tracked in precious registers that you already need to unroll your loops and mitigate the 6-cycle load-use to LS? That’s like what, effectively a 32-entry BHT? This mechanism isn’t a replacement for branch prediction, it’s just a way to use fancy high-level features like “functions” without always mispredicting.
Because I believe C/C++/Java are not good languages for multiprocessing, and I believe special languages (high level ones) must be developed to really make use of computers with dozens or hundreds of processor.
Back to the very beginning of this loop, my point was that making a compiler that can effectively make use of hundreds of processors is hard enough — making one that can do that while generating good code for a processor as primitive as Cell is another thing entirely. Intel has spent billions on Itanium, and still don’t have compilers that can consistently get good performance out of its (sophisticated) in-order core. The overall problem is just too damn complicated.
And because I believe it’s better to improve software and make more efficient hardware, instead of wasting millions and millions of transistors in executing a few poor instructions.
I’m talking about improving software. Good software is general, abstract, easy to maintain, and quick to develop. If it runs fast enough, then great, otherwise — transistors are a dime a million (literally).
I have been programming integer code for most of my professional life and I can swear to you most of the heavy duty algorithms are highly sequential in memory addressing, or have good locality.
The vast amount of research into the subject contradicts your observation. Code that happens to use integers is not “typical integer code”. Typical integer code is business applications, databases, compilers, schedulers, AIs, etc. GCC is the canonical “difficult”, typical integer program. These applications most assuredly do not have highly sequential memory addressing. At least, not in the (abstract, maintainable) way they are usually written. You don’t have to take my word for it — look up the execution statistics for the programs in SPECint. There is a lot of data out there about the execution profiles of these typical integer programs. Some of the statistics are scary — a branch every 4.5 cycles (for ‘li’ from SPECint 95), branch prediction accuracy of as low as 94% even with sophisticated predictors, loads that need to be serviced in one cycle up to 62% of the time.
Why don’t you call Intel and ask them how many money went into the translator this time? It’s was fully reworked, by the way. And this money is money we all pay, you know? There is a reason why Intel and AMD needs 10 times more engineers than IBM or ARM to design a CPU.
I don’t see the connection. AMD’s design team has never been that large. Intel’s last huge project (the P4), ended up a lot less successful than is most recent small project (the P-M). Intel’s latest Core architecture is from its relatively small Israel design team. One of the most theoretically simple ISAs in existance (Intel’s EPIC), also has one of the largest design teams building processors for it. Intel and AMD do generally employ more engineers than IBM (though, the POWER5 team wasn’t exactly small at 300+), but that’s because they tend to rework their architectures more often (more competitive market), and because they do more aggressive circuit-level design and use less automation.
AMD has made some excellent presentations about the relative cost of x86 in its “x86 everywhere” compaign: http://developer.amd.com/assets/WinHEC2005_x86_Everywhere.pdf
It rightly points out that of all the design parameters you can vary to change the performance of your processors, the ISA is actually pretty far down on the list.
The compiler deals with in-order and the new branch prediction system, the programmer does not.
The compiler is too primitive to deal with it transparently. The programmer has to take care to write simple code (stuff that can be easily digested by loop unrollers and whatnot), avoid abstract constructs, and then profile to make sure the generated code really is good. On most modern x86 processors, you can get pretty decent performance with very straightforward code.
I would like to see high level languages to deal with all the low level Cell detail, as I said.
Here’s my final two cents (you can buy 200,000 transistors with that). I haven’t been using computers that long in the grand scheme of things. Even in that short time, I’ve seen processors go from in-order, single scaler cores, to massively out of order superscaler ones. At the same time, I’ve seen software progress — well, I haven’t seen the state of software progress one bit. Today’s ‘state of the art’ software is stuff that was dreamed up in the 1980s. Perhaps there is hope that one day we’ll have compilers that can take a “Parallel Python” program and distributed it over 1000 Cell-like cores. However, I’m not holding my breath.

2006-03-14 10:26 pm
peskanov
The latencies are up to three times higher! In deeply pipelined processors running integer code, L1 latencies are important enough that Intel often uses tiny (eg: the Northwood P4’s 8KB L1) L1 caches just to get 2-cycle latency verus 3-cycle latency. Even in IBM’s own presentations, they point out the 6-cycle load-use hurts for traditional integer programs.
There is no argument to address here. The only point in discussion is how much does it hurt performance against a 3Ghz P4 for example.
I have been programming in the low level field for many years, with a special emphasis in optimization, and I say a decently programmed SPE in not slower than 50% of the very expensive P4 in general code. This happens because the stalls accesing far memory (L2/main) are more important than other stalls like branch miss. And a SPE it’s equal or faster than the P4 in vectorial code, mostly due to the extremely good permute operator and the big register bank.
1) What magic compiler is going to insert these instructions? Because we’re most assuredly not talking about writing assembly code in 2006…
All the state of the art compilers have capabilities to profile and get statistics about branching.
You underestimate IBM and overstimate the power of branch caches.
2) Only one active branch hint that needs to be issued 11 cycles in advance? Branch histories tracked in precious registers that you already need to unroll your loops and mitigate the 6-cycle load-use to LS? That’s like what, effectively a 32-entry BHT? This mechanism isn’t a replacement for branch prediction, it’s just a way to use fancy high-level features like “functions” without always mispredicting.
The hint does not need to be issued 11 cycles in advance. The hint DOES have an effect after 11 cycles, which is a different thing. This is not very different to a branch cache. A branch cache needs previous records to predict anything, and a measure of a branch takes somes cycles to store the record and make this record effective.
And not, SPE mechanism is not equivalent to a 32 entry BHT. You can use as many registers/memory locations as you want to store branch records. It’s software driven.
BTW, those “precious registers” you are talking about are 128; one of the biggest register bank implemented. This is a modern CPU, not an x86.
I know you don’t like the Cell, but this missrepresentations are starting to get ridiculous. The SPE is not based in static branch prediction. Period.
Back to the very beginning of this loop, my point was that making a compiler that can effectively make use of hundreds of processors is hard enough — making one that can do that while generating good code for a processor as primitive as Cell is another thing entirely. Intel has spent billions on Itanium, and still don’t have compilers that can consistently get good performance out of its (sophisticated) in-order core. The overall problem is just too damn complicated.
Guess what, the problem with Itanium is trying to get good performance from not parallel-friendly languages like C.
At the beginning of this loop, I was advocating writting compilers for languages better suited to the problem at hand.
Call me when you accept most popular languages like C++/Java don’t scale well for real multiprocessor systems (like the Cell, or an 8 cores x86 computer).
I’m talking about improving software. Good software is general, abstract, easy to maintain, and quick to develop. If it runs fast enough, then great, otherwise — transistors are a dime a million (literally).
If you don’t care about performance, why do you waste your time spreading incorrect informations about a system designed exclusively to provide the best performance possible with the minimum budget?
ome of the statistics are scary — a branch every 4.5 cycles (for ‘li’ from SPECint 95), branch prediction accuracy of as low as 94% even with sophisticated predictors, loads that need to be serviced in one cycle up to 62% of the time
Ok; a huffman alg. has a lot of branching, or any tree iterator. So what? This split between integer/float code is silly. Blending two layers of RGB pixels is pure integer, data-sequential,branch-free code. So?
The code which breaks data locality is code which does heavy use of indexing, or pointers. Call it “integer code” if you want, but is confusing. Also there is code which branchs a lt based in floating point comparisons.
The SPE is as bad with this code as any modern cpu. It can make speculative execution, or branch hinting, and hope for the best. That’s it. About data locality, instead of prefecthing it has an advanced queue for DMA request.
AMD has made some excellent presentations about the relative cost of x86 in its “x86 everywhere” compaign: http://developer.amd.com/assets/WinHEC2005_x86_Everywhere.pdf
That’s AMD propaganda. Show me an x86 running efficiently at 500 mhz , 300 miliwatts, like any ARM micro.
Some day transistor integration will make it possible, but that’s not the question. The question is that the industry can spend his money more wisely using reasonable designs, and we all win.
The compiler is too primitive to deal with it transparently. The programmer has to take care to write simple code (stuff that can be easily digested by loop unrollers and whatnot), avoid abstract constructs, and then profile to make sure the generated code really is good. On most modern x86 processors, you can get pretty decent performance with very straightforward code.
If you want I will paste here a simple, naive code that will put your x86 on it’s knees, stalling continously for chunks bigger than 200 cycles each. Just ignore data locality and see what happens with “straightforward code”.
Here’s my final two cents (you can buy 200,000 transistors with that).
Unfortunately with 200.000 transistors I can not have any decent x86. But I will accept an ARM/MIPS/SH instead, thanks.
I haven’t been using computers that long in the grand scheme of things. Even in that short time, I’ve seen processors go from in-order, single scaler cores, to massively out of order superscaler ones. At the same time, I’ve seen software progress — well, I haven’t seen the state of software progress one bit. Today’s ‘state of the art’ software is stuff that was dreamed up in the 1980s. Perhaps there is hope that one day we’ll have compilers that can take a “Parallel Python” program and distributed it over 1000 Cell-like cores. However, I’m not holding my breath.
OOO execution is much older than you think, take a look at history. History will tell if it’s a good idea. From my POV there is no advantage in solving a problem with hardware if it can be solved in software without serious penalties.
Todays software is a disaster, yes, but that’s hardly a hardware problem. I don’t see x86 software being better than MIPs one because “it’s easier to program” or something like that.