It’ll be interesting to see if it actually lives up to all the hype. Personally I’m thinking no, on the basis that over the last thirty years nothing has revolutionised computing instantly like these are promising to.
Intel and AMD aren’t just going to sit there while these stomp all over them; either they’re thinking about some sort of competitor or (more likely) they don’t think they’ll pan out as predicted.
What I found interesting, and I guess I had missed it from previous articles is the feature that the Cell processor has of actually parallelizing a program and spreading it across any cell device. So when I’m encoding a movie on my computer, my television, my playstation3, and my toaster (with wireless adapter! haha) could all be adding their cycles to the job. That would be amazing. Parallel computing available to the common person.
I think people are taking Cell somewhat out of context. If you compare Cell’s 1 teraflop performance to a P4’s few gigaflops, then it does seem like its claiming to revolutionize computing. But consider that modern GPU’s can do 50-100 gigaflops, or 10-20x as fast as most CPUs. Having something 10x faster than that, well over a year from now, and manufactured using a much better process and hand-optimized circuits*, that doesn’t seem so unbelievable.
*) Hand-optimized in the sense that CPU circuits are usually done by hand, to allow higher clock-speeds, while GPU circuits are usually laid out by a computer, which limits their clock-speed.
ell clearly makes a very sophisticated building block for distributed grid computing too. “A hypothetical Cell processor with eight of these APUs could achieve 32 BOPS and 32 gigaFLOPS at only 250MHz,” writes Tom. Or a teraflop at 1Ghz.
The math is clearly off here. That’s superlinear scaling there! The quoted numbers come from somewhere else. Cell is supposed to achieve 32 gigaflops per APU at 4GHz. I don’t know quite how that works (4GHz x 4 FPUs = 32 gigaflops?), but they might be counting vector ops as more than one operation (but with 128-bit vectors, that’s 64 gigaflops per APU!). A standard Cell setup is supposed to have 32 APUs (4 PEs with 8 APUs each), which is 32×32 = 1024 gigaflops.
This isn’t instant, symmetric processing is quite old. This is like the perfection of beowulf:
Affordable supercomputing with many processors.
So it wouldn’t be overnight even if it did win overnight. And RISC didn’t take too awful long before Intel revised their designs to be more load-store.
Ah! That makes sense! But who counts an FMAC as two instructions, but a 128-bit vector MUL or ADD as 1? The CPU manufacturers all count a 128-bit vector operation as 4 operations. With regards to 4 PEs being too big, we’ll see. I don’t think it’s too big, given how simple the APUs are. Let’s say, conservatively, that each APU is half the size of a G5. That’s about 30m transistors. That makes about a billion transistors for the APUs, and maybe another 200m more for the PUs. 1.2bn transistors is quite a bit less than the 1.7bn that Intel’s Monticeto is projected to have. PS3 material? Maybe not, but I wouldn’t be surprised to see a workstation-class machine from IBM with that setup.
No duh it is parallel. Graphics/media is a perfect application for consumer parallel processing, but the only thing that will make adding 1+1 together is a faster processor. Cell technology is simply going to make distributed computing a reality….that is it, although the technology seems really awesome, nothing lives up to the hype….that is a good thing since it will drive the cost down after they realize it is not the revolution they think it is.
What I found interesting, and I guess I had missed it from previous articles is the feature that the Cell processor has of actually parallelizing a program and spreading it across any cell device. So when I’m encoding a movie on my computer, my television, my playstation3, and my toaster (with wireless adapter! haha) could all be adding their cycles to the job. That would be amazing. Parallel computing available to the common person.
>Ah! That makes sense! But who counts an FMAC as two instructions,
>but a 128-bit vector MUL or ADD as 1?
>The CPU manufacturers all count a 128-bit vector operation
>as 4 operations.
1 32 bit opearation per FPU gives 4 opearations (8 for Mul-Add).
With regards to 4 PEs being too big, we’ll see. I don’t think it’s too big, given how simple the APUs are. Let’s say, conservatively, that each APU is half the size of a G5. That’s about 30m transistors. That makes about a billion transistors for the APUs, and maybe another 200m more for the PUs. 1.2bn transistors is quite a bit less than the 1.7bn that Intel’s Monticeto is projected to have.
Logic transistors are different from Cache transistors so they’re not directly comparable.
In any case I think the APUs are going to be much smaller, I’m expecting them to be very simple, no out-of-order executon or anything like that. I also expect a very stripped back instruction set as I’d imagine they’d have problems getting AltiVec to 4GHz +.
Actually, I believe the transistors in the cache and logic gates are the same CMOS transistors. It’s just that cache transistors are easier to pack more densely, because of the simple and regular wire routing between them. But the point is well taken — you’re point about the APUs being much smaller is probably accurate.
“We’ve managed to glean some inside tidbits. A single Cell chip is expected to be capable of surpassing 250 billion floating point operations, or 250 gigaflops, per second”
“Each Cell chip will have between eight and ten separate processing cores on one piece of silicon (a final decision is pending)”
“Sony and IBM have announced plans for a workstation combining multiple Cells that, acting in concert, will reach 16 trillion flops, ranking alongside the world’s top ten supercomputers. It will be aimed at engineers and Hollywood animators. This figure is “probably a p.r. exaggeration,” the Cell engineer says, but future workstations containing racks of 32 chips will be able to attain this speed.”
of course, the other point about the Cell architecture is that it’s expressly designed not to _let_ you encode that movie if the MPAA doesn’t want you to.
The speculation is pretty rampant. “Neo” for Macsimum News seems to think a cell based Mac could be shown off next week. Meanwhile the Register gives a fairly even handed review of the situation. However, considering IBM’s current direction, I assumed they would push Linux on Cell, a move the register does not seem to share, while they go as far as calling Linux “stoneage technology”.
Regardless, I have to say I’m mixed. I’d love to see Intel and Microsoft go away, but I’m not sure I view Hardware based DRM as a saviour. And besides, it seems that there would be little, besides ability (smile:), it’s a joke) , stopping MS from porting their apps to cell.
All sounds great in practice i wonder what it is going to be like to develope for a system of this type? No to mention having network latency to contend with it will only be useful for some of the slower cpu intensive stuff.
I don’t think Linux is appropriate for Cell. Linux (and UNIX in general), are really designed to schedule lot’s of threads across lot’s of identical processors. They have complex systems for virtual memory, etc. Cell won’t be structured that way. Actually, I think a message-passing architecture would be far more suitable for Cell. They already have outlined some sort of RPC mechanism to submit code and data to APUs, so I think IBM has already thought of that.
I don’t think Linux is appropriate for Cell. Linux (and UNIX in general), are really designed to schedule lot’s of threads across lot’s of identical processors. They have complex systems for virtual memory, etc. Cell won’t be structured that way.
The PU is a standard PowerPC so it should run Linux no problem. Linux itself will not run on the APUs but there should be a layer on top for utilising them.
http://www.blachford.info/computer/Cells/Cell6.html
The author of the last cell article posted some corrections and a rebuttal to the rant…
-b
It’ll be interesting to see if it actually lives up to all the hype. Personally I’m thinking no, on the basis that over the last thirty years nothing has revolutionised computing instantly like these are promising to.
Intel and AMD aren’t just going to sit there while these stomp all over them; either they’re thinking about some sort of competitor or (more likely) they don’t think they’ll pan out as predicted.
Time will tell.
What I found interesting, and I guess I had missed it from previous articles is the feature that the Cell processor has of actually parallelizing a program and spreading it across any cell device. So when I’m encoding a movie on my computer, my television, my playstation3, and my toaster (with wireless adapter! haha) could all be adding their cycles to the job. That would be amazing. Parallel computing available to the common person.
I think people are taking Cell somewhat out of context. If you compare Cell’s 1 teraflop performance to a P4’s few gigaflops, then it does seem like its claiming to revolutionize computing. But consider that modern GPU’s can do 50-100 gigaflops, or 10-20x as fast as most CPUs. Having something 10x faster than that, well over a year from now, and manufactured using a much better process and hand-optimized circuits*, that doesn’t seem so unbelievable.
*) Hand-optimized in the sense that CPU circuits are usually done by hand, to allow higher clock-speeds, while GPU circuits are usually laid out by a computer, which limits their clock-speed.
ell clearly makes a very sophisticated building block for distributed grid computing too. “A hypothetical Cell processor with eight of these APUs could achieve 32 BOPS and 32 gigaFLOPS at only 250MHz,” writes Tom. Or a teraflop at 1Ghz.
The math is clearly off here. That’s superlinear scaling there! The quoted numbers come from somewhere else. Cell is supposed to achieve 32 gigaflops per APU at 4GHz. I don’t know quite how that works (4GHz x 4 FPUs = 32 gigaflops?), but they might be counting vector ops as more than one operation (but with 128-bit vectors, that’s 64 gigaflops per APU!). A standard Cell setup is supposed to have 32 APUs (4 PEs with 8 APUs each), which is 32×32 = 1024 gigaflops.
This isn’t instant, symmetric processing is quite old. This is like the perfection of beowulf:
Affordable supercomputing with many processors.
So it wouldn’t be overnight even if it did win overnight. And RISC didn’t take too awful long before Intel revised their designs to be more load-store.
>The math is clearly off here. That’s superlinear scaling there!
Is a bit suspect yes…
As for the 32 GigaFlops (which comes directly from the patent):
4GHz X 4FPUs = 16 GFLOPS but… There’ll most likely be a Multiply-Add instruction which is counted as 2 hence 32 GFLOPS.
>A standard Cell setup is supposed to have 32 APUs
>(4 PEs with 8 APUs each), which is 32×32 = 1024 gigaflops.
That’s mentioned in the patent but I think it’ll be far too big right now to be economical. I don’t think we’ll see that until 45nm.
Ah! That makes sense! But who counts an FMAC as two instructions, but a 128-bit vector MUL or ADD as 1? The CPU manufacturers all count a 128-bit vector operation as 4 operations. With regards to 4 PEs being too big, we’ll see. I don’t think it’s too big, given how simple the APUs are. Let’s say, conservatively, that each APU is half the size of a G5. That’s about 30m transistors. That makes about a billion transistors for the APUs, and maybe another 200m more for the PUs. 1.2bn transistors is quite a bit less than the 1.7bn that Intel’s Monticeto is projected to have. PS3 material? Maybe not, but I wouldn’t be surprised to see a workstation-class machine from IBM with that setup.
How much of that 1.7bn transistors is cache? a lot is my guess and cache transistors are much easier to make.
No duh it is parallel. Graphics/media is a perfect application for consumer parallel processing, but the only thing that will make adding 1+1 together is a faster processor. Cell technology is simply going to make distributed computing a reality….that is it, although the technology seems really awesome, nothing lives up to the hype….that is a good thing since it will drive the cost down after they realize it is not the revolution they think it is.
What I found interesting, and I guess I had missed it from previous articles is the feature that the Cell processor has of actually parallelizing a program and spreading it across any cell device. So when I’m encoding a movie on my computer, my television, my playstation3, and my toaster (with wireless adapter! haha) could all be adding their cycles to the job. That would be amazing. Parallel computing available to the common person.
>Ah! That makes sense! But who counts an FMAC as two instructions,
>but a 128-bit vector MUL or ADD as 1?
>The CPU manufacturers all count a 128-bit vector operation
>as 4 operations.
1 32 bit opearation per FPU gives 4 opearations (8 for Mul-Add).
With regards to 4 PEs being too big, we’ll see. I don’t think it’s too big, given how simple the APUs are. Let’s say, conservatively, that each APU is half the size of a G5. That’s about 30m transistors. That makes about a billion transistors for the APUs, and maybe another 200m more for the PUs. 1.2bn transistors is quite a bit less than the 1.7bn that Intel’s Monticeto is projected to have.
Logic transistors are different from Cache transistors so they’re not directly comparable.
In any case I think the APUs are going to be much smaller, I’m expecting them to be very simple, no out-of-order executon or anything like that. I also expect a very stripped back instruction set as I’d imagine they’d have problems getting AltiVec to 4GHz +.
Actually, I believe the transistors in the cache and logic gates are the same CMOS transistors. It’s just that cache transistors are easier to pack more densely, because of the simple and regular wire routing between them. But the point is well taken — you’re point about the APUs being much smaller is probably accurate.
“We’ve managed to glean some inside tidbits. A single Cell chip is expected to be capable of surpassing 250 billion floating point operations, or 250 gigaflops, per second”
“Each Cell chip will have between eight and ten separate processing cores on one piece of silicon (a final decision is pending)”
“Sony and IBM have announced plans for a workstation combining multiple Cells that, acting in concert, will reach 16 trillion flops, ranking alongside the world’s top ten supercomputers. It will be aimed at engineers and Hollywood animators. This figure is “probably a p.r. exaggeration,” the Cell engineer says, but future workstations containing racks of 32 chips will be able to attain this speed.”
http://news.yahoo.com/news?tmpl=story&u=/fo/20050127/bs_fo/e77d9f1f…
I think the main reason of cell computers is to
be able to upgrade your computer speed by buying
another one and add it …
In the case of a ps3, when it will reach the maximum
speed in gaming performance you just link another one
and you can play the next generation of games.
Just imagine the market you just created !!!!
of course, the other point about the Cell architecture is that it’s expressly designed not to _let_ you encode that movie if the MPAA doesn’t want you to.
The speculation is pretty rampant. “Neo” for Macsimum News seems to think a cell based Mac could be shown off next week. Meanwhile the Register gives a fairly even handed review of the situation. However, considering IBM’s current direction, I assumed they would push Linux on Cell, a move the register does not seem to share, while they go as far as calling Linux “stoneage technology”.
Regardless, I have to say I’m mixed. I’d love to see Intel and Microsoft go away, but I’m not sure I view Hardware based DRM as a saviour. And besides, it seems that there would be little, besides ability (smile:), it’s a joke) , stopping MS from porting their apps to cell.
-b
All sounds great in practice i wonder what it is going to be like to develope for a system of this type? No to mention having network latency to contend with it will only be useful for some of the slower cpu intensive stuff.
Linux is stoneage technology. And so are WinXP & macX. win9x & old mac were bone age technology.
I don’t think Linux is appropriate for Cell. Linux (and UNIX in general), are really designed to schedule lot’s of threads across lot’s of identical processors. They have complex systems for virtual memory, etc. Cell won’t be structured that way. Actually, I think a message-passing architecture would be far more suitable for Cell. They already have outlined some sort of RPC mechanism to submit code and data to APUs, so I think IBM has already thought of that.
ISSCC is startng today so here’s my predictions:
APUs will contain local memory.
APU ISA is highly striped back Ûber-RISC, not AltiVec.
Internal caches for APU-APU transfer (possibly within the APUs).
PU POWER5 based, (AltiVec probably not included)
Peak performance just below 300 GFLOPs
Closer to 100 GFLOPs in normal usage.
Stream processing may get close to theoretical max.
FFTs will go like a bat out of hell.
So will Linpack.
I am also initially expecting them to be stupidly expensive at launch but prices to drop rapidly.
I don’t predict Apple using them as a main CPU anytime soon, but I expect they’ll be used as co-processors in PowerMacs within a year.
I *do not* predict having to make any significant changes to my article 🙂
I don’t think Linux is appropriate for Cell. Linux (and UNIX in general), are really designed to schedule lot’s of threads across lot’s of identical processors. They have complex systems for virtual memory, etc. Cell won’t be structured that way.
The PU is a standard PowerPC so it should run Linux no problem. Linux itself will not run on the APUs but there should be a layer on top for utilising them.
pojo: if the application is designed for it AND the bottleneck is pure floating point performance, not network bandwidth.
For example, Pixar should be salivating over CELL right about now.
Here is some useful “Cell” information on wikipedia:
http://en.wikipedia.org/wiki/Cell_chip
It is consice and has a number of links at the bottom including to the patents (#6,526,491).