IBM, Sony and Toshiba have jointly ported Linux to the Cell processor, the 4GHz multi-core PowerPC chip they co-developed. The Cell CPU is slated to ship in Sony’s Playstation 3 next spring, but is likely to appear before that in workstations, embedded computing devices, and supercomputers. The Cell’s Linux port includes a 64-bit PowerPC Linux kernel, along with a filesystem that abstracts the Cell’s independent vector processing units so that the Linux kernel can make use of them. The companies hope their Cell Linux port will be merged into the next mainstream Linux kernel release, 2.6.13.
Fedora running on Cell processors from IBM
http://www.martianrock.com/?p=83
Cell is like a vector processor. I like this type of processor better than the general type because of graphics and other difficult computing jobs.http://www.blachford.info/computer/Cells/Cell4.html&e=10342“ rel=”nofollow”>http://www.google.com/url?sa=U&start=2&q=http://www.gpgpu.org/
Now any hardware vendor can port Linux/GCC to its new chip and instantly get a plethora of various apps working on its new platform. So now hardware vendors can innovate without being limited by software/compatibility issues.
I think that most important thing for them is to create C libs for various things (XML parsing, matrix calculation, image decoding/transformation…) using parallel SPEs. Sort of Cell-LowLevel-Libs. Only this way they will enable for app writers to fully utilize Cell architecture. Otherwise it is no more then PowerPC. I don’t think that developers would hack in Cell assembler to make it faster for few users.
It is time to port OpenSolaris too on Cell!
Yeah… let’s see you purchase an IBM server with Cell Processors first to run OpenSolaris on it.
I don’t like the filesystem design. I would have prefered a more generic “coprocessor” API.
We have OpenGL for GPUs, and it would be nice to have “OpenPL” for ALUs/DSPs etc. With functions like MatrixXXX, FFT, etc. (No need for the filesystem layer).
The OpenPL driver could monitor function usage, and load a given algorithm into more SPUs if needed – and so on…
A process should not be allowed to “own” a SPU.
Have I missed something ???
(Sorry about my english)
I think that most important thing for them is to create C libs for various things (XML parsing, matrix calculation, image decoding/transformation…) using parallel SPEs. Sort of Cell-LowLevel-Libs. Only this way they will enable for app writers to fully utilize Cell architecture. Otherwise it is no more then PowerPC. I don’t think that developers would hack in Cell assembler to make it faster for few users.
I would rather see general gcc optimizations, hand-asm-coded libs would be nice too.
Opensource is so cool, with the port done, you could be running KDE/Gnome in no time. This is probably also one of the reasons microsoft is pushing .NET, as it is the only way to provide platform independence while be still closed source.
Well lets see what linus and the other kernel hackers have to say. I’m quite confident that it won’t get into the mainline, as long as it doesn’t fit well.
I don’t like the filesystem design. I would have prefered a more generic “coprocessor” API.
That’s the Unix standard unfortunately, you open a device up and then dump API calls through the device.
Personally, people should ditch the overload the FS as a generic directory system concept and push a generic directory system that can overlay a file system instead…
people should ditch the overload the FS as a generic directory system concept and push a generic directory system that can overlay a file system instead…
Plan 9 has FS as a generic directory system, so it must be right 🙂
Cell Processor can be emulated on X86 ?
“…with the port done, you could be running KDE/Gnome in no time. This is probably also one of the reasons microsoft is pushing .NET, as it is the only way to provide platform independence while be still closed source.”
Running, yes, taking advantage of the Cell architecture, no. Cell is a different beast and I have to agree with BLCBlob that “the world” needs a new “OpenWhateverL” to adress this. This would also work with current SSE/AltiVec quite nicely and give developers the benefits of specialized vector hardware without having to optimize in assembly. Cell needs something else also as I guess scheduler support wouldn’t hurt instead of using a filesystem design (scheduling instructions ot SPEs in Cell is not my forte so I might be way out of the ballpark here tho).
It never was about platform independence for MS. It was about giving what the developers needed while taking care of compatability/security issues. MS has an excellent piece of software in Visual Studio and MS DID like what Java provided (which is what got the whole SUN – MS shebang started). MS were/are not allowed to ship Java with Windows according to the settlement (yes it was MS fault for using their own addons so let’s not get into that debate). Because of that they shorta had to invent a Java clone, that was sufficiently different but improving where Java was missing yet still being a bit more C/++ like (yes it is my opinion that C# leapfrogged the Java language (sans the libs of course) and actually have improved Java at a faster rate than what would have happend otherwise, win win). Now, to stop developers from going to Java they had to supply a rich framework (that is after all the biggest strength of Java even tho it’s getting a bit too big) and so .NET was born. Rotor was more a proof of concept to get C# through ECMA, IMHO, as MS have not moved an inch towards making .NET more open afterwards. I think Mono and Portable.NET is doing an excellent work tho.
That’s the Unix standard unfortunately, you open a device up and then dump API calls through the device.
But this is no standard device we’re talking about so to me it sounds as this would give awful performance. Going through the driver layer sounds baaaad in this case. I may be wrong tho as I’m not a kernel hacker
Bye bye Apple… we didn’t need you on PPC…
This is really sweet news!
Sure it can be emulated. But it will be even slower than PearPC
“Bye bye Apple… we didn’t need you on PPC…
This is really sweet news!”
Please don’t begin to react stupid!!!!!!!. Although the CELL seems to be an interesting approach for specific purpose computing, everyone has to remember that the PPE inside the CELL is an in-of-order design, this means that it will run heavy branch code much more slowly than a G5 or any other out-of-order chip. The Cell is not gonna run any general purpose application faster than any other chip, it will even run them much slowly. That’s a matter of fact, that’s by design.
It the Cell would be the chip screamer than eveyonne seems to think, Apple would have chosen it instead of going to a transition to Intel, and knowing that Apple has developped many strong APIs (Accelerate Framework) that allow developers to do vector computing for scientific applications, image processing, and so on. We could think that they could port their APIs to the Cell….as the SPU are vector units.
Apple did not choose the Cell , because of this in-order design that makes this processor not very suitable for generale purpose computing as many applications do not get any benefit from vectorizing the code and thy will run slowly in a in-order design.
>Sure it can be emulated. But it will be even slower than PearPC
Better than nothing. PS3 will arrive in 2006.
How compatible is Cell with old PowerPC or Power technology and PearPC
The IBM guy on the Linuxtag presented linux on a cell and how they did the port …
Because opensource programers have profit from others work too. They give me something, I will give them something. Because fun to do what you like to do. Because….
Although the CELL seems to be an interesting approach for specific purpose computing, everyone has to remember that the PPE inside the CELL is an in-of-order design, this means that it will run heavy branch code much more slowly than a G5 or any other out-of-order chip.
It isn’t “branch code” per se that hurts in order designs.
In order designs are perfectly capable of branch prediction and speculative prefetching, predication, etc.
OOO processors are better than in order processors at dealing with data dependancies (ie. they don’t need so much complexity in the compiler), and highly variable latency loads. Although these days even OOO is not able to cope with memory latency – hence we are seeing an emergence of multi-threading.
IA64 is an in-order processor, and until POWER5 it was probably the fastest general purpose processor money could buy (and still is for FP intensive work).
Plan 9 has FS as a generic directory system, so it must be right 🙂
Plan 9 has some pretty good ideas in there, shame it’s just not catching on all that much outside of Plan 9… maybe once everyone’s gotten with the meta-data ideaology in MacOS X 10.4, BeOS and WinFS (whenever it ships) that’ll catch on.
But this is no standard device we’re talking about so to me it sounds as this would give awful performance. Going through the driver layer sounds baaaad in this case. I may be wrong tho as I’m not a kernel hacker
Depends how the devices are used, on the first PS2 Linux distribution they had libps2 which was pretty nasty, then some developers made SPS2 http://playstation2-linux.com/projects/sps2/ which was a lot better and much more efficient, so it works out pretty well. If you code badly, expect performance to suffer accordingly. Having bad tools doesn’t help either.
If you need raw speed after that, you can boot up directly into no OS at all, directly accessing the hardware to get maximum speed. Homebrew developers prefer that instead (ie, to the metal coding style). Development time if you’re inexperienced suffers, of course…
Of course it does!!!! As far as i could see, the in-order PPE implements much simpler branch prediction design, and of course you can be almost sure that code with lot of branching will run slowly.
The IA64 architecture is quite a different approach. Instead of doing heavy work (branch prediction, speculative prefetching…) to insure for out of order execution at the level of the cpu, the IA64 A-64 instead relies on the compiler for this task. Even before the program is fed into the CPU, the compiler examines the code and makes the same sorts of decisions that would otherwise happen at “run time” on the chip itself. Once it has decided what paths to take, it gathers up the instructions it knows can be run in parallel, bundles them into one larger instruction, and then stores it in that form in the program—hence the name VLIW or “very long instruction word.”
So we are not talking about the same thing with IA64. It does not mean that IA64 can not execute imstruction out-of-order, it does actually…..
And another point about this PPE, this core is only capable of issuing 2 instruction per cycle to the execution units (i guess it corresponds to the two threads), which is far below a G5, which can issue until 8 instructions per cycle….That’s a much less superscalar capable core…and be sure that any general purpose app will suffer from it…..
you guys dont think ibm is working on a more generall purpose version of the cell to be used in servers. and simply using the gaming consoles as its testing ground, do u???
Of course it does!!!! As far as i could see, the in-order PPE implements much simpler branch prediction design, and of course you can be almost sure that code with lot of branching will run slowly.
In-Order and Out-Of-Order refers to maximising use of processor resources (such as registers) in respect to instruction scheduling. Branch prediction is a separate issue, whether the CPU is IOE or OOE isn’t likely to affect branch prediction that much from what I can tell. (Unless there’s some heavy complex interaction with the scheduler, which isn’t a problem on a IOE chip).
To be honest, OOE is both a good and bad thing for code. It is good in that sloppy code or code that is optimised for an earlier model CPU can be rescheduled for faster speeds, especially if there’s extra hidden execution units tucked away (ala Intel/AMD), but it’s murder on chip real estate, power, complexity and debugging. A friend of mine has to code at least 5 different versions of his assembly for different chips to figure out which executes the fastest due to the OOE differences per chip.
When you decide to switch to IOE as you say, the load is shifted to the compiler to schedule the instructions properly. In the case of Cell since there’s no historic baggage and the chip is well defined, this isn’t an issue. Future revisions may become problematic if programmers slack off and refuse to reschedule their code per processor (not a problem for OSS, but a big problem for closed source code, not a big deal for the PS3 however). Debugging also becomes a lot easier as there’s no ambiguity about resource use. Examples such as Transmeta and Itanium show that it’s possible, but needs a lot of developer support to get it right.
That’s a much less superscalar capable core…and be sure that any general purpose app will suffer from it…..
Well, that may be the case – I’m looking forward to getting my hands on a Cell with Linux to tinker with to see how it compares. There’s other factors besides how scalable the core is that will probably balance it out quite nicely in practice over theory.
Branch predicition already is helpful when you are processing instructions in a pipeline.
A friend of mine has to code at least 5 different versions of his assembly for different chips to figure out which executes the fastest due to the OOE differences per chip.
Are you sure that’s not due to different instruction set versions (e.g. MMX,SSE/2/3) or different cache sizes?
While clever scheduling could make a huge performance difference on an in-order design like the original Pentium, the instruction windows on today’s OOO designs are sufficiently big that scheduling is a non-issue.
It is good in that sloppy code can be rescheduled for faster speeds
That’s a slur.
Code isn’t “sloppy” just because it hasn’t been specially scheduled for a particular processor implementation. Quite the contrary in fact, the lack of scheduling contortions means it uses fewer registers and is more readable (e.g. in a debugger).
All can be emulated, another matter is the speed.probably you need to wait at least until 32 ghz X86 appear to emulate the 4 ghz cell processor at full speed
All can be emulated, another matter is the speed.
And memory.
probably you need to wait at least until 32 ghz X86 appear to emulate the 4 ghz cell processor at full speed
Only if you make full use of the Cell’s SPEs. With binary translation current x86s shouldn’t have much trouble emulating the Cell’s PowerPC at full speed.
Besides, x86 development isn’t gonna stop at dual-core.
I don’t like the filesystem design. I would have prefered a more generic “coprocessor” API.
The filesystesm abstraction is the lowest level of the overall architecture. That makes sense, given that’s how UNIX treats pretty much all hardware — as an object in FS space. Now, that doesn’t mean you have to use the filesystem abstraction directly. Something like your OpenPL would be layered on top of the filesystem abstraction, with the former handling the details of parallelizing algorithms, and the latter handling the details of uploading stuff to the hardware.
This is actually how OpenGL works, btw, on *NIX systems.
Are you sure that’s not due to different instruction set versions (e.g. MMX,SSE/2/3) or different cache sizes?
This is definitely just plain Jane 80×86 code, made to run on the Pentium and above. Cache sizes do have a small impact, though. For example, to test effectively you need to load the L1/L2 cache up, run the same code a few thousand times and then you get a pretty accurate picture of how well it runs. (Oh yeah, and you need to run it across many versions of the chip).
While clever scheduling could make a huge performance difference on an in-order design like the original Pentium, the instruction windows on today’s OOO designs are sufficiently big that scheduling is a non-issue.
For most people that would be accurate. However, when speed and execution timing is important (ie, real time OS’es, emulation) the OOO design makes it very very difficult to determine what’s the fastest code on any one type of architecture as it varies from chip to chip depending on what code was executed previously, what resources are in use (or about to be used), L1/L2/L3 cache loading, and how well the chip runs the code timing wise.
For example, the Althon has a shorter, more effective pipeline for FP over the Pentiums, therefore scheduling can be tightened up a bit in general while the Pentium scheduling needs to handle extra delays. The shifter in the P4 is also different (executes the same, though) so the extra delay needs to be accounted for over an Athlon.
In practice most people would simply throw on the -O2/-O3 and -mcpu/-mtune flags when recompiling code (say, OSS), but if you have a commercial product unless you provide different versions per chip (not viable at times) having code that runs as well as possible on each chip is sometimes a worthwhile investment.
Anyhow, a IOE chip there is no room for ambiguity as the entire work is thrown onto the compiler to figure this out, so it remains to be seen how “clean” the Cell chip will be once a few variants start coming out that tweak it up a bit, so we’d be back to the same old game as above if they make major changes.
Best to email for further details if you wish to discuss further.
That’s a slur. Code isn’t “sloppy” just because it hasn’t been specially scheduled for a particular processor implementation. Quite the contrary in fact, the lack of scheduling contortions means it uses fewer registers and is more readable (e.g. in a debugger).
Depends on your point of view. If I were wanting the fastest code possible, easy to read, maintainable assembly code would be considered sloppy/bad (if handwritten by a beginner assembly programmer, this would be doubly true). If I wanted the most straightforward code, then the opposite would be true. In this day and age of people claiming faster, better, quicker all the time, simplicity usually loses out first. (Yet they never seem to be able to grasp sometimes faster is not better, and complexity is not good).
To make my personal position clear, I prefer the clearest, most readable code possible, in all things. (I’ll be using Ada/Delphi for most projects where I can these days). It’s only the complexity of the tools and hardware we are usually forced to use that makes us compromise on that principle.
the ONLY reason apple did not choose the cell processor is this. What happens to apple when cell goes belly up just like the power pc did? cell probably won’t be in production in 5-8 years down the road. once the ps3 chips are no longer needed.
there is only one chip archetecture that has been around in some form or another since before apple even had macintoshes. X86 chips are here to stay. they will always run x86 code.
by going with x86, apple is removing the problem of finding suitable chips for their computers. Apple only wants to have to change chips once. They want to get it right the first time. and they are making the right choice by going with the most popular chip type.
“What happens to apple when cell goes belly up just like the power pc did?”
That is quite the accomplishment, not only is the “PPC has died” false and thus the cell –> ppc part a non sequitur but the cell!=PPC is allso false.
Thanks for the smile and chuckle.
They called it a the 4GHz multi-core PowerPC chip. This is the first time I read it was a PPC chip. Did Apple just shout it self in the foot? If not, give me a gun, someone, PLEASE, I will shoot Apple’s foot. Not to get off topic, but how much longer will Apple support the old PPC OS 10.4 before dropping it?