“If it’s the number of cores that gets computers excited these days, then IBM may have its hands on the ultimate processor. Together with Rapport, a Silicon Valley startup, the company previewed the Kilocore1025, a processor with a total of 1025 cores that promises not only to boost processing speed but also to operate at low power levels.”
I’ll bet as soon as that cpu is available IBM will have a version of Linux running on it. If this thing becomes popular there will be one more market dominated by Linux as I don’t see MS having something ready soon, if they are interested anyway.
PS The title is short a 1000 cores…
Edited 2006-04-08 17:43
Hey you hardcore Atari and C64 demo coders (me included)! Now you know your skill conserving wait on the demoscene is finally over. Pay time is commin’!
Hey you hardcore Atari and C64 demo coders (me included)! Now you know your skill conserving wait on the demoscene is finally over.
Speaking of which, today I saw the movie Tron, playing live on an Atari 800XL. The movie was converted to 16 shades of greyscale, but it played from Mr Atari’s hacked harddisk, on a computer with a 1.67 MHz processor.
Now imagine a Beowulf cluster of those…aka 1024 8-bit computers
You can see digitized Matrix on this page at the bottom:
http://mr-atari.com/
Took him 3 hours to decode the video (without the sound!) on an Atari emulator running at 5000%.
I think the title should read “…1025 Cores”
They want thier processors back.
8-bit cores? What are you going to do with a 1025 NESs (except play a hella big game of FF1)?
You need to know the history of computers, and supercomputers, to truly understand. A lot of the things that are brought to mass market have a long history in high performance computing.
This 1024+1 multicore chip is quite different from symmetric multicore chips such as Niagara, Pentium D, X2, and different from Cell too. The 1024 minicores are probably not even meant to execute code, running threads, but instead work on data. They’re probably interconnected in a network, perhaps a torus, or whatever, so they can do a little work, (perhaps byte sized – a byte is 8 bits – get it?), and pass on data to the processor next to it. The network enables very powerful data manipulation with very little power consumption. The PowerPC core is obviously the controller.
Since your imagination seems to be better than mine:) (not insulting, read my previous comment), here’s the question.
Wouldn’t this mean Cell tech applied differently? Smaller cores, larger quantity. While Cell applies 8 large SPUs, this would mean 1024 small ones.
So just to clarify if I understood it correctly. +1 core should be like PPU for Cell and control them. Secondary, there should be pipelines (aka. network, torus, whatever) like there are between Cell SPUs too.
Well, if this logic would apply code controlled grouping (aka. networked instructions) of 8-bit cores (Cell does poses this in some way, you can pipeline all 8 cores together), it would basicaly mean X-bit monster chip. But, I think this is just my pipe dream and I got lost somewhere in my mind. btw. I can’t figure how large calculations would be applied on 8-bit cores otherwise if this feature is not there.
As I already said once. I lack imagination to figure this one:)
I’m no CPU guru, and I couldn’t find much information on the webpage, so I can only speculate.
I believe the Kilocore is different from Cell. That there is no local RAM dedicated to each 8-bit core. The single-cycle reconfigurability too seems unlike Cell. While you can pipeline the Cell SPUs, you can expect each SPU to do a heck of a lot more processing on a datastream than these 8-bit cores, which (if my intuition is right) might do as little as one single operation on a single byte, before passing it on to another 8-bit core. Unlike Cell, whose SPUs typically should operate on much larger chunks of data for DMA transfers to be economic. I haven’t looked closer at Cell inter-SPU transfers, so I’m guessing it’s DMA only, which means some kind of large-ish lantency, compared to 1-cycle shuffling of single bytes.
I could have it all wrong, so feel free to correct me!
8-bit cores? What are you going to do with a 1025 NESs (except play a hella big game of FF1)?
You could process a heckuva lot of text files really fast.
In multitasking of high levels, ie, if you run at least 12 applications ( with 4 SIMDs=single instruction multiple data and with enough HDD feed) the OS not the CPU power will be the bottleneck. OS might get crashed due the amount of load and the interrupts it encounters every second. Loading and unloading applications from RAM also becomes a serious issue if the OS services doesn’t guarantee it to be smooth, efficient and without residuals remaining in the RAM ( ie leaking).
Few OSs will be able to support extreme multitasking without stability issues.
The blueprint showed a central PowerPC that was complemented by 1024 (that is one thousand and twenty four) 8-bit “processing elements” on a single and – according to IBM – low-cost die.
Now,… as much as I love IBM and POWER… what could one do with 1024 8-bit cores? This goes beyond my imagination
Although
“25 GB operations/second at well under a single watt of power,” smells on embedded devices much more than PCs.
Although
“25 GB operations/second at well under a single watt of power,” smells on embedded devices much more than PCs.
I have no clue what they mean with 25 GB ops/s, but if they mean 25k MIPS, it would be comparable to the latest AMD Athlon FX’s. Of course, those Athlon’s run 64 bit instructions, so if you divide 25 by 4 to get 32 bit instructions, it would seem comparable to a modern day processor.
http://en.wikipedia.org/wiki/Million_instructions_per_second
And although I have absolutely no idea what they mean with “the processor will enable a user to view streaming live- and high-definition video on a low-power, mobile device at five to 10 times the speed of existing processors.” (view HD movies at 10x ffwd? Or even 10 at once?), the idea that this chip can do high definition video means it might be an interesting mobile solution, although it would require a vastly different way of programming.
Apple announces a return to PPC architecture….
Some of the processing of satellite imagery I was doing 20 years ago might be a start
Soon you can do it in your cell phone.
It only takes a few 8-bit instructions at a time to simulate a 32-bit instruction.
So here’s the deal: If a computer can run each macro instruction as several micro-instructions then it can translate the 32/64/128-bit instructions into their equivalent 8-bit instructions and cascade them horizontally to produce the same result as a G4/G5-series instruction set using less power than it would take to do the same thing on the original G4-series processor. Best of all, with that many 8-bit cores running at once there would probably be no speed liability involved.
As for the remark about C64 and Atari demo coding: I agree that they were more efficient designs back then and I would have included the Amiga even though it was a 16 bit machine running a 32-bit instruction set.
As some people on this thread have pointed out, such processors are mainly useful for crunching vast amounts of raw data in fairly simple ways. In principle, a chip like this would be more suitable for the tasks performed by something like computational RAM: http://www.ece.ualberta.ca/~elliott/cram/
Now, the two designs are completely different — the IBM processor is byte-oriented and seperate from memory, while the CRAM processing elements are bit-serial and integrated with RAM. However, in principle, both designs should be suitable for similar tasks. Personally, I think if you’re going to simplify the processor down to an 8-bit design, integrating with RAM is a far better approach, because of the vast increase in memory bandwidth allowed by such a design. 1024 8-bit cores would be far more impressive if they could each have full speed access to a RAM chip than if they all shared a conventional 5-10 GB/sec bus.
At
http://www.kilocore.com/overview.html
On the applications page they compare to an Arm7 at only 50MHz and claim 10x throughput and for a PIII at 1.8GHz claim 3x improvement but the power savings are way much better.
The founders though seem to have some interesting backgrounds, remember Andrew Singer Think C Lightspeed C..
This is a MIMD, there is a pic of the 16×16 array, uses 3.5Mil transistors so these PEs are pretty light say 14K transistors/PE which is basically doable with real RISC design and esp if the datapath is 8b presumably its still a 32b arch just uses 4+ cycles per opcode. It doesn’t say what the instruction set of these PEs are, perhaps it is PPC subset after all, definitely much simpler than the Cell PEs.
One of the advantages of going to high cycle count PE designs is that they hide latency very well and the total demand on external memory is greatly reduced. These should be good for codecs, DSP, crypto etc (lots of ops per data fetch) but not much use for general purpose threaded programs I bet.
Ironically its not so difficult to cram almost 256 PEs onto the largest and very expensive FPGAs but the power used would be enormous by comparison, so this thing may well have lots of applications where power consumption really counts, but not where it doesn’t.
As for the 80s, I wonder how this thing is programmed, whether they learnt anything from the Transputer & occam or is it reinventing stuff, and how much or what they got from CMU.
Just few years ago we were in the craze of “MegaHertz myth”.
Marketing was telling us that the more MHz, the better, “forgetting” that a) clock and computational power are not lineary bound since the various unavoidable bottlenecks of the architecture and b) efficience per clock cycle is absolutely not taken in account when looking at pure MHz.
Now we are facing the commercial guys going crazy possessed by a similar craze: the “core myth”.
Back to basis of computation theory, anyone must admit that 10 men cannot dig an hole 10 time faster than a single man, and 100 will not dig an hole neither 100 time faster than a single man nor 10 time faster than 10 men.
Those xxxx-core architecture that Sun and IBM are developing are no longer (cost-power-etc) efficient for *general purpouse* computing since 1000 cores will compute a lot of non parallelizable tasks SLOWER than a single complex core and few parallelizable tasks with a variable advantage on more powerful single core, however nearly always very far from a linear proportion with the (hyper)number of cores.
Definitely, those designs are bad because flatten the chip design on a single core design both for parallelizable and non paralleizable tasks, that need indeed very different kind of approach (and design) for efficient solution.
It would be better a more complex and fast clocked design for compute non parallelizable tasks coupled with a radically different chip design for papallelizable tasks that plainly need a wide pipeline and very simple instruction set.
You are totally missing the point of parallel processors and parallel processing.
In the fully sequential world there is are a couple of problems, the Frequency Wall which finally hit Intel, the Memory Wall about which no one seems to think there is much of a solution other than ever more cache. When all the worlds computer architects push the single threaded performance to the max, you tend to get the same complex solutions that are extrememly disproportionate in cost for what little computing they produce. Over 30 years we have gone from 3u to 60-90nm giving about 1000x improvement in transistor count/area and clock speeds also went from 3MHz to 3GHz ie 1000x as well as 100x bigger dies. Do we see single threaded cpus 100,000,000 x faster than the 6502 of the 70s, no we don’t. and never will. In fact todays P4s are only 30x faster than the 1st Pentium 100 in general purpose code, somewhat faster with MMX,SSE but far from the 10^8.
In the fully parallel world you get the opposite effects, the Frequency Wall is traded with Power Wall, most of these high N core designs clock 2-10x slower than the single core designs. At the logical extreme FPGAs can be considered as the ultimate parallel computing device and run at typ max freq of 150=300MHz but their computation units are in the bit wise fashion and you can consider having several 100K LUTs on die.
Highly threaded parallel designs can completely eliminate the Memory Wall effect and instead you have a Thread Wall, you must run a decent no of threads to reach peak performance. There are quite a few researchers out there that would lough at this 100 men not 100x faster. It all depends on the problem you are trying to solve. It doesn’t even matter if threaded cpus are mostly idle does it, most x86, PPC are mostly idle too, which is more efficient at being idle/
One thing that can be easily said for all parallel designs is that they get much closer to the higher efficiences possible that Moores Law suggests is possible. ie N transistors at F Hz with W Watts can do far more work than trying to serialize the whole darn computation through 1 bottleneck. In the extreme case its likely possible to put several 1000 processors on a die today, but no body will do that because there is no way for the package to have enough pins with enough memory bandwidth going into to it. However with local caches some of that bandwidth can be met.
The real problem lies with programmers that have no clue about concurrency!
“There are quite a few researchers out there that would lough at this 100 men not 100x faster.”
Please don’t misinterpretate my point: Amdahl’s law logic is formally proved and quite peacefully accepted; “100 men not 100x faster” it’s not an “IMHO” statement, it’s one of the founding laws of computing science, unless proven false.
“It all depends on the problem you are trying to solve.”
And this is my second point: independently if the architecture was built as a general purpouse CPU (can resolve complex task and be programmed in a satisfactory way), the architecture is no longer *general purpouse* oriented when a similar amount of cores is taken in account.
I mean: a 1000+ core architecture, as you correctly agree, is good for some problems and not for some others. In other words, it’s a complete waste for non parallelizable problems, and a unavoidable partial waste for any problem laying behind totally parallelizzable/ totally non parallelizable.
That is the reason why general purpouse CPU design tend to rely on not too much cores, but rather on complex ones with good IPC, while processors for specific task or specific omegeneous data kind, like from DSP to GPU to specific math coprocessor hardware, tend, on the opposite to rely on a simpler hardware with higher degrees of parallelism: this, because those processors are built with in mind a specific set of problems to resolve more efficiently than a generic CPU design.
In other world, I’m not criticizing the usefulness of parallelization but I’m pointing that the very idea of a general purpouse CPU is quite stretched if the design is, unavoidably, biased toward being dramathically more efficient with a certain subclass of problems, that’s the exact opposite of the “general purpouse” CPU concept!
I would favour a more asymmetric architecture with some complex cores for general purpouse use and a lot of simple specialyzed cores with high degree of parallelization for the math fields where it’s needed.
“The real problem lies with programmers that have no clue about concurrency!”
The real problem is the math: simply, what is not parallelizable it’s not parallelizable!