Actually, there’s more on processors today, but the processor in this article is so out-of-the-ordinary, that it deserves its own item. “Niagara has eight processing engines – called cores – each able to simultaneously execute four instruction sequences called threads. It’s neither the first multicore processor nor the first to employ multithreading, but it embraces both ideas more aggressively than competing chips from IBM, Intel and AMD.“
MTA is a good way to lower the memory wall effect on the processor side, but but they could do better by attacking the wall directly. Design a threaded memory system that has massive fully random issue throughput for small bits n pieces all over the entire address space rather than fewer huge cache line bursts, and the MTA processor design can just eat that up. When you have such a memory, the MTA processors can get radically simplifed even further as I have seen. When all highly random memory cycles appear to be a few instruction slots rather than 500 cycles he mentions, you can get alot more than 8 cores on a chip and the throughput starts to get back to Moore’s law closer to DRAM density (theres a clue there).
Massive nos of threads are the trade off for massive nos of occasional huge wait states, I’ll take the light threads anyday!
I will venture there will come a day when OoO, SuperScaler, esp EPIC, VLIW and even Branch prediction will be set aside in favor of throughput MTA on both processor and memory. Its the memory cycles that cost, instructions have been free for a long time now but locked up in these idiotically serialized designs.
transputer guy
Isn’t that what the scout was for?
I don’t believe Niagara has scout. I also don’t think
scout specifically addresses the issues raised by
the OP either (I could be wrong, I don’t pretend
to understand this fully). Niagara’s attempts to
mitigate some of the memory problems are having high
bandwidth (relatively) low latency memory, switch thread
every clock, park on a miss. It also has a 12 way set
associative lvl2 cache.
It doesn’t have the scout. Sun is partnersing Fujitsu for applications Niagra is not suited for right now. “Rock” will have scout. But I think that is only out in 2008.
Sun and most other companies and most all cpu EEs in the business are complacent, as long as conventional SDRAM sits on a mobo, its like a Fort Knox of 1Gbits but it takes 40-60ns at the chip level to get any random piece of data out. For some reason, an Athlon,P4 going through the cache misses, TLB misses, then the 2 process older Northbridge to get to DRAM raises that to 300ns or so ie 1000 cycles. If that even happens 2% of the memory cycles, well do the math.
There is nothing intrinsically slow about DRAM if its implemented well. Micron has a 576MB RLDRAM2 that can do fully random access complete in 20ns per bank, but much much better is that it can do 8 concurrent threaded accesses, staring 1 every 2.5ns and going to <2ns next yr RLDRAM3 (as long as banks don’t collide). Thats is actually better than any L2 and many L1 caches that are out there. It cost about 2x your typical DRAM. NOT 1000x, 2x. Its just standard DRAM tweaked for more speed, bigger bit cell, SRAM type packaging and streamlined interface. Its DDR IO is similar to DDR3 but its the issue rate that matters. High latency high bandwidth is there just for cache junkies. Lower the latency, the need for caching drops dramatically.
An MTA processor that uses MTA DRAM can easily hide the remaining 20ns which now looks like 20-40 or so clocks but maybe only 5-10 opcode slots (on 4way MTA), not 500 cycle. But for some reason cpu guys would rather use crap DRAM and then invent 1000 complicated technologies which really don’t work to get around the artificially imposed problem of using current DRAMs. So lets keep doing more of the same. It doesn’t help that most are also locked into the n level paging architecture from the 60s. Thats needs to go too.
If high issue rate DRAM can best SRAM caches, question is why use caches. Well they can still be usefull for further reducing bandwidth load even on a superb RLDRAM memory system.
Besides OoO,SS,BP,VLIW, big fat serialized SRAM caches should also go. SRAM caches can be designed that also have issue rates several times their apparent latency which allows for 3 level memory all using MTA. L1 would then be 2ns 1MB or so but otherwise highly bank interleaved with randomized distribution. L2 as per RLDRAM. L3 as per SDRAM for remote storage but far less issue rate.
transputer guy
John Jakson
Hey everyone here seems to be called John
The problem isn’t the hardware, it’s the software. There are a lot of algorithms that are more dependent on bandwidth than latency (many graphics algorithms), and there are a lot of algorithms that are just plain non-parallelizable (many logic/analysis algorithms).
Current CPUs are designed to run current software. Current software has a hard time scaling to three or four cores, much less dozens of cores. A very wide, slow chip like Niagra sounds great in theory, but it just won’t run (very well) a lot of the code people need to run.
A very wide, slow chip like Niagra sounds great in theory, but it just won’t run (very well) a lot of the code people need to run.
I assume you are not referring to web facing business apps. Many of those do scale fairly well, and are the target market it think.
Obviously apps that were meant to run on PCs won’t.
Yes, I’m not referring to such apps. I was replying to Transputer Guy’s post.
Current CPUs are designed to run current software. Current software has a hard time scaling to three or four cores, much less dozens of cores. A very wide, slow chip like Niagra sounds great in theory, but it just won’t run (very well) a lot of the code people need to run.
I would argue that the bias is more toward current
software being designed to run on current cpu’s than
vice versa. It’s also interesting that you seem to
have pretty much made up your mind about Niagara?
It’s never, ever been claimed that the performance of
a single thread is going to be that hot. It’s never
been positioned as a general purpose cpu. What you
do you get is the ability to schedule across 32
hardware strands crunching through a lot of work
simultaneously. Don’t forget these will be 1u and
2u systems which are also very power efficient –
effectively 2.5w/thread. Lots of software (anything
task or transactional based – web apps, java, database)
will run pretty well on this out of the box.
would argue that the bias is more toward current
software being designed to run on current cpu’s than
vice versa.
Hardly. Software has a life of measured in decades, CPUs have a life measured in years. Consider the CPUs we use today: Pentium 4, G5, Athlon 64. Their designs date from the late 1990s. In comparison, many major software packages have code dating from the 1980s. It has been true for a very long time now that software leads and hardware follows.
It’s also interesting that you seem to
have pretty much made up your mind about Niagara?
Yes, in that I don’t particularly consider it all that relevant for the kind of code I write. However, I’m sure it has its uses. I just think that CPUs of its ilk won’t replace highly OoO superscaler designs, as Transputer Guy seems to.
The reasons I see OoO becoming irrelevant is that it takes many many 100Ks of logic to build even half way decent serialized cpu design with another couple mil Ts for cache to cover the always growing memory wall. As clock freq doubles, cache size was supposed to go up 4x to make things seem even, that never happened.
Such complexity simply doesn’t scale in clock frequency, wires can only be so long before limiting what can be done per clock. Very simple MTA cpus (with MTA memory) can inherently clock a few times faster limited by highly localized wiring constraints. They are not chained down to fast big L1,L2 SRAM (although Niagara is since its memory system is not MTA).
In my simulations only 1-2% cache misses all the way out to DRAM due to link list chains & hash table refs drops these OoO perf levels back to where very simple MTA designs sit all the time per thread. OoO only peaks several times faster for L1 bursts that sit between the misses. IE register & high locality instructions are more or less free, wait states are very expensive.
in the architecture I am developing, the MTA memory delivers upto 200M fully random memory slots per sec. Any no of relatively slow MTA procesor elements can consume that issue bandwidth, lots of slower or fewer faster. At about 80 threads (20 PEs), all the bandwidth is consumed for PEs that each only use 10K gates ie 200K gates plus their local tiny R,I caches. The max instruction throughput is 3000Mips for a miserable 300MHz FPGA clock. With a real ULSI process the design scales to 10x that freq for each Transputer thats around 30Gips rather than the more typical 500-1Gops I see on my xp2400. Again register instructions are free, memory issues come out of a fixed MMU budget. Being a Transputer, the whole scheme scales again although links limit memory sharing as they do on all multis. Since each Transputer MMU+N.PEs is in the order of 1mm of silicon, you can imagine how many can fit on a good size chip. There have been multi cpus in the order of 460 cpus before, but those went out of there way to not be unified in their memory. Lots of local cpu bots with teeny private memory isn’t attractive if you have to manage that fine grain yourself. But if the memory pool looks common, thats very different.
One should think instead of fixed SRAM cycles and allowing MTA to run several (about 8x) times faster than that rather than trying to get the L1 cache to be 2cycles for a slow OoO design. 8/2 comes out to 4 which nicely fits 4way threading. The threads come out for free until memory throughput is exausted. But all this is predicated on other things I can’t disclose completely here. I actually use 8way threading to further cut logic costs and half the memory wall again.
As for SW being ancient, thats a good point I make all the time too. W2K probably has alot of code dating back to early 90s or older as does Linux, Unix etc. The code might get recompiled more often to be freshly reoptimized per current cpus but still all the algorithm decisions were made before the 1st compile. Now in the late 80s & early 90s, link list and hashes were a good idea since the memory wall had only just started to rise. Today if one looks at those algorithms they must be suffering greatly under the burden of poor locality tax.
Imagine what you could do if all far memory accesses apeared to be around 2-4 min opcode slots. Not worrying about caching completely changes how you go about programming. But it can only be done with MTA on cpu & memory. There is no memory wall solution for non MTA designs, only future disappointments.
Hardly. Software has a life of measured in decades, CPUs have a life measured in years. Consider the CPUs we use today: Pentium 4, G5, Athlon 64. Their designs date from the late 1990s. In comparison, many major software packages have code dating from the 1980s. It has been true for a very long time now that software leads and hardware follows.
Not wishing to be pedantic, but your orginal point was
that the development process was CURRENT software
controlling CURRENT hardware design. I was pointing
out that software written NOW is jsut as influenced
by CURRENT hardware. There was no mention of any
longevity.
But for some reason cpu guys would rather use crap DRAM and then invent 1000 complicated technologies which really don’t work to get around the artificially imposed problem of using current DRAMs. So lets keep doing more of the same. It doesn’t help that most are also locked into the n level paging architecture from the 60s. Thats needs to go too.
How much do these super fast RAM chips cost? If the cost is too prohibitive to productize. what is the point?
Let me answer that question for you. The RLDRAM2 chips are still inproduction.
Let’s see cpu guys should desgin based on products that aren’t even available yet and are ridiculously expensive.
The whole point of having a multilevel memory architecture is to keep the most expensive, faster memory small and closer to the cpu. Manly for cost purposes.
We need faster chips .New materials. Not more cores.
wrong, Anonymous.
Multicore is, for now, the way to more powerful systems.
In the same time, research with new materials is still important.
Wrong Tanner.
Multicores are slightly faster if you program specifically for them.
We need to move away from silicon and copper/aluminum.
Machines built with chips like this are primarily targeted be servers at enterprises. And enterprises still buy a lot of commercial software, like WebSphere and Oracle. Last I checked IBM treats each core as one “processor” when it comes to software licensing. Oracle is similar. If this practice does not change, Sun will not have a future with these machines. On the other hand, why would IBM change their licensing policy to save Sun? Heck, IBM would not release WebSphere for Solaris x86 until Sun threated not to renew IBM’s J2EE license.
price/performance
I wouldn’t use any chip unless it can justify its cost with benchmarks. I don’t care if its SPARC, Power, AMD, Intel, MIPS, Cell, whatever. Show me the numbers.
If this chip is so fast where are the benchmarks?
price/performance
I wouldn’t use any chip unless it can justify its cost with benchmarks. I don’t care if its SPARC, Power, AMD, Intel, MIPS, Cell, whatever. Show me the numbers.
If this chip is so fast where are the benchmarks?
There are not any published benchmarks because the
chip isn’t officially released yet. Do your
price/performance calculations also include power
consumption, HVAC and real estate costs? (Niagara
is an order of magnitude better than it’s competitors
when you consider perf/watt)
Niagara is an order of magnitude better than it’s competitors when you consider perf/watt
Prove it..
that’s all I ask..
You’ll just have to wait and see.
Wrong Tanner.
Multicores are slightly faster if you program specifically for them.
We need to move away from silicon and copper/aluminum.
You can’t make a broad statement like ‘faster’ then
say chips should be made of unobtanium instead of cu/al.
Define faster. Faster at what?
Machines built with chips like this are primarily targeted be servers at enterprises. And enterprises still buy a lot of commercial software, like WebSphere and Oracle. Last I checked IBM treats each core as one “processor” when it comes to software licensing. Oracle is similar. If this practice does not change, Sun will not have a future with these machines. On the other hand, why would IBM change their licensing policy to save Sun? Heck, IBM would not release WebSphere for Solaris x86 until Sun threated not to renew IBM’s J2EE license.
You raise an excellent point. Basically, companies that
license software need to change because everyone is
going multicore (including Intel). We are already
seeing traction in this area with Oracle allowing
virtualization technologies such as zones to act as
licensing barriers. I would be surprised if some
negotiation wasn’t in place to count Niagara cores
differently – it isn’t a traditional design after all
(even by multicore standards).
A Micron FAE quoted me RLDRAM as 2x the price of regular DRAM, thats not ridiculous, thats a fxxxing bargain for 20x the throughput for threaded cpu designs. Typically special gaming slightly faster DRAMs already cost quite a bit more than regular parts. The real idea is to get the cpu guys to use RLDRAM for L2 cache which can in some cases also be full L3 memory. They are not widely available outside the networking industry chicken & egg. They can’t be put on DIMMs though, signal integrity means they have to be on the board next to processor but that could change.
I also said 3 level is fine, SRAM interleaved like RLDRAM at the L1 level, followed by RLDRAM then by larger some cheaper but much less performant SDRAM. Hierarchy is still good. RLDRAM is just very much more effective than L2 SRAM, 32-64MB is way better than 1 or 2M right. But to go this route needs some things I won’t divulge here. RLDRAM looks like only 40 or so cycles of latency at 2GHz but looks like 10 or so for 4way MTA, where the new PA PPC sees 22cycles on L2.
Sure it would be nice to move away from SI,Aj,Cu but that won’t happen either. Ironically MTA actually allows the innermost core to run at faster clocks again when matched with MTA memory, so the 4N threads each end up with nearly the same performance as a single complex threaded design with same memory. Threads will eventually be free limited only by the memory issue throughput.
As for parallelization, its a question of how you approach things. If all memory operations have very low wall, then one uses algorithms that have lower costs than higher, ie NlogN better than N.N and refer to Knuth for details. But for some algorithms that use highly non cache friendly patterns, things can & do flip around esp for lists, hashes that are far beyond 20bits of addressing. When you see simulations of how a threaded processor-memory system performs, you can see possibilities just not practical on current HW, a RLDRAM only system, can support upto 4Bops worth of operations for the usual 10ops per load/store.
For instance Radix sort should be say O4N time v ONlogN for quicksort, but if random accesses don’t match the cache, they break. A radix sort that tries 2N passes should be 2x faster still, but try building 64K bins, that will be much slower than 4N passes with 256 bins. On a processor designed for threads on both side, the 2N radix actually is 2x faster since random address actually help in that design and all the old algorithm analysis are correct again.
Also in the days of transputing before many pundits were born I suspect, people looked for ways to parallelize at a time that was maybe premature and before desktops did what they do today and also before the memory wall had started. Si still had another few 100x performance that wasn’t obvious at the time, 20MHz to 3GHz. Well today, the only known constant is how fast memories can be built. Bigger SRAMs generally run much slower than smaller ones but in the end DRAM & SRAM are both limited by communications at the device architecture level and how much concurrency can be organized.
Threading is the only way forward, unused threads can always be idle for power/heat follows work principle. Current cpus don’t idle very well.
transputer guy
A Micron FAE quoted me RLDRAM as 2x the price of regular DRAM, thats not ridiculous, thats a fxxxing bargain for 20x the throughput for threaded cpu designs. Typically special gaming slightly faster DRAMs already cost quite a bit more than regular parts. The real idea is to get the cpu guys to use RLDRAM for L2 cache which can in some cases also be full L3 memory. They are not widely available outside the networking industry chicken & egg. They can’t be put on DIMMs though, signal integrity means they have to be on the board next to processor but that could change.
OK. your original post talked about DRAM. When you really were attacking the SRAM used for caches.
BTW, the cpu guys who designed Niagara were doing exactly what you mentioned RLDRAM solves in the cpu. Not matter how much you thread the cache architecture. Memory will always be slower than the cpu unless something changes drastically. If you don’t have multiple threads in the cpu what exactly is going to fill the MTA memory threads, one thread.
Big business apps like oracle will still not benefit as much as radix and quciksort in the architecure you mentioned.
New materials like silicon germanium .
Tunneling quantum transistors.
Optical/photonic computing is the future.
The intel isa architecture needs to be dumped. We need to use some form vector processing. We need better compilers for this. They also use alot less energy ! The intel isa is killing computing. It’s an Energy hog!
Check out
http://64.233.161.104/search?q=cache:AYJB8JRL88gJ:www.multires.calt…
The intel isa is killing computing. It’s an Energy hog!
BS. The cost of supporting x86 is a few million transistors in the corner of the chip. It was a was a big deal back in the heyday or RISC, but as transistor bugdets get bigger, the circuitry required for x86 compatibility becomes more and more negligable.
Wrong !
http://www.theinquirer.net/?article=25496
Wrong !
http://www.theinquirer.net/?article=25496
What’s your point? That piece of baseless speculation has been proven spectacularly wrong by Intel’s actual announcements. And that Nicholas Blachford may be good at throwing together some buzzwords, but I’m not sure he really understands what he’s talking about there.
RTFA.
They are moving away from the ISA. Ranier said that doesn’t matter. Intel obviously thinks it does.
POINT MADE.
1) Complete speculation does not mean that Intel is moving away from the ISA. Moreover, if Itanium isn’t dead yet, it will be soon. EPIC is a nice bit of mental mastrubation for the hardware guys, but it doesn’t serve the needs of the software, and as such deserves its failure.
2) I never said the ISA doesn’t matter, I said that x86 isn’t the kind of boat-anchor its made out to be. EPIC isn’t an attempt to keep x86 from holding back the CPU, but rather an attempt to explore a different model of instruction set. Obviously, if you want a VLIW instruction set, you aren’t going to use x86. However, that’s no weakness of x86 — you wouldn’t use PPC or SPARC either. The original claim was that x86 is holding back CPUs in the context of power usage. It’s not. And frankly, using the Itanium as your counter example is rather ironic, given its massive power usage!
And you read this article to find out what Intel is really doing:
http://arstechnica.com/news.ars/post/20051026-5485.html
Note the appraisal of your great Inquirer article in the first paragraph.
D’oh! Wrong link.
http://arstechnica.com/news.ars/post/20050823-5232.html
Wrong about what exactly? And how does completely off-the-wall speculation from TheInquirer prove it?
a multicore cpu can be of use to a desktop user in a diffrent way then just running a multithread app. rember that on your avarage windows desktop you have ateast a firewall and a antivirus software running in the background while doing your other stuff.
with multiple cores these tasks can be spread out over diffrent cores. meaning that the os dont have to spend as much time changing tasks for the cores.
The time for the actual switching is totally negligible. Getting the new task’s working set into the caches can matter though.
But the tasks you mentioned as well as your usual desktop programs (browser, office suite, even MP3) do not fully utilise a current single-core CPU anyway, so multi-core won’t make a difference there.
It only gets interesting when considering applications that actually do have a CPU bottleneck, e.g. video encoding, games, compilers and such like.
You may note that while I got the date of an announcement wrong, I didn’t give a specific time frame for when I expected Intel to use a VLIW type architecture in an x86. I still expect them to do this at some point, even if it’s 5 years away.
Oddly enough I’m writing an article on multi core processing and was planning on adding a follow up as part of it, this debate has given me some ideas on things I should cover.
—
To answer rayiner’s point:
The instructions can be converted using a very small block but there’s an entire x86 state machine (registers, their behaviour etc.) which this won’t work on. Every x86 has to somehow implement this state machine and this has important implications for the design of the entire processor.
Sun and IBM have both gone back to in-order designs because they can, it won’t hurt their performance that much. x86 can’t do this because going in-order loses the rename registers and that’d cripple their performance.
If Intel want to build tens or hundreds of processors per core the power savings will have to be drastic indeed. Keeping x86 in hardware restricts the options they have how do this.
“If Intel want to build tens or hundreds of processors per core the power savings will have to be drastic indeed. Keeping x86 in hardware restricts the options they have how do this.”
Indeed.
Rayiner pointed out, ” I said that x86 isn’t the kind of boat-anchor its made out to be.”
but only if you have 1 or maybe 2 behind a relatively large cache.
If one wanted alot more x86 cores, the boat anchor effect comes back in a hurry in several ways no matter how they are implemented.
And if you really want lots of cores and or threads to explore highly concurrent programming with HW support for that, then x86 is all obstacles but $ still talk.
transputer guy
Rayiner pointed out, ” I said that x86 isn’t the kind of boat-anchor its made out to be.”
but only if you have 1 or maybe 2 behind a relatively large cache.
Instruction set architecture is fairly orthogonal to the memory system design, so I don’t see what your point is.
If you’re really talking about out-of-order vs in-order, well, in-order processors are significantly more sensitive to cache misses.
If one wanted alot more x86 cores, the boat anchor effect comes back in a hurry in several ways no matter how they are implemented.
Why? If x86 costs X% of die space in a one-core design it’s still gonna cost X% of die space in an n-core design.
I didn’t give a specific time frame for when I expected Intel to use a VLIW type architecture in an x86.
Really? May I remind you what you actually wrote in your article?
AT NEXT WEEK’S Intel developer forum, the firm is due to announce a next generation x86 processor core. The current speculation is this new core is going too be based on one of the existing Pentium M cores. I think it’s going to be something completely different.
…
Based on the various comments and actions of Intel, as well as other companies, I think Intel is preparing to announce a completely new VLIW processor which uses software to decode x86 instructions and order their execution.
Are you now trying to claim that your VLIW ramblings had nothing to do with the architecture announced at the IDF? In that case, you should be a lawyer rather than a tech journalist.
Are you now trying to claim that your VLIW ramblings had nothing to do with the architecture announced at the IDF? In that case, you should be a lawyer rather than a tech journalist.
That’s not what I wrote:
NB: You may note that while I got the date of an announcement wrong…
The announcements at IDF were for Merom / Conroe which are due in 2006.
I was speculating an announcement of an architecture due late 2007. I didn’t specify this in the story however, I should have given a time frame.
I still expect them to go this route but it could be many years away yet.
You really should have been a lawyer.
Your article was talking specifically about the IDF announcement, there was no indication that you actually had some later time in mind.
I still expect them to go this route but it could be many years away yet.
Maybe if .net really takes off, thereby reducing the dependency on the x86 ISA.
The instructions can be converted using a very small block but there’s an entire x86 state machine (registers, their behaviour etc.) which this won’t work on.
The instructions coming out of the x86 decoder are much like RISC instructions and I don’t see how the the state machine they work on is gonna be a great deal more complex compared to other processors’.
Sun and IBM have both gone back to in-order designs because they can, it won’t hurt their performance that much.
Actually it hurts quite a lot (in terms of single-thread performance). And it hurts even more if you don’t recompile and optimise your software for any particular in-order design you’re using.
x86 can’t do this because going in-order loses the rename registers and that’d cripple their performance.
You do have a point there, but remember that x86-64 reduces that problem significantly.
If Intel want to build tens or hundreds of processors per core the power savings will have to be drastic indeed.
Hence the Merom core. The problem with Netburst was that it was developed with almost total disregard for power considerations.
And you can always fit more cores by reducing the clock rate, because a reduction in clock rate reduces power consumption more than it reduces performance. Sun is of course already doing that with the fairly slowly clocked Niagara.
The tradeoff between smaller slower in-order cores vs. bigger faster out-of-order cores remains the same no matter by how much you multiply their numbers.
Simpler cores probably give you higher overall performance in theory, but Amdahl’s law and the realities of software development and backwards compatibility mean that out-of-order execution is worth the effort.
If you’re really talking about out-of-order vs in-order, well, in-order processors are significantly more sensitive to cache misses.
Hence Niagara. We seem to have come full circle.