During his keynote address at San Francisco’s Moscone Center, Otellini unveiled the company’s next-generation, power-optimized micro-architecture for future digital home, enterprise, mobile, and emerging market platforms aimed at a new category of converged consumer devices.
cool computing???? nothing new. I think AMD will rule.
-2501
So it IS Pentium M after all. That’s sad.
hey rayiner, can you provide your comments on this ?
i remember a long time back you led a good discussion on Pipeline design and out of order exexution..
will be good to see that back instead of some mindless ranting about osx and all that
cheers
ram
The new CPU’s are made from scratch, using knowledge gained from the Pentium M architecture experiment.
It features bits and pieces from both P M and Netburst as well as new things. All in all it is not a Pentium M derivative even if it sounds like that.
Not quite, they’re adding a 4th execution unit it seems. Overall, it’s probably pretty similar though. They do add 64-bit, and VT(Vanderpool–is that Xen stuff?), and its properly dual core.
Vanderpool/VT refers the hardware extensions Intel are adding to aid virtualisation. Previously, x86 was horrible to fully virtualise, VT makes it easier and faster.
And yes, VT support was contributed to Xen by Intel. Today at IDF, Windows running under Xen was demonstrated (using VT for full virtualisation), along side Linux guests running natively on Xen (paravirtualisation).
…I’m fairly happy with almost everything except the dreaded FSB design. Why Intel why? Why do you keep hanging onto old technology? Why can’t you integrate the memory controller onto the CPU? What is the problem, Intel?
You’ve seen what it can do to latency and overall performance in general…I really can’t comprehend their decision on this. They must have had a very good reason to stick with a FSB implementation. EiIther way it makes me sad, to say the least… 🙁
Why can’t you integrate the memory controller onto the CPU? What is the problem, Intel?
Integrating the memory controller into the CPU has it’s downsides. It is not a panacea. You can ask almost any game developer or CPU analyst that does low level performance analysis about the penalties.
An on-die memory controller will give you reduced latency at the cost of bandwidth. Having an on-die memory controller also raises your costs and restricts a manufacturer who sometimes has to maintain compatability between different processor socket designs.
Integrating the memory controller into the CPU has it’s downsides. It is not a panacea. You can ask almost any game developer or CPU analyst that does low level performance analysis about the penalties.
An on-die memory controller will give you reduced latency at the cost of bandwidth. Having an on-die memory controller also raises your costs and restricts a manufacturer who sometimes has to maintain compatability between different processor socket designs.
O.K., that is true…but now let’s look at it this way: even AMD, which is a company at least 10x smaller compared Intel, managed to pull it off (integrating the memory controller into their CPU’s). So if AMD could do it, why can’t Intel? I’m extremely confident, that if Intel put their minds to it, they could elegantly implement an integrated memory controller. I just don’t understand the reasoning behind the decision (was it really that expensive to implement, is compatibility so high on Intel’s priority list, etc.)
O.K., that is true…but now let’s look at it this way: even AMD, which is a company at least 10x smaller compared Intel, managed to pull it off (integrating the memory controller into their CPU’s). So if AMD could do it, why can’t Intel? I’m extremely confident, that if Intel put their minds to it, they could elegantly implement an integrated memory controller. I just don’t understand the reasoning behind the decision (was it really that expensive to implement, is compatibility so high on Intel’s priority list, etc.)
AMD managed to pull it off, but not without the disadvantages I mentioned. Integrating the memory controller raised their costs, and has kept them behind in memory bandwidth compared to Intel. Which is part of the reason why Intel systems are still favored for bandwidth intensive activities such as video / audio encoding, etc.
Additionally, because of the integrated memory controller AMD did have some advantages with their dual-core performance, but they also have a disadvantage in that both processors have to share 6.4gb/s of potential memory bandwidth, whereas Intel since they do not have an on-die memory controller can have each CPU use the full potential bandwidth.
Yes, I believe it’s possible that they will eventually overcome most of the disadvantages (Intel or AMD), but for now it isn’t yet there. I would rather see cost effective processors that perform well, than ones that are idealistically better but still have all the disadvantages I listed.
(I say this as an Athlon64 2800+ owner)…
but they also have a disadvantage in that both processors have to share 6.4gb/s of potential memory bandwidth, whereas Intel since they do not have an on-die memory controller can have each CPU use the full potential bandwidth.
I call Fudulent statement.
Now prove me wrong.
The cores do share the same memory controller, but the K8 isn’t particularly bandwidth constrained and it’s a side-effect of sharing a memory controller rather than moving the memory controller on-die. The K8’s multicore strategy is better than the P4’s. Benchmarks run with variable increases in the memory bandwidth typically result in fairly modest performance gains, and the discussion as far as games is concerned is a tad fishy because PC game engines are not typically multithreaded with kernel threads, though there are a few that make use of coroutines and some that make limited use of multiple threads.
I call Fudulent statement.
Now prove me wrong.
I wish people would stop calling things “FUD” which are not. I was not posting “fear, uncertainty, and doubt”. Even if I was posting something incorrect, which I do not believe I was, it would be an “incorrect statement”, not “FUD”.
See this diagram as proof:
http://images.anandtech.com/reviews/cpu/amd/athlon64x2/preview/AMDa…
As you can see a dual-core setup for the Athlon64-x2 shares one memory controller.
Now, this is not the case for SMP systems, just dual core.
Yes, I realise that Intel’s dual-core chips will be sharing a single memory controller on the motherboard. However, each processor has a full amount of bandwidth to communicate with the memory controller instead of sharing a single path. That makes some difference.
Yes, I realise that Intel’s dual-core chips will be sharing a single memory controller on the motherboard. However, each processor has a full amount of bandwidth to communicate with the memory controller instead of sharing a single path. That makes some difference.
How so? This is not a sarcastic or rethorical question, mind you… an honest request for explanation from someone who is not into CPU design at all.
In a naive point of view, a bandwidth bottleneck is a bandwidth bottleneck, whether it happens on die or on motherboard. Two threads running on different cores will need in both case to reach through the memory controller all along the path the ram for their work, in both cases sharing the bandwidth. Again, from my naive point of view, the lower latency of an on-die memory controller actually reacts ‘faster’ to changes in bandwidth occupation thus optimizing the flow? I don’t really know Please explain.
How so? This is not a sarcastic or rethorical question, mind you… an honest request for explanation from someone who is not into CPU design at all.
I really don’t remember or know at the moment. What I do know is this: AMD has consistently scored lower in bandwidth heavy benchmarks, and articles / reviews that I’ve read have stated this is a trade-off between on-die memory controller’s low latency and a motherboard based memory controller instead. It’s not unique to dual core systems. There is some advantage that Intel seems to have.
What I do know is this: AMD has consistently scored lower in bandwidth heavy benchmarks, and articles / reviews that I’ve read have stated this is a trade-off between on-die memory controller’s low latency and a motherboard based memory controller instead. It’s not unique to dual core systems. There is some advantage that Intel seems to have.
The only disadavantage on-die memory controllers have is that they can not easily be upgraded to the latest memory techinology speeds.
The reason AMD chips core lower on bandwidth intensive tests is because the have a lower bandwidth memory subsytem AKA DDR2 400Mhz aka PC3200. Where as the pentium D systems have 533 Mhz and 667Mhz memory Dimms.
The bandwidth disadvantage is not inherent in on-die memory controllers it is a side effect of not being able to change it fast enough due to prohibitive costs.
This article provides some performance numbers between the new AMD dual cores and the Intel Dual Core offerings…
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2484&p=1
Although it is true that the memory controller is shared between two cores and has less access to memory bandwidth (since it’s plain old DDR instead of DDR2) this doesn’t seem to matter all that much!
Notice in the article that even in the multimedia areas, where Intel used to reign supreme that the 3800+ X2 is faster than the Intel Pentium 4 830 D.
But where the integrated memory controller really shines is in SMP systems. Now, instead of 2-8 processors sharing the same 533 MHz (or 800 MHz) bus to the memory controller, each processor has it’s own, dedicated memory controller and direct link to the memory (through it’s HyperTransport connection).
There are tradeoffs, and memory bandwidth isn’t everything. I think AMD has made good choices when it comes to their memory subsystem design, and I feel that Intel needs to catch up.
Although it is true that the memory controller is shared between two cores and has less access to memory bandwidth (since it’s plain old DDR instead of DDR2) this doesn’t seem to matter all that much!
And that’s because in most applications the big bugbear is latency, not bandwidth.
You can always get extra bandwidth by widening buses or increasing clock frequencies, but unfortunately dynamic memory has shown itself quite resistant to an actual speedup in terms of latency.
“An on-die memory controller will give you reduced latency at the cost of bandwidth.”
I admit I’m a slouch at systems architecture… for a computer engineering student. However, I don’t understand how putting the memory controller on-die either limits or adds cost to memory bandwidth relative to bridged designs. Do chip interconnects cost more than pins? Does the FSB or bridge clock offer more bandwidth in some way?
Your comment leaves me puzzled and looking for an explanation. I’ll mod you up for making my mind do that thing where it thinks.
Techreport has posted some more details of the microarchitecture:
http://www.techreport.com/onearticle.x/8695
14-stage pipeline instead of 12 in PPro through P-M. Four instructions per cycle can be issued instead of three. Multi-cores share the L2 cache. Hyperthreading is missing but could come later.
So they expect to hit 10 cores in 2010, right?
I guess the Cell chip will carry 40 cores by then.
Well done, Apple.
So they expect to hit 10 cores in 2010, right?
I guess the Cell chip will carry 40 cores by then.
Well done, Apple.
You’ve got it all wrong. It’s not the quantity of the cores, but the quality that counts. If those 40 cores on a Cell CPU can only process 1/8th of what a 10-core Intel CPU can do, then they’ve still got a lot of catch-up to do. Besides, the 2 aren’t really comparable anyway as (this is greatly simplified) the Cell is an in-order Power arch and Intel is an out-of-order x86 arch.
OK I’ll bite (or should I say, “sting”):
Currently, it is all about the “quality” of the core(s) (in your parlance) rather than the quantity, because 99% of software is written like people think they think. That is, in a linear fashion, branch-heavy and monolithic. I call this the “queen bee model.” I think our brains really work like a beehive, with one queen bee and thousands(?) of worker bees that do her bidding. We think we think like a queen bee because we can only express our “train of thought” in terms of the perspective of the queen bee. This is why today’s software is incapable of providing the insight and “intelligence” of the human brain.
Hardware comes before software. It is not like the chicken and the egg. This is the way it works. Humanity is developing the hardware of the future. The first incarnation is the Cell (not hexagonal only because of fabrication technology), which conveniently fits into my analogy by representing the “beehive model.” This model requires that we program the hardware like the hive works, not like the queen bee sees things happening. This can be implemented in code, in the compiler, in firmware, and/or even in the hardware itself.
Then it comes to pass that improvements to the quality of each worker bee pales in comparison to the quantity of worker bees in the hive in terms of improving the overall performance of the system. The advantage that we have over the bees–well, I think we have many, but here is just one–is that we can have multiple queen bees in our hive, each controlling an optimal quantity of worker bees, and they will cooperate with one another. This adaptation has not (…yet) taken hold in the social systems of modern bees. Thus we can manage scalability issues and avoid adding undo complexity to our queen bee processing elements (like OOO execution, for example).
If I’m right (and by extension hundreds of other brilliant people), the beehive model of computing will usher in an era where computers can truly think like humans can. And we will do it by putting quantity before quality.
I would like to apologize to any anaphylactics out there, for whom this post must have been terrifying…
Lots of improvements… but it doesn’t look like a “new architeture” too much… It’s more like the evolution from Pentium III to Pentium M…
…But until we have the real thing for testing, I can change my mind (probably won’t, but anyway…)…
The lach of HT is no surprise, as they’re looking for multi hardcores now, but they have to run… AMD already has a really good dual core design and several plans for quad and octa cores with it’s next socket…
I just hope they both have competitive products (currently, AMD has better overall cost vs quality) and both multicore processores became cheaper than the current prizes!
“currently, AMD has better overall cost vs quality”
I like AMD, and my next CPU purchase will be AMD, but I don’t know about price/performance anymore. It seems like the magnetic field of the CPU price/performance planet is suddenly flipping.
So the engineers are back in charge at Intel. Great to see that silly GHz chase is finally coming to an end.
Only shame is it’s still another year until all this stuff actually becomes available. Let’s see how AMD can use this window of opportunity.
And as for Mr Nicholas Blachford of “the Inquirer” fame: eat your words. No sign whatsoever of VLIW or binary translation or anything else of your great predictions. Just plain old and proven out-of-order execution.
what about the next-gen batteries??? They are all talking about better comsuption and blah,blah,blah….but nothing about the development of the next generation of batteries which could help also. It hasn’t really evolved and this is a great factor also to consider.
-2501
Transmeta has been preaching what Intel mentioned today. So, Why is this revolutionary??? I don’t see anything special. All I see is Intel taking the same path but now they are carrrying the flag. nothing special.
-2501
look @ them…..
http://www.anandtech.com/tradeshows/showdoc.aspx?i=2505
this is a great move.. processors are already powerful enough.. they need to address power consumption. jus ‘cos some ppl’s pc’s are bogged down by spy/adware doesn’t mean their hardware is to blame. apple’s os x will fly on these chips
Hannibal has posted his take on the new microarch:
http://arstechnica.com/news.ars/post/20050823-5232.html
He points out how the execution core is quite similar to the PPC970 (aka G5).
And the recent Inquirer article gets a fitting appraisal in the first sentence.
I’d buy a mobile workstation with a quad core processor to run my highend 3D/2D apps on As for Hyperthreading it would be interesting to see if Intel offers this or something similar.
…turned out to be incorrect.
Why?
Well it seem Intel have cut the “average” power not the TDP. The TDP seems to be remain in the range of the P-M range.
If they want to do an 8 core design they’ll need to cut TDP in half, if they want to go to 16 cores they’ll need to cut it highly aggressively.
Whenever Intel or AMD do this I fully expect my the methods I speculated on to be used.
Actually, I think even with a 16 core system, your speculation is still a SWAG (Silly Wild-Assed Guess) based on what they’ve already demonstrated with their ultra-low power version: after all, if they can make it run on .5 watts for (IIRC) is already a dual core processor, even at less than maximum speed of the other processor variants currently announced, why would they want/need that complication? 8*.5=4 watts, and, worst case, 16 cores *.5watts=8 watts for the chip. The biggest immediate problem is the size of the die to go 16 cores, along with the size of the cache: it takes an awful lot of data to keep 16 cores doing useful work, and memory speeds have not even come close to keeping up with CPU’s ability to saturate bandwidth. Sure, you can increase the front side bus width to 256 bits, and with 16 cores, chances are you might have a bit of latency for each one going out of the cache, but if you actually have enough to keep 16 cores busy doing *useful* work (non-idle system thread), you’ll *still* be bandwidth starved, because that’s only 16 bits *per **memory bus** cycle* per core on average, with the multiplier likely to be an absolute minimum (with current RAM available) of at least 3-4 CPU cyles per single FSB bus cycle, not counting latency of the memory controller(s). It would be an interesting trick to schedule threads/tasks such that they all run within the L1/L2 caches of the chip such that those that are almost purely computation bound allow those that are almost purely memory bound to have a useful bit of bus bandwidth for their data and instructions, and this is *before* doing the insane thing of requiring all the translated VLIW code stored somewhere and pumped in and out. In short, what Transmeta did, while it has some pluses, simply can’t practically be scaled without using several memory buses, including one of them purely for the use of translated code, because the processor would be sipping data through a narrow straw.
Then, when you consider that a 4K page of code is not likely to translate 1:1 with a 4K page of translated code in the VLIW format, that becomes much too hairy to do effectively in a combination of hardware and software, requiring all OS’s to account for this weird architectural brainfart to achieve something that *might* be more power-efficient, perhaps more die-efficient, perhaps faster overall… but only with a huge infrastructure that defeats the whole purpose in the first place!
“So they expect to hit 10 cores in 2010, right?
I guess the Cell chip will carry 40 cores by then.”
Cell have only ONE real core. The other 8 cores are only for multimedia.
It’s like Apple had a G4 with 1 PowerPC core and 8 AltiVec unit.
Intel offers 10 REAL core.
2010 is 35 dog years into the future. (IANADL)
Multicore or not, Cells can be combined too, today, and that would mean, for 10 Cells, 10 PPE:s and 70 or 80 SPE:s. That would absolutely blow your socks off, if your OS of choice can keep them busy. I’m not saying it would be low-watt, but neither would an Intel 10-core design be, today.
Multicore or not, Cells can be combined too, today, and that would mean, for 10 Cells, 10 PPE:s and 70 or 80 SPE:s. That would absolutely blow your socks off, if your OS of choice can keep them busy.
The OS can’t help you much there. With their separate instruction sets and local memories the SPEs have to be treated as exclusive resources.
So an application can ask for a number of SPEs, the OS checks that there are enough left and says “ok, here you go”, and then it’s entirely up to the application to put them to good use, whereby you can forget about things like pthreads.
At least that’s how Linux-on-Cell works at the moment. If you can come up with something better, I’d guess IBM and Sony would be very interested.
http://www.businessweek.com/technology/content/aug2005/tc20050823_0…
At least now all the folks snarking about how the Intel-based Mac’s will be 32-bit machines are going to shut up.