IDC: Itanium Is Looking Good

Thom Holwerda 2006-02-19 Intel 40 Comments

“Many people in the industry assumed that Itanium had a low – and poor – profile among end users. That was what the folks at IDC assumed until recently, when they surveyed 500 members of their Enterprise Server Customer Panel. The results were somewhat surprising, they said. Not only was there a high level of awareness among the users – more than 80 percent knew of the platform – but that their intent to buy an Itanium system was fairly strong. About 24 percent of those polled said they had bought at least one Itanium system, though only 13 percent of non-HP users had done so. However, more than a third of all participants said they were highly likely to buy an Itanium system within the next 12 to 18 months.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

40 Comments

2006-02-19 1:46 pm

Dark Leth
Intel.

2006-02-19 3:44 pm

Get a Life
Is it sad when companies try to use herd marketing to increase the sales of their products?

2006-02-19 3:28 pm

ceo1
My own not so equally formal survey among commercial buyers and users of high performance computing tells me the exact opposite. Most big oil companies – with the exception of BP – are dead set *against* Itanium.

They are standardizing on AMD64, as it were. HPC servers would previously be dual CPU, but are now frequently quad-CPU systems – again most of them AMD64 based.

2006-02-19 5:07 pm

stare
Dual and quad Opteron-based servers are not HPC, I’d say low/mid-server range, department level. And in the *true* HPC space (16-128 way) Opteron-based solutions currently are commercially non-existent.

2006-02-19 5:27 pm

sbergman27
> Dual and quad Opteron-based servers are not HPC

Doesn’t that depend upon whether you are talking cluster or SSI? I’d say 128 dual or quad opteron systems in a cluster would definitely be HPC.

A different animal than an Altix, but certainly HPC.
2006-02-20 3:24 pm

VMUNIX

HPC space (16-128 way) Opteron-based solutions currently are commercially non-existent.

Two links.

http://www.appro.com/product/main.asp

http://www.cray.com/products/xt3/index.html

2006-02-19 3:28 pm

Smartpatrol
The fact is the itanium is a viable modern CPU architecture among many old and moldy designs. Give it some more time and it will catch its breath.

2006-02-19 8:32 pm

Lettherebemorelight
Roughly how many years of dev. time and how many billions in dev. costs will it take to catch it’s breath? Not that anyone really cares anymore…I think most people gave up on it some time ago.
2006-02-20 5:00 am

kaiwai
The fact is the itanium is a viable modern CPU architecture among many old and moldy designs. Give it some more time and it will catch its breath.

Itanium was designed as competition to POWER and SPARC; both have pushed their performance up notches, Niagara from SUN and the POWER 6 from IBM, which begs to question, where does Itanium fit into the equation when so far the only vendor who has 120% committment to the platform is HP with HP-UX and a Red Hat, but due to its relationship with Intel – nothing anything due to ‘well argued business case’.

With IBM pushing their POWER architecture forward in ways that was never seen back when Itanium was around, and SPARC being pushed via the Niagara I/II archictecture; given the so far abismal price/performance of Itanium, one really has to wonder how much of these so-called ‘people surveys’ speak the reality – all very nice for people to know and ‘strong purchase’ – but thats the same about *ANY* possible vendor.

2006-02-20 11:45 pm

Smartpatrol
They are quickly reaching the limits of the given architecture. Hence the reason IBM is starting to push CELL. itanium is new an fresh and has some room to flex not to mention from my understanding an OS hasn’t been written to take full advantage of itaniums full feature set. HP-UX V2 is basically a port same with Redhat not fully optimized. x86 is long in the tooth and 64 bit x86 albeit somewhat new won’t last.

Alpha was great technology that was owned by a bunch of dumbasses(Digital) that couldn’t market it properly. itanium is in a lot better position.

2006-02-21 2:30 am

kaiwai
Well, thats what Itanium oppologists (this isn’t directed at you) keep saying; there is a time to just admit to customers or potential customers that the idea seemed great on paper, by in reality, it never delivered due to the complex nature of it.

I think, personally, what really killed it was the stopping of Solaris for Itanium coupled with Microsofts decision to have a very niche concerntrated Itanium version of Windows that wouldn’t have all the same bells and whistles that would appear in the the x86-64 version.

Personally, if I were Intel, I would have gone with an ISA that had a chance – the archiecture you can change, modify, and turn up side down if you want, but once you’ve chosen you’re ISA, you’re stuck with it.

SPARC would have been a good one – don’t take SUN as a benchmark for what can be accomplished; a Intel architecture, SPARC ISA (along with VIS) bolted onto the top, coupled with Intels raw economies of scale, it could have turned out to be quite a nice product – throw that on an EFI motherboard, and bobs your uncle, you’d have a processor with an openstandards ISA, 64bit from the ground up, a well known ISA with good compilers available for it, a large ecosystem in the way of software vendors, it would have been a massive win-win situation.

2006-02-21 3:00 am

Smartpatrol
Well keep in mind that itanium wasn’t really meant to be anything more then HP’s replacement for its PA-RISC line it that capacity its doing okay. Any other sales or hardware builds outside of that is for intel to sell really.
2006-02-21 6:42 am

kaiwai
Well keep in mind that itanium wasn’t really meant to be anything more then HP’s replacement for its PA-RISC line it that capacity its doing okay. Any other sales or hardware builds outside of that is for intel to sell really.

True, but at the same time, both Intel and HP have a vested interest in it moving beyond just a little HP hobby horse – utimately, so far they haven’t made their money back from all the hype they’ve made surrounding investing into it.

The prices of Itanium equipment is still far too expensive when compared with what SUN is providing, and as for price/performance.

What HP need is investment into pushing high end workstation tools onto Itanium/HP-UX, making HP-UX a more attractive target – hell, why not make a cheap Itanium machines available with a free version of HP-UX to encourage developers, both commercial and opensource (which alot of technical workstation users use)developers to start making HP-UX on Itanium a first class target rather the current situation.

2006-02-21 9:33 am

nimble
an OS hasn’t been written to take full advantage of itaniums full feature set.

What features would that be? I don’t think the OS can do much for the Itanium; it’s the compiler that has to do all the hard work of exploiting instruction level parallelism. The OS is more critical on architectures that rely on thread-level parallelism, e.g. Sun’s Niagara.

2006-02-19 5:23 pm

CaptainPinko
…should really love this architecture (although I am referring those who actually like the architecture of x86). Because the premise of the chip is based around the concepts that made the x86 able to compete with RISC chips. First of all it takes out-of-order operation to the extreme with explicitly stating which instructions to execute immediately: so what compilers can only “suggest” you can actually state. Secondly, while many say that x86s CISC instruction set is like gzip-ing your machine-code, well with Itanium that potential is even greater. The problem is that if you can’t find parallel instructions you get stuck with no-ops… which however aren’t the size of a full instruction so they are smaller than a RISC no-op.

Really the problem is with compilers and I believe C/C++ specifically. If everything is just memory and you can access any data at any time you cannot guarantee that by doing two operations simulataneously won’t can’t the end result of the program.

Really I think the industry needs a safer natively-compiled language along the Oberon2 line… something for when you want something fast but don’t need endless memory pointer hacks for that last 1/1000th of effiency.

2006-02-19 6:56 pm

Get a Life
The x86 was never designed with out-of-order execution in mind. It wasn’t until the Pentium Pro that Intel sold an x86 processor with out-of-order execution. While the idea behind EPIC is to obtain greater amounts of ILP the architectural similarity of the approaches of a modern x86 processor and an Itanium 2 don’t exist. Bundles in EPIC are explicitly scheduled by the compiler statically and executed in-order, which works out in special cases, but sucks a nut on data-driven execution. It also makes the binaries much more sensitive to the target processor and compiler. Binaries compiled for IA64 are quite large because of the size of bundles and encoding constraints.

IA64 would benefit from something akin to Dynamo.
2006-02-20 10:32 am

nimble
Secondly, while many say that x86s CISC instruction set is like gzip-ing your machine-code, well with Itanium that potential is even greater.

What? The Itanium encodes only three instructions in one 128-bit bundle, so its code density is even worse than standard RISCs with their fixed 32-bit encodings. And it’s load-store as well, thus losing out further compared to x86.

The problem is that if you can’t find parallel instructions you get stuck with no-ops…

Actually you can have sequential instructions within one 128-bit bundle, but only certain combinations of instructions types are allowed, so no-ops are indeed needed occasionally. Furthermore, branches can only address bundles, so you get an average of one no-op at the end of every basic block.

which however aren’t the size of a full instruction so they are smaller than a RISC no-op.

Wrong. They’re the same whopping 40-something bits.

2006-02-19 7:46 pm

Hank
I think a Cray XT3 would be considered a “true” super computer. That uses the Opteron chip as its CPU. Red Storm, #6 in the last Top 500 chart, has 10880 of those tied together to be specific. #10 on that list is another XT3 with about half as many processors. AMD64 is indeed represented in true HPC applications.
2006-02-19 11:13 pm

transputer_guy
What a puff piece, this is in direct contrast to all the cancellation stories over the last few years.

I have little faith in EPIC, or VLIW, or predication and everything that Itanium stands for, but if you look at the giant die pics, its cost structure isn’t as bad as the much smaller & cheaper x86 would suggest. They are almost entirely covered by repairable cache ram blocks so the yield isn’t as low as the size would suggest, which means Intel can probably make as many as anyone wants and sit it out.

My own view of the future of efficient HPC computer architecture is diametrically opposite, much more in line with Niagara and even Tera MTA.
2006-02-19 11:25 pm

mario
Are you telling me that almost 20% of the members of the Enterprise Server Customer Panel never heard of Itanium? That should indeed be dismaying Intel, considering all the directed marketing effort during the last.. 8 or 9 years!

Seriously, does IDC think all of HPs PA-RISC customers will migrate to Itanium? I really doubt that. IDC has been wrog before, on this subject. Very, very wrong.

Edited 2006-02-19 23:26
2006-02-20 12:02 am

jamesd
What else does HP users have to look forward too. HP has sold off all its other chips and have sworn allegances to Intel.

So there choices are look forward to mediocre intel performance of x86 chips, that have been tested and benchmarked to be slower than AMD’s solution. At least the vaporware promises of the itanium allow them to dream of the day that they have the best performing chip.

Who else besides HPUX users could look forward to a chip that keeps reducing components and its support 32 bit applications as noted at:

http://uadmin.blogspot.com/2006/01/itanium-another-step-closer-to-d…

Perhaps if they strip out enough functionality they can increase the caches to get decent performance to match the current breed of pentium4 chips.

2006-02-20 12:07 pm

stare
Who else besides HPUX users could look forward to a chip that keeps reducing components and its support 32 bit applications as noted at:

http://uadmin.blogspot.com/2006/01/itanium-another-step-closer-to-d…..

Amazing.

1. HPUX is designed for PA-RISC and IA64, HP-UX users couldnt care less about x86 compatibility.

2. Discussed many times. Hardware x86 compatibility module was removed from Itanium die because software emulation is faster now, and this fact /being completely irrelevant to imaginary “Itanium death”/ is actually good for IA64.

Perhaps if they strip out enough functionality they can increase the caches to get decent performance to match the current breed of pentium4 chips

Goto spec.org and try to find chip faster than Itanium.

2006-02-20 12:59 am

AndrewZ
“The fact is the itanium is a viable modern CPU architecture among many old and moldy designs. Give it some more time and it will catch its breath.”

Well, great architecture, poor economics. Just like Alpha. Great performance was not enough for Alpha, Itanium will suffer the same fate. Without the suport for the enterprise software vendors: SAP, PeopleSoft, Oracle, Microsoft, etc etc, you can forget about enterprise purchases.

Until Itanium provides a clear, long-term value for larger business they will be purchasing X86-64 servers.

2006-02-20 11:10 am

Get a Life
The Alpha was pretty competitive price-wise in its time. DEC had a whole host of problems that sunk it, which just took the platform with it. Compaq certainly wasn’t the company for taking the architecture anywhere. I’d say comparing the Alpha and the Itanium 2 on those grounds is giving the Itanium 2 too much credit.

2006-02-20 3:26 am

Xanady Asem
“The x86 was never designed with out-of-order execution in mind. ”

Not only that but the Itanium itself is an in-order processor, which suggest that the original author of the coment regarding the out-of-ordeness of the X86 and IA64 has not a single clue about what out-of-order is….

2006-02-20 5:25 am

kaiwai
Not only that but the Itanium itself is an in-order processor, which suggest that the original author of the coment regarding the out-of-ordeness of the X86 and IA64 has not a single clue about what out-of-order is….

True – I read the post a few times and I did a “what tha!?”.

It has only been a recent development that RISC based processors like SPARC and POWER have included OOE – the idea of RISC was to keep it as simple as possible push the heavy lifting over to the compiler and let the clock speed sort the rest out.

Itanium, when the idea was floated, you could say that Itanium is RISC taken to the absolute extreme of reductionness (yes, a nice GWB’ism when required) – compiler did the work and the CPU kept simple.

The reality is, however, the theory vs. the pratical real world work that needs to be done in business never actually line up and thus compromises needed to be made, OOE was added for one thing.

What Itanium needs is volume, a push for technical workstation software along with workstations sold at the same price as a high end Opteron from a name brand company, coupled with a good number of operating systems provided the necessary flag ship platform in which third party vendors can base their applications upon.
2006-02-20 11:01 am

Get a Life
I had many “is this opposite day?” moments reading his post, and it was only through self-restraint that I didn’t write five pages of text. Specifically I had to bite my tongue on programming languages features that are especially disadvantageous to static analysis, that Oberon-2 certainly doesn’t cure.

2006-02-20 3:33 am

CrLf
“Well, great architecture, poor economics. Just like Alpha.”

The alpha hadn’t as poor economics as you suggest… the problem with the alpha was, in fact, two problems: when Compaq bought DEC, they weren’t sure of what to do with it, then HP came along and bought Compaq, thus axing the alpha in favor of their PA-RISC and Itanium iron.

If things had played differently for DEC, we wouldn’t have an alpha leading the server market, but it would be giving sparc a run for its money.
2006-02-20 11:23 am

cg0def
Ok this is not the first time that IDC claims they had done a survey and the results seem very fictional. So this really doesn’t surprise me. What does surprise me that Intel just won’t give up on the agonizing platform. Oh and please don’t start arguing againts the truth. Yes almost everybody that is somebody in the IT industry knows about Itanium but that doesn’t change the fact that it was never the popular kind. Opterons, though weaker are the choise of many IT departments today. Plus if you want more power you can just get more CPUs and it really doesn’t cost you that much more. As far as really big Iron goes I don’t see what Intel believes they can do in order to take customers away from IBM and Sun.

Sorry Intel but you missed the train about 3 years ago.

2006-02-20 12:23 pm

stare
What does surprise me that Intel just won’t give up on the agonizing platform.

I’d say x86 is the agonizing platform, sooner or later this 30 y.o. architecture will hit the scalability and performance wall.

Plus if you want more power you can just get more CPUs and it really doesn’t cost you that much more

You can’t infinitely add more processors, otherwise we would be running clusters of 486s.

2006-02-20 12:52 pm

nimble
I’d say x86 is the agonizing platform, sooner or later this 30 y.o. architecture will hit the scalability and performance wall.

The x86’s demise has been predicted for at least 20 years, so you’re gonna have to come up with some more convincing evidence for your assertion.

The x86 ISA has been extended and adapted so often and successfully that the scalability argument is just silly. And if x86 is so bad, why does nobody, including Intel themselves, manage to beat it (and not just for special applications) at the same transistor budget?

x86 may not be pretty, but it certainly does the job. And with its compact code it’s actually quite well suited to today’s requirements, where memory bandwidth

and latency are much more important than the size of the instruction decoder.

Besides, what exactly is an “agonizing platform”?

2006-02-20 2:33 pm

magick
The x86’s demise has been predicted for at least 20 years, so you’re gonna have to come up with some more convincing evidence for your assertion.

And so was predicted the end of litographic technology, which still resists, but it doesn’t proofs that it will not reach its practical/physical limitations at some point. It WILL.

Technology/engineer will always find it’s way around, but it doesn’t mean it’s the best way. Transition costs and compatability are really the key terms in this issue, so industry always tend to postpone such gigantic transitions. (LCD vs CRT etc)

The x86 ISA has been extended and adapted so often and successfully that the scalability argument is just silly.

You think so? Just look at the figures showing real performance gain for past decade. You’ll be surprised how curve subsides due to different factors. x86 just hapens to be one of them.

And if x86 is so bad, why does nobody, including Intel themselves, manage to beat it (and not just for special applications) at the same transistor budget?

And who would beat that mamoth application base with its software developers? Like I said, compatability is really a key issue here.

x86 may not be pretty, but it certainly does the job. And with its compact code it’s actually quite well suited to today’s requirements, where memory bandwidth

and latency are much more important than the size of the instruction decoder.

Now, you are not having any clue about what you’re talking, do you? Bandwidth is always opposed to latency, and instruction decoder is just a way to save bandwidth on part of latency. Further more, it limits CPUs ability to process data by delaying and limiting number of instructions which are fed to its pipelines. Out-of-order execution just makes things worse when it comes to prediction miss (pipeline flush). It’s not that simple, you know.

Edited 2006-02-20 14:36

2006-02-20 3:34 pm

nimble
Just look at the figures showing real performance gain for past decade. You’ll be surprised how curve subsides due to different factors. x86 just hapens to be one of them.

The slowdown is mostly due to semiconductor technology, in particular the 90nm step didn’t quite deliver what was exected. Also, Intel went wrong with Netburst, and everyone is fighting power trouble. You’ll have to explain further what the instruction set has to do with any of that, with particular reference to why Itanium and others aren’t doing any better.

And who would beat that mamoth application base with its software developers?

I was talking about raw performance, and to allow for volume I had only said at the same transistor budget, rather than at the same price. Granted, higher volume allows for more development effort too, but then again Itanium had all the development resources it could have asked for.

Now, you are not having any clue about what you’re talking, do you?

I do actually. But how about staying on topic?

Bandwidth is always opposed to latency

No, it’s not, they’re pretty much orthogonal. Simply by widening your connections you can always add extra bandwidth (at a cost), without necessarily affecting latency.

You’re right though in that more compact code doesn’t directly help with latency. But it does reduce instruction cache requirements, which means you can fit more in the cache, which means fewer cache misses, which effectively result in lower latency.

Further more, it limits CPUs ability to process data by delaying and limiting number of instructions which are fed to its pipelines.

I take the point about the delay, but how many decode pipeline stages could something like an Opteron really save with a simpler instruction encoding. One? Two? And note that that only becomes significant when there’s a mispredicted branch.

And I don’t see how it limits the number of issued instructions. The Opteron does three per cycle. Intel’s new architecture will do four, and it remains to be seen whether there actually is enough instruction-level parallelism to make that worthwhile.

Out-of-order execution just makes things worse when it comes to prediction miss (pipeline flush)

Worse than what, in-order? I don’t agree with that, but what’s that got to do with the instruction format?
2006-02-20 4:45 pm

magick
The slowdown is mostly due to semiconductor technology, in particular the 90nm step didn’t quite deliver what was exected. Also, Intel went wrong with Netburst, and everyone is fighting power trouble. You’ll have to explain further what the instruction set has to do with any of that, with particular reference to why Itanium and others aren’t doing any better.

I was reffering to x86 in general. Intel and AMD are taking different approaches on same issue, yet performance difference of their (price) competing CPUs is around couple of % for most applications. Sure litographic technology is one of the bigest limitations, but x86 can’t scale that well any more either. It’s not my “competent” opinion, rather well known concern. That’s way we need SIMD instruction extensions every two-three years, just to (partially) circumvent architectural limitations.

I was talking about raw performance, and to allow for volume I had only said at the same transistor budget, rather than at the same price. Granted, higher volume allows for more development effort too, but then again Itanium had all the development resources it could have asked for.

I’m not following you. I was reffering to obstacles in radical architecture changes. You can’t compare apples with oranges. CPU architecture has to be supported with OS and software base, and that support is the most difficult thing to achieve for every newcomer in industry.

No, it’s not, they’re pretty much orthogonal. Simply by widening your connections you can always add extra bandwidth (at a cost), without necessarily affecting latency.

Yes but engineering is about finding way around real world restrictions and limitations, and not brutal forcing things to death. I thought we are being reasonable in this discusion, not getting wild and fancy. Like you said, staying at the same transistor budget. Nevertheless, you can’t drive fast to much signal lines in parallel without jeopardising signal integrity. That’s why everything is being serialised as much as possible.

You’re right though in that more compact code doesn’t directly help with latency. But it does reduce instruction cache requirements, which means you can fit more in the cache, which means fewer cache misses, which effectively result in lower latency.

No it doesn’t. Out-of-order execution implies inevitable misses and flushes. And when they occur, you have to fetch and decode new instruction, which doesn’t help latency. That being said, caching algorithm becomes much more important with out-of-order execution. That’s way Intel’s P4 caches micro op instead of whole x86 instructions.

I take the point about the delay, but how many decode pipeline stages could something like an Opteron really save with a simpler instruction encoding. One? Two? And note that that only becomes significant when there’s a mispredicted branch.

And I don’t see how it limits the number of issued instructions. The Opteron does three per cycle. Intel’s new architecture will do four, and it remains to be seen whether there actually is enough instruction-level parallelism to make that worthwhile.

It’s not just about getting rid of decoding stages, though that’s not that insignificant. Current x86 CPUs are 3-way superscalar, which means that they have 3 distinct executing pipelines, and thus are equipped with “appropriate” instruction decoder, which supposedly decodes 3 macro (x86) instructions. But that is only a theory. In reality that number is much closer to 2.5, which obviously implies that instruction decoder IS limiting factor.

So, lets sum up:

1) increasing number of instructions every 2-3 years (which also requires software optimisations)

2) increasing number of pipelines

3) application/user demands are shifting to multi-threading (ILP)

Having that in mind it not strange that Intel and AMD are increasing number of cores and not number of pipelines. Can you imagine what an engineers nightmare would be designing an 8-way superscalar processor? :O Just ask IBM.

In couple of years, when x86 engineers will be struggling to fit 8 massive x86 cores on same die, Itanium will pack about several dozens of them with the same transistor count. Not bad, ha?

Worse than what, in-order? I don’t agree with that, but what’s that got to do with the instruction format?

Everything. Are you familiar at all with VLIW/EPIC concept of doing things?
2006-02-20 5:25 pm

nimble
I’m not following you. I was reffering to obstacles in radical architecture changes.

Ok, I try again. Itanium had a large chunk of Intel’s impressive development resources thrown at it, yet it did not yield performance results that are greatly superior to x86. It’s better at some things, but worse at others, even though it has huge and expensive caches.

The problems of architectural change come on top of that.

Out-of-order execution implies inevitable misses and flushes.

Yes, but what are you tryting to say, that in-order execution doesn’t have those?!?

The whole point of out-of-order execution is that it can continue executing other instructions where an in-order processor gets held up by cache misses or other long-latency operations.

And there’s nothing that either approach can do about mispredicted branches.

In reality that number is much closer to 2.5, which obviously implies that instruction decoder IS limiting factor.

No, that isn’t obvious at all. While there are certain restrictions on instruction decoding, the main limiting factor is the available amount of instruction-level parallelism, and that doesn’t change whether instructions are scheduled at runtime or compile-time.

what’s that got to do with the instruction format?

Everything. Are you familiar at all with VLIW/EPIC concept of doing things?

Yes. You could have an explicitly parallel instruction set with variable-length instructions. Variable-lentgth vs fixed-length and in-order vs out-of-order are orthogonal issues.

2006-02-20 2:05 pm

magick
O.K. can we now break from zealotry, at least for a moment? It’s strange (well, not that) how every KDE or Intel news draws comments about GNOME or AMD being better, or the other way around.

I do realize that people get aggravated with Intel’s (arguebly) overpriced technology and their marketing (megahertz) stunts in PC segment, and you are free to boycott their products, but it’s plain wrong to ditch some technology/argument just on that base. AMD happens to be our saviour in past years, bringing down Intel’s monopoly, but hey… don’t get any wild ideas… they would act exactly as Intel does if they where in the position. After all, it’s all about money.

Opterons are great for what they are doing and they are pain in Intel’s butt, which is good thing for consumers, but we are talking about something different here. IT WILL eventually depart from x86(_64) as we know it, and hopefully AMD will have it’s solution when time comes, so leave it aside.

Regarding EPIC (IA64), it certainly has great (technological) potential in long term, but as it’s already concluded many times it suffers form poor industry support and bad timing (it’s debutted in times of recession and rise of clustering technology).

Itanium had it’s rough days, and they are not over by no means. Nevertheless, I think it’s not for a museum, not by far. I’m not saying that VLIW is the only way to go in IPC, but it sure seems reasonable considering recent multi-core trends and rising power consumption. I mean, it’s in-order instruction_decoderless approach does gain on part of simplicity compared to “N-way superscalar out-of-order multi-threaded” monsters like Power6 and recent x86_on_steroids.

I guess time will tell.

2006-02-20 4:00 pm

nimble
I mean, it’s in-order instruction_decoderless approach does gain on part of simplicity

True, but then it loses it all again and more on the need for extra caches, because compiler scheduling cannot be as good as out-of-order runtime scheduling, due to the unpredictability of today’s complex memory hierarchies.

2006-02-20 10:35 pm

magick
True, but then it loses it all again and more on the need for extra caches,

Yes, that’s one of it’s drawbacks but it will become less significant in future, when number of cores per die build up.

because compiler scheduling cannot be as good as out-of-order runtime scheduling, due to the unpredictability of today’s complex memory hierarchies.

Static scheduling has it’s problems, like non-deterministic latencies, but so has hardware/dynamic scheduling in form of poor ILP and cache management (thrashing and trashing). Compiler-time scheduling can give to hardware lots of hints about program’s structure and in EPIC’s case can actually schedule some of branch predictions far before they would occur. Furthermore, EPIC unbundles branches into three stages (compare, prepare-to-branch and actual branch), allowing more sophisticated static/compiler as well as dynamic/hardware scheduling, thus reducing both exposed and hidden latencies. It also incorporates programmatic data cache management, concurrent branching, speculative loads, etc.

EPIC is in essence a great conceptual architecture, it’s Itanium as an implementation which is at stake here.

Ok, I try again. Itanium had a large chunk of Intel’s impressive development resources thrown at it, yet it did not yield performance results that are greatly superior to x86. It’s better at some things, but worse at others, even though it has huge and expensive caches.

Actually, you are comparing apples with oranges again. You can’t make comparasions like that. In long term, EPIC has lot of advantages over x86 out-of-order superscalars, and Itanium is just one implementation of such concept targeting big iron market.

Who knows, maybe in 10 years AMD will come up with great general purpose VLIW/EPIC architecture, beating Intel with its own weapon. It wouldn’t be the first time, would it? :p

Yes, but what are you tryting to say, that in-order execution doesn’t have those?!?

The whole point of out-of-order execution is that it can continue executing other instructions where an in-order processor gets held up by cache misses or other long-latency operations.

And there’s nothing that either approach can do about mispredicted branches. [/i]

The whole point is about who and how makes scheduling decisions. And what instructions out-of-order architecture keeps processing and based on whose decision? Hardware scheduling and prediction has it’s practical limits, and that apply to cache management as well. I don’t see involving a compiler in whole process being a bad thing, whatsoever. Those two (hw and compiler scheduling) are not mutually exclusive, you know. EPIC can have hw scheduler as well as out-of-order execution but it’s not required, which can simplify things in some applications designs.

No, that isn’t obvious at all. While there are certain restrictions on instruction decoding, the main limiting factor is the available amount of instruction-level parallelism, and that doesn’t change whether instructions are scheduled at runtime or compile-time.

Uuuuuhhh! Now that was a haaard mistake. Itanium has ability to exploit compile-time optimisations, rather then doing educated guess (speculating) where a branch is going to land. EPIC utilises compile-time scheduling in order to detect and group instructions that can be parallelised, as well as branches that could be precomputed, hidding their latencies. That is the whole point of EPIC.

Yes. You could have an explicitly parallel instruction set with variable-length instructions. Variable-lentgth vs fixed-length and in-order vs out-of-order are orthogonal issues.

You are missing the point. Please read more on that issue before involving in such discussions.

In conclusion: engineering is about making compromises in order to achieve the best results in given application. Like every engineering design/solution, IA64 has its strong points but also has it weaknesses, which you pretty much failed to address.

I’m not saying that Itanium is a whole-mighty CPU, nor that I would by one if I have to chose between available offers, but does it have future? I don’t know, but sure, why not. It certainly has lots of potential, or at least it’s been stated so by eminent experts (university professors).

P.S. I’m not CPU architect, nor I consider my self competant enough to make hard statements about engineering approaches and solutions, but I see that many people don’t have a problem commenting on something about they don’t have a clue.

2006-02-21 8:00 am

nimble
EPIC is in essence a great conceptual architecture,

Yes that’s what some people keep on saying, yet when you look at it in detail it’s just another point in the design space which is faster on predictable code and weaker on unpredictable code.

it’s Itanium as an implementation which is at stake here.

Yeah right, those stupid Intel engineers, they just hadn’t had the talent, time, or money, to get the most out of the architecture.

but so has hardware/dynamic scheduling in form of poor ILP and cache management (thrashing and trashing).

Where do you think ILP actually comes from? ILP is an inherent property of a program that is determined by the data and control dependencies within it. Dynamic scheduling is actually very good at extracting ILP, because it does not have to rely on predictions based on incomplete data, but instead schedules operations whenever their operands actually become available.

And what the heck has cache management got to do with it? Cache prefetch and bypassing schemes can be and are used with both approaches.

The whole point is about who and how makes scheduling decisions. And what instructions out-of-order architecture keeps processing and based on whose decision?

An out-of-order architecture makes its scheduling decisions based on the actual, rather than the unreliably predicted, availability of the operands that an instruction depends on. Here’s a very good article on the Opteron:

http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture…

Itanium has ability to exploit compile-time optimisations, rather then doing educated guess (speculating) where a branch is going to land.

You’ve got an irrational belief in compiler optimisations. What exactly do you think compile-time branch prediction is if not a more or less educated guess?

Run-time branch prediction can at least rely on the recent history of the actual program, whereas the compiler has to rely on static heuristics alone. Profiling can of course help with that, but it requires extra time and realistic test data.

In any case you end up tying your compiled program to a particular implementation of the architecture, because any changes in latencies will require recompiling for optimum performance.

2006-02-21 12:02 pm

magick
Yes that’s what some people keep on saying, yet when you look at it in detail it’s just another point in the design space which is faster on predictable code and weaker on unpredictable code.

Actually, every code is somewhat predictable, it is only a matter how well. That said, EPIC should do a much better job at predicting and reschaduling all sorts of branches then any other hardware/dynamic scheduler. And people (researches) wich states so are not limitted to those on Intel’s pay list. I personaly am not being commfortable to argue about that in detailes.

Yeah right, those stupid Intel engineers, they just hadn’t had the talent, time, or money, to get the most out of the architecture.

It’s not just about that. There are many other issues involved in success of any given architecture. Putting them out of scope you could state that PentiumPro had a terrible architecture.

Where do you think ILP actually comes from? ILP is an inherent property of a program that is determined by the data and control dependencies within it. Dynamic scheduling is actually very good at extracting ILP, because it does not have to rely on predictions based on incomplete data, but instead schedules operations whenever their operands actually become available.

And what the heck has cache management got to do with it? Cache prefetch and bypassing schemes can be and are used with both approaches.

EPIC doesn’t forbid use of dynamic scheluler (which can override static one), it just uses static where its applicable. And cache management has a lot to do with it because compiler has a lot better awareness of data type. Predicting at which level some data resides opens a whole new issue regarding assumpted and actual latency but lets not go so far.

An out-of-order architecture makes its scheduling decisions based on the actual, rather than the unreliably predicted, availability of the operands that an instruction depends on. Here’s a very good article on the Opteron:

You’ve got an irrational belief in compiler optimisations. What exactly do you think compile-time branch prediction is if not a more or less educated guess?

Run-time branch prediction can at least rely on the recent history of the actual program, whereas the compiler has to rely on static heuristics alone. Profiling can of course help with that, but it requires extra time and realistic test data.

As I said, EPIC can use both, static and dynamic scheduling, and belief in compile-time predictions is not my and it certainly is not irrational. I guess you could compare it with one-pass vs n-pass MPEG4 encoding. You are actually limitted in what you can do in one pass (real time) and data type (un)awareness is an issue on its own.

Dynamic branch prediction has its strong points but also has its limittations. IA64 just tried to address them, rather then throwing away baby with wather.

You are arguing about issues that can’t be discussed like this, and certainly not by me and you. It’s rather academic and scientific issue which has been addressed in many studies and PhD thessis.