A Titanic Story – The History of the Itanium

Guest post by Nicholas Blachford 2002-02-10 Editorial 24 Comments

“This will end up being one of the world’s worst investments, I’m afraid,” – David House, former Intel chief of corporate strategy said in the early 1990s. I’ve been fasinated by microprocessors for years and have been following the Merced debacle since back in 1994 when HP and Intel announced they were getting together to make some amazing new technology.

Editorial notice: All opinions are those of the author and not necessarily those of osnews.com

Turned out this amazing new technology was really just an enhanced VLIW, a technology that had failed due to the complexity of the compilers which at the time (1980’s) were way too processor intensive to be of any use. However two companies had specialized in the field and HP had since acquired their expertise.

One group of people did get VLIW to work though – the Russians. It transpired that engineers from both Sun and HP had visited a Russian company called Elbrus
which had been ahead of the wests technology for years. When they returned both companies promptly started VLIW projects, Sun canned their project and the leaders went off to start a company called Transmeta. HP later merged theirs with Intel.

The Russians kept on working and later announced a very high speed but apparently invisible CPU called the Elbrus 2K.

There was loads of hype around the super secret Merced project and many expected it to just roll over the server market. Such was the expectation in fact that both HP and SGI scrapped their own CPUs (PA-RISC and MIPS Rx000) in favor of Merced.

DEC in fact were so worried they tried to foul things up for Intel by taking them to court, this ended up with with a deal with Intel buying a Fab (chip factory) and DECs StrongARM team. Intel also took over making the Alphas for DEC.

It was only later that rumors started to appear that the Merced project wasn’t going quite to plan. Performance was not as expected resulting in HP and SGI promptly backtracking and “extending” their CPU lines.

It was questionable at one point if Merced would ever be released as a CPU at all and in the end it was quietly downgraded to a “development” platform and released under the Itanium brand which The Register re christened as Itanic. The real attack on the market would only come with the 2nd generation McKinley CPU.

While the idea behind Merced was very good, Intel’s implementation was very bad. Intel screwed up badly during development and instead of a low cost ultra fast CPU, ended up with a massively complex and hideously expensive beast of a CPU which didn’t give the expected stellar performance. This after spending $1 – $2 billion and
delivering it 3 years late.

The Itanium family may in fact never reach the lofty performance which was originally advertised, It relies on getting high numbers of Instructions per cycle and this has proved to be a remarkably difficult thing to achieve in practice. In the PowerPC world Motorola actually dropped to 2 Integer units due to the fact the 3rd unit was used so
little in previous designs.

The 2nd generation McKinley will increase performance significantly but it in part relies on a massive cache and high bandwidth for it’s performance. Notably the Mc Kinley was started at HP, not Intel. It has however the IBM Power 4 to contend with and the Alpha 21364 on it’s way which will also have a large cache, this is expected to take the
performance lead when it appears.

Ironically the Itanium line may also have the same problems as the x86 it was meant to replace, the ISA is not what it could be, Tech collumnist Paul deMones comment was “The IA64 instruction set architecture was designed by a committee” [1]. In addition to this an article on the Elbrus site compares their approach with that of Intel
and Transmeta, they are not exactly impressed with the Intel approach stating that changing the CPU hardware design will reduce efficiency considerably – a known problem in the VLIW world.

Is Mc Kinley going to work or will Intel be a bit player in the world of mid and high end servers? How well the CPU performs is only one factor in that equation – Sun, who leads the server market never seems to have leading edge CPUs. The Alpha was / is the performance leader and yet is destined to die out in the next few years while the engineers are now working for Intel.

But just in case, Intel are said to be working on an x86-64 compatible CPU.

References:
[1] Summary and Conclusion of “Making a Mountain Out of a Molehill” by Paul deMone in the Silicon Insider column.

About the Author:
Nicholas Blachford is a Software Engineer / Architect from Northern Ireland but lives in Amsterdam, Holland. He is interested in Computer, Micorprocessors and all sorts of things generally techie. He’s written open source Audio
software on the Amiga and BeOS, and is currently trying to write a geek comedy. There’s even a home page here. Apart from that he really needs to get out more. Nicholas can be reached via email on nicholas@blachford.info.

24 Comments

2002-02-10 6:23 pm

Anonymous
Can anyone explain to me why Compaq is currently totally committed to moving Tru64 Unix and OpenVMS away from the Alpha CPU and onto Itanium?

I can’t understand why Compaq is making such a huge commitment to Itanium and dumping Alpha if IA64 is so slow?

I can’t imagine it being just for some sort of political reason. Sun at one time had Solaris ported to Itanium but they abandoned it because of the lousy performance of the CPU.
2002-02-10 9:40 pm

Anonymous
“Historically, Intel has this remarkable ability to charge a factor of eight for a performance boost of two in microprocessors.”

they are really bad and only after the money. performance of a PII is almost identical to a PIII to a P4. the L1 and L2 cache are all that seems to matter!! let’s not kid ourselves with crappy comparisions and charts!!!
2002-02-10 9:48 pm

Anonymous
does the 64 bit hammer require special support from microsoft. will it require a separate build if windows XP????
2002-02-10 10:13 pm

Anonymous
For XP to take advatage of the 64 bit part of the chip, yes. To use the 32bit stuff no. I’d imagine you could run 64 bit programs on the 32 bit XP though.
2002-02-10 10:52 pm

Anonymous
Itanium and Transmeta r both VLIW.

Transmeta has these advantages over itanium

(a) lower power consumption

(b) lower cost

(c) faster x86 emulation.

in theory, it should be possible to program directly to

transmeta crusoe processors, by passing the code x86 morphing software, and get native VLIW performance boost.

Perhaps AMD or Apple should consider purchasing transmeta, and making the next gen. processor around it.
2002-02-10 11:34 pm

Anonymous
I’ve asked this before… has ANYONE ever been able to crack open the RISC engine that drives the current x86 lines. The hardware is in there, and I’m sure that Intel must have a way of testing that section so I suspect there is a back door *somewhere*. That would make for very interesting performance enhancements to exisitng x86 operating systems.

P
2002-02-11 12:27 am

Anonymous
Before i get to that RISC comment…

“I’d imagine you could run 64 bit programs on the 32 bit XP though.”

I would imagine the opposite. The *hammer processors are able to execute legacy (there is no sweeter word to use when describing x86) x86 code by behaving exactly like it, in 32bit mode. In x86-64’s 32bit i dont think you have access to the other registers, which makes it _exactly_ like Protected Mode on 386-P4/Athlons. This means you would not be able to execute 64bit code (in 32bit mode) without some type of emulation. And if that code is for IA-64, you are just plain screwed on the *hammer because i’m 99.9% sure they are not binary compatible chips.

Now about RISC and x86. The 386 till even the newer chips like Athlons and P4s have always been considered CISC processors. I’m not sure about intel chips, but i know that AMD made an effort to reduce the number of instructions. The way they did that was by having a RISC-like set of instructions, and then all the others. If an app could get away by just using the “RISC” set it would take less cycles. A non-“RISC” instruction would be punished.

Example:

dec ecx

jnz label

would be the “RISC” translation of:

loop label
2002-02-11 1:31 am

Anonymous
Raptor, he’s referring to the way Intel maps the CISC instruction set to RISC instructions internally on the latest Pentiums. I’m guessing there is no way to bypass the CISC instruction decoder. The existance of a uniform instruction format for the RISC microcode is extremely unlikely.

lipstick lesbian, the Itanium has a LOT more going for it than just VLIW (even though the VLIW in Itanium is a complete waste unless you’re using a compiler that optimizes for explicit paralellism). I agree the Itanium sucks, for different reasons, but Transmeta has a looooong way to go to compete with the raw power of the Itanium.
2002-02-11 3:27 am

Anonymous
Compaq have dropped the Alpha for one simple reason, they understand making boxes, not cpu’s.

For all of DEC’s marketing stupidity they understood how to build cpu’s. Compaq bought DEC for their box production, and things like Alpha were an unwanted bonus.

As soon as Compaq was given the opportunity to get rid of those things it just couldnt cope with it did.

As far as exposing the RISC core of x86 cpu’s, its not as wasy as you’d like. Remember that CISC processors used to have little ROM’s ( microcode ) that actually controlled the internal working of the cpu, with even more basic instructions than what you find RISC ISA’s.

Saying they are RISC cores is a misnomer, they are an application of techniques first developed in RISC cpu’s to a CISC cpu, and the internal instructions reflect that.

If you are really interested then try to dig up some docs on the NexGen 5×86, which let programs run in RISC mode. You might even be able to find a motherboard based on one of these chips, and do some programming.
2002-02-11 2:19 pm

Anonymous
How long will it take before the notion settles (known for years by computer scientists/engineers) that the Mhz of your CPU plays only a partial role in the speed of a system.

Yes, Sun has the lead even with slower CPU’s, because they know how to optimise ALL their IO and balance it with the CPU speed.

I have worked for almost two years on a system which had a 33 MHz CPU and 16Mb of RAM, yet it served 20 people with a relational database of 4Gb with reasonable response times. I am talking MINICOMPUTER here. The reason why this system could do that, was that all IO was handled by separate IO processors, instead of the main processor. If the PC architecture would allow something like that, it would provide additional power at much lower MHz rates.

The only reason that Intel tries to stay ahead of the pack with increasing complexity is that they want to sell CPU’s and cut off competitors. They want to make it possible to use their CPU’s for every function on the PC. When did Intel think about multi-media extensions ? Philips had proposed an architecture for a special multi-media processor to provide audio- and video capabilities besides the CPU in the PC architecture. Intel did not want that, so they introduced MMX. Philips’ idea died a very rapid death.

Yet, offloading IO tasks to specialised processors or spicifically programmed IO processors takes a whole burden from the central processor, allowing it to run the OS more efficient.

Jurgen
2002-02-11 4:43 pm

Anonymous
> Yet, offloading IO tasks to specialised processors or

> spicifically programmed IO processors takes a whole

> burden from the central processor, allowing it to run the

> OS more efficient.

Heh, you’re describing Jay Miner’s concept for the Amiga technology…

-> many different specialised processors = Amiga’s “custom chips” !!
2002-02-11 7:50 pm

Anonymous
so please please explain to me this question:

will AMD be under the mercy of microsoft to release a special build for the x86-64 produced only by AMD????

please please answer the question. it is driving me crazy.
2002-02-12 12:23 am

Anonymous
Have you’ve seen TTA architecture?

i STILL think THIS is whats gonna cut it!

http://www.byte.com/art/9502/sec13/art1.htm>TTA

http://einstein.et.tudelft.nl/~heco/move/move-project/>TTA
2002-02-12 1:48 am

Anonymous
The MS Win32 SDK docs now incorporate constants for x86-64 CPU types when getting the system versions. I bet that something is in the pipeline.

See this quote from

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sy…

”

wProcessorArchitecture

Specifies the system’s processor architecture. This value can be one of the following values:

PROCESSOR_ARCHITECTURE_UNKNOWN

PROCESSOR_ARCHITECTURE_INTEL

Windows NT 3.51: PROCESSOR_ARCHITECTURE_MIPS

Windows NT 4.0 and earlier: PROCESSOR_ARCHITECTURE_ALPHA

Windows NT 4.0 and earlier: PROCESSOR_ARCHITECTURE_PPC

64-bit Windows: PROCESSOR_ARCHITECTURE_IA64

64-bit Windows: PROCESSOR_ARCHITECTURE_IA32_ON_WIN64

64-bit Windows: PROCESSOR_ARCHITECTURE_AMD64

”

What’s curious is the IA32_ON_WIN64. Wonder what that means.

P
2002-02-12 10:45 am

Anonymous
Maybe it’s used by the Windows on Windows system. This would need to have a IA32 entry but bind to the WIN64 api.

This is nasty actually, you’d need to translate stack frames, which would be non trivial if you pass pointers to structure that contains pointers e.g.

// Easy, no translation required

typedef

{

DWORD dwVal;

} APISTRUCT3, *LPAPISTRUCT3;

// Need to translate lpData

typedef

{

DWORD dwVal;

LPVOID lpData;

} APISTRUCT2, *LPAPISTRUCT2;

// Need to build a 64 bit version of Struct

typedef

{

LPAPISTRUCT2 Struct;

} APISTRUCT1, *LPAPISTRUCT1;

APISTRUCT1 ApiStruct1;

APISTRUCT2 ApiStruct2;

No, if you pass APISTRUCT3 to a Windows API call, no translation is required, you just zero extend the pointer on the stack.

If you pass APISTRUCT1 it’s a real pain – you’d need to copy (& translate the lpData pointer) the APISTRUCT2 structure it points to to a temporary buffer, and change the pointer to point there, and then do the reverse after the API call returns.

Plus you’d need to be able to switch to 64 bit mode and back, which may take time.
2002-02-12 12:30 pm

Anonymous
Given the name and place in the list, I would guess that this applies to Itanium instead of Hammer. Note that IA-32 is Intel marchitecture for what everyone else calls x86. Also, 32-bit mode on Hammer was designed specifically to run 32-bit x86 Windows programs without any modifications.

BTW, that also means that you can’t run 64-bit programs on 32-bit Windows because the extra bits and extra registers just aren’t visible unless the OS switches to 64-bit mode. Although it would be possible to create a 32-bit version of Windows that does this, there would be no point.
2002-02-12 12:45 pm

Anonymous
Digital and Compaq have never been able to figure out how to make Alpha popular enough to be profitable. That’s the fundamental reason Compaq isn’t going ahead with EV8. That means they had three choices for what to do with OpenVMS:

1) drop it altogether

2) port it to another platform

3) bleed money until the corporation is insolvent

Surely the first choice was the best for Compaq’s customers.

It also isn’t hard to understand why Compaq’s chose Itanium. The possible choices were SPARC, MIPS, POWER and Itanium. Neither SPARC nor MIPS offers exceptional performance on the type of applications Alpha is good at. POWER would be an option, but IBM is a competitor and Compaq already has a working relationship with Intel. Besides, with the Alpha engineers working for Intel, future versions of Itanium are bound to have better performance.
2002-02-12 1:22 pm

Anonymous
There are two reasons Transmeta doesn’t expose the underlying VLIW engine in Crusoe. First, they don’t have to convince anyone to port to Crusoe. Second, they can change the underlying architecture and code generator without having to recompile applications. It also allows them to optimize the architecture for emulation / interpretation of x86 code instead of as a general purpose processor.

If Transmeta can survive and prosper they will eventually adopt x86-64 because that architecture is a lot more friendly for translation into RISC-like code than x86. There are enough registers to avoid storing temporary values on the stack and they are large enough to avoid most multi-precision arithmetic.

Whether they ever expose the underlying engine will depend on thier experience. If they find that a particular architecture works well so they aren’t changing it all the time and they find that there is a fundamental issue with x86 translation and Crusoe becomes popular enough then they might. However, I doubt that will happen. Since they have made a virtue out of a problem (x86 compatibility), there just isn’t much upside to exposing the engine.
2002-02-12 1:25 pm

Anonymous
Oops, I meant “the second choice”, not the first. Sigh.
2002-02-12 10:38 pm

Anonymous
Several stages in the athlon pipeline is spent on decoding x86 into micro-ops which are Risc like. that way, athlon can run legacy x86 code.

there are like 50+ million athlons out there.

wouldn’t it be faster to have a compiler that expose the underlying Athlon Risc engine, reduce the pipline by several stages and hence reducing mispredict penalties?

in other words, the underlying athlon Risc engine has many registers and a shorter pipeline, so why not have a compiler produce code in micro-ops, instead of x86.

it should allow the athlon to run cooler and faster.
2002-02-12 10:50 pm

Anonymous
I liked the Russians approach to this problem, they have a fast VLIW core but not exposing it means you have to perform code translation and this slows you down. However if you do expose the internals you are left with an architecture which is very hard to change as it will break compatibility (it is hard to make big changes to Itanium).

The Russians have a very clever way of solving this which is to break the problem in two. You ship a binary based on instruction set A (which is a hardware independant list of instructions) then you recompile this into instruction set B which is specific to the CPU. You only recompile once so it could be done when installing but you end up with an optimised binary for the specific CPU and no need for so much complexity in the hardware – so it’s faster also.

As for Transmeta using x86-64, They already have a license:

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~1…

In other news, in a timely announcement, SGI backed up my story

http://www.theregister.co.uk/content/53/24039.html

P.S. There is one correction to the original article:

It was Nick Tredennick of Microprocessr Report who thought up “Itanic”, not The Register.
2002-02-13 12:17 am

Anonymous
The idea of distributing intermediate-code files that are compiled on installation isn’t new. I came up with this idea in 1992 and then discovered it was already old-hat.

The problem with this approach is that it assumes that the source code is platform independent. It is amazing how much source code isn’t. The elegance of the Transmeta approach is that it detects the platform dependencies and emulates around them.

As for Transmeta and x86-64: licensing x86-64 technology isn’t the same thing as delivering it, but I expect that they will for the reasons given.
2002-02-13 12:42 am

Anonymous
The Althon engine is purely internal and there is no reason for AMD to expose it even if they could. The big selling point for Athlon is compatibility with legacy applications. If AMD exposed the engine they would have a RISC chip, but one that isn’t compatible with any existing processor architecture and which doesn’t have any software. Take it from someone who’s been there / done that, there are better ways to throw away money.

You also need to understand that the real problem with decoding multiple x86 instructions is to identify the start of each instruction. That’s hard because you have to (partially) decode each instruction to identify the one after it. (This is one of the primary advantages of fixed-size instructions in architectures like PowerPC or Alpha.) Once you’ve found the instructions, decoding them is comparatively easy. Athlon “cheats” by remembering where the start of each instruction is. This is almost as efficient as remembering the decoded instructions (as Pentium 4 does), but a heck of a lot easier.
2002-02-13 1:12 am

Anonymous
One of the reasons I promote this dual approach is (a) smooth transition for (b) long term performance gains.

MDRONLINE estimates a 2% compounded penalty for using x86 ISA. – one that would be remedied to a switch with either RISC or VLIW.

let’s look at apple as a case example. Apple started on CISC, the 68k line. they then switched to powerpc, with a 68k emulator. now they are almost native 100% RISC.

continually extending x86 maybe challenging. one of x86 problems is small register set (8).

let us suppose that amd allowed for a compiler that directly coded to its RISC engine.

AMD can claim fast x86 execution. this would be like apple powerpc emulating 68k. they can also claim speed improvements with certain applications by writing specifically for its RISC core in micro-op. This would be like writing for apple’s powerpc native code.

at some point, if there is the demand, amd can offer cpu that are very powerful, and will run only with its RISC core. this will reduce die size and heat and transitor count and improve performance.

The Althon engine is purely internal and there is no reason for AMD to expose it even if they could. The big selling point for Athlon is compatibility with legacy applications. If AMD exposed the engine they would have a RISC chip, but one that isn’t compatible with any existing processor architecture and which doesn’t have any software. Take it from someone who’s been there / done that, there are better ways to throw away money.