The Itanium processor

Thom Holwerda 2015-07-29 Intel 11 Comments

The Itanium may not have been much of a commercial success, but it is interesting as a processor architecture because it is different from anything else commonly seen today. It’s like learning a foreign language: It gives you an insight into how others view the world.
The next two weeks will be devoted to an introduction to the Itanium processor architecture, as employed by Win32.

There’s part one, two, and three – with more to come.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

11 Comments

2015-07-30 12:25 am
Drumhellar
It’s too bad that the Itanium didn’t have enough market share to fund is own R&D over the long term, though I suspect the problems it was trying to solve in software target than hardware were still more effectively solved in hardware.
2015-07-30 12:41 am
christian
Registers are expensive. The instruction bundles give poor utilization of available execution resources, and the code has low instruction density even when optimally bundled.
I think SPARC was ample demonstration that rotating register windows just don’t work in practice. They’re expensive to implement, not only in terms of register silicon real estate, but managing spills and context switching just add unnecessary overhead and complexity. Anyone who doubts that should try and follow the register window management code in a kernel, and get your head round all the corner cases it introduces.
Alpha aced it with a simple architecture, that was easy to implement to the point they could do the hand optimization that really allowed massive performance, something that is probably not feasible with more complex architectures.
As much as a hate x86, I’m glad Intel has moved on from Itanium. And hats off to them, they’re doing a great job at every power and performance level.

2015-07-30 1:45 am
galvanash
I think SPARC was ample demonstration that rotating register windows just don’t work in practice. They’re expensive to implement, not only in terms of register silicon real estate, but managing spills and context switching just add unnecessary overhead and complexity. Anyone who doubts that should try and follow the register window management code in a kernel, and get your head round all the corner cases it introduces.
No idea if it will end up working or not in the long run, but the guys designing the Mill at least figured out a clever way get most of the benefits of rotating registers in a what looks like a pretty orthogonal way with few corner cases.
After reading this (and having read quite a lot about the Mill), the Mill sounds more or less like Itanium with all the design-by-committee craziness removed and a focus on the core strength of the design – high ILP without all the OOO hardware overhead.
I would really like to see it in silicon one day… It may not work, but it sure is ambitious.
http://millcomputing.com/

2015-07-30 2:56 am
Alfman verbose=1
galvanash,
I know Itanium gets a lot of frowns. It performed poorly on benchmarks. It needed help from the development community to build software for it, which never came. While the Itanium had some impressive capabilities on paper, in the real world it was destined to emulate x86…horribly. And to boot it never reached the economies of scale needed to make it even remotely affordable to hobbyists. So when AMD64 came along with native support for x86-32, Itanium instantly became history.
From a technical perspective I still think rotating registers can be a good idea, once you wrap your head around it, it is a more elegant abstraction for building stack based programming languages than conventional registers. It could even solve some tricky aliasing problems responsible for bottlenecks with x86.
Anyways thanks for the link, I’ll have to read up on it some time!

2015-07-30 5:15 am
galvanash
From a technical perspective I still think rotating registers can be a good idea, once you wrap your head around it, it is a more elegant abstraction for building stack based programming languages than conventional registers. It could even solve some tricky aliasing problems responsible for bottlenecks with x86.
I agree, but read up on this part in particular about the Mill:
http://millcomputing.com/docs/belt/
First off my understanding is admittedly spotty, I don’t really know anything about CPU design, its just a hobby interest. That said, the Mill is designed such that the concept of register windows (to optimize call/returns) and rotating registers (to create software pipelined loops) are kind of irrelevant to it, because it doesn’t actually have registers (in the conventional sense) – it gets most of the same advantages that both of those features bring to the table just by virtue of its design.
It is an exposed pipeline architecture that uses “the belt” as a conceptual model for storing values. There are no program accessible registers, addressing instead is done temporally. Its like a stack, but different in that you don’t have to pop things off – you can reach into the stack and grab any item (up to a point). Its sort of like a stack crossed with a fixed length fifo buffer.
Anyway, I say conceptual because it tries very hard to never actually have to store anything What it does is rewire the bypass network before every issue cycle so that the output latches on the producer functional units are directly wired to the input latches on the consumer ones – i.e. if an add (1 cylcle) is followed by a multiply that uses it, in the stage before issue of the multiply its input latch is wired to the add’s output latch. It isn’t really “stored” at all, values just fall through the bypass network and end up where they are supposed to be. Long lived values are carried through the bypass network until there is no where to put them anymore, and then they are spilled to scratchpad only if you need them again later.
Conventional architectures used named registers as the programming model, and then resort to do heroics in order to make things go faster. Most of this magic is not visible to the compiler, it just tries to make optimal use of registers – the rest is up to the hardware.
The Mill is the other way around, it will (if the code can be scheduled as such) always produce an optimal path for data to travel resulting in lowest latency and lowest power use. The compiler basically just tries to instrument the best use of the bypass network until it can’t anymore, then it resorts to spill/fill into the scratchpad when it has run out of latches and has no other choice. Even then it has a spill/fill unit that deals with those cases pretty well (3 cycle spill to fill latency).
Virtually all CPUs use a bypass network to make things go faster in various forms, but the Mill is the first one I have seen where the bypass network is essentially the programming model instead of the other way around. It has “registers” (in the form of the scratchpad), but you only use them when you can’t schedule optimally.
It seems to me that this makes much more sense than programming to a model that stores things in named locations only to have the actual hardware running it go to great heroics (OOO/register renaming/register rotation/register windows/etc) to avoid doing what the software is telling it to do (i.e. avoid copying things around, reorder operations, etc).
Anyway, its a wonderfully interesting ISA, especially if your a software guy because it was in large part designed by a software guy.

2015-07-30 3:20 pm
Alfman verbose=1
galvanash,
I agree, but read up on this part in particular about the Mill:
http://millcomputing.com/docs/belt/ [/q]
Wow, the video perfectly illustrates the problems with rename registers implicit to super-scalar architectures.
One can see how stack oriented register models are extremely beneficial for lowering power consumption and limiting the complexities incurred with superscalar architectures. The Belt shares this trait with Itanium, however it seems to have a lot of unique aspects too, like the method of function calling. All in all great food for thought!
[q]The Mill is the other way around, it will (if the code can be scheduled as such) always produce an optimal path for data to travel resulting in lowest latency and lowest power use. The compiler basically just tries to instrument the best use of the bypass network until it can’t anymore, then it resorts to spill/fill into the scratchpad when it has run out of latches and has no other choice.
While not a problem for custom code in embedded systems, hardcoding latencies is a weakness for general purpose binaries. Ideally we’d get away from distributing software as hard coded machine code and instead have a very fast transparent software layer that could always produce optimal machine code for the current system. This would even allow old software to immediately take advantage of new hardware.
Even superscalar architectures have this problem to an extent. A developer’s compiler CPU target creating the binary determines the optimality of software running on end user CPUs, which in many cases will be wrong. Optimality suffers when we have hard coded binaries.
The choice of architecture should not be coerced by the availability of software, which is why I wish more software would be distributed in portable formats, like Java/ART/Davlik/LLVM.
Edited 2015-07-30 15:21 UTC
2015-07-30 8:44 pm
galvanash
While not a problem for custom code in embedded systems, hardcoding latencies is a weakness for general purpose binaries. Ideally we’d get away from distributing software as hard coded machine code and instead have a very fast transparent software layer that could always produce optimal machine code for the current system.
http://millcomputing.com/docs/compiler/
Mill binaries are not distributed with hardcoded latencies… It pretty much works exactly as you just described (although honestly I don’t know how “fast” it actually is).
tl;dr – The Mill compiler produces IR (genASM), and binaries are distributed in IR format. The specializer takes this and produces actual ready-to-run binaries (conASM), doing the bulk of the scheduling in the process.
Only the specializer needs to be concerned about latencies, and specialization is only performed when the binary is actually loaded on the target machine.
2015-07-30 9:07 pm
Alfman verbose=1
galvanash,
http://millcomputing.com/docs/compiler/
Mill binaries are not distributed with hardcoded latencies… It pretty much works exactly as you just described.
Only the specializer needs to be concerned about latencies, and specialization is only performed when the binary is actually loaded on the target machine.
They’ve thought of everything
Seriously seems like a cool project with emphasis on a strong consistent design. I like that they’re taking new interesting approaches to things. I’m not sure it can ever become mainstream but I am certainly interested in following it.

2015-07-30 4:00 am
sergio
Thanks! I admin some HP-UX 11i boxes running on Itanium (obviously)… I think big HP systems are the only “real” market for this architecture.
In practice, the only great advantage of Itanium over other RISC systems like SPARC or POWER is that you can run Windows on it…
For example, If you have an Itanium Superdome, you can run HP-UX, Windows and Linux natively without adding “special” hardware. Obviously you cannot run Windows on SPARC or POWER… so It’s an interesting feature if you got mission critical systems running on Windows.
BTW I’ve never seen Windows on Itanium, only HP-UX… in fact running Windows on ultra hyper expensive RISC boxes sounds crazy…

2015-07-30 5:52 pm
kristoph
The new superdomes have Xeon’s in them so you can run Windows and Linux on them without the need for Itanium ( which although still officially alive has not seen an update since 2013 ).

2015-07-30 6:55 pm
sergio
Yeap sadly there’s no HP-UX for Xeon yet…