Transitive’s translation software will be used to let software from rival RISC processors run on Intel’s Itanium and Xeon server processors. The partnership is designed to make it easier for customers to scrap competitors’ gear in favor of Intel-based systems. “With this relationship with Intel, Intel is funding development and providing us access to engineers so we can accelerate the development of processor-operating system combinations,” Transitive Chief Executive Bob Wiederhold said in an interview.
Of course, Intel misses the point. Why do I run RISC processors and workstations? Because the draw LESS POWER. Hey INTEL, pull your head out of your butt and make processors where I don’t have to have my own power substation.
http://www.anandtech.com/tradeshows/showdoc.aspx?i=2711&p=4
http://www.anandtech.com/tradeshows/showdoc.aspx?i=2713
[/i]Of course, Intel misses the point. Why do I run RISC processors and workstations? Because the draw LESS POWER. Hey INTEL, pull your head out of your butt and make processors where I don’t have to have my own power substation.[/i]
What’s your platform & application and what power savings have you measured?
Of course, Intel misses the point. Why do I run RISC processors and workstations?
This allows companies that have made large investments in the iTanic architecture to justify staying with it for upgrades because they can migrate their other applications to a virtual or translated environment on new Xeon or iTanic hardware. Saves face for those not wanting to acknowledge a choice made that didn’t necessarily live up to the promises from the HP-Intel hardware sellers.
… thinking that Itanium is a RISC processor? I thought that the term RISC was about the instruction set (which is the IS in RISC), but who knows.
Of course, Intel misses the point. Why do I run RISC processors and workstations? Because the draw LESS POWER. Hey INTEL, pull your head out of your butt and make processors where I don’t have to have my own power substation.
What impact has the instruction set on power consumption? I must say I’m confused…;)
Well, IA64 has some similarities to RISC but it differs in that it relies on the compiler explicitly exposing instruction level parallelism to the CPU. A typical RISC CPU will do dynamic reordering of instructions and execute them in parallel to get the best runtime performance; Itanium relies on the compiler grouping the instructions that can execute in parallel. In principle you save runtime overhead and fabrication expense by not having to do dynamic scheduling. They call it “EPIC”, or Explicitly Parallel Instruction Computer. It’s like an evolution of VLIW (Very Long Instruction Word).
This development is kinda neat: if the Transitive software can produce optimised IA64 code (and maybe optimise further during execution based on runtime feedback), you should be able to get pretty good performance. In principle, you might get better performance for some apps than if you statically compiled them, since the runtime parallelism will be apparent to the optimiser.
if the Transitive software can produce optimised IA64 code
Big if. Compiling into RISC and decompiling again will have lost quite a bit of valuable high-level information for the optimiser. Additionally, the optimisation effort has to be balanced against startup-time.
(and maybe optimise further during execution based on runtime feedback)
Does the Itanium provide useful feedback? Or would the code need to be automatically instrumented and profiled? That would be rather difficult and time-consuming.
What the optimiser would really like to know is the delay of each memory access, so that it can schedule instructions accordingly.
Yeah, optimising the unannotated binary code is relatively unforunate. I was planning to do a research project on inserting annotations into compiled binaries which would allow high-level information to be preserved for runtime optimisers. Unfortunately, I didn’t get funding for it ๐ Stuff like Java bytecode is meant to be good for this sort of thing, because it preserves more high-level information, but then you get additional overheads.
In terms of useful feedback, I seem to recall Itanium including quite a sophisticated performance monitoring module. In any case, the optimiser might also want to insert its own instrumentation.
Have you seen the Dynamo dynamic optimiser work? It was originally written for HPPA, but I think it’s been ported to x86. It was actually an *emulator* for HPPA, which JITed hot code paths and optimised them further at runtime. The neat thing about runtime optimisations is that you can try all sorts of things that wouldn’t be possible at compile time. The bad thing is that you may slow things down (Dynamo detects this and just goes back to normal execution, so for decent length runs the hit is minimal).
Delay of memory access is probably a little tricky… but OTOH it can probably get cache miss statistics from the hardware performance counters. It can also attempt to insert advanced loads, and remove them if they don’t provide a speedup.
It really could get quite interesting, but doing it on IA64 is going to hurt one’s head. This is all beginning to sound a bit like the dynamic JIT / optimisation firmware used by Transmeta CPUs, though, so it should be possible to do it reasonably well. A lot of people I speak to tend to say “dynamic translation? That’s what Itanium should have used all along!”.
Whilst on the dynamic theme, I’m really surprised that the performance of VLIW/EPIC hasn’t been properly exploited by Java/.NET – given the nature of using a virtual machine, you’d think that it would be the idea environment and reason for wanting to be as far removed from the machine as possible, and getting the virtual machine to do all the nasty things.
As for RISC; I don’t quite follow what you mean, but what I can say is that for very early RISC processors, many didn’t use OOE and other fancy things that RISC processors today use; back then, they were very academic to what went it, and what was pulled out – now RISC vendors are a little more pragmatic in their approach, it isn’t about RISC vs. CISC, its about delivering the best processor given the jobs that need to be untaken not only now by computers but into the future.
Yes, I agree about the JVM / .NET. It’s an ideal situation for them to do dynamic optimisation: the bytecode has high-level information in it, which can help with optimisation; the CPU should benefit well from dynamic optimisation, since it doesn’t do dynamic execution. Transmeta actually did have a Java bytecode interpreter for the Crusoe.
Regarding the RISC vs EPIC. RISC processors (in or out of order) typically handle the task of scheduling concurrent execution themselves. They’ll do this by sometimes inserting nops into the pipe, whether they’re in-order or out of order. But they’ll resolve dependencies between instructions so that operands are available when they need to be.
In VLIW, this is really the compiler’s job: the compiler groups operations into Very Long Instruction Words, containing several instructions that can safely be issued concurrently into the machine’s pipelines. Each cycle, it retreives an instruction word from memory, splits it into its component instructions and stuffs them all into their respective pipelines. This exposes the number of pipelines to the compiler, so it’s not so good for forwards compatibility… Also, much of the code will be nops, when it’s not possible to fill all the pipes the compiler will have to insert null operations for the idle pipes – the code density / cache behaviour suffers.
EPIC is really quite similar, but it uses a better encoding (IA64 also has some really scary architectural features I’m going to avoid talking about here!). Instructions are in 128-bit bundles, each holding 3 instructions and some extra flags. On a higher level, instructions are in “groups” which can execute in parallel. Since groups are larger than bundles, they may cross bundle boundaries. The flags are used to describe to the CPU where group boundaries fall in the instruction stream. As for VLIW, the compiler is still specifying explicitly to the hardware which instructions may execute in parallel on any cycle. However, the CPU is now more responsible for deciding which piplines to use. This encoding saves on inserting (so many) nops in the code and avoids exposing the microarchitecture to the compiler (so much).
For instance: a small Itanium with few functional units could be build to fetch a bundle at a time, issue everything before any group boundary within that bundle, then issues the rest next cycle. An Itanium with many more pipelines could fetch several bundles, and issue instructions from all of them, as long as they fell within the same group. The compiler makes sure that the CPU can always issue _any number_ of instructions from a group, _in parallel_ – the CPU doesn’t have to find parallelism for itself, but it has the option of not using the full parallelism the compiler has identified, if it doesn’t have enough functional units. The idea is that compilers always make groups as large as possible, then re-optimisation won’t be unecessary as the CPUs get bigger.
The philosophy is a bit RISC-like in the end; making the common case fast, and pushing complexity out of hardware if it can be done as fast in software. But it’s quite different in practice to the CPUs that are typically called “RISC”.
Does that make sense? Sorry if I’ve told you stuff you already know, just trying to explain what I meant earlier ๐
Hey I thought they were CISC.
AIUI, Intel hasn’t been a CISC architecture since the Pentium Pro. From that point on CISC code was run through a translation layer that converted it to instructions understood by the newly native RISC-ish hardware. The impact of this was reduced by moving cache onto the CPU from the motherboard.
So, where are all the comments about “When will OS X Tiger Server run on Itanium”? I am just fooling around, but I’m still wondering where are the comments about “When will OS X Server run on Itanium” are?
You’re it, and I’m the first crap follow-up.
Seriously, though. Why would anyone want to run Mac OS X (any kind) on Itanium? I suppose they will be making an XServe Intel-something eventually, but I don’t think they’ll be adding a third architecture into the mix: PowerPC (32/64), x86 (32/64) *and* Itanium.
Granted, an Itanium iMac would be awesome. Sugarcube iceberg. Sweet.
Edited 2006-03-08 00:42
If this does go ahead, will we see a translation layer on Solaris x86 so one can run Solaris SPARC binaries? being how lazy some companies are to port their software to Solaris x86 <stares with an angry face at Adobe>, the best thing would be to try and work around the problem.
http://www.transitive.com/customers_sgi_success.htm
It’s available for the Prism platform, so it does work, from MIPS to IA-64. Not that I have seen it in action that is…
– simmoV