“Intel’s Developer Forum (IDF) for Spring 2006 has a rather obvious focus: the announcement of Intel’s next generation microarchitecture. Intel’s new core will almost certainly leapfrog arch-rival AMD for integer and commercial computing performance. That alone should make this IDF one of the most exciting in the last several years.”
The inclusion of a memory disambiguation mechanism that improves the ability to reorder loads and store, while preserving the memory-consistency model of the 0x86’s, is intriguing. The mechanism should improve the performance of highly-optimized codes that fit into cache; some commercial video-game codes may fit the bill, and we all know how much commercial video-games are used are benchmarks of performance.
Memory disambiguation is indeed a very interesting feature. In fact, it’s perhaps more interesting for how it interacts with the rest of the core than what it brings to the table by itself. Reordering memory operations could be useful, as you say, for highly optimized codes that fit in cache, but its potential extends far beyond that. In fact, the memory disambiguator should be useful in most codes that have a substantial amount of loads/stores.
One of the primary benefits of memory disambiguation is that it allows you to take better advantage of a more aggressively OOO core. Such a core is better at hiding the effective latencies of caches and memory, but there are diminishing returns on performance as the OOO window is made larger. One of the factors that causes these diminishing returns is the fact that the core often has to make conservative assumptions about the legal execution order to preserve memory consistency. Such assumptions cause the core to stall even when it has plenty of instructions in its window that could legally execute under the optimistic assumption.
NGMA’s alias predictor and attendent pipeline fixup machinery reduce that problem. They allow the core to predict when loads/stores won’t alias (most won’t), and keep executing instructions in such cases. A wrong prediction would cause a pipeline flush, but as the article points out, the predictor is quite accurate. The alias predictor is likely one of the key reasons why Intel was able to achieve a ~17% speedup on integer code over the P-M’s already top-notch per-clock integer performance. It would have been difficult to achieve that result just by making the core wider and deeper.
If you read some of the research out there on highly OOO cores, much of it takes good memory disambiguation for granted. The alias predictor is much like the branch predictor in this regard — good ones allow for a much deeper OOO window than would otherwise be possible.
Edited 2006-03-15 19:02
I think that, at this time, it is prudent to be careful about praising Intels memory disambiguation mechanism. It occurred to me that I’ve come across very little in the way of information about the details of the mechanics of their mechanism. However, I’d wager that Intels memory disambiguation mechanism, regardless of its ability, is likely to be of little benefit to codes where data accessed by the processor is frequently not found in cache, simply due to the latency of a not-in-cache memory access being in the hundreds of cycles.
This is true, and to tell the truth there is no beating a good, low-latency integrated memory controller for such codes. However, the nice thing about memory disambiguation is that it allows the OOO core to effectively hide a substantial amount of that latency. Moreover, for codes where cache is effective, the OOO core can usually hide the latencies of L1 and L2, making the average latency of memory operations relatively small.
That said, it’s better to not think about it as “Intel’s memory disambiguation”. The technique is widesperad in theoretical work, and Intel surely isn’t the first one to implement it. However, I think Intel deserves some credit for implementing it aggressively in a mainstream processor. They seem not to be too afraid of taking new theoretical techniques and bringing them to market. For example, an Intel guy co-authored the first substantial paper on the trace cache, and Intel brought it to market only about five years afterwards. That’s a level of synergy with the research world that deserves some praise.
The gaming does seem to have better performance. See here…
http://www.anandtech.com/tradeshows/showdoc.aspx?i=2713&p=2
Opteron is the way to go! Even Intel seems to think so: http://theinquirer.net/?article=30325
😉 Before half of you get your panties all up in a knot, realize that it’s a joke. Kthx.
I thought integer wasn’t as important as floating point performance these days?
It goes back and forth it seems. When the P4 was designed, CPUs still did T&L, so floating-point performance was quite critical. These days, the spends a lot of time shoveling data to the GPU, so the performance of the program logic (integer code) is more important.
I think in the end balance is key. There are many examples of processors that didn’t live up to their potential in one area because they were held back by another. The PPC970 is a great example — way to much FPU and not enough integer horsepower to feed it. The K6 was a good example of the opposite — great integer performance, held back by a non-pipelined FPU.
When Intel outperforms AMD, I’ll believe it when I see it. Until then it’s just hyperbole based on the premise that if you repeat something often enough people will think it’s true.
This thing is crying out for SMT (aka “hyperthreading”) to keep all those execution units busy. With its wider core it’ll have approximately the same number of instruction slots to fill as the higher-clocked Pentium 4.
Maybe, maybe not. SMT isn’t exactly free (10-15% die space, probably more for power), and it’s not a lot of benefit for the target market.
10-15% die space
That’s in terms of actual core die space, i.e. excluding L2, right?
probably more for power
Why? The implementation on the Pentium 4 may have been rather power hungry, but that doesn’t mean that it has to be. Conroe uses a lot of power and clock gating, and I’d have thought that would be applicable to the SMT circuitry too, so you’d only pay for what you actually use.
and it’s not a lot of benefit for the target market.
For dual cores you pay a 100% die space penalty for a 60-90% performance improvement, whereas with SMT you pay 10-15% for something like 20-40%. A single-core Conroe with SMT would provide great value for money.
I think they left out SMT because Conroe already beats the Athlon anyway and they had to get it to market as quickly as possible. SMT can still give them a boost later on.
It is a good bet that SMT will resurface in Intel’s line-up in the near future given that they have a good handle on it — at least as much of a handle as anyone else — and because it is an aftermarket car-part of sorts in the CPU world. At this time, it is possible that the absence of SMT in their current design is due to SMT not being feasible and/or practical, but it is more likely that SMT was simply too much of a pain in the ass, and that implementing it would have pushed the shipping time.