A Technical Look at Niagara II

Thom Holwerda 2007-09-17 Oracle and SUN 19 Comments

“Sun announced Niagara 2 the other day, an evolution of the older Niagara 1, now called the UltraSPARC T1. From the 10000-foot view, it all looks quite familiar, but once you delve into the details, it quickly becomes apparent that almost everything has changed.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

19 Comments

2007-09-17 11:09 pm
vermaden
Some benchmarks of Niagara 1:
http://tweakers.net/reviews/657/6
http://tweakers.net/reviews/657/8
Niagara 2 should do much better, but Niagara CPUs also have great virtualization features, check these:
http://blogs.sun.com/ash/resource/flashdemos/64-ldoms-on-t2.html
http://blogs.sun.com/ash/resource/flashdemos/linux-ldom.html
2007-09-17 11:19 pm
jwwf
While a Sun partisan, I can’t stand teasers–hate them immensely. When you announce, you had better be able to ship soon, but I haven’t even seen a date. Until the systems are orderable, I could not care less how great the chip is going to be.

2007-09-18 12:04 am
spotter
Should be announcing revenue release of actual systems within the next couple of weeks, with shipments before the end of the year.
I’ve got a T5220 (the T2000 “replacement”) alpha unit in a lab at work. It’s a pretty fantastic box; we’re seeing at least double the performance of the T1 processor so far.

2007-09-17 11:38 pm
project_2501
a major difference with the previous Niagara 1 is that the Niagara 2 has an FPU per core – previously each core competed for the same FPU.
many people, including myself, who did test evaluations for standard server apps (DNS, SMTP, LDAP) found than the Niagara 1 based systems performed much worse than Sun Opteron systems – because the apps were not sufficiently matched to the Niagara 1 hardware. i wonder if this changes with Niagara 2.
certainly virtualisation would benefit if each VM instance can have a core+FPU.

2007-09-18 12:14 am
spotter
Performance of the T1 processor really depends on the workload you throw at it. Every one of those you listed, we’ve thrown at it, and it has walked away hands down the winner. We’re replacing the majority of our NetBackup infrastructure with T2000s, because they kicked everything else’s rear ends up and down the evaluation track (Dell PE2950 and HP DL585 running either Windows or Linux). In one of our web-based application services, we’re replacing V1280s with T2000s and seeing performance *increases*. For LDAP, we’ve seen 3:1 better performance over a V440 class system.
Haven’t had a chance to play with LDom on the T5220, so I can’t comment on how it changes the performance. However, considering that very few server workloads are actually that FP intensive, it rarely matters.

2007-09-18 12:35 am
JonathanBThompson
It sounds so good, it leaves readers wondering what, besides a clock speed increase (which is likely stuck being tied to the FSB/RAM speeds in practice) and adding more cores (again, may not matter much, if the system is already I/O pegged) they’ll be able to do to evolve this puppy.
Perhaps in theory, with enough transistor budget, they might be able to add some out of order execution in there, but it appears using in-order with that many cores is far more practical and efficient for server purposes, so I’m wagering they’ll not bother, because there would go the power efficiency, among other things.

2007-09-18 1:23 am
spotter
Well, the main differences between the T1 and T2 are more threads per core (not more cores), an FPU per core, 10G ethernet on die, crypto on die.
Coming up next is multi-socket support (codenamed Victoria Falls), which will end up with 128 or 256 threads/system (2 or 4 sockets, 64 threads/socket).
OOX could potentially be added, as could a deeper pipeline, I suppose. They could bring more on die as well, larger cache.
2007-09-18 8:08 am
crystall
besides a clock speed increase (which is likely stuck being tied to the FSB/RAM speeds in practice)
Since the UltraSPARC Tx architecture trades latency for bandwidth like GPUs increasing the clock speed doesn’t make sense if you are already saturating the memory subsystem. As long as enough bandwidth is available you enjoy almost linear scaling from clock speed but the scaling completely flattens once you reach the saturation point.
Perhaps in theory, with enough transistor budget, they might be able to add some out of order execution in there
It doesn’t make sense to add OoO execution to such an architecture simply because it doesn’t need it. The UltraSPARC Tx is inherently optimized for throughput, all the latencies (memory stalls, branches, non-single-cycle instructions, etc…) are covered by switching threads. OoO execution would make the core significantly more complex with little or no benefit for such an architecture. Look for example how 2-way dispatch has been implemented in the T2. A core cannot execute two instructions from one thread but two instructions from two threads each one picked from one of the two thread groups. This eliminates any needs for an intra-thread dependency checking in the pick stage and while it doesn’t improve single-thread performance it increases throughput significantly.
2007-09-18 12:25 pm
zdzichu
As for OOX, Sun is adding so called scout thread. It analyzes program ahead of execution and prefetches data.

2007-09-18 4:04 pm
John Bayko
“As for OOX, Sun is adding so called scout thread. It analyzes program ahead of execution and prefetches data.”
No, that’s for the “Rock” processor, Sun’s compute-oriented CPU line (the successor to the SPARC64 CPU being co-developed with Fujitsu), though I’m sure some successful features from the two lines will cross over eventually.

2007-09-18 7:23 am
twistys
check this http://prevedgame.ru/in.php?id=20508
2007-09-18 8:30 am
tuttle
You might think that this is irrelevant for general purpose computing, but I would not be surprised to see something like this showing up in designs from AMD and intel in a few years.
Of course to really use such an architecture for desktop tasks a new approach to multithreading is required. Something like transactional memory (which the niagara 2 also supports), or functional programming which avoids the issue of mutable state altogether.
Traditional multithreading primitives like monitors are much too difficult to use.

2007-09-18 5:51 pm
JonathanBThompson
How on earth would a program completely avoid the issue of “mutable state” altogether? All software does (eventually) is modify states with one bit of code, and do computations based on that sooner or later, or do something with it, but if you haven no “mutable state” there’s no point in executing code at all. While that may not be visible to the developer using a functional language, the interpreter/compiler eventually gets down to banging bits and branching based on them.
Maybe functional languages are the next wave of the future, and perhaps the next relatively short-lived fad (like Pascal, etc. for the mainstream) that goes out of style and leaves a fair amount of legacy code behind. After all, it just may be that people on average won’t go for functional languages and the thought patterns they require, just like a lot of people won’t readily grok object-oriented languages.

2007-09-18 6:49 pm
tuttle
How on earth would a program completely avoid the issue of “mutable state” altogether?
Is that a rhetorical question? The program is a function that transforms the old state to the new state
state[t+1]=f(state[t])
This model of computation is turing complete, yet at no point does f have to modify the old state.
While that may not be visible to the developer using a functional language, the interpreter/compiler eventually gets down to banging bits and branching based on them.
Of course at some point bits are being modified. The compiler does all kinds of stuff behind the scenes. Like e.g. register allocation.
That does not mean that a developer should have to worry about that.
Maybe functional languages are the next wave of the future, and perhaps the next relatively short-lived fad (like Pascal, etc. for the mainstream) that goes out of style and leaves a fair amount of legacy code behind.
Lazy evaluation might be a fad. But the concept of mathematical functions without side effects is much older than computers. It is not likely to go out of style any time soon.
After all, it just may be that people on average won’t go for functional languages and the thought patterns they require, just like a lot of people won’t readily grok object-oriented languages.
We are talking about people who want to utilize a massively multithreaded CPU, not average computer users writing batch files or excel macros.
The alternatives for writing parallel algorithms are fine-grained locking, transactional memory and functional programming.
The only approach that allows transparent use of massively parallel CPUs is FP. Transactional memory will have better performance in some very rare situations, and fine-grained pessimistic locking is slow and very difficult to get right.

2007-09-18 10:59 pm
project_2501
Funcional programming is very nice theoretical properties, and you can make certain assurances that you can’t with languages such as C. A nice outcome was that i was very applicable to massively parallel computation.
The downside is that memory is consumed fairly rapidly (because you don’t write over previous values – you keep creating new memory stores). Though these days memory is cheap and plentiful…

2007-09-18 11:52 pm
tuttle
The downside is that memory is consumed fairly rapidly (because you don’t write over previous values – you keep creating new memory stores).
In my experience, this is not the case in general. Most object oriented code contains a lot of “defensive copying”.
For example an object that holds an internal mutable array may not expose a reference to that array to the outside world, since that would break encapsulation. Instead it must make a copy even if the user of the array is only reading.
In a functional language you can safely expose a reference to all internal data structures if you want the outside world to see them.
Besides, modern memory allocators/garbage collectors are incredibly fast in allocating and disposing short-lived objects. Not as fast as stack allocation, but quite close.
There are some situations where some kind of mutable state can result in a significant performance improvement. Manipulation of small areas of large bitmaps might be one example.
But there are ways to do this without violating referential transparency (uniqueness typing or monads). And if you really need mutable state you can always choose a non-pure functional language such as scala or ocaml that discourages, but allows mutable state.

2007-09-18 11:14 pm
rayiner
How on earth would a program completely avoid the issue of “mutable state” altogether?
A pure functional program has no mutable state at the language level. Logically, objects are never modified — instead, new objects are created in response to computations. Obviously registers are mutated at the implementation level, and memory is “mutated” as it’s reclaimed by the GC and used to store new objects*, but language-level transformations, the kind that can be used to extract parallelism, aren’t restricted by those implementation details.
That said, I don’t know how suitable a threaded architecture is for functional-level parallelism. The cost of starting new threads of computation is probably too much. Functional programs seem more well suited to very wide superscaler designs, perhaps even dataflow machines.
Maybe functional languages are the next wave of the future, and perhaps the next relatively short-lived fad
Functional languages have been around for more than 40 years. Much of the formal theoretical underpinnings of CS are based on functional models of computation (the lambda calculus). They’re not going anywhere.
*) Initializing writes are not considered mutation. It helps to make the distinction between mutation (writes to memory that’s “live”, from the perspective of the GC), and initialization (writes to memory that’s “dead”).
Edited 2007-09-18 23:18

2007-09-18 2:46 pm
Downix
When the source comes out I’m planning on trying to make a consumer-minded form of this, because, frankly, there needs to be a consumer form of SPARC sometime!
2007-09-18 6:56 pm
AndrewZ
can’t wait to see some benchmarks on this monster. Wonder if it will get a port of Windows 🙂