Sun’s latest Niagara and Rock details have reached El Reg, and they confirm that the hardware maker is up to some very ambitious stuff. First off, Sun looks set for the imminent release of its first Niagara II-based servers – the T5120 and T5220 systems. Customers will see 1U and 2U boxes, respectively, each with one of the ‘Niagara II’ or (more formally) UltraSPARC T2 chips. It looks like the eight-core, 64-thread chip will arrive at 1.5GHz.
omfg…
0_o
Browser: Mozilla/4.0 (compatible; MSIE 6.0; Windows 95; PalmSource; Blazer 3.0) 16;160×160
How long before you add one of those beasts to your menagerie, helf?
Where I could see that being utilized in a non-server space currently is with neural networks, if someone really wants to get into them. That, or ray-tracing, like POV-Ray, which AFAIK doesn’t subdivide things up amongst threads in tiles currently, but can be used to render with one thread per frame, which, in that case, that’d be a great machine to play with Of course, most people don’t feel a need to do computer animation to that degree at home…
Of course, for other computer geeks, this would be an interesting system (2048 hardware threads?) to run a true microkernel OS in…
Well, I don’t know about Helf, but I’d sell my house for a couple of those ;^)
I’m thinking you’d pretty much have to. Along with your car. Maybe your wife and first-born child too 😉
hell, running a microkernel on that may even get some on-par speed out of the kernel
Browser: Mozilla/4.0 (compatible; MSIE 6.0; Windows 95; PalmSource; Blazer 3.0) 16;160×160
Sorry, but I just have to hit this on the head; why is it that I need to know what browser and platform a person is posting from?
I think it’s some retarded thing built into the OS comments that automatically puts the useragent of mobile devices in the comments. While I can understand their desire to to keep stats on mobile devices, it is rather pointless to post the useragent on the end of every comment.
Yes, OSNews inserts it. But why? I’ve always wandered too.
Is it as powerful as a 150 HP Case steam tractor with 4000+ pounds of torque at zero RPM? 😛
Fascinating concept, but it will be interesting to see if it gets traction in the market. From what I can tell, the Niagras haven’t exactly gone gangbusters yet.
As enterprise consolidation projects and virtualisation takes traction, I’d expect this sort of hardware to take traction.
Edited 2007-07-27 09:46
On what basis is that assumption being made – sure, Sun doesn’t run around everytime they make a sale, but I am sure, given the sales volume so far, that things are going well.
Its going to take a while for Sun to get back on track – they made the first good move, getting rid of Scott, and now their focus is on products and addressing customer needs rather than senselessly bashing Microsoft.
I haven’t been able to find any sales numbers on the T1s, but even in Sun’s blogs I don’t see much about it, nor do I see Sun enthusiasts going crazy about them (they seem primarily focused on all things [Open]Solaris. In the general market, I haven’t seen any deployment, nor have I seen anything in the non-Sun sphere regarding them. Anecdotal, to be sure, but it still seems that it hasn’t given any existing platform (non-Sun or Sun) a run for it’s money. Yet, at least. It may be an idea ahead of its time.
I’d say its a deliberate decison on Suns part to make the operating system itself sell the hardware – if they started going on about their hardware they risk zapping any possible momentum out of OpenSolaris.
They’ll mention the new hardware but you’ll find that most of the emphasis will be on the operating system and what it brings to the customer as a whole package rather than just singling out the hardware for special attention.
Is that why Sun has this site:
http://cooltools.sunsource.net/
It takes more than the OS and nice hardware to make an application perform well.
Hence the reason I said its the ‘whole package’ – its more than just hardware and operating system.
Unfortunately, it’s not that simple. I just went through a three month debacle with a customer who said they were experiencing performance problems with their application on our hardware. Until they brought in an outside consultant who deemed the hardware we were using was not he issue they insisted that we jump through hoops to fix their problems. As it turns out the performance problem was due to a complex security policy that they created and had not taken into consideration how much of a performance hit the security policy caused.
While Sun sells products and services and can assist in these cases, it is also up to the people who write and maintain applications to assist in the performance improvement process. The OS and hardware vendor can only go so far.
While I certainly can’t vouch for the total sales numbers of the Sun systems with the Niagra processors in them, I can say that my company has 120 Sun T-2000 servers in production right now, each of them with an eight-core Niagra processor in it. I don’t know how many T-2000’s we have that aren’t customer-facing, just the 120 in our datacenters.
They run pretty darned well too!
I read it in one of this months linux magazines at a local barnes and nobles. If anyone can find it there was an interesting article about the potential power of stream processing with Gpu’s. You have to use special compilers to optimize the code but the benifits were in the 2x and 3x factors with one gpu. with a good computer and 2 graphics cards you may not have to mortgage the house to get good computing power.
Since it’s still probably not very fast (the 64 thread one that is but anyone of them at all compared to anything else similair speced/with as many sockets). I don’t care very much that each core can run multiple threads if they are run very slow anyway.
This is the result of running sysbench on MySQL on a “try and buy” T2000 I had for testing. This is without the benefit of using Sun’s CoolTools or the tweaks mentioned by Luojia Chen in this article:
http://developers.sun.com/solaris/articles/mysql_perf_tune.html
sysbench v0.4.4: multi-threaded system evaluation benchmark
No DB drivers specified, using mysql
WARNING: Preparing of “BEGIN” is unsupported, using emulation
(last message repeated 39 times)
Running the test with following options:
Number of threads: 40
Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using “BEGIN” for starting transactions
Maximum number of requests for OLTP test is limited to 10000
Threads started!
Done.
OLTP test statistics:
queries performed:
read: 140196
write: 50070
other: 20028
total: 210294
transactions: 10014 (404.07 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 190266 (7677.38 per sec.)
other operations: 20028 (808.14 per sec.)
Test execution summary:
total time: 24.7827s
total number of events: 10014
total time taken by event execution: 981.9074
per-request statistics:
min: 0.0402s
avg: 0.0981s
max: 1.1433s
approx. 95 percentile: 0.1733s
Threads fairness:
events (avg/stddev): 250.3500/1.53
execution time (avg/stddev): 24.5477/0.04
This is with MySQL on the root disk (mirrored) and MySQL compiled with –prefix=/usr/local/mysql(version) –enable-thread-safe-client and sysbench compiled with defaults using gcc 3.4.3 that ships with Solaris 10.
Do I think the performance could be better, yeah I could probably tweak a few more transactions out of it. The bottom line is the CPU wasn’t even breathing hard to produce these results, sar -u data from the test is below:
17:05:00 21 9 0 70
17:10:00 25 9 0 67
17:15:00 25 9 0 67
17:20:00 26 8 0 65
My testing got cut short by an outside event, but in any case the results proved to me that the UltraSPARC T1 is a serious processor. In the buy we made for updating one of our networks I asked for two T1000’s for web and database servers. Any future purchases we make, these machines will be at the top of out list for consideration.
Do I think the T2 CPU’s will perform any better, you bet! Just because they don’t run at 3+ GHz doesn’t mean they don’t perform well.
I don’t know how that compares to anything else, but the size is impressive!
Why wouldnt you be impressed of many but slow threads? The answer is simple:
Every CPU will have cache misses and therefore uses lots of cache logic and increased size, etc. Studies made by Intel shows that an normal server idles around 60% under full load, because of cache misses.
Sun deals with this ancient problem in another way in its new Niagara CPUs. A thread is run until a cache miss, and then immediately it switches to another thread in ONE clock cycle, which a normal CPU cannot do. Therefore a Niagara idles around 5% or less. Therefore the Niagara at 1.4GHz outperforms easily dual opterons at 2.5GHz in threaded apps, like web servers etc. For a few threads, the Niagara sucks. But at many threads, it excels. Which benchmarks and reports confirm.
The Niagara has generated quite many sales, and generates lots of money for Sun.
This machine however has 64 threads per core. I fail to see how to SMT can imrpove much the situation here.
I mean, you can already keep Niagaras busy with the current number of threads. With 64, most of the threads are going to be waiting for a cache miss from the other threads.
That 64 thread hardware is being released in the future… just like increasingly threaded software is being deployed in the future?
OpenSolaris Xen development is active. One might expect a focus on virtualization when 2048 threads ship
Hopefully the new hardware will support Logical Domains (LDoms) like the UltraSPARC T1’s do now:
http://www.sun.com/bigadmin/hubs/ldoms/
Niagara II has 8 cores and 8 threads per core. Because the Niagara II core is twice as wide as the Niagara core, they doubled the number of threads per core from 4 to 8, maintaining a ratio of 4 threads per integer execution unit.
Sun deals with this ancient problem in another way in its new Niagara CPUs. A thread is run until a cache miss, and then immediately it switches to another thread in ONE clock cycle, which a normal CPU cannot do.
Is that what Intel’s Hyperthreading chips do? Switch between 2 threads in one clock cycle?
It’s conceptually similar technology and used for a same end effect.
But it is better implemented in Niagara, as instead being an afterthought it is Niagara’s modus operandi from day one.
Edited 2007-07-27 19:08
Some questions I have around the new SPARC architecture:
– To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.
– How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.
– With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
– How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?
I don’t know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.
I’m not from Sun, but at least two question of yours are not very useful: the answer of these being ‘it depends’.
>how much will software architecture and design have to change?
Obviously software must be threaded to use efficiently this kind of computer (which is also true for multicore CPU), some problems are already ’embarassingly parallel’ so no design change is needed other have to be recoded from scratch.
>With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
Well, once memory latency is not the bottleneck anymore (due to thread interleaving), then the next bottleneck is memory bandwith or CPU usage or IO, etc.
It’s not possible to answer to your question in a general way as it depends on the cache usage of the code: if it has high locality, CPU becomes the bottleneck otherwise it’s memory bandwith..
<quote>I’m not from Sun, but at least two question of yours are not very useful: the answer of these being ‘it depends’. </quote>
While I certainly agree that the short answer to both of these questions is “it depends,” I disagree about their usefulness.
The first could have been expressed “How and how much…”; it was intended as an opening to draw out the general guidelines and principles involved (beyond “write concurrently as much as possible”). Creating a large number of concurrent threads might perform well with this many cores but perform very poorly on other architectures (including other Sun platforms).
Those of us who must design code to perform efficiently on all supported platforms, with a minimum of “#ifdef” code will need to understand how to do this, preferably without having to discover it all independently.
In addition, the cost of mutexes relative to other operations may change, and understanding this (as well as perhaps using alternate atomic instructions that may work better, e.g., a single “spinlock” instruction that could trigger a thread switch on waiting, might be available.
The second question, on memory scalability, gets to the interface between cache and TLB registers, since larger memory can result in a greater stress on TLB registers and/or TLB miss handling (which could be a TLB/cache operation without accessing memory). It is also connected to selection of new threads in case of cache misses; it is better, for example, if the CPU schedules a new thread with low probability of having immediate TLB register loads, which will make the switch more than a single cycle.
Generally, I/O is not in this scope, other than the need to flush caches to memory before the I/O occurs; I don’t see new issues here (although it could be a blind spot on my part.
These are both the sorts of questions that lead to dissertations as responses. If folks at Sun have already done that research and written the dissertations (or equivalent), that greatly adds to the value of these new CPUs.
They are also the sort of questions I have run into in writing highly concurrent portable Unix software (specifically a main-memory DBMS with very small latency restrictions in soft real time), which is why I thought them worth asking.
For a glimpse of the problems this will cause, read about Azul Vega programming. They have 384 threads today, and Java programs have to be modified in non-trivial ways (e.g. lock striping) to use all those threads. I suspect such optimizations will degrade performance on more conventional systems.
Thanks, Wes. There are certainly some overlap here — Sun is suing Azul Systems over IP issues. They have avoided (it appears) one problem, however; by limiting themselves to Java VMs, they avoid all the address space management issues I raised.
That lawsuit was settled some time ago – see http://www.theregister.co.uk/2007/06/20/sun_azul_stock and http://www.theregister.co.uk/2007/06/19/sun_azul_settle
– To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.
That’s where virtualization comes in. You can run many smaller virtual machines with fewer number of threads per OS instance. You can consolidate many boxes into a 1U or 2U server.
– How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.
Each core has a shared L1 cache and there is a shared L2 cache for the whole socket. This is on an UlltraSPARC T1. Each core has a TLB shared by the threads in the core. Each thread looks like a SPARC cpu to the OS, there is only shared address spaces if the MMU partition ID is the same for a TLB entry. Each core runs 4 threads, threads are switched on a long pipeline stall, like a cache miss or tlb miss.
– With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
Don’t follow.
– How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?
See above.
I don’t know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.
I work for Sun and on the CMT CPUs and LDoms virtualization technology. We are very open about our CMT processors, so much so that we even open-sourced the design and RTL.
http://www.opensparc.net/
You can download the RTL source code for the T1 processor and get all the specifications at the opensparc page.
Edited 2007-07-28 01:20
It depends on load and type of software being run.
For example, typical app or web server will scale linearly with amount of available hardware threads. Maya or Word will not. That’s why Niagra is server cpu and marked as throughput processor.
Yes, that’s true. That’s why Niagara has massive memory bandwidth and extreme number of pins to implement that wide bus.
However never forget principal idea of this design. CMT and SMT are used, not to improve performances of single unit of work, but to improve useful usage of memory. Idea is to run multiple units of work and idle any of them when data is unavailable. Next time when stalled thread is dispatched to run, enough time should pass that data is surly available. As memory is slowest part of system, this design should lead to better overall performances when compared comparable conservative design with similar memory throughput.
Niagara wasn’t first piece of hardware build around these ideas. Cray had barrel shift CPU with 256 threads (SMT) and no cache.
Edited 2007-07-28 16:51
For example, typical app or web server will scale linearly with amount of available hardware threads.
Only if you have a huge number of concurrent requests and no lock contention, both of which are questionable assumptions.
That’s true. I certainly won’t deny that.
However as both Niagara and Niagara II have only 8 cores the whole limitation is more theoretical than practical – in practice at any given moment there always will be more than 8 threads which are ready to run.
And truthfully even 2k+ requests at any given moment seems as next to nothing for any a typical web or app server in production and locking in those systems usually happens on database write – a quite rare event in those environments.
isn’t fortress the language that guy steele, one of the inventors of scheme, is working on? it will be interesting to see how scheme/lisp influence fortress.