Japan’s Tsubame supercomputer was ranked 29th-fastest in the world in the latest Top 500 ranking with a speed of 77.48T Flops (floating point operations per second) on the industry-standard Linpack benchmark. Why is it so special? It uses NVIDIA GPUs. Tsubame includes hundreds of graphics processors of the same type used in consumer PCs, working alongside CPUs in a mixed environment that some say is a model for future supercomputers serving disciplines like material chemistry.
The cooling of that setup must be a delight.
Well, the GPUs only take like 4 racks. I assume they are using another 4 racks for the host computers. 8/10 racks is not that bad for a system that puts out 77 tflops.
4 racks? Are you kidding?
“Tsubame itself – once you move past the air-conditioners – is split across several rooms in two floors of the building and is largely made up of rack-mounted Sun x4600 systems. There are 655 of these in all, each of which has 16 AMD Opteron CPU cores inside it, and Clearspeed CSX600 accelerator boards.
The graphics chips are contained in 170 Nvidia Tesla S1070 rack-mount units that have been slotted in between the Sun systems. Each of the 1U Nvidia systems has four GPUs inside, each of which has 240 processing cores for a total of 960 cores per system.”
More info on http://www.pcworld.com/article/155242/.html?tk=rss_news
They have 170 tesla cabinets, each is 1U rack unit. Each standard rack tends to be 42 units.
Simple math dictates that 4 racks could hold those 170Us.
I am not sure they packed them that much. But I was referring to the GPUs alone. Which is indeed a remarkable density of computing.
NVidia’s gpus really fall flat on double precision. The nvidia tesla system advertises 1TFlop SP and 80GFlop DP. That means 64bit doubles run at only 8% the speed of 32bit floats. And 32bit floats are just about totally worthless for general scientific processing. An 8 core powermac 2 years ago could churn ~40GFlop DP. That was 2 intel generations ago….
The new cell successor from IBM looks far more interesting but it seems that IBM isn’t interested in selling it (same mistake DEC made).
Please, refrain from making blanket statements like “simple precision FP is worthless” it turns out that for some applications, it is enough. Which I am sure that is what these machines are being targeted at.
It was my understanding that nvidia was bringing native doubles with their latest hardware. The boards used were announced in 2007, while the vendor ClearSpeed is saying that they support single and double, this is with their api.
Not all scientific applications require 64-bit floating point support. Some algorithms are resilient enough for the 32-bit FP support to be sufficient, even with the restricted rounding modes. When you absolutely need 64-bit support, you can mix it with 32-bit floats where 64-bit is not necessary. And when you absolutely need 64-bits, remember that all that memory bandwidth (141.7 GB/sec) is available. Your 8% figure is only about raw execution resources but says nothing about actual application performance, which is of course totally dependent on the application. Considering super computers are not exactly cheap, they must have taken this into account somewhere.
True. Cell is still very expensive. IBM cannot compete with COTS nVidia solutions, since the Cell in the PS3 does not support DP, so the volumes must be quite small of the DP enabled Cell.
True, but the majority (if you ask me for a number I’d say >95%) of scientific applications that require a super computer will need 64 bit floating point support.
Be careful, pulling numbers like that out of one’s derriere is a bit counter productive 🙂
Remember that this is a hybrid system. It also has 10,480 Opterons. Parts of the job which require higher precision can use the Opterons, while parts which can get by with lower precision, can use the GPUs. Also, as I understand it, GPUs are really great for vector processing, but really suck at most everything else. So they can be viewed as one more resource that the supercomputer application programmer has at hand which can be applied at his or her discretion, just like any other resource.
Actually cell SPEs can do DP, butt also with a big hit in performance. The newer ibm PowerXCell boosts the double precision. Apparently a single cell with 8 SPEs could only do 14GFlops DP. PowerXCell can do 102GFlops with 8 SPEs…
In fact a super computer made up of these PowerXCell was (maybe still is?) the top super computer in the world.
http://www.ppcnux.com/?q=node/7144
What sucks is that cell should absolutely not be so expensive. Proof: sony who’s claiming no hardware cost losses ~$399 per ps3 sale. That’s with a helluva lot more hardware than just cell.
IBM wants their 10,000% markup, that’s all.
Interesting to see people agree with this statement.
OK, the thing is this: There are computations that require double precision. And there then there are some which don’t. Now guess what this is made for. Also, the machine doesn’t consist of GPUs only. Btw, if you see the performance of NV GPUs within the folding at home project you will find that they easily outdo even the Cell. Again, it depends what you wanna do with it. I hope you don’t believe the issue escaped them before they had built the thing…
Just in case anyone was wondering:
http://tinyurl.com/6998vg