IBM’s Million Dollar CPU Is the Fastest in the World

Thom Holwerda 2010-08-26 IBM 49 Comments

At the Hot Chips 2010 conference, IBM announced their upcoming z196 CPU, which is really, really fast. How fast? Fastest chip in the world fast. Intended for Z-series mainframe computers, the Z196 has a clock speed of 5.2GHz. Measuring just 512 square millimeters, the Z196 is fabricated on 45nm PD SOI technology, and on its surface contains almost one and a half billion transistors. My… Processor is bigger than yours.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

49 Comments

2010-08-26 11:55 pm

Shannara
Interesting, what happened to Intel’s 9.x Ghz (over clocked) from … last year (?)

2010-08-27 11:55 am

flanque
they burnt out trying to keep up with Crysis…

2010-08-26 11:58 pm

Almafeta
…is that it can heat a small apartment on its own in the winter.

2010-08-27 12:04 am

bhtooefr
You don’t need a 5.2 GHz z196 to do that.

I’ve got a 17.7 MHz 7490 (S/390 Microprocessor Complex, in this case, a Microchannel card inside of a PC Server 500, other versions are PCI) that does that just fine at heating my apartment. (Actually, it’s the ridiculous RAID array and horrendously inefficient PSUs that are responsible for most of the heating.)

And a zEnterprise could heat a HOUSE, not just an apartment.

2010-08-27 2:56 am

kaiwai
With the rise of cloud computing I wonder whether we’ll see a massive resurgence in mainframes and big iron boxes? the great thing with these massive mainframes is not only their grunty processors but massive bandwidth that can move huge amounts of data being moved around and rock solid reliability.

2010-08-27 12:10 pm

rom508
I doubt it, I think it’s cheaper, more flexible and makes more sense to use a cluster of small processing nodes, rather than a single large system. I think the reason why there is still a market for large mainframes is due to legacy software that cannot function in a distributed fashion.

2010-08-27 1:51 pm

Lennie
Actually it’s probably the other way around.

The mainframes don’t need an interconnect (or it’s build in and fast, however you look at it). Thus it would probably be very good at pushing data around.

IBM mainframe sales actually are going up, part of the reason is because people don’t want to rely on Sun anymore since it was bought by Oracle.

I hear a mainframe is also very good at running many Linux machines. Uses less power than a cluster of machines and is fully redandant as wel.

Just is not everyone can pay a milion to run a ‘few’ Linux-servers.

2010-08-27 2:28 pm

tylerdurden
Not, really. The whole point of cloud computing is lack of centralization. Which the mainframe implies.

Mainframes are good for mission critical processing, and IBM has a bit of a monopoly in certain transactional markets.

2010-08-29 11:30 pm

JAlexoid
Actually Cloud Computing does not imply how H/W is designed(centralized or otherwise). Cloud Computing implies that resources are virtual and can exist anywhere.

And Mainframes(especially System Z) are very well prepared for that… System Z and it’s predecessors had virtualization built into them for decades…

You put in 2 System Z’s(second one for geographic redundancy), connect them together and if one fails, the other can pick up the load as if nothing happened….
2010-08-31 1:04 am

tylerdurden
Huh? Cloud computing has everything to do with decentralization of resources! That is the whole point.

2010-08-29 9:25 am

sPAZbEAT
“IBM mainframe sales actually are going up, part of the reason is because people don’t want to rely on Sun anymore since it was bought by Oracle.”

this has nothing to with satanism, does it?

2010-08-27 5:32 pm

Bill Shooter of Bul Platinum Prime
Who are the biggest cloud companies now, and how are they doing it?

Google

Amazon

Microsoft

Facebook

Twitter

I believe all of them are using the multiple cheap server route. They aren’t all forthcoming with their methodology or reasoning for their architecture, but I think that leads me to think that Mainframes don’t look like they’ll have much of a future in cloud computing.

Of course, I trust my own predictions in technology less than I do the weatherman’s daily forecasts. But if I had to make a bet, I wouldn’t bet on clouds in big iron.
2010-08-28 10:21 am

cerbie
Doubtful. The cost is very high. That’s part of why companies like, say, Google, don’t even consider them for their grunt work. if I were coming up with something scalable, today, I’d be looking at many cheap boxes, and handling fault tolerance with a proxy layer that caches workloads until they are sent back complete, with as much testing of data correctness as seemed reasonable (FI, if scaled out enough, sent each workload to two machines, and verify CRCs on their results, before returning the results).

There will be a continued demand for big iron (sometimes you’ll just need throughput…and maybe need to run legacy code that isn’t x86), but I don’t think anyone will be moving towards them in any large numbers. Atom and Bobcat derived servers are much more the future, even if you treat cloud computing as a fad.

2010-08-29 11:33 pm

JAlexoid
There will be a continued demand for big iron (sometimes you’ll just need throughput…and maybe need to run legacy code that isn’t x86), but I don’t think anyone will be moving towards them in any large numbers. Atom and Bobcat derived servers are much more the future, even if you treat cloud computing as a fad.

Atom and Bobcat are the future for non critical applications out there. IBM’s superb big iron will still be needed where precision is paramount.

Last time I checked, Google does not run it’s accounting on their cloud systems, because they need precision.

2010-08-27 12:01 pm

rom508
No matter how fast your processor is, you can only go as fast as the slowest path in your system. In most cases the slowest path is RAM. The case when a faster processor makes a big difference is when your data set can fit entirely in CPU’s cache memory. However, as these processors are targeted for large database systems, there will be a lot of access to RAM, hence the super fast 5.2GHz CPU will spend a lot of time just waiting.

I don’t know what tricks IBM engineers did to avoid RAM latency, I’m just speculating on the most obvious bottleneck such a system will have. The more you raise CPU speed, the closer you start approaching the point of diminishing returns. The future is parallel processing, and not just instruction level parallelism, but all the way up the stack to operating systems and user applications.

2010-08-27 2:19 pm

helf
Its IBM. I’m sure they planned for memory latency.
2010-08-27 2:33 pm

tylerdurden
Not really, the whole dataset does not have to fit in cache.

Technically the slowest element in most computing systems are either the disk or the network interfaces.

Very efficient branch prediction subsystems, and nicely tuned caches can boost the utilization of the processor significantly, without having to “fit the whole dataset in cache.” Things like simultaneous multithreading also help improve resource utilization.

Believe it or not, IBM has excellent architectural teams… so the system will be fairly tuned to keep the processor “utilized.”

2010-08-27 3:22 pm

rom508
Well that depends on the locality of data and your cache size. If you have large cache and good data locality, then this will hide memory latency. This is what IBM did:

“A 4-node system is equipped with 19.5MB of SRAM for L1 private cache, 144 MB for L2 private cache, 576MB of eDRAM for L3 cache, and massive 768MB of eDRAM for a level-4 cache.”

That is a lot of cache over many levels. You can probably fit a large portion of your working data set into all that cache memory.

2010-08-27 8:37 pm

tylerdurden
Where did you get your numbers? 12MB of L1? Whaaaa?

From IBM’s specs:

L1 Instruction Cache = 64KB

L1 Data Cache = 128KB

L2 Shared I+D Level = 1.5 MB

The numbers you quoted for the L1 for example, amply exceed the total transistor budget for the whole chip quoted by IBM.

As I said, things like finely tuned branch predictors make as much of an impact as huge caches. In fact, there start to be diminishing returns for most cache sizes after a few megabytes.

2010-08-28 4:16 am

mmebane
Look at the linked Yahoo News article: http://news.yahoo.com/s/zd/20100824/tc_zd/253938
2010-08-28 6:10 pm

tylerdurden
Why would I need to read a 3rd party article, by a non-technical writer, when I was getting the specs directly from IBM?

Do you guys even comprehend the impossibility of a single chip at that 45nm process with those cache sizes?

See the specs from IBM directly, or if you need a 3rd party check wikipedia:

http://en.wikipedia.org/wiki/IBM_z196_(microprocessor)

A 16 MB of L1 would be idiotic, since it could actually make things like context switches very costly with so much local data to flush in/out.
2010-08-29 8:00 am

gilboa
A 16 MB of L1 would be idiotic, since it could actually make things like context switches very costly with so much local data to flush in/out.

Actually, the biggest issue with large L1 is not flush-on-context switch.

The bigger the cache the bigger the index tables (assuming that they are not using direct mapped cache), which in turn, increases the latency.

As a result, L1 caches tend to small and extremely fast, with bigger and slower down the pipeline until you reach the main relatively slow main memory.

– Gilboa
2010-08-31 1:02 am

tylerdurden
You are correct, I was just giving an example as to why really large L1 caches are not only useless but become problematic, touching on the fact that these cores are used in SMT in which case context in the L1s actually has an effect.
2010-08-31 6:27 pm

gilboa
Oh, OK, I didn’t notice the part about SMT.

– Gilboa
2010-08-29 11:40 pm

JAlexoid
Why would I need to read a 3rd party article, by a non-technical writer, when I was getting the specs directly from IBM?

Do you guys even comprehend the impossibility of a single chip at that 45nm process with those cache sizes?

See the specs from IBM directly, or if you need a 3rd party check wikipedia:

http://en.wikipedia.org/wiki/IBM_z196_(microprocessor)

A 16 MB of L1 would be idiotic, since it could actually make things like context switches very costly with so much local data to flush in/out.

First of all, remember that Mainframes are designed to push I/O ops at unparalleled speeds. Their main focus forever was I/O performance, not processing speeds. That is why mainframe processors are not used in their super-computers, they are just not designed for raw calculations. Add to that, these machines are bundled with with some fast storage units – and you get ultra low wait times for data.

I’ve actually seen how these machines perform, Oracle RAC isn’t a contender when comparing the I/O heavy DB operation performance these machines can achieve with DB2.
2010-08-31 1:00 am

tylerdurden
And that has what to do exactly with the post regarding the fact that the previous posters reported the wrong size of caches?
2010-08-29 10:05 am

rom508
Yes you’re right, the specs I quoted do look a bit bogus. I just copied them from the first article I found in the Internet.

2010-08-27 3:18 pm

Neolander
This has been discussed a billion times already Some things just don’t scale well accross multiple cores, if they scale at all. As an example, for physics simulations, interactions can become a major nuisance when you have parallel processing in mind (they are a nuisance for all kinds of calculations, anyway).

So along with the current trends towards hundreds of low-performance processor cores, making individual cores faster is still a good thing for some problems, as long as the bus bottleneck and some relativistic issues concerning the size of electronic circuitry can be worked around ^^

2010-08-27 3:33 pm

rom508
What doesn’t scale well across multiple cores? Give me a few examples. The signals fired by the human brain are pretty slow compared to what computers can do, however the brain does massive amounts of processing, because everything is wired in parallel with billions of connections.

If you can solve the parallel problem first, getting the individual processing units running at a faster rate will be the trivial task.

2010-08-27 4:57 pm

Neolander
A first issue related to multicore is that if the input of task N in an algorithm depends on the output of task N-1, you’re screwed. This prevents many nice optimizations from being applied.

A purely mathematical example : prime factorization of integers from 1 to 10000.

First algorithm that comes to mind is…

For N from 1 to 10000

..For I from 1 to N

….If I divides N then

……Store I in divisors of N

This algorithm can be scaled across multiple cores quite easily (just split the first for loop). But in order to waste a lot less processing power when N grows large, we may be tempted to use this variation of the algorithm…

For N from 1 to 10000

..For I from N to 1

….If I divides N then break

..Add I to divisors of N

..Add divisors of I to divisors of N

…which can’t be scaled across multiple cores because it relies on the order in which the Ns are enumerated !

2010-08-27 8:43 pm

rom508
A first issue related to multicore is that if the input of task N in an algorithm depends on the output of task N-1, you’re screwed. This prevents many nice optimizations from being applied.

You just narrowed down to a basic leaf algorithm. I was talking about larger problems, i.e. where each task can be broken down into smaller sub-tasks, and then perhaps those sub-tasks can be broken down into smaller units.

Sure there are some basic algorithms that are difficult to parallelise, however the world is full problems that can be broken down into smaller units.
2010-08-27 9:48 pm

Neolander
You’re right, that should work for problems where several power-hungry and independent tasks occur simultaneously, like gaming (where you simultaneously handle graphics, sound, AI, physics, and others), as the Cell’s success illustrates it.

2010-08-27 5:05 pm

viton
brain is not very good at precise calculation of physic processes (some of such calculations is not friendly to multicore)

2010-08-28 11:01 pm

Zifre
brain is not very good at precise calculation of physic processes

Then how can one shoot a basketball? The brain is very well adapted at simulating physics.
2010-08-29 6:38 am

Neolander
So well-adapted that it believed for a long time that the Sun was gravitating around the Earth…
2010-08-29 11:40 am

Zifre
So well-adapted that it believed for a long time that the Sun was gravitating around the Earth…

Before science was widespread, that seemed like a pretty good idea. There was no reason to believe otherwise. It’s not like universal gravitation is obvious – we only ever see things falling toward the Earth.

And technically, space isn’t absolute. The Sun really is orbiting the Earth just as much as the Earth is orbiting the Sun. We only consider the Sun to be the center because it is bigger.
2010-08-31 4:47 am

Drumhellar
And technically, space isn’t absolute. The Sun really is orbiting the Earth just as much as the Earth is orbiting the Sun. We only consider the Sun to be the center because it is bigger.

No, the Earth is orbiting the Sun, and not at all the other way around. This is because the barycenter of the Earth-Sun system is contained within the Sun. In order to say the Sun orbits the Earth, the barycenter would have to be located within the Earth.

If the barycenter was at a point that was between the two bodies, but not within either, it would be a binary system. This is the situation with Pluto. Instead of having a moon, Charon really makes Pluto-Charon a binary system, as their barycenter is external to either body.
2010-08-29 9:35 am

sPAZbEAT
from experience?

imprecisely. (only to the precision needed)

also, the game is designed/evolved to the game players abilities, and demands of entertainment. (though I don’t know much *how* that influences the mind’s ability to make musculoskeletal etc system shoot baskets)
2010-08-29 11:45 am

Zifre
from experience?

imprecisely. (only to the precision needed)

also, the game is designed/evolved to the game players abilities, and demands of entertainment. (though I don’t know much *how* that influences the mind’s ability to make musculoskeletal etc system shoot baskets)

Still, it takes quite a lot of precision. First your brain has to measure the distances to the nearest centimeter using the information from your eyes. Then you have to figure out exactly how much to flex each of the muscles involved. Then your brain has to actually send all that information to your muscles. And your muscles have to follow it accurately. All in less than a second. It only seems simple because it natural to us.

Try creating a robot that can do that. Even with near perfect physics simulation, you’d have a very hard time.

2010-08-27 3:22 pm

bert64
Mainframes traditionally had relatively slow processors, coupled to memory that was more than able to keep up with it…

It’s low end machines which have faster processors that are severely bottlenecked by memory speeds, because buyers look at the processor speed and don’t consider other aspects of the system…

I would be extremely surprised if IBM hasn’t designed a suitably fast memory system to go with this processor.
2010-08-27 11:19 pm

leech
Wow, while everyone else went on scientific reasoning of why RAM is fast or not fast..

The slowest path of computing at this point in time isn’t RAM, which is probably the second slowest, the slowest is through the hard drive!

RAM keeps getting faster and faster. Hard drive technology really hasn’t changed all that much since the beginning.

It’s sad that it has been the bottleneck in real speed for so long. If only SSD technology was FAR cheaper than it is. Then maybe we can finally start pushing the limits of the PCI(e/x) and Memory bus.

2010-08-28 3:50 am

cb88
Mainframes tend to have Terabytes of ram just for that reason even my schools compute servers that students use have 64Gb ram and 8x cores. Also wasn’t it just the other week that that group demonstrated sorting and archiving 1Tb of data in a minute? http://science.slashdot.org/story/10/07/27/2231219/Data-Sorting-Wor…

I believe there was a graph I once saw that showed a strong correlation between the speed of AI computations and the size of CPU cache. Such computations don’t benefit much from faster access to data on an HD but do benefit greatly from data that can be accessed quickly from cache. What I’m trying to say is 1ms vs 100ms is still slow compared to 5-10ns or less.
2010-08-28 6:15 pm

tylerdurden
No, you are all missing the point.

It is not that a computer is as fast as it slowest component. The whole point of computer architecture is to make the common case fast. Veeeeery different.

So yeah, the booting speeds may have not progressed that much since they are constrained to the speed of the disk subsystem, which indeed is quite slow. But booting up is not the common case, is it? Running code is. And for the most part, most modern computers tend to utilize their processors rather well, try running a modern game on an old P3. This like databases etc are obviously more sensitive to I/O, but the machines used to run them are not necessarily comparable to a modern single user PC.

2010-08-29 9:00 pm

jeepercreeper
As someone explained on the internet: This cpu is really slow. Two arguments:

The predecessor, the z10 Mainframe cpu, is really slow. A fully configured z10 Mainframe with 64 cpus gives 28.000 MIPS. This equals 437 MIPS / cpu.

This new IBM z196 cpu the article talks about, is 50% faster than the z10 cpu. Which means the new z11 Mainframe gives in total 42.000 MIPS with 64 of these z196 cpus. This means this “new million dollar fastest cpu in the world” z196 cpu, gives ~650 MIPS.

(A) If you emulate an Mainframe on an 8-socket Intel Nehalem-EX server, you get 3.200 MIPS:

http://en.wikipedia.org/wiki/TurboHercules#Performance

This equals 400 MIPS / Nehalem-EX. But remember, software emulation is 5-10x times slower. In reality, the Nehalem-EX gives 2000-4000 MIPS. Hence, you need 10 Nehalem-EX cpus to reach 40.000 MIPS, in par with the new z11 Mainframe which gives 42.000 MIPS.

Hence, any modern x86 cpu is 5-10x faster than this z196 cpu. You need 10 Intel Nehalem-EX to match 64 of the IBM z196.

(B) A linux developer that ported Linux to IBM Mainframes said that 1 MIPS equals 4 MHz x86:

http://www.mail-archive.com/[email protected]/msg18587.html

This means an 2GHz Intel Nehalem-EX with 8 cores, has in total 16 GHz. This equals 4.000 MIPS. Hence you need 10 Nehalem-EX to reach 40.000 MIPS.

We have two independent experts saying that any modern x86 is 5-10x faster than this z196 Mainframe cpu. Which runs 5 GHz and has 376 MB cache (L1 + L2 + L3). And still any old x86 with half the Hz and one tenth of the cache, is much faster than this z196 cpu that costs million dollars.

2010-08-29 11:45 pm

JAlexoid
You should read up what mainframes are used for and where is their strength and the reason for using them. Otherwise, you popping up and comparing plain CPU power is just ridiculous.

PS: You’ll notice that the “fastest” is the clock frequency here, not MIPS or some other measure.
2010-08-30 12:13 pm

Nicholas Blachford
As someone explained on the internet

Yes, because everything you read on the internet is true isn’t it?

People have been saying desktops are faster then mainframes since the 80’s. But big companies like banks still use mainframes, there’s usually a good reason for that.

It’s somewhat pointless comparing a desktop CPU to a mainframe CPU. The desktop only has a few cores and that’s it. A mainframe has a whole set of different types of cores. It’s a pretty much a distributed systems of dedicated processors, with the CPUs being only one of many types of core. If you want to compare performance you need to compare to ALL of the cores in the mainframe.

As someone else said, mainframes are really designed for transaction processing with huge levels of I/O. A decked out z196 goes up to 288GB/second bandwidth. That’s 8x more bandwidth than your 8 core Xeon has memory bandwidth!

—

As for the cache sizes, they are:

L1

64KB Instructions

128KB data

L2

1.5MB / core

L3

24MB / shared across 4 cores

L4

192MB shared across 24 cores

That’s per “book” you can add 4 books.

…and they are backwards compatible with code written in the 1960’s.

More info here: http://www.redbooks.ibm.com/redpieces/pdfs/sg247832.pdf

2010-08-31 11:07 am

jeepercreeper
If the given explanation is wrong, and if you claim this IBM Mainframe z196 cpu is really fast – then where is the error in the reasoning? Please point the errors out. If you claim that the IBM Mainframe cpus are not slow, but fast – please point out the errors so that I understand why are correct. Or do you mean I should just trust you, just because you work at IBM?

Short recap of the arguments showing why this z196 cpu is slow. This IBM z196 Mainframe CPU gives ~650 MIPS, according to IBM. Let us relate the number 650 MIPS to x86 cpu:

(A) Intel Nehalem-EX gives 400 MIPS under software emulation. Software emulation is 5-10x slower. If we ported the Mainframe code to x86, the Nehalem-EX can actually execute code worth of 2000-4000 MIPS. Hence, x86 is much faster.

(B) 1 MIPS equals 4 MHz x86. An 8-core 2GHz Nehalem-EX has 16 GHz in total. This equals 4.000 MIPS. Hence x86 is much faster.

So where are the errors in (A) and where are the errors in (B)? Please point them out.

Also, you claim that one z196 does not have up to 376 MB cache? So, no half a GB of cache?
2010-08-31 2:37 pm

juwi
Actually IBM doesn’t allow Benchmarking those machines, they’ll sue you if you do. From experience I can tell that his calculation using simple math is about right. Those machines are ridiculously slow when you look at the price. Theres really only two reasons for using those machines.

One is RAS.. which SUN, HP and Nehalem EX can offer as well at a much lower price point and usually with more performance.

Two is when you’re using old code. Find someone that’ll port your old COBOL Applications to something.. new. When you’ve found him ask what he wants for doing it. In that case the mainframe really is cheaper and imo thats the only reasons for using those machines.

Sun and HP have been bragging about the poor Performance Mainframes deliver for years – they definitely did so for a reason.

Bragging alone just doesn’t help against IBMs sales tactics which are currently being investigated by the FCC and others though.

2010-08-31 2:37 am

Dano
Can I get two of those with a dual socket mini-ATX motherboard? Eat your heart out Maximum PC…