AMD reveals Threadripper 2: up to 32 cores

Thom Holwerda 2018-06-06 AMD 14 Comments

At the AMD press event at Computex, it was revealed that these new processors would have up to 32 cores in total, mirroring the 32-core versions of EPYC. On EPYC, those processors have four active dies, with eight active cores on each die (four for each CCX). On EPYC however, there are eight memory channels, and AMD’s X399 platform only has support for four channels. For the first generation this meant that each of the two active die would have two memory channels attached – in the second generation Threadripper this is still the case: the two now ‘active’ parts of the chip do not have direct memory access.

I feel like the battle for the highest core count at the lowest possible price while still maintaining individual core clock is really the new focus for Intel and AMD. My only hope is that this will spur better and easier parallelisation in software so that we can all benefit from this battle.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

14 Comments

2018-06-06 6:25 pm
CaptainN-
Rust is one language working on better parallelism. It’s a pretty fun environment to work in, and it even compiles to webassembly.
I remember back in the Flash days there was this thing called PixelBender, which was like a mid level (C-like) shader platform. It ran primarily on CPUs, but was built from the ground up to make parallelism through multiple cores and vectorization easy. It was killed along with Flash, when mobile CPUs took over the world, and Adobe decided to get out of the app platform game.
It was easier than GPU based shaders at the time for a couple of reasons (main memory access, more expressive language at the time, etc.). I wonder if we’ll see more of those kinds of efforts again now that high CPU counts with big vector processing units are popular again.
Edited 2018-06-06 18:27 UTC

2018-06-06 6:48 pm
Kochise
Try Erlang/Elixir or DrRacket…
2018-06-08 3:37 pm
Yoko_T
Rust is one language working on better parallelism. It’s a pretty fun environment to work in, and it even compiles to webassembly.
I remember back in the Flash days there was this thing called PixelBender, which was like a mid level (C-like) shader platform. It ran primarily on CPUs, but was built from the ground up to make parallelism through multiple cores and vectorization easy. It was killed along with Flash, when mobile CPUs took over the world, and Adobe decided to get out of the app platform game.
It was easier than GPU based shaders at the time for a couple of reasons (main memory access, more expressive language at the time, etc.). I wonder if we’ll see more of those kinds of efforts again now that high CPU counts with big vector processing units are popular again.
Does this mean that UI’s like Gnome3,KDE,and Windows 8+ will finally run as well as the UI’s and their software did on the 8-bit Atari,Apple II’s and C64 computers?
Edited 2018-06-08 15:38 UTC

2018-06-06 7:07 pm
yoshi314@gmail.com
everything that makes intel break out in cold sweat is a win for the consumer.
2018-06-06 8:26 pm
CapEnt
Intel maybe in a even worse shape to compete against this than most people supposes after that horrible parlor trick that Intel tried to pull out two days ago.
There is pretty good evidence pilling up that this “new” 28 core 5GHz CPU that Intel presented on Computex was in fact a mere Xeon from the existing lineup (likely a Xeon Platinum 8180) with a unlocked multiplier, overclocked to 5GHz using a 1HP 1700W Hailea HC-1000B chiller to cool it down to 4ÂºC. And they quietly omitted that on their demonstration.
Anandtech is running a history on this (Tom and Linus too) and quite frankly, it really smells bad: https://www.anandtech.com/show/12907/we-got-a-sneak-peak-on-intels-2…

2018-06-07 12:09 am
Brendan
Hi,
Intel maybe in a even worse shape to compete against this than most people supposes after that horrible parlor trick that Intel tried to pull out two days ago.
There is pretty good evidence pilling up that this “new” 28 core 5GHz CPU that Intel presented on Computex was in fact a mere Xeon from the existing lineup (likely a Xeon Platinum 8180) with a unlocked multiplier, overclocked to 5GHz using a 1HP 1700W Hailea HC-1000B chiller to cool it down to 4ÂºC. And they quietly omitted that on their demonstration.
Anandtech is running a history on this (Tom and Linus too) and quite frankly, it really smells bad: https://www.anandtech.com/show/12907/we-got-a-sneak-peak-on-intels-2…
To me; what smells bad is the power consumption numbers that both Intel and AMD are throwing around – the only thing they’re missing is a wide slide-out tray to insert the uncooked pizza.
It doesn’t make any sense from a “performance per watt” perspective. Any software that actually benefits from 28 (or 32) cores must be “very parallelizable” and is therefore likely to get similar performance with twice as many cores at half the frequency and a quarter of the power consumption.
What I really want to see is a dual socket motherboard with a “2 cores at 4.5 GHz and 80W” chip in one socket, and a “64 cores at 1 Ghz and 80W” chip in the other socket. With some scheduler and power management tweaks you’d end up with all of the benefits (for both parallelizable and unparallelizable loads) at a fraction of the power consumption.
– Brendan

2018-06-07 1:44 am
cb88
I mostly agree except your separating them onto two sockets is arbitrary…. big.Little architecture doesn’t require that at all in fact it would be detrimental.
There is reasoning behind the way things are, you’re 64core chip would probably be memory bandwidth starved… and you’re 2 core 80w deal should really only be about 24-35w realistically and would be wasting/under utilizing whatever memory channels they were connected to.
What you’d really want is 1-2 fast cores per die, and a ton of slow small cores. That way Threadripper would have 4-8 fast cores, and 128-256 lightweight cores.
The drawback to this architecture is that it is hard to program similar to CELL… and even multithreaded workloads scale better if you have better single threaded performance.

2018-06-07 7:12 am
Brendan
Hi,
I mostly agree except your separating them onto two sockets is arbitrary…. big.Little architecture doesn’t require that at all in fact it would be detrimental. [/q]
Separate sockets would give people a lot more flexibility. For example; maybe there’s 5 different “fast core” chips to choose from and 5 different “many-core” chips to choose from; so people have 25 permutations of “fast core + many core” (and maybe you’d be able to use two “fast core” chips if you don’t need lots of cores, or two “many cores” chips if you don’t need fast single-threaded performance).
Note that big.Little is all about power management (and not about the number of cores) – e.g. turning high power cores off (to save power) and using the low power cores when the system is idle, because there’s no frequency/voltage control in the core itself. I’m talking about all cores running at the same time where the OS’s scheduler would use “fast but few” cores for some (unparallelizable) processes and “slow but many” cores for other (parallelizable) processes.
There is reasoning behind the way things are, you’re 64core chip would probably be memory bandwidth starved…
That’s the other reason for multi-socket – you end up with a NUMA architecture, typically with twice as many memory channels.
and you’re 2 core 80w deal should really only be about 24-35w realistically and would be wasting/under utilizing whatever memory channels they were connected to.
2 cores at 3 GHz would be around 30W, but 2 cores at 6 GHz would be closer to 100W.
[q]What you’d really want is 1-2 fast cores per die, and a ton of slow small cores. That way Threadripper would have 4-8 fast cores, and 128-256 lightweight cores.
The drawback to this architecture is that it is hard to program similar to CELL… and even multithreaded workloads scale better if you have better single threaded performance.
What made Cell hard is that it wasn’t a single/shared physical address space (each SPE had its own private RAM and the PPE had to transfer data between “main RAM” and each SPI’s private RAM). This meant that you couldn’t (e.g.) spawn a few normal threads and let the scheduler worry about it, and had to design software specifically for it.
– Brendan

2018-06-07 12:58 pm
viton
1) recent big.Little CPUs can use all cores at the same time.
Not only ARM. Esperanto Technologies is working on heterogeneous system with wide and slow cores paired with narrow fast cores.
2) SPEs can communicate with each other and read/write any memory without the assistance of PPE. The important rule of CELL programming – donâ€™t touch PPE caches.
2018-06-07 3:12 pm
Alfman verbose=1
Brendan,
What made Cell hard is that it wasn’t a single/shared physical address space (each SPE had its own private RAM and the PPE had to transfer data between “main RAM” and each SPI’s private RAM). This meant that you couldn’t (e.g.) spawn a few normal threads and let the scheduler worry about it, and had to design software specifically for it.
I think scalability considerations will ultimately force software to work with non-shared address spaces anyways. Cell processors were likely ahead of their time. I predict that model will be far more prevalent as we keep adding more cores since shared memory and cache coherency doesn’t scale well as we continue adding cores. Coherency logic imposes it’s own latencies. The benefits of having shared address space are largely nullified with such large numbers of cores. At some point it’s better to optimize memory performance for locality.
We’ll probably see a hybrid approach where cores get divided up into clusters where cores within a cluster share memory but with no shared memory between clusters. Not that normal consumers will care about any of this, such high core counts are mostly targeting industrial/commercial applications.
Edited 2018-06-07 15:13 UTC

2018-06-07 2:49 pm
Carewolf
Well of course it is. These second gen high core chips are always cripled Xeons where they cut the legs for ECM memory off and disabled some of the cache.
2018-06-09 7:37 pm
zima
So, sort of Pentium 3 Coppermine 1.13 GHz all over again?

2018-06-07 10:29 am
Yamin
Pardon the title. I understand that for a single application to really make use of 32 cores, it must be written in really parallel way. I get that.
But how about just the reality that a typical computer tends to run many applications and each of those might have many different threads.
The more cores you have, the more you can assign that to any of the many applications that run on any modern computer. To me, that’s the biggest benefit I’ve noticed as I’ve gone from dual core to quad core. Just the smoothness of the performance and there is less ‘bogging’ down the system.

2018-06-07 3:30 pm
Alfman verbose=1
Yamin,
Pardon the title. I understand that for a single application to really make use of 32 cores, it must be written in really parallel way. I get that.
But how about just the reality that a typical computer tends to run many applications and each of those might have many different threads.
The more cores you have, the more you can assign that to any of the many applications that run on any modern computer. To me, that’s the biggest benefit I’ve noticed as I’ve gone from dual core to quad core. Just the smoothness of the performance and there is less ‘bogging’ down the system.
For normal consumer uses, this quickly hits a point of diminishing returns. Four cores is reasonable, but at 8, 16, or 32, etc you’ll have way more cores than active tasks. Having large numbers of cores is certainly useful for niche tasks like raytracing, but IMHO these CPUs are not intended for consumer use cases.
They make a lot of sense in the data center though where many completely independent requests are coming in constantly and we want to service them as quickly as possible. The important question is whether having one server with a high number of cores can be faster/cheaper than having many servers with a lower number of cores. Consider also that in many data centers space is a premium. On the other hand, in many data centers you are limited in the amount of power per rack such that having a 1U or 2U is less important that fitting in a fixed power budget. Ultimately all of these factors will determine whether there will be strong demand for these CPUs.
Edited 2018-06-07 15:35 UTC