AMD Threadripper reviews and benchmarks

Thom Holwerda 2017-08-11 AMD 29 Comments

In this review we’ve covered several important topics surrounding CPUs with large numbers of cores: power, frequency, and the need to feed the beast. Running a CPU is like the inverse of a diet – you need to put all the data in to get any data out. The more pi that can be fed in, the better the utilization of what you have under the hood.
AMD and Intel take different approaches to this. We have a multi-die solution compared to a monolithic solution. We have core complexes and Infinity Fabric compared to a MoDe-X based mesh. We have unified memory access compared to non-uniform memory access. Both are going hard against frequency and both are battling against power consumption. AMD supports ECC and more PCIe lanes, while Intel provides a more complete chipset and specialist AVX-512 instructions. Both are competing in the high-end prosumer and workstation markets, promoting high-throughput multi-tasking scenarios as the key to unlocking the potential of their processors.

As always, AnandTech’s the only review you’ll need, but there’s also the Ars review and the Tom’s Hardware review.

I really want to build a Threadripper machine, even though I just built a very expensive (custom watercooling is pricey) new machine a few months ago, and honestly, I have no need for a processor like this – but the little kid in me loves the idea of two dies molten together, providing all this power. Let’s hope this renewed emphasis on high core and thread counts pushes operating system engineers and application developers to make more and better use of all the threads they’re given.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

29 Comments

2017-08-11 8:39 pm
Licaon_Kter
…hopefully you can even compile something without the system melting ( like Ryzen does )
2017-08-12 2:20 am
kwan_e
make more and better use of all the threads they’re given
More use? No. Better use? Yes. Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.

2017-08-12 9:57 am
dpJudas
More use? No. Better use? Yes. Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.
Problem is that once NUMA enters the picture it becomes much more difficult to be thread agnostic. A generic threadpool doesn’t know what memory accesses each work task is going to do, for example.

2017-08-12 2:55 pm
Alfman verbose=1
dpJudas,
Problem is that once NUMA enters the picture it becomes much more difficult to be thread agnostic. A generic threadpool doesn’t know what memory accesses each work task is going to do, for example.
I agree, multithreaded code can quickly reach a point of diminishing returns (and even negative returns). NUMA overcomes those bottlenecks by isolating the different cores from each other’s work, but then obviously not all threads can be equal and code that assumes they are will be penalized. These are intrinsic limitations that cannot really be fixed in hardware, so personally I think we should be designing operating systems that treat NUMA as clusters instead of as normal threads. And our software should be programed to scale in clusters rather than merely with threads.
The benefit of the cluster approach is that software can scale with many more NUMA cores than pure multithreaded software. And without the shared memory constraints across the entire set of threads, we can potentially scale the same software with NUMA or additional computers on a network.

2017-08-12 5:34 pm
FortranMan
This is really why I still use MPI for parallel execution even when running on a single node. This approach also has the added benefit of scaling up to small computer clusters without much extra effort.
I mostly write engineering simulation codes though, so I’m pretty sure this does not make sense for entire classes of program.

2017-08-12 8:40 pm
Alfman verbose=1
FortranMan,
I mostly write engineering simulation codes though, so I’m pretty sure this does not make sense for entire classes of program.
Oh cool, I’d really like to learn more about that. I’ve played around with inverse kinematic software and written some code to experiment with fluid dynamics, but nothing sophisticated. I’ve long wanted to try building physical simulations with a GPGPU, though obviously it requires a different approach!
2017-08-14 7:00 am
FortranMan
Most of my codes are simulations of experiments; tens to hundreds of thousands of simulations are run of the same setup with slight variations in input to help predict the uncertainties expected in the actual experiment. These are usually done in the design phase of an experimental project.
I have also written programs to approximate solutions to partial differential equations on various grids in parallel and serial, using shared memory, distributed memory, and even hybrid architectures. It is interesting stuff, albeit very challenging.
2017-08-14 1:37 pm
Alfman verbose=1
FortranMan,
Most of my codes are simulations of experiments; tens to hundreds of thousands of simulations are run of the same setup with slight variations in input to help predict the uncertainties expected in the actual experiment. These are usually done in the design phase of an experimental project. [/q]
So are the individual simulations multithreaded? Or do you use the threads to run many singlethreaded simulations at once?
[q]I have also written programs to approximate solutions to partial differential equations on various grids in parallel and serial, using shared memory, distributed memory, and even hybrid architectures. It is interesting stuff, albeit very challenging.
Yea it sure blows my line of work out of the water, the businesses I get work from don’t offer intellectually interesting or challenging work. Your lucky!

2017-08-13 12:04 am
tylerdurden
It seems you want to get the worst of both worlds in order to not get the benefits of NUMA.
You can simply pin threads if you’re that concerned with NUMA latencies. Otherwise let the scheduler/mem controller deal with it.
Edited 2017-08-13 00:06 UTC

2017-08-13 12:38 am
Alfman verbose=1
tylerdurden,
It seems you want to get the worst of both worlds in order to not get the benefits of NUMA.
You can simply pin threads if you’re that concerned with NUMA latencies. Otherwise let the scheduler/mem controller deal with it.
That’s the problem, the operating system scheduler/mem controller CAN’T deal with it.
If you had 32 cores divided into 8 NUMA clusters, the system would start incurring IPC overhead between NUMA clusters at just 5+ threads. You can keep adding more threads, but the system will be saturated with intra-NUMA IO.
To scale well, software must take the NUMA configuration into account. IMHO using separate processes is quite intuitive and allows the OS to effectively manage the NUMA threads. It also gives us the added benefit of distributing the software across computers on a network if we choose to. But obviously you can do it all manually pinning threads to specific cores and developing your own custom NUMA aware memory allocators, or you could allowing the OS to distribute them by process, it achieves a similar result. Personally I’d opt for the multiprocess approach, but you can choose whatever way you want.
Edited 2017-08-13 00:40 UTC
2017-08-13 2:28 am
tylerdurden
Sure. But remember, the whole point of NUMA is not to incur in IPC overhead.
I think, if I’m correct, you’re viewing threads as basically being at the process level. There, sure message passing makes sense, since you’re not dealing with shared address spaces. But that’s not what NUMA is trying to deal with.
You only have issues with NUMA when you have a very shitty memory mapping, when every core is referencing contents in another chip, but those pathological cases are rare.
Edited 2017-08-13 02:33 UTC
2017-08-13 5:37 am
Alfman verbose=1
tylerdurden,
Sure. But remember, the whole point of NUMA is not to incur in IPC overhead. [/q]
Yes, but only when multithreaded software takes the NUMA topology into account. If you simply increase the number of worker threads to equal the number of cores, but you don’t take NUMA topology into account, your going to end up with lots of IO crossing the NUMA boundaries. It’s certainly possible for a developer to build algorithms that avoid this, but it adds a lot of complexity to an already complicated topic. For example, avoid sharing structures between threads of different NUMA regions. Also avoid sharing mutexes/futexes/etc across NUMA regions since synchronization primitives based on cache coherency protocols incur significant overhead when accessed across NUMA regions.
A multiprocess design greatly simplifies this and can enforce NUMA boundaries with no additional work. In other words, using a multiprocess design, there’s zero implicit IPC between NUMA regions and only explicit IPC calls will trigger IO between NUMA regions.
I think, if I’m correct, you’re viewing threads as basically being at the process level. There, sure message passing makes sense, since you’re not dealing with shared address spaces. But that’s not what NUMA is trying to deal with.
I don’t rule out multi threaded entirely, just trying to avoid crossing NUMA boundaries.
So for example, assume we have a dual CPU NUMA configuration where half the RAM is physically connected to one CPU and the other half of RAM is physically connected to the other CPU. Each CPU has four native cores. The pure MT approach would be to run one process with eight threads, but naive MT code will end up with allocations spanning both NUMA regions and resulting in lots of chatter between CPUs having to fulfill each other’s remote requests.
It’s much more efficient if NUMA CPUs never have to fulfill each other’s memory requests unless explicit IPC is taking place. So for this example I would create two processes, each with four multithreaded workers. Since the address space for each of the threads is only ever shared with other workers on the same CPU, there’s zero unnecessary NUMA chatter.
[q]You only have issues with NUMA when you have a very shitty memory mapping, when every core is referencing contents in another chip, but those pathological cases are rare.
Not really, you might have 32 threads waiting for a socket operation, but you don’t have much control over which thread will actually receive the operation, and so naive code could very easily end up in a pathological case where the data structure it needs is located on the wrong NUMA CPU, which will kill the performance.
In naive MT code, it’s very common for all threads to block on shared synchronization primitives, but in a NUMA configuration this produces a pathological case since all the cores will constantly have to perform IO across NUMA boundaries check the mutexes/semaphores.
I think that solving the pathological cases will ultimately require MT algorithms that resemble the multi-process design anyways, so IMHO it makes sense just to start there and not have to reinvent the wheel.
Edited 2017-08-13 05:41 UTC
2017-08-14 12:47 am
tylerdurden
I think the problem is that you’re seeing “Threads” as full processes, not at the fine grained streams that NUMA deals with.
2017-08-14 1:11 am
Alfman verbose=1
tylerdurden,
I think the problem is that you’re seeing “Threads” as full processes, not at the fine grained streams that NUMA deals with.
Haha, but there is no problem. It’s not a matter of definition, it’s strictly a matter of the software compromises necessary to make NUMA scale well. Oh well.
2017-08-14 5:04 am
tylerdurden
NUMA systems have scaled to over 1K cores, I’d say that’s a good scalability.
2017-08-14 6:12 am
Alfman verbose=1
tylerdurden,
NUMA systems have scaled to over 1K cores, I’d say that’s a good scalability. [/q]
You can throw hundreds or thousands of CPUs on enormous buses. Fine, whatever, it’s completely orthogonal to my point that NUMA scalability only works when we isolate threads from one another across NUMA boundaries. The scalability will be absolutely pitiful with ordinary MT algorithms and 1000 cores. To use NUMA effectively, we have to distance ourselves from the traps that conventional MT programmers can fall into since threads are not equal.
We should all be in full agreement that locality is of utmost importance for NUMA to scale, right?
Edit: This is why performance can get worse with more cores…
https://arstechnica.com/gadgets/2017/08/amd-threadripper-review-1950…
…AMD also claims that certain games, older ones for the most part, run better (to an average of four percent) when presented with less physical cores. These include Fallout 4, Dota 2, Heroes of the Storm, and Civilization VI.
https://www.reddit.com/r/hardware/comments/6rket7/threadripper_gamin…
[q]In a nutshell, because of the way Threadripper is made from two dies, there’s going to be a huge memory latency impact crossing from one to the other – measured on Epyc it’s about twice as bad as the CCX penalty. So you’ll want to avoid that….but half of your cores, lanes and memory is on the other die – so you’re going to need to if you want to use all of it. A program can be carefully designed to take this into account, but I’ve never heard of a NUMA aware game. So how will games handle this? That depends on what “mode” you choose in the UEFI….
The later link is a good read on the topic.
Edited 2017-08-14 06:31 UTC
2017-08-14 6:25 am
tylerdurden
I think we keep missing each others point. I’m referring to NUMA, in the traditional architectural sense of the term. Which deals with intra-process parallelism. You keep referring to supposed issues/solutions at the inter-process level.
2017-08-14 7:00 am
Alfman verbose=1
tylerdurden,
I think we keep missing each others point. I’m referring to NUMA, in the traditional architectural sense of the term. Which deals with intra-process parallelism. You keep referring to supposed issues/solutions at the inter-process level.
Sorry, I believe we’re referring to the same thing, but maybe bringing up the similarity to processes was too far without adequately covering the limitations of NUMA first.
Do you agree that that code optimized for locality will perform much better on a NUMA system and code that isn’t optimized for locality will cause more overhead?
Conventional multithreading on SMP systems treats all threads as identical, but do you agree that on a NUMA system they are not identical and performance will degrade when a thread performs memory operations on memory physically connected to another CPU than the one executing the thread?
2017-08-14 4:04 pm
tylerdurden
Yes, NUMA degrades when a data item is located on a different node. But you’re missing the point of NUMA as a programming model.
The solution you’re proposing seems, literally, like the problem that NUMA was trying to solve originally. Which is why I was getting confused.
2017-08-14 6:19 pm
Alfman verbose=1
tylerdurden,
Yes, NUMA degrades when a data item is located on a different node. But you’re missing the point of NUMA as a programming model.
The solution you’re proposing seems, literally, like the problem that NUMA was trying to solve originally. Which is why I was getting confused.
We can build NUMA systems with hundreds or thousands of cores, but NUMA has not “solved” the scalability issue with regards to hundreds of cores sharing the same memory. Of course NUMA enables existing SMP code patterns to continue working, but we hit a wall and the performance degrades significantly if we use them with very large thread counts.
NUMA only scales at massive thread counts when we give up shared memory and switch to algorithms that enforce locality. In doing so, each NUMA region can operate at full performance independently from all others.
It turns out this is extremely similar to running independent computers in a cluster, however NUMA does have an advantage in that explicit IO across NUMA boundaries can be done much more efficiently than with a conventional packet switching network. For example passing messages directly into the memory buffers on the target CPU saves us all the work of encapsulation/fragmenting/exponential backoff/acknowledgement/re-transmission normally involved with network IO. Conventional network clusters don’t hold a candle to NUMA
Edited 2017-08-14 18:20 UTC
2017-08-14 7:42 pm
tylerdurden
I did not imply NUMA solved all the scalability issues. I’m simply saying that NUMA is scalable.
I think we keep diverging because I’m taling of NUMA at the architectural and single address space programming level. And if I understand correctly, you keep referring to coarse process-level multithreading.
Perhaps I’m focusing more on HW-NUMA and you on SW-NUMA.
In any case. Just like anything NUMA applies/works well with certain workloads/applications and it is atrocious on others.
Cheers.
2017-08-14 8:24 pm
Alfman verbose=1
tylerdurden,
I did not imply NUMA solved all the scalability issues. I’m simply saying that NUMA is scalable. [/q]
I know, I’m just trying to get to the compromises NUMA makes to make it scalable since the consequences to MT software are important.
I think we keep diverging because I’m taling of NUMA at the architectural and single address space programming level. And if I understand correctly, you keep referring to coarse process-level multithreading.
Perhaps I’m focusing more on HW-NUMA and you on SW-NUMA.
I don’t quite understand why we wouldn’t consider both together. “HW-NUMA” gives each set of cores their own direct memory path. Consequently “SW-NUMA” code that targets local memory can achieve massive scalability whereas code that accesses memory across NUMA boundaries is bottlenecked. This is all I’m trying to get everyone to agree on.
[q]Cheers.
Cheers to you as well
Edited 2017-08-14 20:25 UTC
2017-08-13 4:32 am
kwan_e
It also gives us the added benefit of distributing the software across computers on a network if we choose to. [/q]
But trying to be too general in your approach will mean getting the worst of both worlds. If the software doesn’t require such a thing, they shouldn’t pay the cost of the underlying implementation.
[q]That’s the problem, the operating system scheduler/mem controller CAN’T deal with it.
.
.
.
But obviously you can do it all manually pinning threads to specific cores and developing your own custom NUMA aware memory allocators, or you could allowing the OS to distribute them by process, it achieves a similar result. Personally I’d opt for the multiprocess approach, but you can choose whatever way you want.
To me, that just means the OS should open up a way for a process to say “these bunch of threads/tasks/contexts should be clustered together” and the software can say “these work units are of type X” and the OS can schedule them appropriately. Something like Erlang’s lightweight processes?
Edited 2017-08-13 04:33 UTC
2017-08-13 5:51 am
Alfman verbose=1
kwan_e,
But trying to be too general in your approach will mean getting the worst of both worlds. If the software doesn’t require such a thing, they shouldn’t pay the cost of the underlying implementation. [/q]
I’m not sure what your criticism is specifically, what is it you don’t like?
[q]To me, that just means the OS should open up a way for a process to say “these bunch of threads/tasks/contexts should be clustered together” and the software can say “these work units are of type X” and the OS can schedule them appropriately. Something like Erlang’s lightweight processes?
Sure, you could bundle some threads together, and then write code such that those threads avoid sharing memory or synchronization primitives with other bundles, and then make sure network sockets are only accessed by threads in the correct bundle associated with the remote client. This is all great, but it should also sound very familiar! We’ve basically reinvented the “process”
Edited 2017-08-13 06:10 UTC
2017-08-13 8:55 am
kwan_e
I’m not sure what your criticism is specifically, what is it you don’t like? [/q]
Having programs that can be offloaded onto the network is fine, but it is not necessary. To take advantage of that, it would affect a program’s design in a way that would make it substandard for its common use case.
[q]Something like Erlang’s lightweight processes?
This is all great, but it should also sound very familiar! We’ve basically reinvented the “process”
Pretty sure lightweight processes a la Erlang aren’t processes. Context switching between processes is much more expensive than those lightweight processes.
And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.
2017-08-13 1:40 pm
Alfman verbose=1
kwan_e,
Having programs that can be offloaded onto the network is fine, but it is not necessary. To take advantage of that, it would affect a program’s design in a way that would make it substandard for its common use case. [/q]
I’m going to ask you again to be more specific. I’m not saying being able to run a networking cluster needs to be a goal, however it is a nice side effect of having a design that’s well optimized for NUMA. Only having explicit IPC between NUMA regions aligns very well with the network clustering too!
Pretty sure lightweight processes a la Erlang aren’t processes. Context switching between processes is much more expensive than those lightweight processes.
Not when they’re running on physically separate cores and don’t have to context switch.
And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.
It’s true that you might need somewhat more memory to run more processes instead of one (to match the NUMA regions), but when you think about it carefully, this decoupling of address space is precisely what we need to do to avoid costly & performance killing IO across NUMA boundaries.
From your original post:
[q]Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.
I understand why you like this design pattern, and it would be perfectly fine to use MT across the cores within a NUMA region, but unfortunately this pattern doesn’t scale well across NUMA boundaries. This is what I was getting at before, the more you optimize MT code to reduce the overhead on NUMA, the more you end up engineering something that acts like a process.
As you know, NUMA scalability comes by compromising equal access to global address space and resources. Designing software around NUMA locality is just a suggestion, but of course you are welcome to engineer things however you like
Edited 2017-08-13 13:50 UTC
2017-08-13 4:33 pm
kwan_e
Not when they’re running on physically separate cores and don’t have to context switch. [/q]
If you have more processes than cores, you have to context switch to give all processes some fair run time. That’s how multitasking works, and the OS has to get involved. That’s not the same with lightweight processes.
From your original post:
[q]Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.
I understand why you like this design pattern, and it would be perfectly fine to use MT across the cores within a NUMA region, but unfortunately this pattern doesn’t scale well across NUMA boundaries. [/q]
Only with, as you keep saying, naive MT. But I’m not talking about naive MT. I’m talking about structuring a program into packaged tasks, and keeping threading concerns out of those tasks. The threading concerns should be handled by something else, which may include NUMA aware executors.
[q]This is what I was getting at before, the more you optimize MT code to reduce the overhead on NUMA, the more you end up engineering something that acts like a process.
Yes. Something that acts like a process but is not one. Particularly not one which is as heavy. Just because something acts like something else doesn’t mean they’re the same or have the same costs. As I said before, the main reason for processes is security – making sure that processes don’t step on each other’s memory. Not every logical process requires that kind of separation from each other and thus do not necessarily need to map to an OS process.
And as I also said before, the intermediate level of abstraction should probably be provided by the OS or some architecture aware library. So programs don’t need to reinvent the wheel because it’s already done.
2017-08-14 12:43 am
Alfman verbose=1
kwan_e,
If you have more processes than cores, you have to context switch to give all processes some fair run time. That’s how multitasking works, and the OS has to get involved. That’s not the same with lightweight processes. [/q]
Yes, however I don’t suggest having more processes than cores, only one per NUMA region.
http://www.osnews.com/comments/29956
Only with, as you keep saying, naive MT. But I’m not talking about naive MT. I’m talking about structuring a program into packaged tasks, and keeping threading concerns out of those tasks. The threading concerns should be handled by something else, which may include NUMA aware executors.
Of course you can make MT less naive, but the part you’re overlooking is that once you solve the overhead between NUMA regions, you end up with a design pattern that mimics a process anyways, even if that wasn’t the goal.
This is the reason I keep asking you to provide a specific objection. I wish you would so that we could talk about it.
[q]And as I also said before, the intermediate level of abstraction should probably be provided by the OS or some architecture aware library. So programs don’t need to reinvent the wheel because it’s already done.
Take a look at the clone syscall in linux.
https://linux.die.net/man/2/clone
Creating a new thread or new process use very similar kernel paths. The difference is that a new process creates new associated address space, but a thread does not. In terms of keeping NUMA regions separate, using processes is not a hack, it’s not a shortcut, using a separate namespace is exactly the pattern we need to follow to scale well on NUMA.
Again, I get that you don’t want to agree with me here, but I’d like for you to put forward a specific objection that we can discuss.

2017-08-14 12:34 pm
ahferroin7
This push for parallelization over raw speed is great for certain use cases. For an average desktop user though, you really don’t need more than 4 cores with 2 threads a piece. In fact, outside of certain use cases such as simulations, virtualization, and multimedia editing, it’s pretty serious overkill. Most desktop apps aren’t heavily multi-threaded because it makes no sense to do so. You don’t need a Twitter client or a mail reader to use more than a few threads, and even then it’s I/O parallelization you need, not computational parallelization, and that doesn’t require more cores, just better coding.
For games, yeah it would be nice if they better utilized system resources, but many of them are already pretty heavily parallelized, they just do most of the work on the GPU. For a case like that where you’re doing most of your processing on the GPU, it doesn’t make sense to use more than a few threads on the CPU. You don’t need to parallelize input processing or network access internally, and just like with regular desktop apps, disk access doesn’t need more cores to be more parallel.
The reality is that most stuff other than games that benefits from parallelization already does a pretty good job of it, especially on non Windows systems. Modeling, CAD, and simulation software has been good at it for decades now. Virtualization, emulation, and traditional server software software has also been pretty good at this for years. Encoding software (both multimedia and conventional data compressors) could be a bit better, but for stream compression there are limits to how much you can parallelize things from a practical perspective, and that’s about the only thing. In fact, I’d argue that the push for more cores over higher speed is a reflection of the demands of modern software (think machine learning), not the other way around.
Aside from all that, while I would love a Threadripper system from a bragging rights perspective, in all seriousness I don’t need one. I upgraded the system that I would put this in from a Xeon E3-1275 v3 to a Ryzen 7 1700 about a month after launch, and that alone was enough that I don’t need any more processing power. The system in question is running anywhere between 10 and 30 VM’s, plus 8 BOINC jobs, Syncthing, GlusterFS (both storage and client), and distcc jobs for a half dozen other systems, and despite all that I still have no issue rebuilding the system (I run Gentoo) in a reasonable amount of time while all that is going on. In fact, the only issues I have that aren’t just minor annoyances are all related to AMD not releasing full datasheets for Zen, and thus would be issues with a Threadripper CPu too.