In this review we’ve covered several important topics surrounding CPUs with large numbers of cores: power, frequency, and the need to feed the beast. Running a CPU is like the inverse of a diet – you need to put all the data in to get any data out. The more pi that can be fed in, the better the utilization of what you have under the hood.
AMD and Intel take different approaches to this. We have a multi-die solution compared to a monolithic solution. We have core complexes and Infinity Fabric compared to a MoDe-X based mesh. We have unified memory access compared to non-uniform memory access. Both are going hard against frequency and both are battling against power consumption. AMD supports ECC and more PCIe lanes, while Intel provides a more complete chipset and specialist AVX-512 instructions. Both are competing in the high-end prosumer and workstation markets, promoting high-throughput multi-tasking scenarios as the key to unlocking the potential of their processors.
As always, AnandTech’s the only review you’ll need, but there’s also the Ars review and the Tom’s Hardware review.
I really want to build a Threadripper machine, even though I just built a very expensive (custom watercooling is pricey) new machine a few months ago, and honestly, I have no need for a processor like this – but the little kid in me loves the idea of two dies molten together, providing all this power. Let’s hope this renewed emphasis on high core and thread counts pushes operating system engineers and application developers to make more and better use of all the threads they’re given.
…hopefully you can even compile something without the system melting ( like Ryzen does )
More use? No. Better use? Yes. Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.
Problem is that once NUMA enters the picture it becomes much more difficult to be thread agnostic. A generic threadpool doesn’t know what memory accesses each work task is going to do, for example.
dpJudas,
I agree, multithreaded code can quickly reach a point of diminishing returns (and even negative returns). NUMA overcomes those bottlenecks by isolating the different cores from each other’s work, but then obviously not all threads can be equal and code that assumes they are will be penalized. These are intrinsic limitations that cannot really be fixed in hardware, so personally I think we should be designing operating systems that treat NUMA as clusters instead of as normal threads. And our software should be programed to scale in clusters rather than merely with threads.
The benefit of the cluster approach is that software can scale with many more NUMA cores than pure multithreaded software. And without the shared memory constraints across the entire set of threads, we can potentially scale the same software with NUMA or additional computers on a network.
This is really why I still use MPI for parallel execution even when running on a single node. This approach also has the added benefit of scaling up to small computer clusters without much extra effort.
I mostly write engineering simulation codes though, so I’m pretty sure this does not make sense for entire classes of program.
FortranMan,
Oh cool, I’d really like to learn more about that. I’ve played around with inverse kinematic software and written some code to experiment with fluid dynamics, but nothing sophisticated. I’ve long wanted to try building physical simulations with a GPGPU, though obviously it requires a different approach!
Most of my codes are simulations of experiments; tens to hundreds of thousands of simulations are run of the same setup with slight variations in input to help predict the uncertainties expected in the actual experiment. These are usually done in the design phase of an experimental project.
I have also written programs to approximate solutions to partial differential equations on various grids in parallel and serial, using shared memory, distributed memory, and even hybrid architectures. It is interesting stuff, albeit very challenging.
FortranMan,
Yea it sure blows my line of work out of the water, the businesses I get work from don’t offer intellectually interesting or challenging work. Your lucky!
It seems you want to get the worst of both worlds in order to not get the benefits of NUMA.
You can simply pin threads if you’re that concerned with NUMA latencies. Otherwise let the scheduler/mem controller deal with it.
Edited 2017-08-13 00:06 UTC
tylerdurden,
That’s the problem, the operating system scheduler/mem controller CAN’T deal with it.
If you had 32 cores divided into 8 NUMA clusters, the system would start incurring IPC overhead between NUMA clusters at just 5+ threads. You can keep adding more threads, but the system will be saturated with intra-NUMA IO.
To scale well, software must take the NUMA configuration into account. IMHO using separate processes is quite intuitive and allows the OS to effectively manage the NUMA threads. It also gives us the added benefit of distributing the software across computers on a network if we choose to. But obviously you can do it all manually pinning threads to specific cores and developing your own custom NUMA aware memory allocators, or you could allowing the OS to distribute them by process, it achieves a similar result. Personally I’d opt for the multiprocess approach, but you can choose whatever way you want.
Edited 2017-08-13 00:40 UTC
Sure. But remember, the whole point of NUMA is not to incur in IPC overhead.
I think, if I’m correct, you’re viewing threads as basically being at the process level. There, sure message passing makes sense, since you’re not dealing with shared address spaces. But that’s not what NUMA is trying to deal with.
You only have issues with NUMA when you have a very shitty memory mapping, when every core is referencing contents in another chip, but those pathological cases are rare.
Edited 2017-08-13 02:33 UTC
tylerdurden,
Not really, you might have 32 threads waiting for a socket operation, but you don’t have much control over which thread will actually receive the operation, and so naive code could very easily end up in a pathological case where the data structure it needs is located on the wrong NUMA CPU, which will kill the performance.
In naive MT code, it’s very common for all threads to block on shared synchronization primitives, but in a NUMA configuration this produces a pathological case since all the cores will constantly have to perform IO across NUMA boundaries check the mutexes/semaphores.
I think that solving the pathological cases will ultimately require MT algorithms that resemble the multi-process design anyways, so IMHO it makes sense just to start there and not have to reinvent the wheel.
Edited 2017-08-13 05:41 UTC
I think the problem is that you’re seeing “Threads” as full processes, not at the fine grained streams that NUMA deals with.
tylerdurden,
Haha, but there is no problem. It’s not a matter of definition, it’s strictly a matter of the software compromises necessary to make NUMA scale well. Oh well.
NUMA systems have scaled to over 1K cores, I’d say that’s a good scalability.
tylerdurden,
The later link is a good read on the topic.
Edited 2017-08-14 06:31 UTC
I think we keep missing each others point. I’m referring to NUMA, in the traditional architectural sense of the term. Which deals with intra-process parallelism. You keep referring to supposed issues/solutions at the inter-process level.
tylerdurden,
Sorry, I believe we’re referring to the same thing, but maybe bringing up the similarity to processes was too far without adequately covering the limitations of NUMA first.
Do you agree that that code optimized for locality will perform much better on a NUMA system and code that isn’t optimized for locality will cause more overhead?
Conventional multithreading on SMP systems treats all threads as identical, but do you agree that on a NUMA system they are not identical and performance will degrade when a thread performs memory operations on memory physically connected to another CPU than the one executing the thread?
Yes, NUMA degrades when a data item is located on a different node. But you’re missing the point of NUMA as a programming model.
The solution you’re proposing seems, literally, like the problem that NUMA was trying to solve originally. Which is why I was getting confused.
tylerdurden,
We can build NUMA systems with hundreds or thousands of cores, but NUMA has not “solved” the scalability issue with regards to hundreds of cores sharing the same memory. Of course NUMA enables existing SMP code patterns to continue working, but we hit a wall and the performance degrades significantly if we use them with very large thread counts.
NUMA only scales at massive thread counts when we give up shared memory and switch to algorithms that enforce locality. In doing so, each NUMA region can operate at full performance independently from all others.
It turns out this is extremely similar to running independent computers in a cluster, however NUMA does have an advantage in that explicit IO across NUMA boundaries can be done much more efficiently than with a conventional packet switching network. For example passing messages directly into the memory buffers on the target CPU saves us all the work of encapsulation/fragmenting/exponential backoff/acknowledgement/re-transmission normally involved with network IO. Conventional network clusters don’t hold a candle to NUMA
Edited 2017-08-14 18:20 UTC
I did not imply NUMA solved all the scalability issues. I’m simply saying that NUMA is scalable.
I think we keep diverging because I’m taling of NUMA at the architectural and single address space programming level. And if I understand correctly, you keep referring to coarse process-level multithreading.
Perhaps I’m focusing more on HW-NUMA and you on SW-NUMA.
In any case. Just like anything NUMA applies/works well with certain workloads/applications and it is atrocious on others.
Cheers.
tylerdurden,
Cheers to you as well
Edited 2017-08-14 20:25 UTC
To me, that just means the OS should open up a way for a process to say “these bunch of threads/tasks/contexts should be clustered together” and the software can say “these work units are of type X” and the OS can schedule them appropriately. Something like Erlang’s lightweight processes?
Edited 2017-08-13 04:33 UTC
kwan_e,
Sure, you could bundle some threads together, and then write code such that those threads avoid sharing memory or synchronization primitives with other bundles, and then make sure network sockets are only accessed by threads in the correct bundle associated with the remote client. This is all great, but it should also sound very familiar! We’ve basically reinvented the “process”
Edited 2017-08-13 06:10 UTC
Pretty sure lightweight processes a la Erlang aren’t processes. Context switching between processes is much more expensive than those lightweight processes.
And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.
kwan_e,
I understand why you like this design pattern, and it would be perfectly fine to use MT across the cores within a NUMA region, but unfortunately this pattern doesn’t scale well across NUMA boundaries. This is what I was getting at before, the more you optimize MT code to reduce the overhead on NUMA, the more you end up engineering something that acts like a process.
As you know, NUMA scalability comes by compromising equal access to global address space and resources. Designing software around NUMA locality is just a suggestion, but of course you are welcome to engineer things however you like
Edited 2017-08-13 13:50 UTC
Yes. Something that acts like a process but is not one. Particularly not one which is as heavy. Just because something acts like something else doesn’t mean they’re the same or have the same costs. As I said before, the main reason for processes is security – making sure that processes don’t step on each other’s memory. Not every logical process requires that kind of separation from each other and thus do not necessarily need to map to an OS process.
And as I also said before, the intermediate level of abstraction should probably be provided by the OS or some architecture aware library. So programs don’t need to reinvent the wheel because it’s already done.
kwan_e,
Take a look at the clone syscall in linux.
https://linux.die.net/man/2/clone
Creating a new thread or new process use very similar kernel paths. The difference is that a new process creates new associated address space, but a thread does not. In terms of keeping NUMA regions separate, using processes is not a hack, it’s not a shortcut, using a separate namespace is exactly the pattern we need to follow to scale well on NUMA.
Again, I get that you don’t want to agree with me here, but I’d like for you to put forward a specific objection that we can discuss.
This push for parallelization over raw speed is great for certain use cases. For an average desktop user though, you really don’t need more than 4 cores with 2 threads a piece. In fact, outside of certain use cases such as simulations, virtualization, and multimedia editing, it’s pretty serious overkill. Most desktop apps aren’t heavily multi-threaded because it makes no sense to do so. You don’t need a Twitter client or a mail reader to use more than a few threads, and even then it’s I/O parallelization you need, not computational parallelization, and that doesn’t require more cores, just better coding.
For games, yeah it would be nice if they better utilized system resources, but many of them are already pretty heavily parallelized, they just do most of the work on the GPU. For a case like that where you’re doing most of your processing on the GPU, it doesn’t make sense to use more than a few threads on the CPU. You don’t need to parallelize input processing or network access internally, and just like with regular desktop apps, disk access doesn’t need more cores to be more parallel.
The reality is that most stuff other than games that benefits from parallelization already does a pretty good job of it, especially on non Windows systems. Modeling, CAD, and simulation software has been good at it for decades now. Virtualization, emulation, and traditional server software software has also been pretty good at this for years. Encoding software (both multimedia and conventional data compressors) could be a bit better, but for stream compression there are limits to how much you can parallelize things from a practical perspective, and that’s about the only thing. In fact, I’d argue that the push for more cores over higher speed is a reflection of the demands of modern software (think machine learning), not the other way around.
Aside from all that, while I would love a Threadripper system from a bragging rights perspective, in all seriousness I don’t need one. I upgraded the system that I would put this in from a Xeon E3-1275 v3 to a Ryzen 7 1700 about a month after launch, and that alone was enough that I don’t need any more processing power. The system in question is running anywhere between 10 and 30 VM’s, plus 8 BOINC jobs, Syncthing, GlusterFS (both storage and client), and distcc jobs for a half dozen other systems, and despite all that I still have no issue rebuilding the system (I run Gentoo) in a reasonable amount of time while all that is going on. In fact, the only issues I have that aren’t just minor annoyances are all related to AMD not releasing full datasheets for Zen, and thus would be issues with a Threadripper CPu too.