Apple silicon: a little help from friends and co-processors

Thom Holwerda 2024-03-02 Apple 12 Comments

So far in this series, I have looked in broad terms at how the CPU cores in Apple silicon chips work, and how they use frequency control and two types of core to deliver high performance with low power and energy use. As I hinted previously, their design also relies on specialist processing units and co-processors, the subject of this article.
↫ Howard Oakley

Another excellent read from Howard Oakley.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

12 Comments

2024-03-02 1:11 pm

sukru
SIMD instructions on Apple is not a new thing at all. Before ARM, during PowerPC times, there was AltiVec, which I happened have written a report about in senior college year.

As the article suggests, most people do not interact with NEON or other SIMD instructions. This is because they are either not directly supported by the compiler (so has to use intrinsics or fall back to assembly), or even when they are supported, it is much more difficult to design for.

Look, for example, at this AVX implementation of the log function inside PyTorch. That is 140 lines, and covers only one special case (8 floats):
https://github.com/pytorch/pytorch/blob/7f81563e5e247f65909084a9b5c5b9ee21fac76d/aten/src/ATen/native/cpu/avx_mathfun.h#L89

In High Performance Computing, you’d see many similar examples, for each permutation of input shape, and target hardware (AVX, AVX512, NEON, SSE, and so on), with a fall back to basic C++ for compatibility.

That kind of skill is very rare even in large companies, and most compilers cannot do more than optimizing very common patterns. (Even the simple dot product which is a multiply-add operation usually requires manual attention).

2024-03-02 2:17 pm

Alfman verbose=1
sukru,

As the article suggests, most people do not interact with NEON or other SIMD instructions. This is because they are either not directly supported by the compiler (so has to use intrinsics or fall back to assembly), or even when they are supported, it is much more difficult to design for.
…
In High Performance Computing, you’d see many similar examples, for each permutation of input shape, and target hardware (AVX, AVX512, NEON, SSE, and so on), with a fall back to basic C++ for compatibility.

Naturally it depends on the compiler. I haven’t tested this in many years, but I did test AVX auto vectorization at one point. GCC did not do well and in fact it was really difficult to get it to use vector opcodes effectively. If you didn’t deliberately craft your code in such a way that GCC could recognize the pattern, it wouldn’t work. But intel’s ICC compiler did a much better job at optimizing.

I’d be interested in doing these tests again to see where they stand today and if things have changed in terms of automatic code optimization. In the future, I have little doubt that AI technology will be able to optimize code beyond the level of human experts.

2024-03-02 10:17 pm

sukru
Alfman,

Using AI is probably very reasonable. I have tried a simple session with ChatGPT, and the results were very promising:

https://chat.openai.com/share/9014e891-44ff-4f8c-a665-12740dd93c2b

I’ll be honest, I don’t have personal experience with AVX. So cannot vouch for the code. However that is the point. An expert along with the capable AI can generate significant amount of code in a reasonable time. And you really need to be an expert to do this right with specialized SIMD hardware.

2024-03-02 11:15 pm

Alfman verbose=1
sukru,

Using AI is probably very reasonable. I have tried a simple session with ChatGPT, and the results were very promising:

TBH I don’t think chatgpt is the right kind of AI for code optimization. chatgpt learns by example, and while there’s nothing wrong with that I think we can do much better.

And you really need to be an expert to do this right with specialized SIMD hardware.

Well, let’s put it this way: you really need to be an expert to beat chess grand masters. AI is able to do that through reinforcement learning where it trains itself instead of being taught. This has several advantages in that it can even learn & optimize strategies that are original and/or unknown to the creators. Instead of using reinforcement learning to play chess or go, a fitness function could rank code performance and “win” by having a more optimal algorithm. IMHO this will ultimately beat human expertise. It’s just a matter of computing power.

2024-03-03 12:04 am

sukru
Alfman,

Chess and computation are very different problems.

Practically speaking, chess is “tractable”, while computation, specifically Turing Complete systems are “intractable”.

What it comes from is the fact that no algorithm can decided whether a program will HALT (i.e.: terminate) for a given input. That implies we can make no general algorithms that can built that understands all possible other programs.

That brings us back to this AI, and while it can perform better than most people (as it shown here to write AVX code, which I currently cannot), it can by definition never be beyond full human capabilities, but only approximate them. Unless of course we come up with a new paradigm, hence new theory of computation.
2024-03-03 12:09 am

sukru
Anyway… I thinkI went too far of a tangent…
2024-03-03 2:19 am

Alfman verbose=1
sukru,

Practically speaking, chess is “tractable”, while computation, specifically Turing Complete systems are “intractable”.

Go’s fan out properties make it “intractable” as well because it has too many possible states to enumerate them all. Yet reinforcement learning has still proven to be a valid approach in beating humans. Keep in mind you don’t have to have a perfect solution to beat humans as we humans are very imperfect. Beating humans is more tractable than being perfect.

What it comes from is the fact that no algorithm can decided whether a program will HALT (i.e.: terminate) for a given input. That implies we can make no general algorithms that can built that understands all possible other programs.

I personally feel the halting problem has several major flaws when used as a justification for what software can or can’t do. One is that the contradiction fundamentally relies on constraining halting oracle to two states even though I argue the math allows for an oracle that recognizes the existence of a third state such that no contradiction exists. Secondly, people wrongly transfer the proof, which is about infinite Turing machines, to finite state machines where the proof does not logically apply Unlike real machines with finite state, a Turning machine is a theoretical construct with infinite state and the proof depends on this. And thirdly, the proof is not absolute, it asserts there exist some cases that cannot be solved, but it does not assert that no cases can be solved. And in fact when talking about most real software, the halting problem is solvable and the contradiction does not apply because most real software doesn’t input it’s own code in order to contradict itself. Real software often CAN be mathematically proven to halt or not halt for given input in spite of the halting problem.

That brings us back to this AI, and while it can perform better than most people (as it shown here to write AVX code, which I currently cannot), it can by definition never be beyond full human capabilities, but only approximate them.

Even if you wanted to accept the halting problem as a blanket justification, you need to realize the oracles aren’t just about the limits of algorithms but the limits of metaphysics and intelligence itself. Replacing the oracles with humans does not solve the logical contradiction, it’s still there.

So with this in mind, it does not follow that “AI can by definition never be beyond full human capabilities”.

Anyway… I thinkI went too far of a tangent…

Personally I enjoy these kinds of discussions, but I know not everyone does.
2024-03-03 1:34 pm

sukru
Alfman,

So with this in mind, it does not follow that “AI can by definition never be beyond full human capabilities”.

We can go on to discuss fundamental limits of human mind, and the computation; and we can speculate whether a modern day computer with seemingly infinite “external” memory (i.e.: Internet) is a better analogy of Turing machines.

But yes, eventually we have to stop somewhere. It was fun though.

2024-03-03 6:08 pm

Xanady Asem
It is the opposite; most people actually interact, a lot, with the SIMD extensions. Since they are used by A LOT of system code these days. They are just not aware of it.

Most people are users, not developers. so I feel the language was a bit weird in that regard

FWIW, autovectorization in modern compilers does a pretty good job if people are at least doing a proper organization of their data sets within the code. But for most programmers, you just need to use the proper libraries (specially BOOST) and be done with it, as they take care of most o the vectorization and data parallel processing “nastiness” for you.

One of the nice things about NEON and the newer ARM data parallel extensions is that they are far more “programmer/compiler” friendly than AVX. As they are sort of vector length/width agnostic. Which is very nice from a programmability standpoint, as you don’t need to do all the nasty differentiation of “widths” in x86 land.

2024-03-04 4:14 pm

sukru
Xanady Asem,

That is true. Compilers are adding more optimizations, along with implementing them in common libraries / core stuff like strings processing: https://www.phoronix.com/news/Glibc-More-AVX-512-October-2022

But at the same time, when you really need to push the envelope, that specialized knowledge is still required, and maybe even more so today.

If you look at “high performance” applications all(*) of them have specialized vector code, in some form or another in various volumes.

ffmpeg (multimedia library):
https://github.com/search?q=repo%3AFFmpeg%2FFFmpeg%20avx&type=code

tensorflow (machine learning, in addition to the pytorch examples I shared before):
https://github.com/search?q=repo%3Atensorflow%2Ftensorflow%20avx&type=code

postgresql (database):
https://github.com/postgres/postgres/blob/e5bc9454e527b1cba97553531d8d4992892fdeef/src/include/port/simd.h#L46

(interestingly Godot, which is a game library does not seem to have it).

And we can expand the list to include many other systems with performance critical sections.

And, this will get even more important as GPUs become that “co-processor”. Today CUDA, MPS, OpenCL are already being used in many projects. But fortunately, most of the time generic functions are shared as “kernels” which can be reused in many applications.

For example, TensorRT from nvidia to basically showcase their hardware and APIs: https://github.com/NVIDIA/TensorRT

Again, yes, most developers already use them without knowing. But those who actually know how to use them directly are really sought after these days.
2024-03-05 10:47 am

sukru
(hmm… comment still waiting moderation after about a day)

2024-03-03 6:12 pm

Xanady Asem
Good to see more articles dwelling on modern SoC architecture details. A lot of people are still stuck in the old “scalar” core paradigm as being the main defining element of a modern CPU (especially when people fret about ISAs). We’ve long move into the SoC age.