Why is Apple’s M1 chip so fast?

Thom Holwerda 2020-11-30 Apple 53 Comments

On Youtube I watched a Mac user who had bought an iMac last year. It was maxed out with 40 GB of RAM costing him about $4000. He watched in disbelief how his hyper expensive iMac was being demolished by his new M1 Mac Mini, which he had paid a measly $700 for.
In real world test after test, the M1 Macs are not merely inching past top of the line Intel Macs, they are destroying them. In disbelief people have started asking how on earth this is possible?
If you are one of those people, you have come to the right place. Here I plan to break it down into digestible pieces exactly what it is that Apple has done with the M1.

It’s exciting to see x86 receive such a major kick in the butt, but it’s sad that the M1 is locked away and only a very, very small number of people will get to see its benefits.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

53 Comments

2020-11-30 8:33 pm
hdjhfds
but it’s sad that the M1 is locked away and only a very, very small number of people will get to see its benefits.
And the prices will rise by a benjamin franklin portrait with each successive generation of the new architecture. The current price levels are just the bait to attract doubters.
Anyway, it’s always good to have competition. I hope the x86 camp will feel the obligation for improvement, and I hope other ARM vendors will follow the suit.
2020-11-30 8:59 pm
sj87
Reminds me about the early days when investing big bucks into overpriced top-notch hardware was simply pain stupid. Technological advancements at the time meant that hardware priced at $20,000 was not worth $2,000 a couple years later.

2020-11-30 11:32 pm
Earl C Pottinger
In some cases it only took a year to lose it’s value.
2020-12-01 9:19 am
Bill Shooter of Bul Platinum Prime
Yeah I had idiot friends that took out education loans for workstations that were obsolete before the year was over. They then paid off that obsolete workstation over the next ten years.
2020-12-03 2:01 pm
Brisvegas
Back in the mid 80s my brother’s (government) employer was upgrading their HP workstations every six months and replacing them every 12 months.

2020-11-30 9:05 pm
franko
I am very happy with my Raspberry Pi 4 servers for $50

2020-11-30 11:49 pm
Anonymous
That’s nice?

2020-12-01 1:59 am
DeepThought
Intel/AMD will see customers switching to Arm based machines w/o big complaints. So I wonder if they are not already trying to develop a new ISA and simply provide their version of an x86 emulator for the transition. Sure, they will need Microsoft to steer the train (not only jump on).
Intel/HP tried it with IA64, but maybe back then the time was not ready (and the technology).

2020-12-01 10:20 am
rastersoft
I had a funny idea/mental masturbation: since the problem is, as it is explained in that article, that the x86 architecture has a variable-size instruction set, and that doesn’t allow an efficient OutOfOrder system because you can’t analyze as many instructions as desired in a single step to keep full the system with micro-ops… What if a different instruction set, with a fixed length, is implemented in the same architecture? That could be done just by adding a new mode and only a new instruction decoder (which is a very small part in the silicon) which output the same type of micro-ops currently being used in the current x86 processors, and everything else could be maintained. Let’s say that they use the AARCH64 instruction set. Of course, that wouldn’t be a true ARM because, although the instructions themselves would be the same, the other bits of the architecture (like segmentation/pagination, exceptions, virtualization modes, etc) would be the same than in x86 and x86-64. But that would allow to just add an instruction decoder in parallel with the current one, and depending on the current mode, send instructions to the corresponding one. And since the new decoder uses fixed-size instructions, it could be much more efficient and fetch more than four instructions at a time. This would allow an easy transition because the same processor would work with both instruction sets, with the x86 ones being executed as fast as today, but with the new one it would be much faster. Also would allow mixing old and new code, because the operating system would switch from one decoder to another in function of the instruction set used in each process. Also, it would allow to take advantage in compilers, because they already exist for that instruction set (and because the architecture itself, the “other bits” that I commented, would affect only to the kernel and drivers, not to user space programs). Of course, using AARCH64 would mean licensing problems, so maybe using the RISC-V instruction set would be better.
But, of course, since I’m not a CPU designer, it is probable that this would have a lot of hidden problems…

2020-12-02 12:15 am
BlackV
Wouldn’t that be an over complication of already complicated x86 arch?
IFAIK right now instructions are relatively short in 16-bit or 32bit modes depending on processor mode – prefix is necessary for 32bit instructions in 16bit mode and for 16bit instructions in 32bit mode. For 64bit mode for all 64bit instructions separate prefix is always necessary since default operation size is 32bit.
Normal RISC instructions are 32bit width. For x86 arch it would mean either mean new operation mode which decode RISC instructions directly, or additional prefix and 40bit+ width instructions.
But I guess encoding is not really important – since processors would either need separate decoders for new instructions or even more complicated already present decoders.

2020-12-02 5:16 am
rastersoft
Yes and no… I mean: it wouldn’t be a full ARM/RISC-V architecture, but only an instruction set working over the current x86-64 architecture. Let’s say that we take all the x86-64 instruction set and map each one into an equivalent ARM/RISC-V instruction. Or even develop a fixed-size instruction set that accommodates each x86-64 instruction, but all the other things would be exactly like the current x86-64 architecture (segmentation and pagination, virtualization…). Thus only a new decoder would have to be added, which is a quite small part of the silicon, which would emit exactly the same micro-ops than the x86-64 instructions, but everything else would remain exactly the same.
And even in a future, compatibility with old x86-64 instruction set could be removed, if desired. Just like x86-64 allows to build processors without 32bit compatibility.

2020-12-02 6:49 am
BlackV
I’m not sure it is good idea to pull all x86 legacy along to new ISA. After all main space eaters in x86 ISA are various full sized immediate arguments and numerous addressing modes opposed to RISC’s much shorter immediate arguments,outright simple addressing and clear separation between LD/ST and ALU.

2020-12-02 4:32 am
jgfenix
RISC-V has variable lenght instructions but done the right way.

2020-12-02 5:17 am
rastersoft
Yes, I know: each 32bit word can have one 32bit instruction or two 16bit instructions. That still can be considered fixed-length, because there are never instructions that cross a 32bit boundary.

2020-12-01 3:30 am
oiaohm
https://www.eenewsembedded.com/news/64bit-risc-v-core-can-operate-5ghz
The reality is the new Apple M1 may not be that fast or power effective. The idea of a 4Ghz processor that you can run of a AA battery for a decent runtime would have seemed impossible a few years ago.
2020-12-01 7:45 am
FriendBesto
Even though, currently, the only financial threat for AMD/Intel is that of people switching to Apple, they still have a strong incentive to work on dealing the OoO problem.
People just won’t be inclined to hand over thousands for barely-noticeable incremental upgrades, knowing that a technology exists that can produce way more performance / Hz. And the M1 is just a first-generation preview of what is possible.
I, personally, don’t plan to switch to Apple, but I will also defer upgrading my current x64 platform.
2020-12-01 8:16 am
jgfenix
I wonder if a x64-only cpu would see a noticeable improvement. The x64 architecture is a cleanup of the x86 architecture after all.

2020-12-01 10:28 am
rastersoft
The article says that the big problem is the variable-size instruction set, and, AFAIK, that has been kept in both x86 and x86-64 architectures.

2020-12-01 3:33 pm
Xanady Asem
The article was written by someone who had little understanding of architecture.
The “decoder” in x86 stopped being an issue 2 decades ago. Most modern x86 cores, from the PII/Athlon era on now, are basically decoupled architectures, where the fetch engine is decoupled from the execution engines.
There overhead of decoding x86 is basically “noise” at this point, in terms of overall transistor count and execution inefficiencies.

2020-12-01 7:09 pm
Alfman verbose=1
javiercero1,
The “decoder” in x86 stopped being an issue 2 decades ago. Most modern x86 cores, from the PII/Athlon era on now, are basically decoupled architectures, where the fetch engine is decoupled from the execution engines.
There overhead of decoding x86 is basically “noise” at this point, in terms of overall transistor count and execution inefficiencies.
Obviously you have to concede this depends on the workload, In the ideal case instructions would be decoded once and then cached, but not all workloads are ideal and cache is quickly divvied up by all the processes and the operating system too. If you’ve got a lot of micro-instruction cache misses (ie either a program where the code>cache or multiuser systems running several different workloads at once) the x86 microcode decoder becomes a bottleneck. IIRC with intel chips, instruction decoding can take 1-4 clocks depending on the instruction.
This is less of a problem on ARM CPUs. ARM still requires dedicated prefetching due to the compact thumb instruction set, but unlike x86 there’s no need for microcode in the same sense that x86 has it since architectural instructions map directly to microinstructions. For this reason the pretecher on ARM can be made more efficiently while using fewer transistors and less power. Granted prefetching transistors may not be as numerous as those in a CPU’s execution units, but regardless engineers can still put them to better use like more caching.
The ISA can limit optimization possibilities in different ways too, consider byte alignment. ARM saves on transistors by not supporting arbitrary byte alignments where x86 does allow it. Even though most data access is already aligned, the ability to support unaligned access creates more complexity for x86 CPUs. The truth is that x86 was designed for a time when saving every byte mattered, not when 4 & 8 byte integers and pointers would be the norm. x86 isn’t optimal for modern computing. I don’t blame intel’s engineers all those decades ago, they lacked our experience and had no idea x86 would still be a mainstay of computing so many decades later.

2020-12-02 1:29 am
DeepThought
“This is less of a problem on ARM CPUs. ARM still requires dedicated prefetching due to the compact thumb instruction set”
Armv8-A cores (can) have three decoders, AArch32, Thumb-2 (T32) and AArch64.
Makes me wonder if the M1 still contains AArch32 and T32 support.
2020-12-02 6:36 am
Xanady Asem
Y’all are looking at CPU architecture from a standpoint of a textbook of the early 90s.
Just about anything that does out-of-order breaks down instructions into micro-ops, which is just a form of microcode if you want to look at it. Given the budget in terms of other resources like multi port register files, huge instruction windows, reorder buffers, etc, etc. A lot of the “issues” you are focusing on represent such a tiny amount in the overall transistor budget.
There are some inefficiencies here and there, but with a fully decoupled architecture, like most x86 or 64-bit ARM, you basically solve most of this stuff in the fetch engine. And from there on, the execution units, both x86 and ARM look remarkably similar.
Honestly, if I were to say what’s the biggest differentiator between x86 and ARM… it is not RISC vs CISC, or whether or not the instructions are of variable/fixed length, but rather the memory models.
x86 has a very strong memory ordering, whereas ARM is far more relaxed.
In fact, it seems that one of the things that makes the M1 so efficient at emulating x86, over other ARM variants, it’s that Apple apparently added support for x86-like memory ordering in HW (which is usually absent in ARM)
2020-12-02 6:43 am
rastersoft
But that’s exactly what the article says: that although the instruction decoder and the micro-ops execution pipeline are fully decoupled, the problem is that the decoder can’t keep filled the pipeline because separating the instructions is a nightmare, and trying to decode more than four instructions per clock doesn’t worth it, while in ARM it is possible to decode much more instructions per clock, thus being more able to keep filled the pipeline.
2020-12-02 1:34 pm
Alfman verbose=1
javiercero1,
Just about anything that does out-of-order breaks down instructions into micro-ops, which is just a form of microcode if you want to look at it. Given the budget in terms of other resources like multi port register files, huge instruction windows, reorder buffers, etc, etc. A lot of the “issues” you are focusing on represent such a tiny amount in the overall transistor budget.
It doesn’t negate what the article or I said though. For an identical transistor budget in both x86 and ARM, you can fit more/longer prefetchers for an ARM ISA. The complexity of x86 has a scalability cost. The cons of the x86 prefetcher are mitigated with u-op caching. however for workloads that don’t fit in the u-op cache, the execution path with have to wait for the prefetcher, which leads to ARM having better performance and/or lower power consumption.
To the extent that you use more transistors for parallel x86 prefetching (ie because it’s a small percentage of the overall CPU die), that still gives ARM the advantage because all else being equal and given the exact same transistor budget as an x86 ARM can have more decoder parallelism and/or a larger u-op cache.
Honestly, if I were to say what’s the biggest differentiator between x86 and ARM… it is not RISC vs CISC, or whether or not the instructions are of variable/fixed length, but rather the memory models.
x86 has a very strong memory ordering, whereas ARM is far more relaxed.
In fact, it seems that one of the things that makes the M1 so efficient at emulating x86, over other ARM variants, it’s that Apple apparently added support for x86-like memory ordering in HW (which is usually absent in ARM)
I heard that apple added hardware support in M1 to aid in x86 emulation too, although I didn’t know the details of what that entailed. What you say makes sense.
2020-12-02 2:21 pm
Xanady Asem
rastersoft Alfman
Again, neither of you) realize you are beating an architectural dead horse so to speak. That problem was solved over 20 years ago!
It’s now that there’s an ARM core that can compete in IPC and frequency with an x86 counterpart. Both intel and AMD have figured out how to keep their pipelines filled long time ago.
And again, compared with other overheads from out of order processing, what you all of you are concentrating (the decoding of variable vs fixed length insts) is a non-issue.
For example, the M1 gets a tremendous boost from the huge L1, which in terms of transistors is orders of magnitude more… than the cost in decoding overhead from x86, which has a better L1 cache behavior for the same size.
Out of order is a great equalizer, because neither architecture has an intrinsic “leg” up over the other in terms of performance limiters.
2020-12-02 2:35 pm
Xanady Asem
Edit to add:
You’re focusing on the worst case for x86, which is a very long instruction that wreaks all sorts of havoc on the decoder. And somehow limits the ability for x86 to have multiple decoders in parallel (which is nonsense, but let’s go with it).
Well, turns out that complex instruction once decoded folds into several uOps which do a lot of work on the execution engine.
Wheras ARM may require a lot of fixed length instructions, to generate the same amount of uOp density to keep it’s pipelines “busy,” as that one single x86 large instruction. Now ARM is at a higher risk of a cache miss, which is far more catastrophic (like orders of magnitude more cycles) than whatever overhead x86 paid for decoding the multi-cycle instruction.
Incidentally, this is why M1 has basically perhaps the largest L1 cache in history.
Neither architecture has much of a leg up over the other. High performance x86 have had no problem achieving tremendous levels of IPC, which points out at the decoder not being as much of a limiter as some of you think.
2020-12-02 5:18 pm
Alfman verbose=1
javiercero1,
Again, neither of you) realize you are beating an architectural dead horse so to speak. That problem was solved over 20 years ago!
I don’t know why you keep saying this was solved over 20 years ago because looking back at year 2000 intel was telling developers to avoid instructions that translated to complex microcode to avoid stalling the pipeline.
https://studylib.net/doc/14553594/ia-32-intel-architecture-optimization-reference-manual
Maximize the flow of μops into the execution core. IA-32 instructions which consist of more than four μops require additional steps from microcode ROM. These instructions with longer μop flows incur a delay in the front end and reduce the supply of uops to the execution core. In Pentium 4 and Intel Xeon processors, transfers to microcode ROM often reduce how efficiently μops can be packed into the trace cache. Where possible, it is advisable to select instructions with four or fewer μ ops. For example, a 32-bit integer multiply with a memory operand fits in the trace cache without going to microcode, while a 16-bit integer multiply to memory does not.
Ergo, it was a very real bottleneck.
We can move forward in time to 2011, the complex instruction bottleneck still exists in Sandy Bridge and the u-op cache has an 80% hit rate, which means that there were still out of order pipeline stalls 20% of the time.
https://www.realworldtech.com/sandy-bridge/4/
Bring it to today, the u-op cache is still too small to hold complex programs in order to avoid stalling the pipeline. AMD’s own engineers in 2020 suggested just changing the cache heuristics to improve the u-op hit rate in order to reduce delays and power consumption.
https://www.microarch.org/micro53/papers/738300a160.pdf
Our experiments on a x86 simulatorusing a wide variety of benchmarks show that CLASP improvesperformance up to 5.6% and lowers decoder power up to 19.63%.When CLASP is coupled with the most aggressive compactionvariant, performance improves by up to 12.8% and decoderpower savings are up to 31.53%.
Far from being “solved”, finding ways to mitigate poor x86 prefetch performance is an area of ongoing research and optimization. The prefetcher is still relevant to the architecture’s power & performance despite out of order execution cores. And sure, the solution may be just to throw more transistors at the problem, which I don’t deny. But it poses a scalability problem for x86 and for the same amount of transistors/complexity in the prefetch unit a simpler ISA like ARM can get that much more work. It really is advantageous to have a simpler ISA.
I understand your rational for abstracting OoO u-op cores from the ISA, but it doesn’t mean we can or should deny the pain points caused by x86 ISA. If you want to disagree, then so be it, but I don’t think this ought to be as controversial as you are making it.
2020-12-02 5:30 pm
Alfman verbose=1
Self:
I don’t know why you keep saying this was solved over 20 years ago because looking back at year 2000 intel was telling developers to avoid instructions that translated to complex microcode to avoid stalling the pipeline.
Minor correction, the intel optimization documentation I linked to is dated 2005.
2020-12-02 9:24 pm
Xanady Asem
The document you’re linking it’s about the Penitum 4, a 20+ yr old microarchitecture, and it was a PITA to optimize for because it had a very long pipeline with a very narrow integer and FP execution engines. Sure, the trace cache had a 4-wide uOp limit, so anything wider than that is microcoded and it’s going to take an extra cycle at least, but once it is in the cache you’re golden.
In a Sandy Bridge system you can still have a bunch of uOps in the buffer AND get a cache miss, so you wouldn’t necessarily have to stall 20% of the time.
And lastly, just because a problem is solved it does not mean improvements upon those solutions are not still possible. 30% improvement in power is great, but then we’re talking about a structure that takes single digits of the overall dynamic power budget for the core.
The point is that the x86 decoder is not as unyielding and intractable limiter to performance some of the posters seem to think of it being.
2020-12-03 12:21 am
Alfman verbose=1
javiercero1,
The document you’re linking it’s about the Penitum 4, a 20+ yr old microarchitecture, and it was a PITA to optimize for because it had a very long pipeline with a very narrow integer and FP execution engines. Sure, the trace cache had a 4-wide uOp limit, so anything wider than that is microcoded and it’s going to take an extra cycle at least, but once it is in the cache you’re golden.
No kidding, but you opened yourself up to that when you said it had already been solved 20+ years ago. Intel’s own product documentation proves otherwise. 🙂
Anyways I’m very glad that you can agree with me (and the documentation) that the u-op cache is the way to avoid the costly x86 prefetcher. Of course it’s great so long as your code is actually contained in the u-op cache, otherwise the x86 prefetcher enters the critical path. That’s the rub, the x86 prefetch expense creeps in when your code is too large as clearly reported by the AMD paper I linked to.
Of course I agree that engineers can always solve the performance problem by throwing more transistors at it, but obviously this creates a tradeoff between performance and power efficiency. This creates challenges for x86 in microarchitectures like atom and silvermont where energy efficiency is paramount. They don’t have the budget for more transistors to help prefetch x86 instructions. Even with the reduced demands of limited execution pipelines, the complexity of x86 instructions with too many prefixes/escape bytes can still lead to execution stalls, which is a direct consequence of the ISA unfortunately 🙁
The point is that the x86 decoder is not as unyielding and intractable limiter to performance some of the posters seem to think of it being
Ok, but I never claimed it was intractable, I only pointed out that x86 decoding is inherently less efficient to prefetch than ARM even if everything else in the CPU is equal.
2020-12-03 1:15 am
Xanady Asem
You’re seeing the prefetching process as the end of it. So yes, there some corner cases in the ISA may be harder to decode there. On the other hand, on the other end of the fetch engine, at the front end of it, RISC ISAs put more stress on the i-Cache BW. So whatever RISC gaveth you in the decode stage, the instruction memory hierarchy taketh away.
Again: Microarchitecture and ISA were decoupled long ago. Now, in a out-of-order superscalar world, it’s just a bunch of execution engines doing uOps. Turns out that neither RISC nor CISC had any particular advantage over the other in terms of how those uOps come to be ready to be executed.
The fetch engine on a high end ARM core, like the ones in the M1, puts a whole lot of pressure on the i-Cache. Which is why they’re huge in this design. Whereas x86 puts more pressure on the logic within the fetch engine itself.
And it turns out, that there’s not much difference in between the two approaches.
Atoms were just bad designs. AMD has some great power/performance specs with their latest cores, so there’s nothing inherently stopping an x86 vendor to target similar power/performance envelopes as the M1. Inte; can’t this time, because they’re way behind in their fab node. But I could see AMD doing a quad core Zen3 in TSMCs 5nm which could be within a close envelope to the M1’s.
2020-12-03 4:23 am
Alfman verbose=1
javiercero1,
You’re seeing the prefetching process as the end of it. So yes, there some corner cases in the ISA may be harder to decode there. On the other hand, on the other end of the fetch engine, at the front end of it, RISC ISAs put more stress on the i-Cache BW. So whatever RISC gaveth you in the decode stage, the instruction memory hierarchy taketh away.
It’s not “the end of it”, but in an apples to apples comparison where the OoO are on equal technological footing, it is objectively worse to have a complex ISA up front. Prior to the thumb instructions, the encoding was less efficient, not so much anymore. As I said at the get go, the x86 made more sense when it debuted decades ago and bytes were really expensive, but now days we’re running 32 and even 64bit programs the benfits are leaning more and more towards ARM. Most engineers would agree x86 more of liability and we wouldn’t do it this way again given the choice.
Again: Microarchitecture and ISA were decoupled long ago. Now, in a out-of-order superscalar world, it’s just a bunch of execution engines doing uOps. Turns out that neither RISC nor CISC had any particular advantage over the other in terms of how those uOps come to be ready to be executed.
You are not providing any new information here, everyone already knows it’s decoupled. But that doesn’t address the very real and documented inefficiencies of x86 prefetch. Look if you don’t want to admit it, fine, but then I don’t think anything of value can come out of this discussion so let’s just agree to disagree.
2020-12-03 7:29 pm
Xanady Asem
x86 still makes sense in a world of “32bit and 64bit programs” because we still have to deal with memory hierarchies. And CISC ended up having a much better iCache behavior.
Furthermore, I keep repeating myself, because I don’t think you understand that the x86 decoupled out of order speculative architectures are able to mitigate, significantly, a lot of the inefficiencies you are pointing out.
All things being equal, in a modern OoO design, there’s not that much difference in “efficiency” between CISC and RISC decoding. In x86, you have added complexity in the prefetch buffers to deal with the long instruction corner cases, and in ARM you move that complexity into the fetch engine to deal with the increased instruction bandwidth and the higher miss rates it causes.
Again, the x86 decode hasn’t been a limiter to performance in a very long time, and overall in a modern x86 core, you’re talking about structures that take less than 4% of the power budget.
2020-12-03 8:22 pm
Alfman verbose=1
javiercero1,
x86 still makes sense in a world of “32bit and 64bit programs” because we still have to deal with memory hierarchies. And CISC ended up having a much better iCache behavior.
Furthermore, I keep repeating myself, because I don’t think you understand that the x86 decoupled out of order speculative architectures are able to mitigate, significantly, a lot of the inefficiencies you are pointing out.
I’ve already responded to you about this. Anyways if you don’t feel like addressing front end issues, that’s your prerogative, but there’s nothing more to talk about. We have to agree to disagree.

2020-12-01 10:27 am
rastersoft
Please, can an admin delete that trash? Those are two pron/scorts webs >:-(
2020-12-01 9:59 pm
cpcf
There are some fairly large generalisations going on here and in the article, if only things were so simple we could all be out there building our own FPGA based GP SoCs with eye popping performance improvements! There will probably be a big data related answer, the sort of thing that only the owner/operator of the OS can identify.
Of course, on the counter argument, increased speed comes at increased risk. @Franko mentions being happy with RaspberryPi 4 servers, but it would be very interested to hear just how demanding the server applications / workload may be because I’d never claim that a Raspberry Pi 4 is reliable by commercial GP server standards, in fact far far from it. So the full answer is neither speed or scale.
2020-12-02 1:02 am
Anonymous
The ability to build a custom system and tune things and factor out overhead is really interesting. I’m not at all attracted to closed platforms and forced obsolecence and don’t remotely have the money to buy into Apple even if I wanted to.
I’m a bit ticked off by having a laptop which is pretty functional while Intel has abandoned the CPU socket in mine so no further upgrades are possible. It strikes me as an expensive waste to lob out an entire laptop for want of a single part so I’m sticking with it as long as I can. I don’t play games which helps but as art programs begin to demand GPU acceleration APIs mine doesn’t support this places anything really new out of reach.
2020-12-02 11:37 am
ndrw
The whole ARMIntel architecture part of the story is red herring. Yes, in low performance implementations ARM architecture is indeed better but in high performance ones (and M1 is definitely one of them) the difference is irrelevant. In a typical superscalar CPU almost all silicon and power consumption goes into OoO logic and caching. M1 microarchitecture is one of the best, if not the best in the industry, but if it was just about the microarchitecture we would be talking about single digit percent improvements (like with IntelAMD). The real efficiency and performance gains are elsewhere:
– Heterogeneous architecture. Conventional wisdom is that hardware implementations are not worth the effort outside well defined tasks because in a few years software implementations catch up in performance and cost of maintaining additional code paths is simply too high. However, (1) Apple is less affected by this issue because of their grip on software ecosystem, (2) performance gains may be small/moderate but power efficiency improvements are potentially huge. Essentially, Apple could replace most inner loops and real time features with hardware implementations.
– Bringing data and code closer. CPU’s main function is to flexibly combine data and code. The problem is, it has to first bring data and code together but their are increasingly farther away. Caching helps to recover most of the performance losses but it does so by further increasing power consumption. Placing 16GB of RAM next to the CPU immediately relaxes these constraints and improves memory latency and power consumption. For lower-end machines this is far more important than RAM capacity.
– Software APIs. Not unique to Apple but, again, they have an advantage of vertical integration. Basically, all high performance APIs should now operate on data collections (think Matlab or map-reduce). Iterating over collections and calling API functions on single elements is terrible for performance due to memory locality and caching.
I like how they are thinking outside of the box and are making the most of their market position. I am not a fan of their business model, though – I’d rather use a less efficient machine that can run software of my choice. I would also be interested in seeing how they are planning to handle backward compatibility in the future – will new versions of their systems and apps be falling back to software on older machines?

2020-12-02 3:22 pm
Alfman verbose=1
ndrw,
– Heterogeneous architecture. Conventional wisdom is that hardware implementations are not worth the effort outside well defined tasks because in a few years software implementations catch up in performance and cost of maintaining additional code paths is simply too high. However, (1) Apple is less affected by this issue because of their grip on software ecosystem, (2) performance gains may be small/moderate but power efficiency improvements are potentially huge. Essentially, Apple could replace most inner loops and real time features with hardware implementations.
It would help to talk about specifics to make sure everyone’s on the same track. A lot of PCs already use accelerated hardware after all. The big difference is that the M1 moves a lot of that stuff onto the CPU, but I’m unsure how much of a difference this makes. The biggest accelerator of them all is the GPU, but it seems the shared memory is actually a con for the M1. It performs better than other iGPUs, but that’s hardly a surprise as these were always considered very low performance in the PC space. I’d like to see more data on this if you have it, but I still think discrete GPUs are going to have the performance advantage for a long time.
– Software APIs. Not unique to Apple but, again, they have an advantage of vertical integration. Basically, all high performance APIs should now operate on data collections (think Matlab or map-reduce). Iterating over collections and calling API functions on single elements is terrible for performance due to memory locality and caching.
It’s been almost two decades since I’ve used it at university, but Matlab was very slow even compared to naive C algorithms at the time. I guess they’ve upped their game?
– Bringing data and code closer. CPU’s main function is to flexibly combine data and code. The problem is, it has to first bring data and code together but their are increasingly farther away. Caching helps to recover most of the performance losses but it does so by further increasing power consumption. Placing 16GB of RAM next to the CPU immediately relaxes these constraints and improves memory latency and power consumption. For lower-end machines this is far more important than RAM capacity.
Well, as with anything in physics there’s an implicit tradeoff. Here putting everything together should make it possible to improve latencies, but onboard RAM will make the CPU run hotter and may make it harder to clock up due to much higher power densities of this approach. Limited scalability is the downside of this approach. I’d like to see at least 32GB/64GB, but as long as it’s on the CPU that would probably come at the cost of CPU performance. I think a hybrid model would be neat though: primary ram would be external (and be able to support arbitrary capacities as PC users normally expect), and the ram on the CPU would be usable as a very large and fast L1/L2 cache. Obviously this kind of thing already exists in the CPU space, but the M1 shows that a cache of several gigabytes would be possible. To the extent that tradeoffs allow it, I’d prefer to have less, but much faster cache over 16GB of CPU memory/cache, say 1GB of high performance cache, it could be a huge win.
I like how they are thinking outside of the box and are making the most of their market position. I am not a fan of their business model, though – I’d rather use a less efficient machine that can run software of my choice.
Same here. I like that they’re bringing more hardware competition. PC hardware hasn’t had much competition in a very long time. But at the same time I think it’s dangerous to have corporations locking owners out of their own hardware. 🙁
I would also be interested in seeing how they are planning to handle backward compatibility in the future – will new versions of their systems and apps be falling back to software on older machines?
I don’t know whether apple’s said anything on the subject? For historical reference rosetta 1’s software emulation was kept around for about 5 years.
en.wikipedia.org/wiki/Rosetta_%28software%29
And the G5 power macs that were new in 2004/2005 had OS updates through 2009, about 5 years.
en.wikipedia.org/wiki/Power_Mac_G5
en.wikipedia.org/wiki/Mac_OS_X_Leopard
Historically microsoft hasn’t had the same incentive to push new hardware, so this sort of thing isn’t usually on the radar for windows users. Same for linux.

2020-12-03 9:22 am
ndrw
GPUs are poorly suited for off-loading random bits of code to them, mainly because of communication overhead between CPU and GPU. IGPUs are better at that but then they perform worse as stand-alone GPUs, mainly because of thermal limits and memory bandwidth constraints.
I was thinking more about a dedicated hardware – DMA, DSP (convolution, FFT), whole decoders/encoders/compressors. They have almost zero hardware cost (reuse dark silicon). The cost is in writing and maintaining software. Performance gains are typically not seen as worth this cost but if you are after power efficiency that’s a different story. Power savings in the order of 1000x for a given function are not unusual.
Matlab
Two decades ago we were in completely different computing world, optimized for single-threaded computing and CPU cycles were still the king. Now it is all about parallelism and feeding data in&out efficiently. Matlab/numpy/Fortran are doing that quite well, assuming you don’t fall back to custom for loops.
RAM
>16GB would be nice but it is not essential for portable devices. OTOH DDR buses and extra cache controllers can use a lot of power (more than RAM itself) and add significant amount of latency. I think Apple has done the right choice here. RAM is virtualized at the system level anyway so in the future they may back it with on-board DDR modules or NVMe storage, that’s one place where moving from hardware to software may be beneficial.
Apple, Microsoft, Linux
Most of the new features depend on Apple having control over their software platform, including 3rd party closed-source apps. This is aimed squarely at Microsoft. Anyone wanting an efficient laptop for travel would have to buy an Apple product. Microsoft could come up with their own M1 SoC and write a custom version of Windows/Office for it, or negotiate access to Apple’s SoC and subject to their terms (“embrace, extend, …”).
I would love to see Linux on such M1 laptops (I don’t care about power efficiency of M1 desktops). It could work quite well given the opensource nature of the Linux ecosystem but I’m pretty sure Apple will not allow that to happen. Realistically, we are better off targeting mobile/server ARM SoCs and waiting a bit for performance to converge on desktop specs. Raspberry Pi 4 already has 64b CPU and 8GB of RAM, the problem is in its pitiful IO performance and CPU/GPU optimized for low cost mobile applications. Boards like HiKey 970 or DragonBoard 820c would be much more competitive if someone’s decided to put more work into them.

2020-12-03 5:44 pm
Anonymous
You are correct to a point. OpenGL introduced the concept of bundling multiple API calls into one pre-processed call which sped things up a lot by avoiding multiple calls. This was in tandem with the OpenGL concept of unified memory which is where in the case of PCs (not SGI machine which had genuine unified memory) more memory on the GPU card provided a bigger buffer for storing the result of these operations.
Power efficiency is a good way of measuring code efficency. Back in the day when hardware was limited code efficency was important. The fact that too many lazy coders or code frameworks rely on throwing cheap hardware at the problem is shifting the cost to the end user. When this extra taxing of processing resources or increase carbon footprint adds up it can be orders of magnitude more expensive than the cost of rewriting the software to be more efficent. This may make the software cheaper (although this is debateable) from the point of view of a developer a higher cost is passed on to the end user and society. This has to be paid for and because the cost doesn’t magically disappear the argument really isn’t about how much but who pays for it.
As for new hardware such as the M1 being more efficent so using less power what Apple have also done i a somewhat indirect way is the kind of trick Mcirosoft got rich on. They are forcing end users to pay more money for yet another piece of software and provide an indirect bribe to developers not to fix their old software but gouge their users for a new version which in some cases may just be a tweak and a recompile while charging the full price as if it is a new and original piece of work.
2020-12-03 6:04 pm
Alfman verbose=1
ndrw,
GPUs are poorly suited for off-loading random bits of code to them, mainly because of communication overhead between CPU and GPU. IGPUs are better at that but then they perform worse as stand-alone GPUs, mainly because of thermal limits and memory bandwidth constraints.
Yeah, there’s always tradeoffs. Hopefully apple doesn’t rope itself into the “unified” memory model, because that will have scalability implications down the line. Today’s GPUs from Nvidia and AMD support the exact same method of accessing system memory directly from the GPU that apple is using in the M1. This unified memory model is simple to enable and simple to use just by calling the right allocator:
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
This is very flexible, but the reason we don’t use it exclusively in GPU applications is that it dramatically increase contention of shared resources. IMHO what the M1 has today is fine for competing against other iGPUs, which have low expectations today, but in the long run they’ll have to consider a dedicated GPU option to compete with high end GPUs.
I was thinking more about a dedicated hardware – DMA, DSP (convolution, FFT), whole decoders/encoders/compressors. They have almost zero hardware cost (reuse dark silicon). The cost is in writing and maintaining software. Performance gains are typically not seen as worth this cost but if you are after power efficiency that’s a different story. Power savings in the order of 1000x for a given function are not unusual.
Can you elaborate on what kind of hardware you are thinking of? How would it work and what would you use it for? Because I think GPUs are already very well suited for offloading convolutions, FFT, NN and the like. You can make hardware that is very specific to every task, but then you risk the hardware being too niche and less useful. To me the balance that GPUs achieve is quite impressive.
16GB would be nice but it is not essential for portable devices. OTOH DDR buses and extra cache controllers can use a lot of power (more than RAM itself) and add significant amount of latency. I think Apple has done the right choice here. RAM is virtualized at the system level anyway so in the future they may back it with on-board DDR modules or NVMe storage, that’s one place where moving from hardware to software may be beneficial.
All modern CPUs have an integrated memory controller these days, has apple published anything to suggest theirs is different other than the fact that their RAM is sitting on the CPU? I’d like to see it if you’ve got a link.
As for the amount needed, it depends on what it is used for. A lot of people use “laptops” for “desktop” use cases so IMHO “portable” is not necessarily a good reason to justify limited capacities. Having the GPU share ram means you have less for applications. I expect that apple will eventually acknowledge the needs of power users with a future CPU that has better specs for GPU/ram/multithreading. The big question is whether putting all of this on a single CPU is scalable. To me that’s a big unknown. Since everything is packed so closely together, upgrading any one of the these subsystems can potentially put the entire chip past it’s thermal limits. If that is the case then I still think a large high speed cache backed by more traditional dedicated components could still be more scalable.
I’m definitely curious to see how this pans out in the long run…
Most of the new features depend on Apple having control over their software platform, including 3rd party closed-source apps. This is aimed squarely at Microsoft. Anyone wanting an efficient laptop for travel would have to buy an Apple product. Microsoft could come up with their own M1 SoC and write a custom version of Windows/Office for it, or negotiate access to Apple’s SoC and subject to their terms (“embrace, extend, …”).
To be honest I don’t see this as being an advantage at all (other than for the purposes of vendor locking of course). There’s very little in terms of software functionality that cannot be duplicated on other devices. By it’s very nature software is meant to abstract the hardware.
I would love to see Linux on such M1 laptops (I don’t care about power efficiency of M1 desktops). It could work quite well given the opensource nature of the Linux ecosystem but I’m pretty sure Apple will not allow that to happen. Realistically, we are better off targeting mobile/server ARM SoCs and waiting a bit for performance to converge on desktop specs. Raspberry Pi 4 already has 64b CPU and 8GB of RAM, the problem is in its pitiful IO performance and CPU/GPU optimized for low cost mobile applications. Boards like HiKey 970 or DragonBoard 820c would be much more competitive if someone’s decided to put more work into them.
Totally agreed 🙂

2020-12-02 7:20 pm
Anonymous
The new Playstation 5 is adding tiered GPU capability. There is the first GPU plus a second daisychained on with a cloud GPU capability daisychained on behind this. If you squint at the Playstation 5 and Apple M1 CPU it looks like they are reinventing SGI workstations and the BBC Micro. No new ideas really just a shuffling around of the bits.
2020-12-03 4:12 am
bassbeast
Are you really asking this? Because its a console, duh. No seriously, that is what it is, its a console just like Xbox whatever and PS5 . They control the hardware, they control the software, they control the network and I/O and every other piece so they can optimize the living hell out of it that you simply cannot do with an OS like Windows or Linux where you have to support hundreds of CPUs, GPU, I/O chips, etc.
Its the same reason the previous Xbox whatever (can someone seriously call Redmond and tell them to just stop with the dumb names? They are getting as bad as Intel with the naming) and PS4 can still play mainstream games in 2020 while a mainstream PC built at the same time would struggle, you can do a heck of a lot of optimizing when you know what every single circuit is gonna be and have full control of every transistor.
And I’m sure some are gonna say “Then why wasn’t the Intel ones super fast?” and the answer is right in the question…Intel. Apple didn’t have any control over Intel, nor any control over what Intel shipped to them other than picking the SKU. They couldn’t ask for custom graphics in the die or specific I/O or any of that and they also I’m sure wanted to keep their options open in case they flaked out on them like IBM did with PPC, which is of course exactly what happened when Intel couldn’t do 10nm followed by 7nm. There was no point in doing really hardcore optimizing when they had no idea from one revision to the next what Intel was gonna do, now that they have full control that is no longer an issue.

2020-12-03 5:29 pm
Anonymous
I’m discussing system architechture concepts not business decisions or specific implementations. This was very clear in my comment,.

2020-12-03 9:39 am
cb88
The fact is it isn’t that fast… when comparing the M1 vs x86 the benchmarks being used are microbenches… and on top of that we are talking about mobile to mobile comparisions and there aren’t any 5nm APUs yet, and next generation 7nm APUs are set to likely be about 20% faster in single thread… than the M1 in a month or two.

2020-12-03 3:11 pm
bassbeast
Uhhh…just in case you missed the memo Gamernexus reported about 3 days ago you can pretty much give up seeing any new AMD APUs,GPUs or in fact the new Ryzen 5s, until mid 2021 at the earliest. What little supply that is currently in the chain is all you are getting as TSMC has reported AMD’s entire allotment of 7nm fab is going to the Xbox whatever and PS5. This makes sense as I’m seeing those go for nearly $2k a pop thanks to scalpers and Sony and MSFT are breathing down AMD’s neck for more units to deal with the holiday demand.
So while I agree that WHEN the new chips become mainstream they will probably stomp this thing from the looks of it those chips are gonna be as rare as hen’s teeth for probably the better part of a year. Heck there is a bit of a scandal brewing right now as it has also been reported that Nvidia has been selling truckloads of brand new RTX 3080s directly to scalpers as they are in such demand for mining the scalpers were willing to give Nvidia triple MSRP per unit, and supplies are starting to get so tight in my area I’m seeing Ryzen 3600s going for prices the Ryzen 7 3700x was getting just a year ago.
I’m just glad I built my Ryzen 3600 early in the year as from the looks of things when it comes to computers its gonna be as bad as during the cryptomining craze with waaay too many people wanting the gear and too few units to go around so Apple may get to be the only game in town for awhile, at least until the insane desire for consoles quiets down.

2020-12-03 6:57 pm
Alfman verbose=1
bassbeast,
I’m just glad I built my Ryzen 3600 early in the year as from the looks of things when it comes to computers its gonna be as bad as during the cryptomining craze with waaay too many people wanting the gear and too few units to go around so Apple may get to be the only game in town for awhile, at least until the insane desire for consoles quiets down.
I’m obviously seeing the same thing. I was happy that I bought my last system before the tariffs kicked in and now…sheesh. Between the new taxes and lack of supply prices have gone crazy.
It’s funny that some people were rushing to sell their used rtx 2080 ti for $500 when they learned about the prices for RTX 3000. Yet now their used cards are selling for ~ $1000 on amazon. A lot of them jumped the gun because they thought they’d be able to own the new cards, haha.
https://wccftech.com/nvidia-geforce-rtx-2080-ti-used-market-flooded-after-rtx-3080-rtx-3070-unveil/
I’m not tempted by the 3070, which has much less ram, smaller bus width, and performs slower than 2080 ti in the cuda benchmarks I care about. The 3080 could be a tempting upgrade, but as you’ve pointed out you have to win the F5-lottery or buy it at exorbitant prices from a scalper.
I think these hardships are going to persist for a while. The tech industry is not in a good place to have so much dependence on a single supplier. It’s ironic that the big name manufacturers are getting more competitive with their products, yet they’re all becoming inceasingly reliant on TSMC to supply their chips. TSMC is the new chipzilla.
https://www.tweaktown.com/news/75679/nvidia-will-shift-over-to-tsmc-for-new-7nm-ampere-gpus-in-2021/index.html

2020-12-07 3:46 am
bassbeast
Yeah seriously glad I built mine when i did as I ended up with a nice gaming PC for just $650 and today it would easily got me a grand plus. Its truly insane that the entire tech world is completely at the mercy of a single company, I mean what happens if tensions flare up and China decides to just take Taiwan? Or a natural disaster strikes? CPUs, GPUs, consoles, mobile chips, so many devices would be SOL if that one company gets taken out that it would make the flood in Thailand that crippled HDD production look like a slow weekend.

2020-12-03 5:33 pm
Anonymous
You are correct that M1 isn’t fast and it is not fast by any objective measure. What does give M1 its speed is architecural arrangemtns which increase throughput. From an end user point of view they are essentially equivalent so the experience is that M1 is fast when its not. it’s the same with GPUs. GPUs 9and earlier graphic processing techniques) are not fast. They are just built in a way which avoids bottlenecks or spreads the load for the specific use case.

2020-12-03 6:36 pm
Brisvegas
I had to laugh when I read about the “$4000” iMac being blown away. Last time I checked $4000 gets you a 32 core Threadripper CPU, 128GB of RAM and a Nvidia 2080 in the PC world. That will utterly humiliate the M1 while still running at idle
2020-12-05 4:19 am
smashIt
After skimming the article, I have the distinct feeling that the autor has no clue.
CPUs not having IO-capabilities? WTF?
Did he ever look at specifications of CPUs for the last couple of years?
Ryzens even have SATA and USB integrated into the CPU.
2020-12-14 3:32 am
mitteltiefen
Apple is no1
https://google.com
google.com