Hackaday recently published an article titled “Why x86 Needs to Die” – the latest addition in a long-running RISC vs CISC debate. Rather than x86 needing to die, I believe the RISC vs CISC debate needs to die. It should’ve died a long time ago. And by long, I mean really long.
About a decade ago, a college professor asked if I knew about the RISC vs CISC debate. I did not. When I asked further, he said RISC aimed for simpler instructions in the hope that simpler hardware implementations would run faster. While my memory of this short, ancient conversation is not perfect, I do recall that he also mentioned the whole debate had already become irrelevant by then: ISA differences were swept aside by the resources a company could put behind designing a chip. This is the fundamental reason why the RISC vs CISC debate remains irrelevant today. Architecture design and implementation matter so much more than the instruction set in play.
↫ Chips and Cheese
The number of instruction sets killed by x86 is high, and the number of times people have wrongly predicted the death of x86 – most recently, after Apple announced its first ARM processors – is even higher. It seems people are still holding on to what x86 was like in the ’80s and early ’90s, completely forgetting that the x86 we have today is a very, very different beast. As Chips and Cheese details in this article, the differences between x86 and, say, ARM, aren’t nearly as big and fundamental as people think they are.
I’m a huge fan of computers running anything other than x86, not because I hate or dislike the architecture, but because I like things that are different, and the competition they bring. That’s why I love POWER9 machines, and can’t wait for competitive non-Apple ARM machines to come along. If you try to promote non-x86 ISAs out of hatred or dislike of x86, history shows you’ll eventually lose.
CISC ISAs have dense programs, RISC tends to have low density programs… RISC then ends up implementing instruction compression and you are back at square one complexity wise.
Thom Holwerda,
I like healthy competition too.
The arguments in favor of simpler ISAs aren’t strictly wrong, there are tangible benefits to simpler architectures. However there are caveats that preclude a straightforward conclusion:
1. When one has a money/fab advantage, such as Intel, alternative ISAs may have a difficult time winning any contests regardless of their technical merit.
2. Many ISA related bottlenecks (x86 has bad/expensive prefetch instruction properties in particular) can be offset simply by throwing more transistors at the problem. This may involve adding more cache, more parallel decoders, etc, all of which is proven to work…however there is one metric where the additional transistors can hurt an architecture badly, this is energy efficiency. The very same transistors used to convert CISC instructions to a faster performing RISC core implicitly need more power.
I think these points are born out in empirical evidence from industry. When it comes to power efficiency (and related cooling requirements) x86 is disadvantaged compared to competing ARM processors. This is perhaps the one metric that x86 can never win at by adding more transistors. Perhaps more significantly though, apple have shown that intel are vulnerable on the performance front too when a competitor is able to produce ARM CPUs on cutting edge fabs.
Of course this doesn’t strictly mean intel/x86 are going away, there are still very strong network effects in play. Most of us still use x86 for various reasons that have absolutely nothing to do with the architecture. Frankly the market for commodity ARM devices leave a lot to be desired. x86 still has the best compatibility, flexibility, scales of economy, etc. Regrettably generic Linux support on ARM still sucks.
Oops, comment is in the wrong spot.
Indeed. AMD had a tough time competing with the same ISA because of Intel’s fab advantage. Intel forgot this, and lost their fab advantage.
True. Efficiency isn’t something x86 has prized usually. There have been times when it’s been important, Intel’s Atom and Pentium M are two examples, but it usually gets sacrificed on the alter of marketing needs and sales numbers.
That might be changing though, or at least we’re at another point where efficiency is a priority. AMD and Intel both releasing chips with “efficiency” cores (again, in Intel’s case) is the big news. Some in a big.LITTLE arrangement, and some in chips which are 100% efficiency cores.
The two approaches are very different. AMD’s Zen4c are optimized versions of their Zen4 cores, and Intel’s Gracemont are distinct from the big perf cores.
Flatland_Spider,
I agree. Efficiency is rather important for mobile. The efficiency cores help, but obviously perform much worse. While there’s always going to be engineering tradeoffs involved, these are somewhat amplified on x86 because of the additional complexity next to ARM.
Yes, it’s interesting to compare the different approaches. In the long run though I expect more intensive applications will rely less on the CPU. GPUs are simply more scalable. Of course, the problem is that we have all this software built around sequential high level languages. More developers will either need to pick up GPGPU programming, or maybe compilers could achieve an evolutionary jump by compiling conventional sequential software into GPGPU compatible binaries. This would really be revolutionary.
We could really do with a new Grace Hopper.
On a comparable node, the power efficiency between a similar performance-level ARM and x86 cores is negligible.
Instruction decode hasn’t been a major limiter to performance in a very long time.
Xanady Asem,
That’s true in some ways, but when it comes to code density ISA still matters if software doesn’t fit in cache. And while in principal you can add tons of cache, that costs energy. A better optimized ISA clearly helps here.
That is why RISC machines tend to need, traditionally, slightly larger I-caches next to the core.
Also code density is not a metric tremendously correlated with power consumption.
Xanady Asem,
Not in and of itself, but if the CPU designer’s goal is to offset the performance overhead incurred by a slower & more complex prefetcher, then it takes more transistors with a higher duty cycle, which does translate to more power. versus a more efficient ISA that requires fewer transistors to activate less often.
Of course the natural way to mitigate front end overhead is with a micro op cache, but it also uses power and is only so large in practice – a few thousand micro-op slots versus megabytes of L3 cache.
https://www.anandtech.com/show/14514/examining-intels-ice-lake-microarchitecture-and-sunny-cove/12
https://www.anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x-and-ryzen-5-7600x-review-retaking-the-high-end/8
Naturally it depends on what you’re running, but a program that doesn’t fit in the micro-op cache will have to rely on the prefetcher doing work continuously, which clearly translates to more power if if it more complex and/or has to process longer code. In the case of x86, both may be true. Ideally we’d fix these sources of overhead, but in a way x86 is “too big to fail”, haha.
That all depends on what you mean by “efficient”
RISC encoding is more “efficient” in terms of simplified decoder logic. But it adds more pressure to the ICache. Which in turn requires more DDR transactions, which are expensive in terms of energy consumption and latency. And modern high performance RISC machines also have their own micro-op cache, since they too break down instructions in smaller internal-representation functional packets to execute out of order.
Whereas x86 puts more pressure over the decoder, indeed. But it also requires less instruction fetch transactions for a similar compute kernel. Also prefetching has been part of the x86 ISA/architecture since the original 8086.
But in the end, focusing on making what represents a tiny amount of the power budget, in a modern core, compared to the rest of the out-of-order structures doesn’t really make that much difference in the big scheme of things.
Xanady Asem,
Not necessarily. We can probably agree that the micro-up cache becomes more important for ISAs with worse density, but it’s not automatically true that the more complex ISA has higher density.
Yes, although most of that pressure is probably already absorbed by the L1/L2/L3 caches. Regardless, I agree that a more dense ISA requires less memory and fewer transactions.
The problem with x86 is that it’s not an efficient encoding despite being complex.
You agree it depends on the workload?
As the computation becomes more intense with heavy math in tight loops, large vector processing, etc, it’s easy to see why the percentage going towards decoding becomes less, but the converse is also true. Workloads that comprise of long streams of basic instructions (like scripting/PHP/SQL) could increase the load on the instruction decoder. In the absolute worst case scenario the computation might have to block waiting for the instructions to decode and the decoder might be doing most of the work.. Workloads that experience this will end up suffering much more from decoder overhead both in terms of energy and performance.
I said scripting thinking of a server running compiled PHP, but I should be careful with the way I say it because a small script interpriter could actually fit into uop cache and not place a load on the CPU’s instruction decoder. Interpreted scripts are only “data” from the CPU’s perspective.
cb88,
That could be true in theory, but as a blanket generalization it’s flawed because it assumes CISC ISAs are optimal. However it’s technically possible to develop a CISC architecture that is not dense. Unfortunately x86 features like instruction prefixes increase decode complexity while simultaneously contributing to bad instruction density.
I’ll pull some examples from an old discussion…
https://www.osnews.com/story/136297/the-complex-history-of-the-intel-i960-risc-processor/#comment-10430953
I thought the difference was largely explained away AMD64 as being 64-bit.
Flatland_Spider,
It’s true that AMD64 code is quite a bit larger than 32bit x86. But when comparing 64bit x86 to 64bit ARM usually I find that the ARM output is smaller. x86 prefixes can end up using bytes just to convey a couple bits of information. If the world weren’t so addicted to x86 ISA, in principal a more optimal ISA could simplify prefetching, reduce latency, while also improving code density.
Without knowing the details on the compiler, optimization level, what is linked, etc. Those numbers are not really normalized, so we don’t know where the size difference in executable is due specifically to inefficiencies of the ISA, or to a bunch of other factors.
E.g. the x86_64 binaries could include several code-paths for different x86_64 revision, like different SIMD support levels (from SSE to AVX, for example). Whereas ARM may only support Neon (at least in the case of OSX).
That would be my theory too. Though once you are using AVX instructions and up, the x86 instructions are also longer than ARM instructions.
The really long instruction are rarely used though.
A big chunk of the computing kernels between scalar architectures tend to use the same mixture of instructions (e.g. the percentage distribution between branching ops, load/store ops, and compute ops tends to be remarkable constant)
Different arch ABIs do things differently in terms of what it is linked statically vs dynamically for example when it comes to generate a binary. And how they setup threads, the memory ordering management, how to save register state, etc. And as I said, different archs will generate different number of codepaths to support different releases, functional units, etc. Some arch’s ABIs will even go as far support “dummy” expeculative threads to warm up caches. And the standard libs, all have different sizes depending on the architecture. Never mind stuff like Boost (or even Metal compute in OSX).
The only way to properly compare is to either create a fully controlled kernel specified in pseudo code, and either implementing it in assembler, or with some super strict subset of C (mainly not using any of the standard libs) using the same tool chain and optimizations.
Surprised to hear support for X86 being that almost all the patents are owned by Intel and how controlling of X86 they have been partnered with Microsoft.
That’s never been something people care about. Some people care, but Intel can safely ignored them.
Don’t the patents expire?
Yup. Patents expire after 20 years, which means that all relevant patents to the original AMD64 as released in 2003 have expired.
Yeah, but this is why every few years AMD and Intel generate new revisions to X86_64 😉
Also, copyright now is a 100 year deal or so. And the x86 microcode (and firmware) falls under copyright protection laws.
That being said, nothing stops organizations from emulating x86 in software. So you can do something similar to what Transmeta did and implement your own ISA and just run a small kernel that translates x86 binaries to your internal ISA. Apple is doing a glorified version of that with rosetta, running virtual x86 cores on demand. The Application, and even some system, software is none the wiser. And if you tune your underlying microarchitecture to support the emulation use case, you can get decent speed (like Apple does with the M-series running x86 code).
Do not confuse the amd64 arch & ISA (yes, x86 will die in the next ten years, even Debian will drop support) with the actual chips: we’re all basically running very advanced RISC CPUs that sort of emulate in hardware the amd64 specs. I believe the last true incarnation of CISC+x86 was the Intel 486 (SX/DX). The Pentium was a different beast altogether. The came AMD’s Opteron, the Itanic sank and a new world was born.
God I LOVE history! 🙂
Not sure about that. My memory is bad but I recall PentiumPro and newer being more of a shift away from entirely CISC design.
The RISC/CISC distinction became meaningless, precisely after the Pentium Pro.
What some people think makes modern intel chips “RISC” has nothing to do with RISC at all.
This is, things like superscalar, out of order, prediction, caches, pipelining, SMT, etc… don’t have anything to do with RISC or CISC instruction encoding. And the vast majority of transistors in modern cores go into implementing those features, which is why modern cores look basically alike when targeting a specific performance bracket.
Erm.
How a modern 80×86 CPU works is that the front end starts with CISC instructions, then adds dependency information to them, adds “program order” information to them (so “out-of-order” can be made to emulate “program order” at retirement), adds other hints to them (e.g. which logical CPU they were from), replaces “architecturally visible register names” with internal registers, and sometimes uses “micro-op fusion” to turn 2 original instructions into a single more complex thing; and the final result is “significantly more complex than CISC” micro-ops that have nothing even slightly “reduced” or “RISC” about them whatsoever..
Sadly; RISC was always a synonym for “academics who can’t handle the complexity necessary to compete, who need to double the clock frequency to compensate for needing more instructions to do the same work” and when those stupid academics found out that you can’t increase clock frequency forever (because nobody wants to have their face melted off by the waste/heat) they started trying to redefine what RISC means (to “load-store architecture” or “fixed length instructions” or … anything except “reduced instruction set”). One of the many creative redefinitions from stupid people who failed is that RISC means “not microcoded”; which is an extremely misguided attempt to pretend that something that almost never happens (e.g. maybe once in several million instructions) is an important defining characteristic.
Mostly; as someone who claims to love history, you probably shouldn’t be spreading deluded fabrications as some kind of alternative to history.
The came AMD’s Opteron, the Itanic sank and a new world was born.
8086 was CISC. Pentium made CISC more complex. There was a massive amount of progress (introduction of SIMD, multi-core, hyper-threading, power management, performance monitoring, …) which made the CISC even more complex. Then AMD’s Opteron moved the memory controller onto the same chip as the CPUs (leading to lower latency RAM, higher performance, and NUMA scalability for servers); but they also extended the old 32-bit stuff to 64-bit while changing nothing fundamental (all still “more complex then CISC”). During that time Itanium sucked and died because lots of software developers wouldn’t pay for absurdly over-priced and unobtainable test machines/workstations (and a computer without software is worthless); but it’d be foolish to think this failure had anything to do with “more complex CISC” beating “EPIC that is simpler than CISC”.
Those stupid academics taking the time to research, develop, model, and study quantitatively microarchitectural matters in a peer reviewed fashion! That’s silly!
It’s more like, “beginners with no experience trying to cram 10 years of work into the 6 months at the end of their degree, while being led by a hairy dude that failed to get a real job and learnt everything they know from a textbook that was 10 years too old when they read it 10 years ago; while they ignore everyone with actual real-world experience because they think $X of student debt (spent mostly on accommodation and beer) means that what they actually learnt is worth $X..”.
Sounds like a you problem, good luck.
People can, but they need to flood the market with cheap boards which people can easily work on. RPi, for all their problems, have this figured out. People need to be able to afford dev boards, and the dev boards need to be supported to allow anyone and everyone to work with them.
Power still has this problem. They’re designed for mainframes and IBM servers. Raptor can sell boards, but they’re expensive.
The companies in charge of those ISAs were vertically integrated hardware companies who wanted to charge eye watering prices for their products. The x86 crew was willing to sell to anyone, and they got lucky with the BIOS accidentally becoming a standard.
x86 was not as much of an accidental standard as people think.
x86 was a bit like the “open” architecture of its time during the 8086 to 80286 period. (and 8080 before it). Intel was THE microprocessor company, as they sort of created the market. So there was a relatively larger ecosystem around intel ISAs, in terms of development kits, compilers, OSs, etc, than just about any other micro by the late 70s.
There were also relatively lots of second sources that sold their own 80(0/1/2)86 clones. Which in some cases had their own extensions/improvements, and/or were faster than the intel part. This in turn created a very efficient supply chain around that specific architecture. And having multiple vendors also led to x86 parts being significantly cheaper and more abundant than most of their competitors.
The reason why IBM chose intel over motorola (even though the M68000 was arguably a better CPU than the 8086) was that it was a much quicker path to market in terms of design complexity (lots of documentation/support for intel, less so for motorola). Lots of second sources, guaranteeing supply volume and low prices. And a strong software ecosystem in terms of ramp up OS and application development.
During the 70s and a big chunk of the 80s, the 80×86 was almost the de facto “open” CPU standard (with obvious caveats, from what we now know as an “open” CPU standard).
The cloning of the BIOS was indeed another an accidental benefit that entrenched further x86.