Reverse-engineering the ModR/M addressing microcode in the Intel 8086 processor

Thom Holwerda 2023-02-27 Intel 21 Comments

One interesting aspect of a computer’s instruction set is its addressing modes, how the computer determines the address for a memory access. The Intel 8086 (1978) used the ModR/M byte, a special byte following the opcode, to select the addressing mode. The ModR/M byte has persisted into the modern x86 architecture, so it’s interesting to look at its roots and original implementation.
In this post, I look at the hardware and microcode in the 8086 that implements ModR/M and how the 8086 designers fit multiple addressing modes into the 8086’s limited microcode ROM. One technique was a hybrid approach that combined generic microcode with hardware logic that filled in the details for a particular instruction. A second technique was modular microcode, with subroutines for various parts of the task.

This is way above my pay grade, but I know quite a few of you love this kind of writing. Very in-depth.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

21 Comments

2023-02-28 7:58 am
jalnl
I love Ken’s articles, he’s doing a whole series of in-depth 8086 dissection. It gives a fascinating look into what makes the 8086 tick.
2023-02-28 10:55 am
sukru
This actually shows how well designed the x86 instruction set is. They are short, but versatile. Allows many common uses by default, but still covers corner cases.
From the article:
The D and W bits are an example of orthogonality in the 8086 instruction set, allowing features to be combined in various combinations. For instance, the addressing modes combine 8 types of offset computation with three sizes of displacements and 8 target registers. Arithmetic instructions combine these addressing modes with eight ALU operations, each of which can act on a byte or a word, with two possible memory directions. All of these combinations are implemented with one block of microcode, implementing a large instruction set with a small amount of microcode. (The orthogonality of the 8086 shouldn’t be overstated, though; it has many special cases and things that don’t quite fit.)
This is talking about two additional bits in the instructions themselves, allowing direction flips (D) so instructions can work with [MEM], REG or REG, [MEM], and width allowing 16 or 8 bit parameters.
And that same mode-register-memory encoding is still being used in modern x86-64 encoding as well:
https://en.wikipedia.org/wiki/VEX_prefix, but extended to allow much more registers (16 base + SIMD).

2023-02-28 1:03 pm
Alfman verbose=1
sukru,
This actually shows how well designed the x86 instruction set is. They are short, but versatile. Allows many common uses by default, but still covers corner cases.
…
This is talking about two additional bits in the instructions themselves, allowing direction flips (D) so instructions can work with [MEM], REG or REG, [MEM], and width allowing 16 or 8 bit parameters.
And that same mode-register-memory encoding is still being used in modern x86-64 encoding as well:
https://en.wikipedia.org/wiki/VEX_prefix, but extended to allow much more registers (16 base + SIMD).
x86 certainly achieved the goals engineers set out for it, but the original engineers never intended for x86 to last as long as it has. While x86 has been continually extended over decades, it requires instruction prefixes and a multitude of CPU modes to make up original shortcomings. This increases capabilities but it’s not without cons. Evolving this way has created unnecessary bloat and complexity, worse code density, need for larger caches, increased instruction decode latency, fewer parallel instruction decoders (compared to say an ARM M1), more power consumption, etc… I don’t blame any of these problems on intel’s original engineers, but at the same time I think they’d agree x86 has grown long in the tooth. We’re throwing lots of money, transistors, and power budget at x86 just to keep software compatibility but arguably sticking with x86 has imposed a long term opportunity cost on the microprocessor industry.
I appreciate many of the technical advantages of a modern architectures like ARM and I’m extremely impressed with the progress ARM is making in terms of efficiency and performance. The dilemma, for me anyways, is that the experience of supporting ARM products can be so much worse than x86.

2023-02-28 2:57 pm
sukru
Alfman,
True, If x86 was designed today, it would be a much cleaner architecture. Comparatively, 64-bit ARM was cleaner.
However,
https://www.reddit.com/r/intel/comments/skmgae/intel_i9_13900k_raptor_lake_s_die_annotations/
The instruction decoder is but a very small part of the newer Intel dies. And after translation, it uses RISC-like instructions internally. So, even though there is some “cruft”, I am not sure it has a major effect. (Except for queue decoding length, and being unable to scale down to mobile).
Anyway, ARM manufacturers still insist on being incompatible. As long as it is there, x86 will be relevant. (can I install Windows on M2, or Linux on SQ2?)

2023-02-28 5:04 pm
Alfman verbose=1
sukru,
The instruction decoder is but a very small part of the newer Intel dies. And after translation, it uses RISC-like instructions internally. So, even though there is some “cruft”, I am not sure it has a major effect. (Except for queue decoding length, and being unable to scale down to mobile).
Well, I’d say that the die area is not actually what makes x86 instruction preprocessing costly. It’s really the Instruction decode latency and limited instruction decode depth that we’re combating rather than die area. Consider a hypothetical extreme to clarify this point: a CPU with 0.01% die area for prefetch that has 100 cycle delay compared to a CPU with 10% die area for prefetch with only a 1 cycle delay. All else being equal, the instruction latency and decode depth may factor far more into bottlenecks than die area.
The obvious solution is to add more uop cache to attempt to compensate for instruction decoder bottlenecks, but this cache is extremely limited and to whatever extent you manage to increase the uop cache on x86, a simpler ISA could be given the same amount of cache and still hold the advantage. Also die area isn’t strictly proportional to energy consumption. AVX integer & floating point units will take up a lot more transistors & die area than prefetch logic, but you have to consider duty cycle. A prefetch unit under full load may end up being responsible for a greater percentage of power consumption than it’s die footprint would otherwise suggest.
Vector optimized applications are probably the best case scenario for x86, but on the flip side there are lots of scenarios like databases and web hosting that don’t typically make good use of things like AVX and on top of this there’s extremely high context switching, which puts more pressure on the decode units.
Anyway, ARM manufacturers still insist on being incompatible. As long as it is there, x86 will be relevant. (can I install Windows on M2, or Linux on SQ2?)
I know. Still, I’ve wanted to replace my x86 servers with ARM servers for decades. Besides the OS support problems, the consumer grade ARM CPUs & boards available to the public haven’t been up to the task. For a long time all server grade ARM chips were being custom fabbed for amazon/google/etc. Thankfully ARM vendors are slowly showing up, but I’m feeling price shock next to commodity x86 servers – it feels weird that ARM carries such a premium price. So it still seems like affordable ARM servers may be a ways away for me.
Oh, I did manage to order a rock64 this week. As a 6 year old SBC with 2gb ram, it’s slow and not really what I wanted. But seeing as it’s out of stock again and I couldn’t get anything else I should be happy I got anything at all, haha. I was close to buying one of the mini pcs we were talking about before if I couldn’t get anything else.

2023-02-28 6:55 pm
sukru
Alfman,
This used to be $600, just last year:
https://shop.solid-run.com/product/SRLX216S00D00GE064H09CH/
And yes, decoding depth is a problem. But that is a general issue with CISC architecture. I am not sure that can be solved as good as a fixed size RISC instruction set.
2023-02-28 8:34 pm
Alfman verbose=1
sukru,
This used to be $600, just last year:
Yeah. also add the chassis, power supply, memory, disks, etc…
I built a 24core xeon rack mount for a client, I’d like the chance to build an ARM server that ups the ante. Something like 64core graviton, but AFAIK those kinds of ARM CPUs are not available outside of custom fab requests.
https://en.wikipedia.org/wiki/AWS_Graviton
And yes, decoding depth is a problem. But that is a general issue with CISC architecture. I am not sure that can be solved as good as a fixed size RISC instruction set.
Agreed. This didn’t used to matter much in the original days of the IBM PC, but the variability of instructions isn’t ideal today when we want to fetch several instructions in parallel and it’s difficult to find out where they actually start.

2023-02-28 8:11 pm
Xanady Asem
@ Suku
You’re correct, a lot of people are still stuck back in the 80s/90s when it comes to CPU architecture, so they regurgitate a lot of old “memes” from the ancient usenet flame wars between nerds. As they don’t understand how massively things have changed in the past 2 decades, specially with the introduction of aggressive out of order archs. Decode hasn’t been a major limiter in ages. As dynamic HW schedulers and predictors have absorbed most of the penalties/deficiencies in instruction encoding/programming model/register budgets/etc… plus complementary SW approaches like thread-level speculation, soft pipelining, etc.
The A in ISA is silent these days. Microarchitectures are for the most part decoupled from the final interface to the programmer. Regardless of RISC or CISC, very little of the underlying architecture is as exposed to the programmer as it was back 40 years ago, when ISAs had a stronger role in defining/exposing the actual architecture.
The Firestorm core in the Apple M1, for example, was a more “complex” architecture than the contemporary Zen3 uArch. The ARM fetch and decode box for firestorm, being significantly larger in area/resources than the x86 front end for Zen3. In fact, Zen was developed initially with the ability to execute x86 or ARM instructions, which should give an idea about how decoupled ISA has became from uArch.
Also, the more proper way of seeing the internal instructions is as nano-ops, as both modern high performance x86 and RISC (POWER, SPARC, ARM) uArchs break down instructions internally to feed between the Fetch and Execution Engines.
These underlying architectures are superscalar/OOO/etc, which a lot of people seem to equate with RISC. But that is not the case, as x86 for example implemented some of these arch concepts earlier than competing RISC products.
A lot of people love to rant on x86, by throwing stuff they heard somewhere. But it’s actually a very complete architecture and it supports more programming models than most of it’s competitors. The whole “bloat” argument tends to look nonsensical when we’re dealing with an issue that takes <3% of the overall transistor budget. It's actually remarkably low overhead in order to keep compatibility with a 50 year old SW library.

2023-02-28 8:49 pm
Alfman verbose=1
javiercero1,
The A in ISA is silent these days. Microarchitectures are for the most part decoupled from the final interface to the programmer. Regardless of RISC or CISC, very little of the underlying architecture is as exposed to the programmer as it was back 40 years ago, when ISAs had a stronger role in defining/exposing the actual architecture.
We understand what a microarchitecture is, but it’s still a relatively small state machine and unless you’re fortunate enough to have software (plus OS) fit inside the state machine, then it’s going to be affected by the ISA decode latency and parallelism, that’s when you appreciate a simpler ISA.
Nobody is saying a subpar ISA is the end of the world for x86, but it simply wouldn’t make sense to choose x86 today if we had the opportunity to start over with the benefit of hindsight and without the need for backwards compatibility. The benefits of consistent instruction sizes are clear.
2023-02-28 10:22 pm
Xanady Asem
@ Alfman
Thank you for letting me know that you do not understand what a microarchictecture is.
2023-03-01 9:33 am
Alfman verbose=1
javiercero1,
Thank you for letting me know that you do not understand what a microarchictecture is.
Obviously that’s not true, but once again you are more focused on grudges and diminishing others than the topic at hand. Don’t you get tired of that? I like discussing technology, it’s why I’m here, but this whole “I’m better than you” facade seems petty to me. Care to actually articulate a fair response to my points?
2023-03-01 10:44 am
sukru
javiercero1,
I am not sure calling Alfman “does not understand” the right thing.
We are all here to discuss technology, and I can see you are knowledgeable, so is Alfman, but we all have our biases and favorites.
2023-03-01 11:35 am
Xanady Asem
Blah, blah, blah. Here comes the usual Alfam argumentum ad victimum.
This is like the nth time where I’m trying to tell you that instruction encoding (and decoding for that matter) hasn’t been a limiter in DECADES. You are basically going off about a trick about long division from high school algebra, in a conversation about quantum electrodynamics. But you don’t realize that, because you know so little about the matter, that you do not realize how little you know.
When it comes to x86, the decoder is not as complex as you think. Compared to the rest of the uArch, specially in terms of area and power budget. The corner case instructions that drop to uCode, you obsess about, are very rare, so they rarely even register in terms of reduction of performance. And the common cases are well understood and were optimized eons ago.
Trace and uOp caches absorb most of the decode latencies, and the rest are taken care of by the speculative out-of-order elements. A RISC decoder has to transfer complexity elsewhere in order to realize the same bandwidth of nano Ops being served to the Execution Engine. So the area for the internal decoder caches in the x86 translates to similar SRAM budget increases to the I-Cache and Brach Predictor, and larger number of ports for the prefetcher registers. There is no free lunch to get the same retire bandwidth.
Plus in either case, regardless of instruction encoding, you end up having to break the program interface instructions into internal nano-ops. So the vast majority of a modern high performance microarchitecture is not defined by the instruction encoding at all. The ISA in a modern processor family is just a standardized interface to the programmer, to maintain binary compatibility.
I am actually interested in discussing technology as well. I do CPUs for a living. So I am really interested in having an educated exchanges with people who are also interested in the matter, and who either have actual insights I can learn from. Or they are interested in learning in case my information is of interest to them.
You’re not one of them. All you do is retard the conversation, with your “solutions” and “insights” about problems you don’t understand, and which were solved DECADES ago. While. If I had any interest in what you have to say, I’d just go to the usenet archives and waste my time reading some flamewar from 30 years ago.
2023-03-01 1:03 pm
Alfman verbose=1
javiercero1,
When it comes to x86, the decoder is not as complex as you think.
The problem is that x86 instructions incur complexity which increases decode latency and limits the prefetch depth that could otherwise be obtained through a circuit of similar complexity. An easier ISA could produce more instructions faster and more efficiently with less latency.
Maybe you don’t care about that, but my point is that it’s not ideal to have to deal with such needless complexity if there weren’t a backwards compatibility reason to do so.
Trace and uOp caches absorb most of the decode latencies,
That’s exactly what I’ve said above, I’m glad we agree on it. If the software has long running loops that fit entirely in said caches, then yes that helps mitigate decode latencies, but for larger stretches of code the ISA decoder obviously has to remain in the critical path. Considering the rather limited size of the uop cache, I’d say ISA prefetching bottlenecks can still affect performance regardless of the microarchitecture.
https://en.wikipedia.org/wiki/Alder_Lake
“μOP cache size increased to 4K entries (up from 2.25K)”
and the rest are taken care of by the speculative out-of-order elements. A RISC decoder has to transfer complexity elsewhere in order to realize the same bandwidth of nano Ops being served to the Execution Engine. So the area for the internal decoder caches in the x86 translates to similar SRAM budget increases to the I-Cache and Brach Predictor, and larger number of ports for the prefetcher registers. There is no free lunch to get the same retire bandwidth.
A simple ISA can use all the same micro architectural back end designs as a complex ISA, so I would assert there’s no microarchitectural reason to prefer an ISA like x86 with complex instruction sizes given the opportunity to start from a clean slate with a more optimal ISA.
I am actually interested in discussing technology as well. I do CPUs for a living. So I am really interested in having an educated exchanges with people who are also interested in the matter, and who either have actual insights I can learn from. Or they are interested in learning in case my information is of interest to them.
You’re not one of them. All you do is retard the conversation…
Ah, you were doing so well. We can and should have have many interesting discussions, but what’s it going to take for you to get over this mean streak? What good does it do anyone including yourself to act this way? We disagreed over something once, BFD, get over it!

2023-02-28 5:12 pm
NaGERST
I would say the opposite. The x86 base is very poorly designed, as shown by arm being faster per clock cycle ever since 1981 onwards.
The ARM arch had several shortcomings and being 26bit did not help. (32-6) but it made code execution so much faster that intel could not compete. aa and aalc as well as ax* are all VASTLY faster per watt than anything intel has at the moment.
Apple was the third investor after the UK government and *undiclosed* and is still according to law to be using the original acorn pantens.

2023-02-28 5:39 pm
cybergorf
How would that work? There was no ARM in 1981!
2023-02-28 7:00 pm
sukru
NaGERST,
I am not sure that is true.
First, real life benchmarks show in many (but not all) cases M2/M1 Pro uses more power for the same task compared to recent Intel and AMD chips:
https://www.techspot.com/review/2499-apple-m2/ . And there is a node (5nm) advantage.
Additionally we can observe that large datacenters have not completely switched to ARM, even for their own internal codebases. If there was a consistent low power usage, they would have switched before everyone else. (No legacy code, and electricity is the most expensive cost component).

2023-02-28 9:21 pm
Alfman verbose=1
sukru,
I am not sure that is true.
First, real life benchmarks show in many (but not all) cases M2/M1 Pro uses more power for the same task compared to recent Intel and AMD chips:
https://www.techspot.com/review/2499-apple-m2/ . And there is a node (5nm) advantage.
That’s an interesting point you bring up, however I question the author’s approach…
In Cinebench R23 multithreaded we recorded 23 watts of power draw from the wall after subtracting the idle power usage of the system. This puts the general CPU package power between the two configurations of the Ryzen 7 6800U that we’ve tested, at 15W and 25W. What this suggests to us is the M2 is most equivalent to a 20W configuration when the CPU is heavily utilized, which is a typical number for an ultraportable 13-inch notebook.
I don’t think isolating the power consumption above and over idle is necessarily fair. I think it’s more fair to compare the CPU’s total power consumption. I appreciate that maybe the author didn’t have a practical way to accurately measure this in isolation of other components, but still the approach taken could lead to bias against a truly more efficient processor. Frankly I would have preferred to see whole system power consumption across the board (in which case I suspect apple’s M1 ARM advantage might be more substantial).
Also, I wish he would have run the numbers with how many watt-hours were used to complete the full benchmarks and not just watts. If one CPU uses 2X the watts of another CPU, but completes the job in only 1/3 the time, then this CPU could technically be considered more efficient despite drawing twice the number of watts.
Additionally we can observe that large datacenters have not completely switched to ARM, even for their own internal codebases. If there was a consistent low power usage, they would have switched before everyone else. (No legacy code, and electricity is the most expensive cost component).
I think more datacenters might elect ARM servers if they weren’t so niche. As you surely know by now, I’m certainly interested in ARM servers but way more x86 kit is readily available and at commodity prices.

2023-02-28 8:15 pm
Xanady Asem
“The x86 base is very poorly designed, as shown by arm being faster per clock cycle ever since 1981 onwards.”
This is not even wrong.
2023-03-02 7:02 am
jalnl
What does “faster per clock cycle” even mean? You can’t measure speed “per clock cycle”. Also, “ARM being faster” is not a measure of x86 failure. The x86 was not poorly designed, as Ken’s articles clearly show. They were working within the constraints of the day. ARM was designed almost a decade later (not in 1981!), when new technologies were available and RAM was much cheaper. Also, ARM wasn’t “26-bit”, it was 32-bit but had a 26-bit address space (and the address space is never used to describe the bitness of a CPU). And we all know that Intel could compete just fine. When the ARM was released, Intel had the 80386, which was a perfectly fine CPU for its time. Oh, and “faster per watt” is just as nonsencical as “faster per clock cycle”. Maybe don’t start ranting about stuff you clearly know nothing about?

2023-03-02 8:29 am
Alfman verbose=1
jalnl,
What does “faster per clock cycle” even mean? You can’t measure speed “per clock cycle”.
People can and do look at instructions per cycle, which is what I think NaGERST is referring to. It’s an important metric for superscaler cores and a sensible thing to talk about in principal. But we do have to be careful when comparing different architectures because one architecture’s instructions might be doing more work than another architecture’s…
https://en.wikipedia.org/wiki/Instructions_per_cycle
The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy. However, certain processor features tend to lead to designs that have higher-than-average IPC values; the presence of multiple arithmetic logic units (an ALU is a processor subsystem that can perform elementary arithmetic and logical operations), and short pipelines. When comparing different instruction sets, a simpler instruction set may lead to a higher IPC figure than an implementation of a more complex instruction set using the same chip technology; however, the more complex instruction set may be able to achieve more useful work with fewer instructions. As such comparing IPC figures between different instruction sets (for example x86 vs ARM) is usually meaningless.
…
For computer users and purchasers, application benchmarks, rather than instructions per cycle, are typically a much more useful indication of system performance. However, IPC does provide an example of why clock speed is not the only factor relevant to computer performance.
Oh, and “faster per watt” is just as nonsencical as “faster per clock cycle”. Maybe don’t start ranting about stuff you clearly know nothing about?
I’m not really sure why you’re saying the concept is nonsensical? NaGERST’s post lacks specificity and I’d definitely advise him to cite evidence with benchmarks for performance claims so that everyone can be on the same page. But nevertheless performance per watt is not nonsensical at all. It’s actually very important for portable devices and even data centers.