It’s well-known that you should measure the performance of your code, and not rely only on the opcode’s “cycle counts”.
But how fast is an IBM PC 5150 compared to a PCjr? Or to a Tandy 1000? Or how fast is the Tandy 1000 HX in fast mode (7.16Mhz) compared to the slow mode (4.77Mhz)? Or how fast is a nop compared to a cwd?
I created a test (perf.asm) that measures the performance of different opcodes and run it on different Intel 8088 machines. I run the test multiple times just to make sure the results were stable enough. All interrupts were disabled, except the Timer (of course). And on the PCjr the NMI is disabled as well.
There’s no point in any of these benchmarks, but that doesn’t make them any less interesting.
The only link in the article points to a large assembly file. Not the kind of source I was looking for 🙂
https://retro.moe/2018/03/04/performance-of-the-8088-on-pc-pcjr-and-…
Thanks. Interesting reading 🙂
… And now the article is updated as well.
To the core, basics. Interesting.
Noting that x86 haven´t changed much.. If one could make the instructions a bit more like highlevel lang, say a lot of DSP, are clear candidates for combining several clocks into one, a 4pole filter (one input, one output), can be combined into a one clock operation aswell. JSR could take arguments. And ofcourse commonly used functions. Even larger things that has just one input, and one output. And this could even be done in parallel.
..one would not need C much at all.
C indeed seems to be a “mr. fix it”, for masochistic assembly language.
Edited 2018-07-01 06:41 UTC
Take into account the 680×0 or SH-4 assembly languages. OK, Intel was first on that market, hence made mistake in engineering choices which the other didn’t, but the x86 “legacy” lasted a bit too long, even the “protected” mode introduced with the 386 was a pain to use.
Looking at the key lines in the aseembler code, e.g:
times 1000 nop
what Ricardo is really measuring is the bus bandwidth of some early 8088 based PCs, not the performance of individual instructions.
This is because the 8088/8086 is basically designed as a pair of blocks: the Bus Interface Unit and the Execution Unit and they made the execution unit more efficient in some cases than the BIU on the 8088.
The BIU prefetches instructions while the EU is busy, but each prefetch takes 4 cycles. On an 8086, the BIU can usually keep up with the EU because a 16-bit fetch takes 4 cycles and two single byte instructions will take 6 cycles to execute. So, the BIU has a 6 byte buffer (4 bytes on the 8088) to store instructions while the EU is busy. This made sense in the late 70s when highly microcoded processors were still in vogue, because the EU was expected to be slower.
But on the 8088, the BIU can’t keep up: 4 cycles per byte can’t keep up with 3 cycles per simple instruction and for two byte instructions it’s just as bad. It’s not the case for instructions that read or modify memory because the effective address calculations are wonderfully slow on the 8086.
And this is what makes the benchmarks fairly flawed. In real 8086 code there are a mixture of simple and complex instructions. Memory accessing instructions occur about 39% of the time, and indexed addressing modes occur 67% of that time => 26% of the time.
https://www.cis.upenn.edu/~milom/cis501-Fall05/papers/X86-appendix-D…
[Page 17]
The 8086 takes 6 to 9 cycles to compute this, which is enough time for another 2 instruction bytes to be read in. Hence whenever a memory instruction reaches the buffer it will tend to fill up the buffer:
http://stanislavs.org/helppc/instruction_timing.html
ADD CX,[BX+100] => buffer = [000000 0 1] [01 001 111] [0110 0100] [xxxxxxxx]
So, by the time the instruction has been processed, another two bytes will have been read in, which is likely either a pair of single byte x 3 cycle instructions or a double-byte 3 cycle instruction. Either way, the instruction will execute at full speed.
This is why Ricardo observes that multiplies and divides take the same time regardless of which computer is used – the buffer will completely fill up before the instruction has finished.
This means 8088 coding is very much an art:
http://www.jagregory.com/abrash-zen-of-asm/#chapter-3-context
http://www.jagregory.com/abrash-zen-of-asm/#chapter-4-things-mother…
http://www.jagregory.com/abrash-zen-of-asm/#chapter-5-night-of-the-…
Slightly faster though, from 133MHz Pentium, over IDT WinChip, Trasmeta Crusoe and Efficeon, G4 Cube, G5, PS3, MacBooks, etc. https://www.youtube.com/watch?v=LMSIng6v-LU
That’s the question I have from this. What design attribute of the PCjr made it dramatically slower than the others?
The base ram (64KB) is shared with the video. That changes the access time from 4 cycles (standard for the rom and expansion ram) to 6 cycles. So running in the base ram makes it 50% slower on bus throughput.