I gave a talk in 2007 that explained 8088 Corruption in detail, and in that talk I explained that displaying FMV using CGA in graphics mode would be impossible. This is because CGA graphics mode uses 8x the amount of video memory that 8088 Corruption was handling. Even a simple calculation assuming 24fps video reveals that the amount of data needing to be updated per second (24fps * 16KB = 384KB/s) is outside of the IBM PC’s capability: CGA RAM can only be changed at a rate of 240KB/s, and most hard drives of the era operate at roughly 90KB/s. It sure felt impossible, so that’s why I said it.
This is amazing. I also have no idea under which category to file this, but I settled for this one.
This reminds me of something I did back in college, encoding video across a serial connection to an ancient laptop (http://myoldcomputers.com/museum/comp/ibm_ps_2_L40sx.htm). I hooked it up to a video camera and my roommate was quite surprised to see himself on the laptop. Ah those were the days when computer science for fun rather than for work
Anyways, kudos to this guy for getting video to work on the 8088!
The art of squeezing the most from hardware is lost.
Why does a computer with GHz processors and wide bandwidth memory, with even faster graphics RAM, and hardware codec assist .. still stutter? Because the software that sits (I said sits!) on this hardware is bloatware.
Hardware is cheaper than programmer’s time.
That’s often stated as a blanket statement, but of course it depends. Spending additional money for a developer to optimize an inefficient piece of code can be more expensive up front, but on the other hand inefficiency can lead to higher recurring costs in general. For example a dedicated server is now needed whereas more optimal code could have run on a much cheaper VPS).
Also the other consideration is the asymmetry of scale between software developers and users. The developer’s time may be expensive, but the collective time/expenses incurred by the thousands of people using the software can be even higher such that it would have made more sense to optimize the software.
As an aside, it would be very interesting to find the correlation between software efficiency and greenhouse emissions
I agree 100% with your arguments, in fact, I hate bloatware and unoptimized software… but the truth is, nobody cares about optimization cause hardware is cheap and creating good code don’t worth the effort.
And this is not a commercial software fault only… even in non-commercial open source projects adding features is always more important/valuable than optimizing code.
It greatly depends on the type of software being developed.
At least in my field (Network security DPI), trying to shove 100+ Gbps in-and-out of COTS server requires a lot of micro-optimization and even, *gasp* hand written asm code.
Granted, back when I started working in software development (90-something) an average developer was far more-knowledgeable and capable than the average developer today. (Back when man were man and wrote their own device drivers)
– Gilboa
One piece of hardware may be cheaper but slow software distributed over millions of computers increases everyones cost.
Because you didn’t have malware and zero days back then? Not to mention all the security focused processes, sandboxes, hell even memory management. Back then you were running bare metal and if something fucked up? Oh well, hope you got the OS floppies and none of them are corrupt!
Folks can wax nostalgic about the so called “good old days” but I’ll take modern multitasking systems that don’t easily corrupt or get infected any day of the week, thanks ever so.
Personally I *really* miss messing endlessly with autoexec.bat and config.sys in order to create 5 billion different configurations that would work with whatever game or program you needed to run. Oh, and just for fun add QEMM and Stacker to the mix.
Damn, those where the good old days, eh?
Oh, the things we did as kids just for a few extra sprites in Wing Commander.
“Bloatware” is a handy keyword to filter out “Woo” in discussions pertaining technology.
Dude, how bad do you have to screw up your OS installation to get stutter?
The linked article starts off with the author stating he’s been fascinated by digital video for decades, goes on to explain what he said about not being able to do FMV on CGA in 2007 and then goes on to explain the two breakthroughs he’s discovered that made it possible for him to declare himself wrong.
Those breakthroughs are: frame deltas and buffered disk I/O.
By his own statements he brings his decades learned digital video skill into question.
This is more the result of a common illness, normally the things that are studied are the latest and the old and/or most basic thinks are forgotten to be studied leading to the rediscovery of years old techniques as new revolutions.
I was still impressed, it’s not easy getting a 8088 to do much. That said, he had actually not realized that you don’t need to update the whole screen every time you update it? That’s basic graphics programming, no one sane would ever tell you to update the whole thing at all times. I don’t know, that part just surprised me.
I disagree (respectfully!). This looks to me like the classic difference between realtime oldskool demo programming and the pre-emptive stochastic programming most people are more familiar with now.
If you have cycles to spare on a multitasking system, updating only as much as you need to makes sense. However, if your demo requires a consistent 50 FPS update with 100% CPU utilisation, you have to take the worst case scenario of full screen redraw as your default case.
Most classic demos, just like most games today (except FMV), always do a full-screen redraw (nowadays the effort is handled by the graphics card anyway).
At any rate, even though it’s a reinvention of modern video compression, I was also still impressed.
Edited 2014-06-21 11:19 UTC
I think you’re confusing things. There was no way of making full-screen redraw the default case as the CGA RAM couldn’t be written fast enough, it was an actual physical limitation, so he had to resort to only partial writes, and that’s exactly what I was surprised about; the fact that he had not realized before that you don’t even need to always write the whole thing on a single go.
Okay, I get better what mean, and in this context I agree it might seem obvious to apply partial deltas in order to allow video rendering. I guess what I mean is that demos are more generally about consistent and low peak timing rather than reducing the average across frames (i.e. overall optimisation). As such the sort of techniques used for video rendering might not necessarily be those you’d expect to work well for demos.
Either way, what Jim Leonard has done is nothing short of amazing in my opinion!
But that was not basic graphics programming in the 8088 days. Back then, we still had computers that hooked up to TVs, and that /was/ a whole screen refresh, such as TVs still do to this day. So to be in the mindset that a full screen refresh was needed isn’t such a farfetched thing.
Still, a neat read.
Sorry, I wasn’t clear in what I wrote. Obviously deltas have been around since the 1980s, and any sane person would use them. The problem to be solved this time around was how to represent and play back deltas within the very limited amount of time and processing power available. The reason I did entire memory moves in the first production was because (I thought) there wasn’t enough CPU time to perform even two branches per delta. This time around, there still isn’t, so I resorted to outputting code to avoid branches.
Trixter,
After considering deathshadow’s input and the data from this link, it would seem the high bus overhead per instruction can significantly offset much of the benefit of using a more efficient encoding.
http://www.smallapple.net/blog/blog-2013-03.html
Using a tighter stream format might still be better assuming the disk is a bigger bottleneck, but I don’t know if the assumption is valid.
FMV would work, but it all comes down to how much data we have to throw away at the encoder to keep within the runtime constraints of the decoder. You’ve answered this question using an x86 data stream, but I’m still curious if a more compact data stream would have been better/worse. This is probably not something I’ll be able to find out without an implementation, and I don’t even have an 8088 computer to test on.
Edited 2014-06-23 18:03 UTC
I looked into tighter stream formats, and even with Terje’s optimization of my RLE loop, compiled code output was still the tightest due to more than half of the typical list of changes consisting of deltas 1-2 bytes long with the (carefully chosen) source material I was using.
I could always be wrong; if anyone finds errors in my calcs when I release the source in a few weeks, I’d love to hear about them.
Gosh no, that’s trivial. No, actually the breakthroughs were realizing ordered dithering would greatly increase visual fidelity without significantly increasing the number of deltas, and outputting code instead of data (the encoder is a compiler).
Trixter,
It’s neat that you are here! I’m curious, did you try using a conventional data format first, or did you take that route to scratch an itch (ie simply wanted to try it)? I’m not terribly familiar with CGA bit planes, but it seems that something like this might have worked without using an exotic x86 binary format.
event_loop:
; Todo: disk handling…
; Todo: keyboard handling…
; Todo: audio handling
; Todo: timing
next_span:
; [DS:SI] -> input stream pointer – 2
; [ES:BX] -> last byte in stream – 2 – max_span_size
; [ES:DI] -> output video buffer pointer
mov DI, [SI+2] ; DI = span position
mov CX, [SI+4] ; CX = number of words to copy from input stream
add SI, 6 ; [DS:SI] -> CGA pixel data
rep movsw ; copy CX words from input stream to screen
mov CX, [SI] ; CX = number times to repeat last AX
rep stosw ; repeat AX value on screen CX times
; [ES:DI] -> next span in input stream – 2
cmp SI, BX ; more data in stream?
jns next_span
jmp event_loop
So the input data would look like this:
DW 0000h, 0001h, 0000h, 4000h ; set whole screen to black
DW 0010h, 0002h, 1234h, 5678h, 0000h ; set eight pixels at position 10h to 1h, 2h, 3h, 4h, 5h, 6h, 7h, 8h
…
Maybe we could save a couple cycles with unrolling. Now bare in mind this algorithm allows for arbitrary combinations of RLE and pixel copying in every span. This is probably unnecessary flexibility, so by grouping together common span lengths we could eliminate repetition of the length field, reduce the size of the data and save a few more instructions.
DW 0003h, 0002h ; 3 spans, 2 dwords each
DW 0000h, 1111h, 1111h ; 1st span at 0000h, data=1111h, 1111h
DW 0100h, 2222h, 2222h ; 2nd span at 0100h, data=2222h, 2222h
DW 0200h, 3333h, 3333h ; 3rd span at 0200h, data=3333h, 3333h
…
A data representation like this would probably be more compact than x86 binary code. Another idea would be to allow a span to reuse bytes in the data stream that were previously sent and are still in ram, which should be fairly inexpensive “compression” even on a 8088. A pattern that repeats over and over (esp vertical objects) wouldn’t need to be resent.
I consider everything I said above fairly “obvious”, so I assume you probably thought of all that already. However a (possibly) novel idea would be to use probabilistic encoding. To consider the above encoding probabilistically, one might precomputed the distribution of span lengths. The decoder would then always expect A spans of length X, B spans of length Y, C spans of length Z, etc. The data stream could omit span lengths entirely since the decoder would already have the distribution. The encoder would have to be written very differently: I have A spans of length X, B spans of length Y, C spans of length Z, how can I best arrange them to maximize quality of a decoded image? This could save lots of DISK bytes while adding almost no decoder CPU overhead!
You could take this to another level and make the distributions themselves dynamic as different parts of the video might render better with different distributions…?
I have no idea which is a bigger bottleneck on the 8088 though, disk or cpu? Back when Stacker was around, it was said to make I/O faster because disk throughput was a bigger bottleneck than CPU, however this was many CPU models after the 8088.
Anyways, it’s a neat mental challenge and it’s cool that you followed through.
Edited 2014-06-23 05:08 UTC
Uhm… why are you using offset calcs instead of LODSW? Once the BIU is figured into things doing a LODSW (1 byte, 16 EU clocks) followed by MOV DI,AX (2 bytes, 2 EU clocks) works out to 21 clocks; depending on how much time is free to fill the BIU that MOV DI,[SI+2] could take up to 29 clocks…. and if you did it for both, you could skip the ADD.
lodsw
mov di,ax
lodsw
mov cx,ax
rep movsw
Would likely be faster… Though you seem to have a value in AX for christmas knows what.
Though I’d probably make frames progressive… and only use a byte for the increment and loop since one byte can be like 3 scanlines. Then you could use one word read for both increment and blitcount… and probably be faster since that 16 clocks for the lodsw would give time to fetch the next two instructions, resulting in the entire operation likely having 0 BIU impact other than iteration.
// ch would always be 0
lodsw
mov cl, ah
xor ah, ah
add di, ax
rep stosw
I’d probably also have a block count at the start of the data stream. I might have to take a stab at implementing something similar. Given I’ve actually written a FLV decoder, I’ve got a pretty good grasp of the basics of doing this in 256 color efficiently.
Chasing down a hoodoo there; Born on the BIU… Down on the BIIIUUU…
deathshadow,
Good, a critique!
The reason I didn’t use LODSW is twofold:
A) It would require another instruction to move the value to a new register (as you illustrated). On a 386, the single mov is definitely faster, but I think that’s true on the 8088 too.
Here is a source:
http://tic01.tic.ec-lyon.fr/~muller/trotek/cours/8086/index.html.en
lodsw ; 16 clocks
mov di,ax ; 2 clocks
mov di, [si+x] ; 8+EA / 12+EA – not certain which?
add si, 6 ; 4 clocks
I just noticed that this takes fewer clocks on the 8088
add si, dx ; 3 clocks
So it still seems beneficial to use the movs, but maybe you could convince me otherwise.
B) In the example code earlier I needed the value in AX for run-length-encoding, LODSW would have clobbered the AX register.
I considered that, but it requires more instructions to move the values into CX. However you may be right that the bus speed was so terrible with the 8088 that it should be prioritized over CPU cycles. Never the less it still makes sense to use the smallest machine code possible since every instruction is going to add slow bus cycles to read the instructions.
http://www.smallapple.net/blog/blog-2013-03.html
I started x86 assembly programming on the 486, where bus latencies were much less pronounced due to caching.
I’m not too interesting in personally writing software for old hardware, but I do appreciate it as a lost artform.
Edited 2014-06-23 17:00 UTC
386 is a whole different ballgame than a 8088 / 8086. The 4 cycle per byte penalty AND the BIU (byte instruction unit) cache makes things a bit… different.
On both the 8088 and 8086 there are two main components — the execution unit that runs opcodes and the BIU that fetches opcodes; the BIU providing 4 bytes of opcode cache. During any EU time that isn’t accessing memory the BIU tries to fill that cache… and that’s a real game changer on what opcodes mean.
For example:
mov di, [si+2]
On the 8088 that’s 8 clocks PLUS the “effective address” (EA) caclulation, which for Index (SI) + Displacement (2) is 9 more clocks…. for 17 total; BUT:
It’s 4 bytes… which depending on what was running before this point, could be as much as 16 clocks to fetch it. You have this as the first thing inside a loop (which it typically would be) you’re looking at the full 16… for as much as 32 clocks total… and only eight of those clocks are free for BIU fetch.
lodsw – 1 byte, 17 clocks; 8 of those clocks are memory, leaving nine clocks free to fetch two bytes.
mov di, ax — 2 bytes, 2 EU clocks, fetched for free
4 clocks fetch + 17 EU + 2 EU = 23 clocks.
Though the actual results can vary WILDLY based on what code is before and after it.
The BIU offsets the 4 cycle penalty in a number of cases, but it makes “clock counting” extremely difficult. In many cases you’re better off alternating 1 or 2 byte long EU opcodes with multi-byte fast executing to try and keep the BIU full.
Part of why XOR AX, AX is way better than MOV AX,0 — saves two bytes and possibly as much as 9 clocks on the 8088. OR AX,AX too for testing zero or sign bit gets and equal boost.
Good rule of thumb: 1 byte opcodes are rock-and-roll, 4 byte opcodes should be used when and only when needed, and if possible NEVER at the start of an area where you are looping. Simply changing the source order can have a MASSIVE impact on the 8088 and 8086.
Whole different world from later x86 processors where you don’t have the bus penalty or EA calculations.
Edited 2014-06-23 19:46 UTC
deathshadow,
Ok, very insightful info. I do see how quickly 4 cycles per instruction byte adds up. Unfortunately the lodsw solution would still require more instructions than you counted above to save the value of ax, which was important for the RLE encoding used earlier. But in cases where AX doesn’t matter, then lodsw does look better even with an additional mov.
Since immediates are so expensive, it would make sense to throw them in unused registers instead (Those mov instructions would become the 2 byte variants instead of the 4 byte ones).
Then the add could be pushed forward. Also if we stored 2/4 inside of bx/bp, then we could get by without using the 4 byte versions of mov/add/etc.
All in all, whether any of this would be worth doing in the first place depends on if we can decode the data faster than the disk can hand it to us. This is really the big question for me, which I don’t have an answer to with respect to the 8088. If we’re stuck waiting for the disk anyways, then the CPU might as well be doing more sophisticated decompression.
Edited 2014-06-23 22:14 UTC
Good rule of thumb — don’t use AX for long term storage of ANYTHING; the accumulator is so much more efficient for short operations.
Sometimes you’re even better off shoving it to memory with PUSH. The loss in clocks is often made up by accumulator operations like and, or, xor, add, sub — which are all one byte less using al/ax/ax than using other registers.
Leaving it sitting there to rot in the middle of some other operation? You want that, that’s DX or BX’s job
… and your instinct on immediates is correct, if you have unused registers free and immediate inside a loop, get that value in a register… like BX and DX.
Oh, and on disk access, you’d HATE the Jr…
A little of both 🙂
Yes, and unfortunately the system’s memory and processing speed are too slow for what you suggested. I posted a simpler loop in part 2 of the writeup and even that was too slow for what I was trying to achieve.
There is a compromise I thought of after the entire thing was complete: Generate data describing only runs, then another set of data describing only copies, then you don’t have to branch within the update loop. I might investigate this at a future date, but in all honesty the best gains come from better preprocessing of the data (or outright cheating, such as leaving every other scanline blank).
The answer is “all of the above + memory access time” since it takes 4 cycles to read a byte on 8088.
I have very fond memories of Stacker and other products; I own a copy of all of them (including the two Stacker hardware ISA boards). They sped up disk reads on 386s and above; on 12MHz+ 286s it was roughly a wash. Below 12MHz it was slower, hence the hardware boards.
Way back in 78 or 79 I was barely a teenager and taking an Apple ][ BASIC programming course held at my local Computerland, and at the end of the lesson they showed us a 4-5 second movie clip that played back in the Apple’s low res graphic mode.
I remember that it was a man turning and saluting, but obviously it was very blocky. The video capture had been done at UC Berkeley, and they then wrote a program to playback the video.
It blew me away as a kid, and as an adult — thinking about the state of technology at the time — I am doubly impressed now.
Also impressive is FMV on the C64 (just 1 MHZ): https://www.youtube.com/watch?v=Yo7uXaV6Q1s
Note that this is the full 320×240 in 16 colour! — the C64 only natively supports 160×240 in colour, the hackers managed to abuse the scanline so much they could get a screen mode the C64 never had!
In the absence of a HDD, a 16MB RAM expansion is used to store the video.
Well optimized 6502 could generally meet or beat 4.77Mhz 8088 code. Same data path width (8 bits) and the 6502 is a more efficient processor.
MHz is not always an apples to apples comparison…
Neither are apples: http://rainierfruit.com/retail/img/size-apples.jpg
… sorry
Well-optimized 6502 definitely beats 8088 in some cases, probably because the 8088 takes 4 cycles to read a byte, whereas 6502 only takes 1 cycle IIRC. That means, apples to apples, the 8088 is running at “1.2 MHz” (14.31818MHz clock divided by 12).
This is very impressive, and gave me an idea:
A competition where the longer you take to complete the challenge the more throttled back the computer you get to run the program on becomes.
The challenge is to make a program of a given complexity and technical proficiency within one week and 12 hours. The processor speed and memory on the PC you get to run it on (a remotely controlled virtualised PC) is halved every 18 hours.
So you start with a single 2GHz core with 2Gib of memory (no 3D cards as that’s too complex) on a Monday morning. If you complete your program late Tuesday then it has to run at 1GHz/1GiB. If you fail to complete the program until noon on the following Monday then your program has to run at 125MHz/128MiB.
On the Tuesday after all programs are reviewed according to the original specification.
Alternately there could be no strict limit on the number of days. The computer keeps halving the speed and memory until the last player gives up.
would that be a little analogous to the bitcoin encryption and transaction chain verifications, with increasing difficulty — both challenges you hope maybe no ones beats../’wins’…mmm.
Wasn’t expecting to see it, but way cool that it was noticed.
Trixter is something of a legend in the retrocomputing community, always pushing the boundaries of what the hardware can do.
In terms of playing with video modes, a bunch of us have been trying to push the CGA to see just how far it can go… When I started playing with 160×100 in 16 colors for my own little Pac-man clone
http://www.deathshadow.com/pakuPaku
Trixter’s advice and experience helped a lot in getting the corners tacked down… in fact it wouldn’t run as smoothly as it does without his help and the help of others on the Vintage Computer Forums.
Right now over there a number of interesting projects for CGA are going on… Like an 80×50 16 color game:
http://www.vintage-computer.com/vcforum/showthread.php?42849-MagiDu…
I’m playing with 80×50 for the MDA to make another pac-man game, just because the old IBM 5151 / text-only MDA card never got a whole lot of loving from developers.
I often get asked why any of us bother writing new software for the old systems, and really there’s three really good reasons:
1) The challenge of it; the narrow hardware target and pushing it to it’s limits actually feels like you’re doing something — as opposed to today where it’s “slap a bunch of libraries and engines together”
2) It makes you a better programmer; in working on ‘narrow targets’ you learn about balancing efficiency of code size vs. efficiency of execution, and this makes you better qualified to make choices on how to do things on modern targets.
3) It’s FUN! What? You never heard of fun?!?
Though it’s kind of a laugh his ‘new’ method, since it’s how FLV and animated GIF work internally.
Edited 2014-06-21 21:35 UTC
deathshadow,
I didn’t know there were people here who did that, interesting the things we learn about each other!
I used to enjoy the challenge as well, it’s sad but typical software shops today don’t really appreciate these skills at all.
I always look back fondly at my early little projects, then I wonder where it all went wrong since joining the real workforce. I find the work completely boring compared to what I could do. I think there are many of us who are overqualified for the web stuff we can get hired to do.
Well, some of us didn’t get the memo.
https://www.youtube.com/watch?v=G66KL-hxxKI
I love his reviewing style…
It goes with the ‘get it out the door now, to hell with what it costs later’ attitude; people don’t care if it’s written properly, works properly, is useful to users, they want it done yesterday and to hell with how much it rapes them long-term.
Can’t say it’s surprising in the “credit mentality” world, where dumbasses like President Odumba calls credit the “life blood of the economy” — I’m with Peter Schiff on that, it’s not the life blood, it’s CANCER.
Pay more later for what you can’t afford now; such a brilliant gameplan, and it’s creeping into every aspect of society — which is what halfwit dumbass shortcuts like “Frameworks” are all the rage no matter what half-assed crap the result is. jQuery, BluePrint, Bootstrap, YUI, grids, Google “Web whateverthehelltheyrecallingit” — they’re all at best crutches for the inept, at worst pissing all over the usefulness of every website they are used on!
Hence why on web devlopment forums you have people asking “why is my site slow” or “why is my site penalized by google” — when the answer is “it’s two megabytes in 100+ separate files”; and naturally the dipshits chime in with their “doesn’t everyone have broadband now?” lame excuses.
That’s why it’s called work, not happy happy fun time. No matter how enthusiastic you are about something or how much you enjoy it, you turn it into work, and it sucks the fun right out of it faster than a brown log in a swimming pool.
Though having worked an average of three jobs a year between 1989 and 2002, I came to the conclusion that anyone who says they enjoy their job is either not actually working at that job, full of ****, or suffering from severe brain damage.
Which is why I think so many people who program by day, have side projects like these at night; if it wasn’t for people doing it for fun on the side, I’d be willing to bet most open source projects would have died off ages ago.
There’s what we want to do, and then there’s what needs to be done; right now we have the problem of an overeducated workforce that ‘expects’ to be able to get their dream white-collar job, but it ends up being a case of too many chiefs, not enough Indians — we’ve attached a stigma to the very thing that made developed nations strong: blue collar work! Doing the work that needs to be done as opposed doing no work and collecting a paycheck for sitting at a desk all day playing Farmville.
… then we wonder why the economy is in the tank.
deathshadow,
I know getting into politics won’t possibly end well However I agree that credit has been a horrible development for society. While ostensibly it makes things “more affordable”, especially high price items like cars and houses, in reality the ability to pay more with credit has affected the dynamics of supply and demand such that prices have gone up along with credit. This hurts the purchasing power of those who don’t believe in credit, and we become dependent on credit to make purchases. Of course the net affect of “credit” is self evident – a net transfer of wealth from the lower classes to the upper classes through interest on loans (ie %50 of mortgage payments going towards interest is a huge wealth transfer).
Well the trouble is, at least in my experience, a false expectation from the outset whereby everyone enters the tech field aspiring to do what very few end up doing. You work on all these cool challenging projects at university only to discover it’s nothing like the jobs afterwards. Maybe I was particularly naive, but there are still lots of institutions pumping students into computer science for jobs that we’re often overqualified to do.
I imagine you are right, most open source projects get done for the love of the craft.
BTW “not enough Indians” sounds racist.
Anyways, you’ll probably agree very strongly with the ideas Mike Rowe (from the Dirty Jobs television show) talks about in this interview:
https://www.youtube.com/watch?v=qzKzu86Agg0
Edited 2014-06-22 15:05 UTC
There are categories on osNews? I come here since ca. 2001 and never even recognized.
Yeah… My thoughts exactly. Although, I guess the icons are indicative of the category. There used to be a link to all of them which had some really neat logos for obscure operating systems.
Sadly, there aren’t that many anymore.
Edit: also I think it should have been in the DOS category.
2nd Edit: Oh hear it is:
http://www.osnews.com/topics/
Edited 2014-06-23 16:43 UTC