Zip files: history, explanation, and implementation

Thom Holwerda 2020-03-06 General Development 23 Comments

I have been curious about data compression and the Zip file format in particular for a long time. At some point I decided to address that by learning how it works and writing my own Zip program. The implementation turned into an exciting programming exercise; there is great pleasure to be had from creating a well oiled machine that takes data apart, jumbles its bits into a more efficient representation, and puts it all back together again. Hopefully it is interesting to read about too.
This article explains how the Zip file format and its compression scheme work in great detail: LZ77 compression, Huffman coding, Deflate and all. It tells some of the history, and provides a reasonably efficient example implementation written from scratch in C.

One for the ages. Articles like this don’t get written every day.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

23 Comments

2020-03-07 12:09 am
Alfman verbose=1
I’m a bit shocked the author would actually re-implement old software like pkzip, but it is an interesting article that brings back memories. There were so many different archiving formats and tools in the BBS days.
I’ve kept backups of these programs. I probably don’t have any archives files to use these on any longer, but I distinctly remember using many of these in the day.
ace
arc
arce
arj
gzip
lha
lharc
mksarc
pak
pkarc
pkzip
rar
rearj
shrink
wuudo486
wz
zoo
You’d go download some archive over kermit, ymodem, and zmodem and need these extraction programs. Things have really consolidated since then with a few winners taking over most of the market.

2020-03-07 11:49 am
friedchicken
Now that’s a list that brings back memories! I’m also reminded of all the special/private C64 compressors there were, written by various coders and groups of the day. Using certain custom packers was almost a status symbol because it meant you either had a copy yourself, or you knew someone who did. Same with the music packers & players as well.
2020-03-07 4:07 pm
tidux
The Amiga world still uses lha/lhz formats but pretty much everyone else uses zip, 7z, or tar+gz/xz these days.

2020-03-07 5:03 pm
Alfman verbose=1
tidux,
The Amiga world still uses lha/lhz formats but pretty much everyone else uses zip, 7z, or tar+gz/xz these days.
I’m curious, does amiga have support for modern compression formats? The compression ratios have gotten a lot better but the algorithms require a lot more ram to support larger dictionary sizes. I once had to add a few hundred megabytes of ram on a small VPS just in order to compress and decompress archives without swapping. How much ram can you even get when you max out an amiga computer?

2020-03-09 6:39 am
The123king
Alfman,
Amigas are a bit finiicky with RAM, as they have many different types of RAM, all operating at different speeds. You have superfast “Chip RAM”, which is soldered to the board. Fast “Fast RAM” is generally added on expansion board, and slower “pseudo-fast RAM”. IOt’s all a bit of a mess, and i don’t know the full details, but some RAM is only accessable by the CPU, whereas some can be accessed by the custom chipsets too,.
Generally, an A1200 is limited to 8MB “fast RAM”, but can take 128MB with an accelerator.
Modern PPC “Amiga’s” are much more standardised, and generally are limited by the 32-bit address bus of the PPC processors they use, so 4GB.
If you’re running 64-BIT AROS on x86-64, your limit is whatever your motherboard limit is.
So to be blunt, it depends on what hardware you’re running. Real 68k amigas are limited to <256MB RAM, PPC Amigas are limited to 4GB

2020-03-10 4:01 pm
Carrot007
Though TBF PPC Amigas aint Amigas (and if the market had not degenerated into a rip off every last customer as much as we can it would have gone x64 (or x86 previously), or arm a long long time ago. The PI would make a great amiga with a native OS. So as an amiga lover you are better off with wholey FPGA or software emulation these days anyway. (let’s face it even apple got out of ppc far too late).
Why yes I do still have real Amiga’s. But those are just for fun!
Hopefully youalso realise all amiga’s are 32 bit and the 16 bit thing was just marketing. How many cycles it takes to read from ram is not the definition of a system for any sane person.
2020-03-12 9:35 am
JLF65
Chip Ram is not and never was “superfast” – it’s the memory shared with the custom chips and was basically twice the bus rate of a stock 7MHz 68000 so that the 68000 could take one cycle, and the custom chips would have unrestricted access to the other. Certain features of the chip set could steal the 68000 cycle for better speed, like the blitter, or if you needed the cycles for video fetch beyond 16 colors in low-res, or 4 colors in hi-res. Most people called chip ram “slow ram” because it was tied to a base 68000 cycle, no matter what CPU you had (say, a 68020 at 20MHz, or whatever).
The next kind of memory was called Bogo Ram, or sometimes Slow Ram. This was the memory in the belly slot of an A500. This was controlled and accessed through the chip set, but could NOT be used for video or sound data. The design made adding memory to the A500 cheap and easy, but limited the speed to exactly the same as chip ram.
Then there was Fast Ram, which was controlled and accessed directly by the CPU. It was as fast as the CPU, and never had to wait on chip set data access, which meant no slow downs due to DMA contention in higher color modes. It could be on an accelerator card, or in a Zorro II or Zorro II slot. Zorro II limited the size and speed of such ram to no more than 6MB and no faster than a standard 68000 bus cycle (but it COULD be asynchronous for slighter faster cycles). Zorro III memory cards could be much bigger and faster. Zorro III allowed for 32-bit widths (unlike the Zorro II 16-bit width), and faster cycle rates (up to 30MHz IIRC). The kickstart rom would recognize and configure up to 256 MB of Fast RAM on an accelerator or Zorro III ram cards without an init rom, and larger cards could have an init rom to configure more ram. 32-bit accelerator cards could also use an init rom to handle more than 256 MB of fast mem. In all original Amigas, the final limit for 32-bit machines was a hair under 2GB of ram as the address space with A31 set was reserved for future use by Amiga. That’s the official limit. Clearly, that never came about as Amiga went under, and the people who bought the assets never did anything, so that extra 2GB of space was free for use for anything by third parties, making the limit just a hair under 4GB of ram.
However, then you run into the same problem as old 32-bit Mac computers, and old 32-bit PCs: is the software 32-bit clean? With a 32-bit address bus and 32-bit cpu operations, now you have to deal with 32-bit signed or unsigned operations. Using 32-bit signed operations on 32-bit pointers can lead to incorrect addresses in that high 2GB area. This is called “dirty” addressing, and many programs had this problem on ALL 32-bit computers of the time. For 32-bit pointers to correctly access a full 32-bit address bus, you must use unsigned operations on pointers. This would be “clean” addressing, and Apple futilely tried educating programmers on this issue back when they moved to a full 32-bit NuBus architecture. Because of this, pretty most all older OSes restrict access of the top 2GB of address space. Amiga did. New(er) versions of Windows did not, allowing properly written programs to access more than 2GB of ram, but they had a special flag in the executable file to indicate the program was 32-bit clean. This is the LAA (large address aware) flags you may have heard about.
Most older Macs (68K or PPC) were limited to a max of 1GB of ram for various reasons. 3rd party accelerators could go beyond that.
2020-03-12 11:25 am
Alfman verbose=1
JLF65,
However, then you run into the same problem as old 32-bit Mac computers, and old 32-bit PCs: is the software 32-bit clean? With a 32-bit address bus and 32-bit cpu operations, now you have to deal with 32-bit signed or unsigned operations. Using 32-bit signed operations on 32-bit pointers can lead to incorrect addresses in that high 2GB area. This is called “dirty” addressing, and many programs had this problem on ALL 32-bit computers of the time. For 32-bit pointers to correctly access a full 32-bit address bus, you must use unsigned operations on pointers.
…
I’m impressed by your knowledge of these amigas 🙂
Do you mind going into more detail about what you mean above though? Arithmetically, when using 2s-compliment notation, signed and unsigned operations are identical, it doesn’t even matter if you use the right variable type in the compiler since they use the same CPU opcodes anyways. Consider the output of this program…
#include
int main(int argc, char*argv[]) {
unsigned int x = (int)-4;
unsigned int y = (int)-101;
unsigned int z = x+y;
printf(“%d %d %d\n”, x,y,z); // -4 -101 -105
return 0;
}
This is obviously the cool part about 2s-compliment, negative values are interpreted differently but they are calculated in exactly the same way as positive values even to the point where it doesn’t matter that the CPU or compiler knows whether they are negative or not. So with this in mind, I don’t understand what you were saying here. There may be OS/kernel limits imposed on userspace applications that prevent the full address space from being available, but as far as I know this was pretty arbitrary and nothing to do with the “sign” bit.
A long time ago 16bit software would expect reads & writes past the end of memory to wrap around back to 0x0. For compatibility sake IBM added a mechanism to the keyboard controller to mask the A20 memory line to emulate the old behavior and you’d loose access to some ram. But there was no similar “feature” for 32bit.
2020-03-12 11:53 am
Alfman verbose=1
The table in this link highlights the purpose of LAA on windows.
https://helloacm.com/large-address-aware/
It allows 64bit versions of windows to give 32bit programs 4GB of memory instead of the normal 2GB that they used to have under 32bit versions of windows.
Someone here asks why not enable it by default…
https://stackoverflow.com/questions/2288728/drawbacks-of-using-largeaddressaware-for-32-bit-windows-executables
I am not strongly convinced by the first answer. The next answer says this:
Because lots of legacy code is written with the expectation that “negative” pointers are invalid. Anything in the top two Gb of a 32bit process has the msb set.
The implication here is that there’s some code that does something like this…
int ptr = malloc(100);
if (ptr<=0) fail.
I’ve never seen code like this, typecasting a pointer into an int is neither common nor good practice, but it still technically works. The bug here is testing for ptr<=0 rather than ptr==0 (ie null). Honestly I don't think I've ever seen this in real code. This code would clearly break on platforms such as linux that had a different 3GB user+1GB kernel allocation, but I can't deny it's in the realm of possibility that some old windows programs had this bug, so apparently this is why windows required a flag to increase past 2GB. Although I will say I don't believe many 32bit applications and libraries in the wild have this bug.
2020-03-13 11:06 pm
JLF65
At some point, x86 processors used the MMU to be able to map memory outside the 4GB space addressable by a 32-bit address so that it COULD be addressed by a 32-bit address. So you had say, 16GB of ram to split between processes that could access 2GB of ram normally. The other 2GB of space was reserved for mapping shared ram and kernel ram. LAA allowed the OS to limit the shared and kernel memory so that a process could access almost 4GB of ram. In any case, this meant that you could have processes using real ram to the max instead of virtual ram that would necessitate waiting on ram to load from the hard drive. With falling ram prices along with increasing ram density, this quickly led to virtual ram being less an issue for computers.
As to 32-bit “clean” addressing, yes, simple operations on signed values can preserve the full address, and give the proper address. The issue generally comes from casting – pointers tend to be unsigned, so casting an int to a pointer or the reverse could lead to undefined operations if the value was negative. Compilers of the era might handle this differently. If you were using Manx C or SAS C or K&R, you’d have to know a bunch of low-level undefined rules for things like that, so older programs might have #ifdef sections for the differences. Example: suppose your pointer was 0xC0000000; now it got cast as an int to store in an integer variable; The pointer is too big to store in an int, and the compiled code might throw an error at runtime, it might set the variable to the largest maximum positive number an int can hold, or it might ignore the issue altogether (which would ironically lead to the code working). Things have gotten better in standardizing C, but back in the hey-day of the Amiga/PC/Mac, everyone had to be a little different. 🙂
2020-03-14 12:31 am
Alfman verbose=1
JLF65,
At some point, x86 processors used the MMU to be able to map memory outside the 4GB space addressable by a 32-bit address so that it COULD be addressed by a 32-bit address. So you had say, 16GB of ram to split between processes that could access 2GB of ram normally. The other 2GB of space was reserved for mapping shared ram and kernel ram.
I’m very familiar with the 32bit x86 architecture and I wrote my own little x86 operating system back in the day. 32bit x86 code could always access 4GB of space from the start. There isn’t an architectural limit for memory under 4GB and in fact a popular mode of operation in 32bit dos applications at the time was “real flat mode” which can access the full 4GB. PAE was needed to access more than 4GB provided you use multiple selectors (although these have been phased out and have no modern equivalent in 64bit). All limits beneath 4GB on x86 were imposed by the OS and not by the x86 architecture. You probably know this already, but I’m just reiterating it so we’re on the same page.
I’m far less knowledgeable of other non-x86 architectures though.
LAA allowed the OS to limit the shared and kernel memory so that a process could access almost 4GB of ram. In any case, this meant that you could have processes using real ram to the max instead of virtual ram that would necessitate waiting on ram to load from the hard drive. With falling ram prices along with increasing ram density, this quickly led to virtual ram being less an issue for computers.
Yeah, I looked up what this does on windows. It allows an application to allocate more of the 32bit address space. I never used it though and before you mentioned it I didn’t even know it existed, so thanks! I’ll keep it with all my other trivia that nobody outside of here cares about, haha.
As to 32-bit “clean” addressing, yes, simple operations on signed values can preserve the full address, and give the proper address. The issue generally comes from casting – pointers tend to be unsigned, so casting an int to a pointer or the reverse could lead to undefined operations if the value was negative.
But “negative values” don’t affect 2s compliment arithmetic, you get the right binary bits no matter which way you cast it. The interpretation of a “sign bit” is different, such as asking if it’s negative as illustrated in the last post, but the ALU produces the same bits either way. You can prove this to yourself by doing the math by hand in binary.
Compilers of the era might handle this differently. If you were using Manx C or SAS C or K&R, you’d have to know a bunch of low-level undefined rules for things like that, so older programs might have #ifdef sections for the differences. Example: suppose your pointer was 0xC0000000; now it got cast as an int to store in an integer variable; The pointer is too big to store in an int, and the compiled code might throw an error at runtime,
it might set the variable to the largest maximum positive number an int can hold, or it might ignore the issue altogether (which would ironically lead to the code working). Things have gotten better in standardizing C, but back in the hey-day of the Amiga/PC/Mac, everyone had to be a little different.
Yeah, supposing your compiler uses 16bit ints and 32bit pointers, obviously I have to agree that casting a pointer to an int will result in truncation of all pointers representing addresses >= 64k. Such code wouldn’t even be “17 bit clean”, much less “32bit clean”.

2020-03-08 4:07 am
acobar
LZ4 with squashfs and overlayfs is a fantastic combination to mount compressed documentation, to be used by zeal or from devdocs on linux.

2020-03-11 3:54 pm
phoenix
SquashFS + Network Block Device (nbd) make for a much more performant alternative to NFS, as well. 🙂 At least for read-only filesystems like /, /usr, and the like; for diskless workstations booting off the network.

2020-03-11 7:45 pm
Alfman verbose=1
phoenix
SquashFS + Network Block Device (nbd) make for a much more performant alternative to NFS, as well. At least for read-only filesystems like /, /usr, and the like; for diskless workstations booting off the network.
I use squashfs with xz compression for my linux distro, no complaints there. However a long time ago I wanted something like nbd or iSCSI to boot up virtual machines on a different server than the ones with the disk image. I found nbd to be extremely unstable & unreliable. The long term goal was to use nbd to facilitate VM migration, but the linux kernel would become erratic and lock up once it connected to an nbd server, I had to give up on it. I’m thinking of trying it again if it is more stable now.

2020-03-08 10:26 am
hdjhfds
I thought rar rather than 7z occupied the second spot. But I guess it varies from industry to industry, country to country.

2020-03-08 4:38 am
Kochise
You forgot sit from the Macintosh

2020-03-08 4:00 pm
Alfman verbose=1
Kochise,
You forgot sit from the Macintosh
Ah, yes there’s probably a couple missing that I never encountered in DOS.
IBM PCs were so dominant when I was growing up and going to university that I had never actually seen an apple/mac store in person, Who knows, if I had then maybe I’d be a mac fanboy with praises for steve jobs, haha. I doubt it though because I’ve never put bill gates on a pedestal even though I’ve used many microsoft products. They’re just people with flaws like everyone else. I’ve always been more interested in the technology than the people. Looking back now, I think I would have been very intrigued by commodore even though I did not experience it at the time.

2020-03-08 4:55 pm
Kochise
I doubt that, DOS and Windows are so much more convenient to use and tweak. The file extension is enough to select the proper application of open the file, not in a Mac where it is a file metadata.

2020-03-09 12:24 pm
darknexus
I think “was” is the correct verb tense there. Ever since Mac OS X, file extensions have been the correct way to identify files. For a while the metadata method was still supported for backward compatibility however, which did lead to some rather odd issues at times where you did change the application to open a certain file type but one file in particular did not obey due to the creator metadata being set on it. I think it was in 10.6 that the metadata support was removed completely, however don’t quote me on that part.

2020-03-08 4:44 pm
jgfenix
In the DOS days I usted arj because it gave a bigger compression than zip.

2020-03-09 12:26 pm
darknexus
I liked jar in DOS (not to be confused with the .jar file format for Java), which was made by the same people who made arj and was intended to be its successor. The compression ratio, considering the time period, was unparalleled. It never really caught on though.

2020-03-17 1:18 am
SteveNordquist
…bunzip2, jpeg2000__, djvu, FractalView, various triangle and fan tile texture companding and normalizing things…
All pretty essential. Then there are encryptions and whether those would accept fallbacks in streams…
Things have really been hustled over by anticrypto factions and a pretty bogus federal certification for crypto.
As for the market…yeah, it’s not through fixing things somehow; I wonder if somewhere there’s a student ready to stand up and recite all the standard formats and compressions used in single-cell nucleosome sampling, GIS, post-hadoop dimensionalizing of big data…

2020-03-09 8:31 pm
Steve
I once wrote a GIF file encoder and decoder based only on the specification of the GIF format from CompuServ and Terry Welch’s IEEE article describing the LZW algorithm. In Fortran. Took a bit of head-scratching to figure out how LZW actually worked, given that the description of the algorithm, even though complete, was rather succinct.