Users of DOS or older versions of Windows will have invariably stumbled upon a quirk of Windows’ handling of file names at some point. File names which are longer than 8 characters, have an extension other than 3 character long, or aren’t upper-case and alphanumeric, are (in some situations) truncated to an ugly shorter version which contains a tilde (~) in it somewhere. For example, 5+6 June Report.doc will be turned into 5_6JUN~1.DOC. This is relic of the limitations brought about by older versions of FAT used in DOS and older versions of pre-NT Windows.
So far, nothing new. This article, however, delves deeper into a special aspect of this relic: a built-in checksum function that, up until now, was undocumented.
The version of the algorithm he reverse engineered was written for Win7. Prior to that it’s a different version without the magic primes he’s referring to. We did this work to reduce hash collisions, which end up being very expensive as directories get large (and the probability of collisions increases.)
Most of the strange instructions he’s looking at are really a result of the abs() function. The original source is a lot smaller (and a lot smaller than the previous version.)
Good to know somebody really cares about the minutia that occupies our time…
Do I understand correctly that you are actually familiar with the code he was reverse engineering? Wow that’s funny.
I recently succeeded in reverse engineered parts of Dell’s proprietary racadm tool to reveal the proprietary instructions it was using to communicate with DRAC hardware. (Anyone have the source code by any chance )
I started with a simple strace, then dug in deeper with gdb, then wrote a custom wrapper function for ioctl to intercept and log the commands. I didn’t really think to document my steps as an article, not sure if that kind of thing would interest anyone here. Anyways, this kind of article interests me
Count me among the interested.
It’s “documented”, alright, numerous places. That guy needs to learn how to google…
http://www.maverick-os.dk/FileSystemFormats/VFAT_LongFileNames.html
timby,
What you are referring to in your link is different… Sure it describes the LFN structure in the FAT directory, but this article is actually diving into a much more subtle detail: the arbitrary short file names chosen to represent the long file names. This topic doesn’t seem to be discussed in your link at all (ie there isn’t even a single “~” in your link).
It’s an arbitrary hash and doesn’t really matter (all that matters is that the short filename be unique), but this author decided to reverse engineer it anyways. This is very esoteric stuff
Other than compatibility with ancient DOS applications, whats the point? Whenever I install a Windows system, first thing I do is disable 8.3 filename creation in the registry.
I never run any DOS or 16 bit apps, and I don’t think I’ve ever run into any issues.
Yeah, of course, like every Windows user does. And anyway, the algorythm is trivial and implemented by other systems just fine. So the 8.3 compatibility kludge doesn’t affect anybody anymore.</sarcasm>
I wonder how the Linux (and others) implementations compare.
IIRC, the Linux vfat driver generates garbage for short file names, as Microsoft has been flexing it’s patent muscle on their method for storing both short and long file-names.
The way around the patent is to store only short, or only long, but not both.
The vfat driver will pick which is best based on the chosen file name. However, there is the potential of incompatibilities when the 8.3 and the LFN differ.
There is also the msdos driver, which enforces strict 8.3 file names, avoiding those issues.
Why?
A few reasons:
a: Creating 8.3 names takes up extra space on the disk, and slows down disk access (more stuff in the disk catalog to search)
b: I always liked the NT bases OSs (starting with NT 3.5), but despised DOS/Win3.x and their file names, so just don’t want anything whatsoever to do with them. Never run DOS apps, why do I need DOS compatibility.
c: 8.3 names are just plain ugly, and never want to see them.
Edited 2015-06-11 13:33 UTC
In that case, why use FAT at all ? The only usecase for that filesystem is cross-OS compatibility. If you’re willing to compromise compatibility by disabling short file names, you might as well use a modern filesystem to begin with.
He’s actually talking about short filename support in NTFS. You cannot disable short filename support in VFAT for obvious reasons (and on VFAT it doesn’t have much performance benefit, for obvious reasons).
Edited 2015-06-11 16:31 UTC
Yes, this is only on NTFS, I don’t think its even possible to disable short names on VFAT.
NTFS is actually a very good file system, I only wish OS X has a file system nearly as good.
I’ve never cared enough to do it, but it was all the rage for windows optimization for a while.
https://support.microsoft.com/en-us/kb/121007
It’s a bit of an oversimplification to conclude that 8.3 names are required for DOS/Win16 compatibility. A DOS app can only read or write 8.3 named files, but short name generation is all about creating hidden names for files that aren’t themselves 8.3; so the DOS app can read its own files just fine. If you save a file from that app, it can open that file. If a DOS installer installs a DOS app, the app can read its own binaries.
AFAICT there are two historic reasons for 8.3 name generation:
1. As others on this topic have implied, VFAT requires an 8.3 name in order to ensure disks can move between long file name supporting systems (Win95) and other FAT compliant systems (DOS.) DOS can’t see the long names, but the disk can still be functional.
2. Win95 introduced the document centric interface, so you could install 16-bit Word, create a document on the desktop, give it a long name, then ask Word to open it. But since Word can’t see the long name, it needs to be given an autogenerated short name. Making this scenario work requires an even more evil hack, known as the tunnel cache (https://www.osronline.com/article.cfm?article=22 .) This beast means that Word can delete the file, create one with the same short name, and the long name is revived from outer space.
So, you’re reading this, and think those reasons are gone, and short names should be too? It turns out that when something gets added for one reason, creative people find other interesting reasons for it.
1. Short names must be ANSI/ASCII (ie., not Unicode.) So if you write an app and don’t support Unicode fully, getting the short file name served as shorthand for an ANSI file name. And if that functionality isn’t there, no name can be given that the app will understand.
2. Windows has a path limit known as MAX_PATH. Names greater than 260 characters can be created, but most of the system can’t operate with them. In order to operate with as many paths as possible, short names are used to partially lift this limit. So if you delete a file from CMD, it can use the short name to delete from.
3. Along the same reason, apps came to use short names as a compaction scheme. So if you install Office with short names enabled, the registry will be populated with short names. It will work just fine with long names, but that means more space in the registry.
So where does that leave us? Well, ANSI is becoming less of an issue day by day, so the transition away from short name generation is well underway.
1. In Win7+ this became a per-volume setting.
2. In Win8+ newly formatted volumes have it disabled. (It is reenabled as part of OS setup for OS volumes.)
3. Hopefully this can get more targeted in future. “C:\Program Files” aka “C:\PROGRA~1” et al are a bit problematic from a compatibility point of view, although 64-bit Windows has two of these without consistent naming, so hopefully this one will stop being an issue sometime also.
I’d just like to say that that’s an incredibly stupid limitation that I keep bumping up against more and more often. I can’t believe MS hasn’t removed that limitation since it’s not even a limitation of NTFS.
I know its not a limitation of NTFS, but I think its because some Win32 API file/path functions took a struct that had a fixed buffer defined as MAX_PATH that they’re is this limitation.
MacMan,
I don’t really see what the problem is? Just change the constant, recompile, and …. oh wait, windows. Never mind
LOL, funny.
You do however have to give Microsoft credit for taking binary and backward compatibility seriously. Thats why, even though its a lot easier developing on Linux, its ORDERS OF MAGNITUDE easier releasing on Windows: just compile a binary and you’re pretty much guaranteed that it will work everywhere on Windows.
Linux on the other hand, OMG, have to have like 15 different VMs for each different flavor, create 15 different repositories, 15 different package manages, for fuck’s sake, debs, rpms, pkgs, then telling users how to add repositories. Oh, crap, what if they don’t have root access???, then you have to make another 15 statically built binaries that run from user dir.
As great as the dev environment is on linux (shell, clang, gcc, unix cmd line tools, etc..), the disto zoo is an unmitigated disaster. Is it any wonder why devs are reluctant to target Linux.
MacMan,
I actually agree that it could be better, however I don’t think your conclusion is sound. For example, most developers are reluctant to target apple computers as well, even though I’m positive you will agree that apple has the least fragmentation of all PCs. Surely market share plays a big part in a developer’s decision (at least for me it does).
Edited 2015-06-14 06:40 UTC
MacMan,
Tongue in cheek humor, I did not actually expect a response or dialog.
Or, you know, use a language which makes self-contained programs rather easy to create.
Maybe the problem isn’t Linux….
Edited 2015-06-14 13:21 UTC