Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000-used to be UCS-2 in Windows 95 era, when Unicode standard was only a draft on paper, but that’s another topic. Using UTF-16 means that filenames, text strings, and other data are stored as sequences of 16‑bit units. For Windows, a properly formed surrogate pair is perfectly acceptable. However, issues arise when string manipulation produces isolated or malformed surrogates. Such errors can lead to unreadable filenames and display glitches—even though the operating system itself can execute files correctly. But we can create them deliberately as well, which we can see below.
↫ Zafer Balkan
What a wild ride and an odd corner case. I wonder what kind of odd and fun shenanigans this could be used for.
My understanding is that Windows accepts unpaired surrogates because they decided that was the least painful way to be backwards compatible with NTFS filesystems created when UCS-2 didn’t impose any special restriction on those codepoints.
It is a “damned if you do, damned if you don’t” situation, and I don’t think there would be any right answers.
The solution could be implemented in chkdsk though. It could have scanned the older disks, possibly during OS upgrade time, and let users to choose behavior: “keep as-is”, “replace with best match”, “downgrade to ASCII”…
Or something similar.
Unicode filenames were always a bad solution to a non-problem; this is just one more illustration of why.
Minuous,
It’s easy for you to say if your language is latin based, but otherwise it’s kind of unfair to insist filesystems can’t represent your language.
I think linux implementation is the simplest: threat filenames as bytes and let higher level applications interpret what they mean in terms of characters.
My objections to unicode have more to do with mission creep, hijacking unicode for emojis with colors etc went too far. Now what you see is entirely dependent on the device you use. IMHO the scope should have been limited to natural alphabets and emojis should have used a markup language that don’t pollute the unicode namespace.
We used to get codeset along with file names to tell us how to display them.
As much as I loathe the overuse of emoticons, this has unfortunately become an evolution of how people communicate. I have already come across the use of emoticons in a story book and it was mildly infuriating to realize that this is the direction or language is going. Emoticons are used to express thoughts and feelings that would otherwise make up a lot of text and it helps convey a message for a lot of people. I know many people with autism and ADHD and many of them (but not all of us) think emotions are a blessing.
Some arguments for including emotions in unicode is that nothing will get lost when copy/pasting a message that uses them. Also, text-to-speech software can implement a way of describing the emotions for better accessibility, which would be a lot harder if emotions were images without a proper ALT description. I am still of the opinion that it is feature creep that seems to be getting out of hand with the ever growing fringe case variety, but I do understand why this is happening, and I think it’s too late to stop it by now.
Personally, I preferred ASCII emoticons in IRC chatrooms (although those are hard on text-to-speech too) and the pre-flash/HTML5/javascript web. Unfortunately we’re way past the good old days of the internet.
Titanius Anglesmith,
I’m not saying there isn’t a place for emojis, but it doesn’t belong in the unicode layer especially as emojis include more variations of the same thing and new attributes for different color variations etc. All of this would be ok for a dedicated standard, but honestly it’s becoming a mark up language and it’s ill-fitted for unicode.
That’s what I’m saying too. We took this path, but technically it would have been better for unicode to reject these and to get a purpose built markup language that would have been more flexible and extensible to boot. There was an article on osnews some months ago about how unicode is already getting broken today because different editors/viewers that can otherwise handle unicode are confused by complex sequences used by emojis. The hypothetical consistency benefits of adding emojis everywhere through unicode is already invalid – the output you get depends on your software/editor/browser/etc.
Rather than arbitrarily declaring tons of code points for emoji variations and attributes in unicode, a proper markup language could have made variations explicit while screen readers would have an easier time working with them without being tripped up by attributes that aren’t understood. It would be far easier to view/edit emoji properties with a proper markup language. And it would be easier make emojis more future proof, a markup language could create poses/etc without breaking clients that don’t understand the new attributes.
We made the wrong choice, but I agree it may be too late to fix and now we’ll probably face the problems for a long time.
Care to elaborate on why you think it was a non-problem, or why Unicode was a bad solution?
Without Unicode, a lot of people weren’t able to name their files in their own native language, or if they could, they could not easily be shared across different regions with different languages.
I would agree that UTF-16 is not as good an encoding as UTF-8, but that’s only with the power of hindsight. Windows NT development started in November 1989; UTF-8 wasn’t announced until January 1993, four years later. And UTF-8 would not necessarily fix this specific problem – it is still possible to encode UTF-8 incorrectly, so Windows might still have the same kinds of issues displaying or handling it.
UTF-8 has the exact same issue as UTF-16 here. The problem in this case is what is valid was retroactively restricted after Windows had already allowed it.
To make good filenames unicode needs to be heavily normalized. The HUGE problem with that, is that it is language specific. So we went from having to pass a codeset with an encoded name, where we now need to pass a language and region with a Unicode encoded name to know how to handle it.
For instance one simple example, if you are searching for files with ‘ä’ do you search for the code ‘ä’, or for the string of two codes ‘”‘-combining and ‘a’. This one at least isnt language or region specific like a case-insensitive search for ‘i’ is.
This does seem to be a legitimate problem, it looks like they got around Unicode normalisation needing to know language/locale by simply not using Unicode normalisation:
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation:
> There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.
How to other operating systems / filesystems deal with this? On macOS, it looks like HFS+ used to do Unicode normalisation, but APFS does not. And on Linux filesystems like ext4, it seems to be optional but not standard.
djhayman,
I didn’t even know it was an option in linux, I thought they were always bytes but I guess ext4 tried to implement unicode features a few years ago.
https://lwn.net/Articles/784124/
I would have voted to relegate this to userspace since I don’t like the idea the kernel being tethered to unicode versions. Anyway…
I would say this quote could be a bit misleading. It may not perform “normalization”, but it does perform case checks. In other words it’s actually not just an opaque sequence of WCHARs, those WCHARS do get interpreted by the file system. They should not have called them opaque, but I think what they meant was that it doesn’t affect the path length.
Windows uses UTF16, but purely hypothetically if they used UTF8 I wonder if it would be possible for two code points representing upper and lower case for the same character could have different lengths. This seems extremely unlikely, but is it technically possible for unicode to do that? If so then you could end up in a scenario where a file would fit the max length constraints in one case but not another.
Unlike long filename support, it wasn’t done in a backwards-compatible manner. Therefore many platforms and programs cannot open these files at all, depending on eg. which calls are used. What good are filenames in your native language they can’t be opened?
Still not sure I understand the problem here.
Regarding platforms that couldn’t open these files, we’re talking about Windows NT being a brand new operating system, and NTFS being a brand new filesystem. No other platforms would have been able to read NTFS when it was brand new, but this has nothing to do with Unicode – Microsoft could have just as easily created NTFS as an ASCII filesystem, and still no other platform would have been able to read it.
Regarding programs that couldn’t open these files, all native Windows NT programs would have been fine (assuming the developers didn’t introduce bugs), and all legacy 16-bit Windows and DOS programs would have seen the 8.3 short ASCII filename anyway, so it was 100% done in a backwards-compatible manner.
Either way, it’s been 32 years since Windows NT was released, so if there are still any platforms or programs that are unable to open files with Unicode filenames, that’s their fault.
The issue is not NTFS per se but rather the fact that eg. fopen() won’t work for such filenames; _wfopen() has to be used instead.
Minuous,
That’s an interesting problem that didn’t cross my mind.
Unfortunately it doesn’t work this way on windows, but the nice solution would have been for fopen to support UTF8 on windows like it does on linux/macos/etc increasing compatibility in the process.
https://github.com/ocornut/imgui/issues/917
I think UTF16 support was an early selling point for windows, but today it actually feels like a kludge compared to platforms that use UTF8 transition to extend the ASCII set.
Stack overflow answers suggest a few solutions to get programs that don’t support UTF16 to open the files anyway, like copying files or GetShortPathName…
https://stackoverflow.com/questions/23285759/fopen-file-name-with-utf8-string-in-windows
OK, I understand. I guess you would just have to think about what the program is doing. For example, you couldn’t do this on Windows:
> fopen(“♥.txt”, “rb”)
So in cases like this where the program wanted to open a hardcoded filename (e.g. a config file) it would be limited to ASCII characters. However, I think this is a more common pattern:
> fopen(argv[0], “rb”)
i.e. you take the filename from somewhere else such as a command line argument. Programs like this could still work unmodified, but you would need to pass the 8.3 short filename. So in certain circumstances, it could have some of the same backwards compatibility constraints as a program that wasn’t long filename-aware.