Our solution for the hell that is filename encoding

Thom Holwerda 2018-05-04 General Development 28 Comments

By far, the worst part of working on beets is dealing with filenames. And since our job is to keep track of your files in the database, we have to deal with them all the time. This post describes the filename problems we discovered in the project’s early days, how we address them now, and some alternatives for the future.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

28 Comments

2018-05-05 9:38 am
acobar
There was a time on computing when English was enough and users were expected to deal with or abide to English language. Of course, even though it kept things relatively easy, it was far, far from be a reasonable solution.
The fact that no language I know can be reduced to a tautologic fundamental set that spans all languages complicate things many orders of magnitude. Also, national pride and fear of loss of expressiveness are at play.
After a couple of experiments we got to compromise, we got Unicode. It is far from easy, has many quirks, has it’s tricks and improved the situation; but it is far from be perfect (by the way, “perfection” is a human mind yearning condemned to sure failure). For any developer dealing with internalization, it was challenging, yet, way better than what existed before.
OS needed time to adapt, language developers needed time to adapt and, down the road, apps developers needed time to adapt. Unluckily, time is a precious asset always running low. As always happens with complicated things, no one got things right at first shot, nor Unicode committee, nor OS developers, nor language developers and, even less, apps developers. More time “wasted”.
But here we are, I couple of years later, a bit more resigned to the compromises chosen and fighting to deal with the quirks and tricks.
Unicode is far from perfect, took more time to be developed and integrated than we hoped and present challenges to all developers users, but is a good enough compromise solution. We just need to pick the building blocks: utf8, utf16 or utf32?
All good, except that Windows went to Utf16 and Linux to Utf8 for file system names and text encoding.
For those that ignore some of the difficulties presented by it (be it utf32, utf16 or utf8):
– graphemes can not be represented by just one basic “unit” in utf8 (1 to 4 bytes, yes, it is variable), nor in utf16 (1 or 2 unsigned 16 bits integers);
– utf32 is extremely wasteful (most languages use really a, relatively, small subset of all possible values that could fit in 1 or 2 bytes instead of 4);
– sorting, a “basic” need present on all languages, is not as easy as it used to be on “good ol’ days”.

2018-05-05 10:11 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
– graphemes can not be represented by just one basic “unit” in utf8 (1 to 4 bytes, yes, it is variable), nor in utf16 (1 or 2 unsigned 16 bits integers);
– utf32 is extremely wasteful (most languages use really a, relatively, small subset of all possible values that could fit in 1 or 2 bytes instead of 4);
As I remember, a “grapheme” isn’t actually a technical thing in the Unicode spec… just “grapheme cluster” which is a somewhat unhelpful shortening of “codepoint cluster representing a grapheme”.
…and even in UTF-32, there’s no guaranteed correspondence between a code point and a grapheme/character.
The misconception that there exist fixed-width encodings for Unicode is a western-ism that will break languages with more complex systems of diacritics and other combining characters.
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning…
https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
Also, if you’re developing software…
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-a…
(Together, those three articles are what I consider to be an excellent crash course in writing something that’s minimally likely to have glaring internationalization bugs.)
Edited 2018-05-05 10:17 UTC

2018-05-05 10:43 am
acobar
As I remember, a “grapheme” isn’t actually a technical thing in the Unicode spec… just “grapheme cluster” which is a somewhat unhelpful shortening of “codepoint cluster representing a grapheme”.
True, I just did not want to complicate things even more for people not aware of the challenges and why I used “compromise solution” and complained about languages and tautology.
2018-05-05 1:27 pm
Yoko_T
- graphemes can not be represented by just one basic “unit” in utf8 (1 to 4 bytes, yes, it is variable), nor in utf16 (1 or 2 unsigned 16 bits integers);
– utf32 is extremely wasteful (most languages use really a, relatively, small subset of all possible values that could fit in 1 or 2 bytes instead of 4);
As I remember, a “grapheme” isn’t actually a technical thing in the Unicode spec… just “grapheme cluster” which is a somewhat unhelpful shortening of “codepoint cluster representing a grapheme”.
…and even in UTF-32, there’s no guaranteed correspondence between a code point and a grapheme/character.
The misconception that there exist fixed-width encodings for Unicode is a western-ism that will break languages with more complex systems of diacritics and other combining characters.
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning…
https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
Also, if you’re developing software…
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-a…
(Together, those three articles are what I consider to be an excellent crash course in writing something that’s minimally likely to have glaring internationalization bugs.)
When will you guys ever give up? All this nonsense is a “solution” in search of a problem.
Get a Clue. All you are doing is making a bigger mess out of things in the name of making a bigger mess of things.
All you need to do is look at the GUI has devolved into a horrid mess on all the computing platforms it has ever touched…..
Edited 2018-05-05 13:34 UTC

2018-05-05 2:42 pm
kwan_e
When will you guys ever give up? All this nonsense is a “solution” in search of a problem. [/q]
What are you going on about? Proper support for different languages is a real problem.
If you want to give up trying to solve problems, then go away.
[q]All you need to do is look at the GUI has devolved into a horrid mess on all the computing platforms it has ever touched…..
What does this have to do with anything?
Multiple language support exists whether it’s GUI, or IME or in this case, FILENAMES, which have NOTHING to do with GUI.

2018-05-05 6:43 pm
Yoko_T
When will you guys ever give up? All this nonsense is a “solution” in search of a problem.
What are you going on about? Proper support for different languages is a real problem.
If you want to give up trying to solve problems, then go away.
All you need to do is look at the GUI has devolved into a horrid mess on all the computing platforms it has ever touched…..
What does this have to do with anything?
Multiple language support exists whether it’s GUI, or IME or in this case, FILENAMES, which have NOTHING to do with GUI.
Bullshit. We’ve been able to acess the *filename* for files for decades now, mo matter the language or operating system ,if we couldn’t you wouldn’t be be able to create or read a file to begin with on *ANY* hardware
All this nonsense is about basically scrambling the data within the filestructure in ways to make it inaccessible without jumping through a bunch of useless hoops to get at it.
Edited 2018-05-05 18:55 UTC

2018-05-05 11:27 am
kurkosdr
– sorting, a “basic” need present on all languages, is not as easy as it used to be on “good ol’ days”.
It never was. Even in ASCII, sorting by numerical comparison of the bytes will NOT sort your strings. You will get “automobile” after “Zebra”. And Latin-1 added accented vowels, but they come well after their non-accented counterparts. Of course, Unix doesn’t care and pretends to do alphabetical sorting while infact doing numerical sorting, much like C pretends to have arrays as passable parameters to functions but they degenerate to simple pointers once you pass them into a function.
Anyway, use tables or rules to sort Unicode.
Edited 2018-05-05 11:39 UTC

2018-05-05 11:39 am
acobar
It never was.
True. I was again just simplifying the subject, specially when comparing the current situation with what it was on ASCII times.
Sorting is a kind of specialization of collating with almost no agreement about the correct/optimal heuristics.

2018-05-05 8:04 pm
kurkosdr
Sorting is a kind of specialization of collating with almost no agreement about the correct/optimal heuristics.
Not to mention that, when confronted with a directory that has filenames in Herbrew, Russian, Arabic, Greek and English, which language (character set) goes first?
What about Latin-based languages? Say you have a list of files called “Barnacle”, “BÃ¼ch”, “Bye” and “Cat”. BÃ¼ch is German, so does it go between Barnacle and Cat or should it come before (or after) the English words?
These should be questions for the linguists to tackle, should they decide to go out on a limb and make the future for once instead of studying the past…
Edited 2018-05-05 20:04 UTC

2018-05-06 7:55 am
kwan_e
These should be questions for the linguists to tackle, should they decide to go out on a limb and make the future for once instead of studying the past…
Why is it for the linguists to tackle? Most every culture has already figured out their language sorting rules.
It is just the English speaking computing world refusing to learn about how other languages work.
2018-05-06 12:30 pm
kurkosdr
These should be questions for the linguists to tackle, should they decide to go out on a limb and make the future for once instead of studying the past…
Why is it for the linguists to tackle? Most every culture has already figured out their language sorting rules.
It is just the English speaking computing world refusing to learn about how other languages work.
The problem is that, there is no agreement when, given a list of words from different languages, which comes first. In the past, the problem didn’t exist, but it does now.
Forget about computing. If I give you a set of books called “Barnacle”, “BÃ¼ch”, “Bye” and “Cat” and tell you to sort them in a bookshelf, how would you do that? There is no common aggrement, even among linguists, how this should be done. It should.
2018-05-07 12:29 am
kwan_e
The problem is that, there is no agreement when, given a list of words from different languages, which comes first. In the past, the problem didn’t exist, but it does now.
Forget about computing. If I give you a set of books called “Barnacle”, “BÃ¼ch”, “Bye” and “Cat” and tell you to sort them in a bookshelf, how would you do that? There is no common aggrement, even among linguists, how this should be done. It should.
Why does there need to be agreement?
If I change my locale to German, then German sorting rules should apply. Pretty sure they already have their own preferences for sorting.
2018-05-07 11:44 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
What’s the German preference for sorting a list of movie titles containing German, Hebrew, Japanese, and Hindi names?
2018-05-07 12:13 pm
kwan_e
What’s the German preference for sorting a list of movie titles containing German, Hebrew, Japanese, and Hindi names?
Why are programmers always so obsessed with the 0.1% case that they’ll ignore the 99.9% case that is solved?
Most Germans will have only German titled movies to worried about, whether they are actual German, or translated to German.
2018-05-07 3:06 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
I speak and compute in English, I’m learning French (Je suis Canadien. Le franÃ§ais est notre deuxiÃ¨me langue officielle), and, just off the top of my head, my music collection contains artist and track names in Gaelic (celtic music), German (classical music), Italian (classical again), Japanese (anime music), Chinese (C-pop), Korean (K-pop), Hindi or Tamil (Devanagari, Bollywood), Russian (Cyrillic, random Russian music), and Polish (songs an acquaintance introduced me to).
It has nothing to do with my being a programmer.
Edited 2018-05-07 15:13 UTC
2018-05-08 12:24 am
kwan_e
I said this:
Why are programmers always so obsessed with the 0.1% case that they’ll ignore the 99.9% case that is solved? [/q]
Then you precisely proved my point, with your 0.1% use case:
[q]I speak and compute in English, I’m learning French (Je suis Canadien. Le franÃ§ais est notre deuxiÃ¨me langue officielle), and, just off the top of my head, my music collection contains artist and track names in Gaelic (celtic music), German (classical music), Italian (classical again), Japanese (anime music), Chinese (C-pop), Korean (K-pop), Hindi or Tamil (Devanagari, Bollywood), Russian (Cyrillic, random Russian music), and Polish (songs an acquaintance introduced me to).
It has nothing to do with my being a programmer.
It has even more to do with you being a programmer. Do you really think you represent the 99.9% of all cases? That’s the programmer mentality – that your use case is somehow representative when it’s in the minority. The mentality that a solution to your very niche use case should have the same weight as the solution for common cases.
2018-05-08 1:40 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
It has even more to do with you being a programmer. Do you really think you represent the 99.9% of all cases? That’s the programmer mentality – that your use case is somehow representative when it’s in the minority. The mentality that a solution to your very niche use case should have the same weight as the solution for common cases.
No, but I think that, in this era of global culture flow, a not at all trivial portion of people are likely to have at least a small number of foreign songs show up in a listing they might want to sort. (Especially within the EU, where so many people are multilingual and international travel is easy.)
My issue is with you characterizing it as 0.1%.
Edited 2018-05-08 01:41 UTC
2018-05-08 3:18 am
kwan_e
My issue is with you characterizing it as 0.1%.
The world has 7 billion people. Even assuming there are only 1 billion people who have the economic privilege of owning some kind of computing device capable of acquiring content in multiple languages, 0.1% of 1 billion is 1 million people.
I think 1 million people is a pretty generous estimate for the number of people for whom they have so much varied content that this multi-language sort is the problem.
And in this highly exclusive club, I would focus more on language identification and categorizing file names via language, rather than mix them up into one kludge. Most content management systems already categorize things by other criteria like genre, etc, such that this isn’t a problem.
2018-05-07 12:18 pm
acobar
We, actually, have two separate issues here. Your OS doesn’t has to worry about sorting files, it just needs a unique way to deal with Unicode code points and provide a basic collate sequence. It is a, mostly, solved problem and more or less follow the Unicode normalization rules. Granted, it is not your OS that provides this basic handling directly but some critical system libraries and what they cover are really very basic needs. I guess, that is what kwan_e is talking about.
Now, for user apps, and ‘ls’ is one of them, that may be not considered an acceptable solution by most users and as so there should exist options that can satisfy what users expect. Indeed, it is up to the music library management system to provide sensible choices, for example, by eliminating diacritics, performing ligatures and grapheme cluster expansion/elimination and defining what language should show up first when presenting the data to the user. On this case, the order that fits is what the user selects/expect to be available and that is precisely what we don’t agree as having a correct/optimal solution because it really does not exist.

2018-05-05 12:20 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
*nod*
People refer to that kind of “ordering by raw ASCII values/Unicode code points” as ASCIIbetical sorting or, sometimes, lexicographic sorting.
Coding Horror did a commonly-linked blog post back in 2007, arguing against that and teaching people the search keywords “natural sort” and “human sort” for sorting algorithms which are smart enough to properly sort filenames containing non-zero-padded numbers.
https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

2018-05-05 12:59 pm
kwan_e
“natural sort” and “human sort”
Surely “locale sort” should suffice? Or even “il8n sort”. “Natural” and “human” makes it sound like there should be one true sort for all of humanity.

2018-05-05 2:04 pm
acobar
Truth be told, locale sort does not solve the situation, even though it can be viewed as an acceptable compromise solution.
We want to have things reproducible in a predictable way. Now, get a language with diacritics marks and sort the name of people on it. Now the displayed order of homonyms where only diacritics differentiate them may be dependent of the order of insertion. Not a desired effect, hence my comment a couple of posts ago.
2018-05-05 2:37 pm
kwan_e
Truth be told, locale sort does not solve the situation, even though it can be viewed as an acceptable compromise solution.
I meant it in the sense of having a suitable name, as opposed to “natural” or “human” sort. If not “locale” sort, maybe a different name would be “culturally aware” sort.

2018-05-07 11:24 am
ahferroin7
Compared to Unicode though, it is really easy with ASCII to sort things. A simple alphabetic sort is largely just looking at the character to make sure it’s in the range a-z or A-Z, simple ordering based on one bit (if the second most significant bit is high, the letter is lowercase), and then simple numeric sorting.
With Unicode though, you functionally have to convert to the codepoints (instead of just using UTF-8 or UTF-16), and then you have to deal with figuring out which language you’re using (because accented letters get ordered differently in different languages), then you have to decide how to handle grapheme clusters not in that language, and then you can finally start comparing codepoints, but you still then have to deal with there being two or more valid forms for various grapheme clusters (unless you normalize things first, which is it’s own headache), as well as the fact that many widely used languages don’t have one long contiguous block that all their codepoints are in.

2018-05-05 11:04 am
kwan_e
http://userguide.icu-project.org/
Just use ICU and be done with it.

2018-05-05 11:13 am
acobar
ICU is huge and we all despise bloated code and think we can do better, except that most of us, myself, of course, included, can not.

2018-05-05 11:35 am
kwan_e
ICU is huge
http://site.icu-project.org/download/61#TOC-ICU4C-Download
For something that does Unicode properly, those binary builds seem to be a reasonable size to me.

2018-05-05 11:44 am
acobar
Totally agree, hence the rest of the sentence. It had a bit of sarcasm and critics about what we, as developers, keep falling for.