The absolute minimum every software developer must know about Unicode in 2023

Thom Holwerda 2023-10-02 General Development 8 Comments

A lot has changed in 20 years. In 2003, the main question was: what encoding is this?

In 2023, it’s no longer a question: with a 98% probability, it’s UTF-8. Finally! We can stick our heads in the sand again!
The question now becomes: how do we use UTF-8 correctly? Let’s see!

Everything you ever wanted to know about how Unicode works, and what UTF-8 does. Plus some annoying website design tricks, for which In apologise, even if it’s obviously not our site we’re linking to.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

8 Comments

2023-10-02 10:08 pm
Alfman verbose=1
But whatever you choose, make sure it’s on the recent enough version of Unicode (15.1 at the moment of writing), because the definition of graphemes changes from version to version. For example, Java’s java.text.BreakIterator is a no-go: it’s based on a very old version of Unicode and not updated.
This is exactly the problem with multi-character extended graphemes.
A length function that’s right one year will necessarily be wrong another after unicode updates. Not because the programmer’s implementation is faulty, but because the standard provides no way of predicting future graphemes. It’s literally impossible to know. The author suggests text length, index, substring, etc functions be updated every year, but regardless of what unicode intended it’s simply unrealistic. Even with the best intentions there will inevitably be differences due to circumstances outside of developer/user control. So personally I find it hard to blame languages or devs that would rather count individual unicode characters consistently rather than unicode graphemes that are redefined by the year.
I think languages and developers were setup to fail when unicode created graphemes with unpredictable sequences. That was a serious mistake. “How do I compute the the length of a string?” is a very reasonable question for language developers, database developers, etc. The answer of “you need to know all defined graphemes now and into the future”, is fundamentally not future proof at all. It’s no wonder so many languages don’t return the same value. I think unicode deserve most of the blame for failing to come up with a future proof solution.

2023-10-03 10:44 am
kurkosdr
If they restricted it to actual languages instead of also including yawning poop emojis, at least the graphemes would be updated only when new ancient languages or new linguistic rules were discovered, but nope, gotta have yawning poop emojis.

2023-10-03 2:12 pm
Alfman verbose=1
kurkosdr,
If they restricted it to actual languages instead of also including yawning poop emojis, at least the graphemes would be updated only when new ancient languages or new linguistic rules were discovered, but nope, gotta have yawning poop emojis.
Yeah, sticking to real languages would have made things far saner.
BTW the Unicode Consortium is an example of why committees are sometimes bad: A committee’s primary motivation is to perpetuate its existence
That’s a good point, but even to the extent that we accept all these new emojis, it really bothers me that the standard fails on so many technical levels. Future proofing could have been accomplished through simple consistency. For example the use of length prefixes could tell programmers how long a grapheme is without a dictionary.
U-grapheme-len-2 + U-code-1 + U-code-2
U-grapheme-len-3 + U-code-1 + U-code-2 + U-code-3
U-grapheme-len-4 + U-code-1 + U-code-2 + U-code-3 + U-code-4
…
This way all the emojis that have additional properties attached can always be handled by completely generic code in a genuinely future proof way even when they don’t have knowledge of what the grapheme represents.
The fact that unicode fails to deliver mathematical consistency for such basic text operations is very troubling and has implications for things like databases, file names, regular expressions, website errors, bugs and exploits in software, even consistent command parsing becomes very problematic. I’m not surprised by committees perpetuating their existence as you say, but It is just stunning to me that they failed so hard on technical grounds.

2023-10-04 3:20 am
sj87
Then you would end up with wasted space with everything non-ascii. As mentioned in the article, one design decision was to optimize data storage / transfer for the Latin alphabet. (Which includes non-ASCII letters from more complex languages.)

2023-10-04 5:47 am
Alfman verbose=1
sj87,
Then you would end up with wasted space with everything non-ascii.
All the regular unicode codes would still be represented exactly as is. There’s no length ambiguity for them since the lengths are well defined. It’s only the graphemes that present this problem. So for example, Kate and vim support these “normal” unicode characters fine…
F0 9F 8C 91
F0 9F 8C 92
F0 9F 8C 93
F0 9F 8C 94
F0 9F 8C 95
F0 9F 8C 96
F0 9F 8C 97
F0 9F 8C 98
But as mentioned in the article, every year new graphemes like this come out and they bug out applications…
F0 9F A4 A6 F0 9F 8F BC E2 80 8D E2 99 82 EF B8 8F ‍♂️
Kate interprets this as two separate characters ‍ and ♂️.
VIM interprets this as four buggy “characters” that it doesn’t know what to make of ♂️.
It’s one thing not to have the fonts needed to render symbols on screen, but having text editors and applications that aren’t even able to agree on where characters begin and end leads to bugs and inconsistencies. These are the fault of unicode failing to future proof the standard, the resulting application bugs are inevitable. Using a prefix is a very practical solution that would effectively solve the character bugs in a truly future proof way.
As mentioned in the article, one design decision was to optimize data storage / transfer for the Latin alphabet. (Which includes non-ASCII letters from more complex languages.)
Well, if you take a close look at UTF8 encoding, for better or worse it’s mathematically wasteful.
https://en.wikipedia.org/wiki/UTF-8
We should all appreciate the value that utf8 and unicode bring for interoperability, unfortunately they were quite wasteful in the design and graphemes have serious problems. Alas it seems impractical to fix these weaknesses at this point.
2023-10-04 5:53 am
Alfman verbose=1
All those unicode characters I pasted got wrecked…bah!
Firefox supports them, so I’m guessing wordpress or mysql stripped them out?
Anyway, how ironic!

2023-10-03 10:55 am
kurkosdr
BTW the Unicode Consortium is an example of why committees are sometimes bad: A committee’s primary motivation is to perpetuate its existence (which perpetuates the salaries and reputation of everyone involved). That’s all well and good if there is a going concern or when there is a need to leverage existing technologies (USB comes to mind), but Unicode should have been done 20 years ago the latest. There is no reason Unicode should need “updating” today.
One of the good things the US government did with ASCII is that they gave the relevant committee a deadline to get it done and then disbanded them. There were no “versions” of ASCII (as it shouldn’t be, just like there is no “versions” of what a two’s complement number is).

2023-10-03 10:47 am
kurkosdr
Plus some annoying website design tricks, for which In apologise, even if it’s obviously not our site we’re linking to.
I started reading on a smartphone, and was like “that’s not that bad”, and then switched to a desktop browser., and I was like “Whoa!! That’s just awful!”. Whose bright idea was to have fake cursors moving around the screen? Too bad such an informative article is hosted in a blog that’s so user-hostile as to make it unlinkable.