UTF-7: a ghost from the time before UTF-8

Thom Holwerda 2018-11-30 Internet 25 Comments

On Halloween this year I learned two scary things. The first is that a young toddler can go trick-or-treating in your apartment building and acquire a huge amount of candy. When they are this young they have no interest in the candy itself, so you are left having to eat it all yourself.

The second scary thing is that in the heart of the ubiquitous IMAP protocol lingers a ghost of the time before UTF-8. Its name is Modified UTF-7.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

25 Comments

2018-11-30 11:58 pm
Carewolf
Inside UTF-7 is an even older more powerful ghost that is still lurkering EVERYWHERE…
Base64. Needed to make binaries survive buggy conversions between ASCII and EDCDIC.
And now what is EBCDIC…. No one knows for sure except it an Ancient one from the time of the mainframes, but the mainframes can not be killed, they only slumber to return when the stars are right.

2018-12-01 11:22 am
kuiash
Even BASE64 is riddled with incompatibilities. Been bitten by that before!
I’d love to understand the rationale behind EBCDIC.
Although my first computer (actually /mine/, not borrowed) didn’t support ASCII. The ZX81 (and it’s predecessor) have a character set all of their own invention with only 64 characters and nothing is in any “standard” place.
The very first computer I programmed didn’t have a character set at all in any meaningful sense.
The KIM-1 had 7 segment (plus a dot) LED readouts. So the “character codes” were the pattern of ons/offs you needed to represent each character. Not all could be displayed. “M” and “W” were just not possible.

2018-12-01 11:09 pm
kurkosdr
I’d love to understand the rationale behind EBCDIC.
So… there is this thing called BCD (binary-coded decimal), which is decimal numbers written as a series of zeros and ones. Think of it like an ASCII code only for numbers, which only needs to be 4-bits wide. The advantage of BCD is that it allows you to put decimal numbers into a binary computer without any rounding inaccuracies happening due to conversion to binary, which is important in the field of financial transactions. Immediately, computer manufacturers realised that 4 bits gives 16 possible combinations and decimal numbers used only 10 of them, which left 6 combinations to use as they please. That was Extended BCD aka EBCD. Soon, BCD got extended even further with 6-bit and 8-bit codes on top of the 4 bit code. Since the damn thing “grew” instead of being standardised, it came with some major WTFs, such several incompatible variants and gaps of undefined segments in the encoding (head over to Wikipedia for the full list of atrocities against information technology and common sense). So the rationale is in the name: And Interchange Code of Extended Binary Coded Decimal bits for computers belonging to a certain family.
The rationale for not replacing EBCDIC with ASCII is that software using EBCDIC makes certain assumptions about what a series of bits means, and the code implementing those assumptions is spread inline throughout the code. Thing of how many ASCII text tools make the assumption that a textual byte with all bits zero is a string termination byte (fails in 16-bit Unicode obviously) and you get the idea why you just can’t easily convert EBCDIC software to ASCII or UTF-8.
Edited 2018-12-01 23:28 UTC

2018-12-01 11:33 pm
kurkosdr
BTW when I said replacing EBCDIC with ASCIIz I meant it for the textual part of the software.
Edited 2018-12-01 23:33 UTC
2018-12-02 5:09 am
kuiash
OK. I guess the connection to BCD makes some sense. The thing that REALLY irks me is the non-contiguous nature of letters.
And I think I know why that’s the case.
It was only ever a system for storing “BCD” and that only needs 10 rows on a punch card (yup, punch cards).
There’s a picture on Wiki here https://en.wikipedia.org/wiki/EBCDIC#/media/File:Blue-punch-card-fro…
The 3 top rows (unlabelled) are a kind of “space/register select” and the absence or presence of a single hole in each column (at least for registers and numbers).
OK, so I reckon the engineers put the blocks of characters on mod 16 because it’s easier to calculate in binary (possibly) but there actually are no rows for bits 10 through to 15. Therefore there are gaps in the letter parts of the char sets. The last 6 of each block of 16 are empty – later that went back and backfilled them.
In my “ideal” universe ‘0’ is at 0x00 and the digits are immediately followed by letters. *sigh*

2018-12-02 9:44 pm
kurkosdr
Yes, that’s the main problem with EBCDIC. Since it “grew” instead of being standardised, not only there are multiple incompatible encodings (that came into existence as the thing was extended on an as-needed basis), but the mapping of the encoding was made to serve the implementation details of the era (mainly to serve punch-cards but who knows what else) as the first and foremost priority.
BTW there is one punch-card implementation detail that has crept into ASCII: The DEL character is encoded as such that all 7 bits are ones (which places it on the other side of the table compared to the other control characters) because in punch-cards, when you wanted to delete a character, you just punched all the holes (since you cannot un-punch a hole, it was agreed that punching all holes would be the agreed way to delete aka ignore a character), and having the DEL character map to all bits one was helpful to that (the parity bit would make sure the eighth bit would get punched too).
Implementation details do creep in all standards. It’s the reason the distinction between RGB 15-235 (“limited range”) and RGB 0-255 (“full range”) exits in digital video (it’s in the Nvidia control panel and Intel Graphics control panel in case you want to find it). When analog video was digitized by studios back in the early days of digital video, it wasn’t usually sent to consumers but kept archived or converted back to analog, so it made sense to encode the signal as-is, including the level difference between v-sync (RGB 0) and 0% black level (which is always higher than the v-sync so the analog TV can differentiate and was mapped to RGB 15). Unfortunately PCs used the full RGB range for things like png and gif files and even computer-generated video and mapped 0% black to RGB 0, so the difference exists and is handled by the GPU. And HDMI can handle both RGB 15-235 and RGB 0-255 (in most TV sets anyway). Another implementation detail in the HDMI standard is the fact 1080p 3D is capped to 24Hz. Since 1080p 3D has slightly higher than double the bitrate of 1080p 2D, capping 1080p 3D to 30Hz (and not 24Hz) would make it have more than double the bandwidth of 1080p at 60Hz and would require from TV manufacturers to change the pixel clocks in their HDMI inputs to accommodate the higher bandwidth. Instead it was decided to let TV manufacturers not change their pixel clocks. Which makes Nvidia 3D vision not work well over HDMI 3D because most games won’t allow a refresh rate setting of 24Hz.
Implementation details are inside all standards. EBCDIC was just worse than average because there was no real standardization.
Edited 2018-12-02 21:51 UTC
2018-12-02 9:55 pm
kurkosdr
BTW the reason 1080p 3D requires more than double the bandwidth of 1080p 2D instead of just double is because the left and right image are placed on top of each other and separated by a 45 horizontal lines of black (aka 1920*45 pixels), in order to give 3D TVs the time to swap from left to right without the need for any buffering. Which makes the 24Hz restriction for 1080p 3D in HDMI an implementation detail meant to serve another implementation detail.
Edited 2018-12-02 21:58 UTC
2018-12-03 7:18 am
Alfman verbose=1
kurkosdr,
Implementation details do creep in all standards. It’s the reason the distinction between RGB 15-235 (“limited range”) and RGB 0-255 (“full range”) exits in digital video (it’s in the Nvidia control panel and Intel Graphics control panel in case you want to find it). When analog video was digitized by studios back in the early days of digital video, it wasn’t usually sent to consumers but kept archived or converted back to analog, so it made sense to encode the signal as-is, including the level difference between v-sync (RGB 0) and 0% black level (which is always higher than the v-sync so the analog TV can differentiate and was mapped to RGB 15). [/q]
Interesting, I never heard of this problem with studio equipment. The fact that that studio analog levels need to differentiate “black” from “0V sync” makes sense.
I know it’s decades after the fact, however the question that arises for me is why they felt the need to represent an analog gap in the digital domain at all? Was it just because the technology to remap the ranges was technologically impractical *?
[q]Unfortunately PCs used the full RGB range for things like png and gif files and even computer-generated video and mapped 0% black to RGB 0, so the difference exists and is handled by the GPU.
I’m coming from a different angle, but I actually think it’s very fortunate that computers use the full 0-255 RGB range for things like png files and even HTML files where #000000 is supposed to be black. It’s better for electrical engineers to remap the ranges on their end than for a subset of the 8 bit color gamut in file formats to accommodate studio voltage levels. Voltage requirements can change, just think of how confusing it would be for digital formats to accommodate electrical requirements: your PNG file is incompatible with your TV, try a different TV or use an app to re-encode your PNG.
IMHO it makes more sense for hardware to remap 0-255 RGB to the 0.39% – 100% voltage levels or whatever is needed by studio standards. This isn’t a problem anywhere today is it? If it is, do you have links so that I could read more about it?
Edit: * I have a limited electronics background, but I would think an opamp with a simple resister network could do the trick of remapping one range to another. Do you have any insight into why they wouldn’t have remapped the full range of values from 0-255 with 0 for black?
Edited 2018-12-03 07:32 UTC
2018-12-03 2:14 pm
kurkosdr
I agree that RGB 15-235 is the most idiotic thing ever, the most idiotic thing I had ever seen at least. Much like the case for capping 1080p 3D at 24Hz, some part (that ceased to be relevant a couple of years ago) in some input (some DAC or ADC component for the case of RGB 15-235 I guess) had to be preserved and we are stuck with it. It’s in your GPU’s settings and it’s in DVDs, BluRays and MP4s. Don’t know about WebM, but wouldn’t be suprised if it has it too. And “limited-range” RGB creates problems such as compression noise that is “blacker than black” and “whiter than white” which shows up in poorly calibrated TVs and quality loss from converting between limited range and dynamic range and vice-versa (sometimes both in some poorly configured equipment).
Edited 2018-12-03 14:19 UTC
2018-12-03 2:57 pm
Alfman verbose=1
kurkosdr,
I agree that RGB 15-235 is the most idiotic thing ever, the most idiotic thing I had ever seen at least. Much like the case for capping 1080p 3D at 24Hz, some part (that ceased to be relevant a couple of years ago) in some input (some DAC or ADC component for the case of RGB 15-235 I guess) had to be preserved and we are stuck with it. It’s in your GPU’s settings and it’s in DVDs, BluRays and MP4s. Don’t know about WebM, but wouldn’t be suprised if it has it too. And “limited-range” RGB creates problems such as compression noise that is “blacker than black” and “whiter than white” which shows up in poorly calibrated TVs and quality loss from converting between limited range and dynamic range and vice-versa (sometimes both in some poorly configured equipment).
I didn’t realize at first what you were referring to, but now I understand you are talking about this:
https://referencehometheater.com/2014/commentary/rgb-full-vs-limited…
The levels are a biproduct of the differences color spectrum slices represented in the RGB and YCbCr color models.
https://www.faceofit.com/ntsc-vs-srgb/
Yeah some formats represent color in YCbCr, others in RGB, and some can do either. I’m most accustomed to PC RGB colors and like the simplicity of RGB representing the intensity of physical pixel elements in the screen. There’s a solid logic to using that, but I guess it’s a real tossup for color scientists because YCbCr is supposed to more accurately represent colors the way we perceive them.
This is an interesting topic to have in the context of UTF-7, haha!
2018-12-03 4:00 pm
kurkosdr
BTW I think I found a real answer to this. Basically early video engineers in the TV realm didn’t grasp or didn’t want to use the concept of metadata/headers/etc and instead wanted to encode video as a constant stream of bits, v-sync lines and all to seperate frames, and if you mapped the v-sync level to RGB 0 and 0% black to RGB 15, the generation of the v-sync lines and correct level for 0% black happened for you automagically (sigh).
https://www.vegascreativesoftware.info/us/forum/studio-rgb-computer-…
Formats that never got near a studio in those early days (that’s ASF aka WMV) always use full range RGB 0-255.
The above theory explains the duality between 720 and 704 horizontal lines in supported DVD resolutions. Since the early digital video TV engineers presumably also stored the non-visible parts of the line in the bitstream, they never felt the need to define the width of a TV picture (in analog terms, aka where the picture ends and the back-porch and front-porch start), so DVD had to accomondate both valid widths (the two digital widths were chosen on sampling theorem best case scenario and compatibility with division by 16).
There’s a solid logic to using that, but I guess it’s a real tossup for color scientists because YCbCr is supposed to more accurately represent colors the way we perceive them.
Most RGB 15-235 video is indeed YCbCr converted to RGB 15-235, but YCbCr didn’t have a reason to use a non-zero positive value for 0% black either.
BTW YCbCr does not represent colors the way we perceive them. RGB does that. YCbCr has the advantage of that it creates a grayscale representation of the video for analog TVs and for allowing Cb and Cr to be stored at half the resolution either in width or height with minimal loss to subjective picture quality (due to the way color resolution is perceived compared to luminance resolution, perhaps that’s what you meant above). This was beneficial in Analog TV because the Cb and Cr (actually Pb and Pr in the analog domain) streams have half the horizontal resolution than luminance because they are mixed together in a phase-amplitude modulation scheme and then combined with the grayscale image, essentially becoming noise in the grayscale image of a specific frequency range, with the frequency range being on the higher-end of the frequency range of the luminance signal. If the modulation frequency is chosen correctly it’s invisible in most TV sets due to the way low-pass filtering works (all but the most perfect analog AV equipment acts as a low-pass filter on the signal, video or audio). But the ability to reduce Cb and Cr resolution was beneficial to digital video, so YCbCr was kept.
Edited 2018-12-03 16:16 UTC
2018-12-03 5:08 pm
Alfman verbose=1
kurkosdr,
Lots of informative stuff there. I can’t believe the studios screwed up the tech so badly with range limited sRGB. That link suggests sony vegas doesn’t even support the full range of bytes for color values. If not, you are right, that’s such a stupid limitation.
BTW YCbCr does not represent colors the way we perceive them. RGB does that. YCbCr has the advantage of that it creates a grayscale representation of the video for analog TVs and for allowing Cb and Cr to be stored at half the resolution either in width or height with minimal loss to subjective picture quality (due to the way color resolution is perceived compared to luminance resolution, perhaps that’s what you meant above).
That is what I meant.
Another thing that we tend to overlook is that not all monitors have historically used identical color phosphors.
http://www.labguysworld.com/crt_phosphor_research.pdf
http://www.babelcolor.com/index_htm_files/A%20comparison%20…
I vaguely remember that apple screens used different phosphors and therefor emitted slightly different wavelengths than PCs, consequently the same RGB color values used by software and image formats would be subtly different. RGB is an impractical format to faithfully represent colors when the color components aren’t globally fixed. So I believe YCbCr would have been more likely to be used in high end software that was to be calibrated across platforms and even paper mediums.
I don’t know if this is still true on modern oled, lcd, plasma, etc technologies?
While I have an appreciation for color science/theory, most of this stuff can be filed under “it’s good enough” for my daily needs, haha.
2018-12-03 5:23 pm
kurkosdr
kurkosdr,
Lots of informative stuff there. I can’t believe the studios screwed up the tech so badly with range limited sRGB.
sRGB is not Limited-range RGB / Studio RGB / RGB 15-235. sRGB means a completely different thing
Edited 2018-12-03 17:31 UTC
2018-12-03 5:59 pm
Alfman verbose=1
kurkosdr,
sRGB is not Limited-range RGB / Studio RGB / RGB 15-235. sRGB means a completely different thing
You’re right.
Why is something that should be straitforward so complex. Then again, it’s not my domain of expertise. I’m sure that plenty people approaching computer science would question the way we do things too. “Bah, it’s stupid for things to be that way!”. Haha.
Edited 2018-12-03 18:07 UTC
2018-12-03 5:45 pm
kurkosdr
I don’t know if this is still true on modern oled, lcd, plasma, etc technologies?
Most LCD TVs have completely inaccurate colours out of the box, with wrong color temperature settings which gives everything a blue-ish bias (in order to give the impression of higher brightness), as well wrong chroma settings and special “dynamic contrast” filters that clip the blues and the reds and crush some dark detail (in order to give the impression of better contrast). You see, LCD is actually a crap technology compared to Plasma and CRT, having greyish blacks and dim whites. Direct LED fixes most of the problem, but is expensive and has its own problems (halo effect). Normally, LCDs should only be used where there are dpi and/or power requirements to be met, aka on portable screens, but since LCD is a cheap technology with high margins compared to Plasma, it is sold to consumers as the pinnacle of TV technology, and when the technology can’t deliver, cheating (aka clipping filters and wrong settings) are applied to give the impression of high performance to the eye that hasn’t seen better. If you really hate someone, tell them to point their spankin’ new LCD TV to the Lagom LCD website (via a laptop or by saving the images to a USB stick) and have them compare the results to a laptop screen. BTW my current LCD TV is an LG, which of course has wrong settings out of the box as expected, but it allows you to set everything to calibrated settings and even has a pre-calibrated picture mode, and everything looks much more natural than out of the box settings.
LCD PC monitors are most of the time well-calibrated, but even then there are actually manufacturing differences that make a “perfect” RGB impossible, although too small to perceive without a colorimeter.
So, to conclude, RGB differences are irrelevant to most people.
Edited 2018-12-03 17:51 UTC
2018-12-06 11:37 pm
zima
Most LCD TVs have completely inaccurate colours out of the box, with wrong color temperature settings which gives everything a blue-ish bias (in order to give the impression of higher brightness), as well wrong chroma settings and special “dynamic contrast” filters that clip the blues and the reds and crush some dark detail (in order to give the impression of better contrast). You see, LCD is actually a crap technology compared to Plasma and CRT [/q]
Well, to be fair CRT or Plasma TVS also typically had crap default setting, intended to make them look more “impressive” in the store; I always had to at the very least bring saturation down, for more natural look…
[q] Normally, LCDs should only be used where there are dpi and/or power requirements to be met, aka on portable screens
In these days, low power consumption is important everywhere…
2018-12-03 8:17 pm
kuiash
Bingo. That’s exactly the problem. The conversion from RGB can produce out-of-range chroma values.
I still see this on various, usually old, TV broadcasts. Large areas, usually of red, that /should/ have texture/gradients just max out as bright red.
2018-12-03 8:14 pm
kuiash
Ha! Yeah. I’ve worked on video cards, GPUs, video drivers, video encoders/decoders and god only knows what in the last (nearly) 30 years.
And that “limited” range is a real pain. Its even sillier now as little of our content requires those “sync” levels.
Of course the original video recordings were literally (literally, literal) the output of a camera. Sync pulses and all.
Back to the ZX81 – the original ULA didn’t provide proper 0V level HSYNC pulses OR colour burst. Consequently an old ZX81 doesn’t play well with most modern TVs that need those levels for calibration. So, the older the TV the better.

2018-12-03 6:24 am
Alfman verbose=1
kurkosdr,
So… there is this thing called BCD (binary-coded decimal), which is decimal numbers written as a series of zeros and ones. Think of it like an ASCII code only for numbers, which only needs to be 4-bits wide. The advantage of BCD is that it allows you to put decimal numbers into a binary computer without any rounding inaccuracies happening due to conversion to binary, which is important in the field of financial transactions.
Your post is informative, I didn’t realize this was the way EBCDIC evolved. I always found it strange when dumping VSAM files on the mainframe that numbers were encoded to be human readable.
However I do want to make one correction: BCD is only important for humans. As far as math goes, binary numbers work fine for financial transactions. Floating point numbers exhibit rounding instability when converting between number bases and should be avoided, however integer numbers are stable regardless of number base. You can use what’s called fixed point arithmetic to avoid floating point error. So long as you choose the one’s unit to represent the first whole unit of currency you need to represent, then there will not be rounding inaccuracies since every number represented in binary (and every other number base including 10 for BCD) will EXACTLY equal the smallest currency unit you want represented.
So for example, if the smallest currency unit you want to be able to represent is a penny, then you make 0b0001=$0.01, 0b0010=$0.02, 0b0011=$0.03 and so on.
This came up once before:
http://www.osnews.com/thread?655616
I think BCD (and hence EBCDIC) makes lots of sense given the historical context in which humans were directly programming mainframes by hand, but it’s less important today.
Edited 2018-12-03 06:33 UTC

2018-12-03 2:08 pm
kurkosdr
Basically, financial institutions wanted fixed-point decimal math, don’t know the exact details, but my guess is there are all kinds of complex accumulated interest calculations it makes sense, or it was just for use by humans. Not much clue really. Also, upon futher reading it turns BCD was first extended to BCDIC (multiple incompatible encodings) and then EBCDIC. Go figure.
2018-12-03 2:24 pm
Alfman verbose=1
kurkosdr,
Basically, financial institutions wanted fixed-point decimal math, don’t know the exact details, but my guess is there are all kinds of complex accumulated interest calculations it makes sense, or it was just for use by humans.
When mainframes used punchcards, BCD enabled programmers to literally punch in every decimal digit as input directly into the every byte/nibble. If mainframe computers were going to interact in binary, then the programmers would have to convert every number into binary for input and convert every number from binary for output, which would have been even more unbearably inefficient and error prone than it already was.
I wasn’t around back then, but logically considering that BCD numbers have a 1 to 1 representation with binary numbers, there’s no mathematical need for BCD and I’m pretty certain that human data entry requirements were the driving factor for it’s adoption in early systems. In any case, it’s extremely rare to see BCD today except in legacy systems that continue the mainframe tradition.
2018-12-03 5:22 pm
Vanders
In any case, it’s extremely rare to see BCD today except in legacy systems that continue the mainframe tradition.
There’s a surprising number of sensors that provide data in BCD.

2018-12-03 8:40 pm
subsider34
And now what is EBCDIC…. No one knows for sure except it an Ancient one from the time of the mainframes, but the mainframes can not be killed, they only slumber to return when the stars are right.
The stars have aligned. All hail the Ancient Ones reborn!
https://www.serverwatch.com/server-news/ibm-z14-mainframe-and-power9…
https://www.forbes.com/sites/forbestechcouncil/2018/07/06/guess-what…

2018-12-03 8:36 am
niilee2020
How come there’s not “super clever” and contrarian comment from Thom next to this quote? :-O :-O :-O What happened?
2018-12-03 4:42 pm
kriston
Who else remembers ISO-2022-JP and Shift-JIS?
Edited 2018-12-03 16:50 UTC