Files are fraught with peril

Thom Holwerda 2019-07-23 General Development 3 Comments

In this talk, we’re going to look at how file systems differ from each other and other issues we might encounter when writing to files. We’re going to look at the file “stack” starting at the top with the file API, which we’ll see is nearly impossible to use correctly and that supporting multiple filesystems without corrupting data is much harder than supporting a single filesystem; move down to the filesystem, which we’ll see has serious bugs that cause data loss and data corruption; and then we’ll look at disks and see that disks can easily corrupt data at a rate five million times greater than claimed in vendor datasheets.

Deeply technical, but well-written and pleasant to read.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

3 Comments

2019-07-24 2:02 pm
Alfman verbose=1
TLDR; Guarantying transactional semantics under power failure conditions is difficult.
I agree with the overall assessments, however I think the author focuses too much on file system abstractions that are not the causes of the problem. There’s nothing wrong with APIs or file system abstractions, the failure modes arise out of implementation details.
One of the reasons this is so complicated for software to solve is that arguably it is a hardware domain problem. Most existing hardware does a terrible job and solving this problem (efficiently) in software seems like a bad way to go about it. It can be done by never using write-back caching, using playback logs for every operation and/or flushing frequently, but at a large performance cost. I think what we really need is hardware innovation here, like using super capacitors to allow for clean shutdown of not only the media, but also in memory caches such that software is never put in a position to have to compromise between performance and safety. Server grade raid cards can help by using a battery backed cache to decrease the overhead of write/sync operations, however it’s not a complete solution and unless your using a PCI/NVME interface, it’s still orders of magnitude slower than in-memory writeback caching.
I’m not holding my breath, but I’d like to see hardware manufacturers step up their game and address all the power failure modes To be honest though, I’m not sure ordinary consumers would be willing to pay for any of this. If it ever gets done, it will likely be relegated to servers.
Some may even question to need for any new hardware power failure features because there’s some functional overlap with UPS backup power. I guess that’s a fair point, although it would still be to have integrated solutions to address power failure modes internally, even for systems that aren’t hooked up to a UPS. As an external device, a UPS isn’t full proof, if it isn’t properly connected/configured, it may simply delay the inevitable power failure. Even when a UPS is present and working, some setups rely on a human operator to do a clean shutdown, particularly if you have multiple computers plugged into a UPS with only one monitoring cable(*).
* I think it would be pretty cool if, rather than using USB/serial cables, UPS used a trivial standard powerline protocol to broadcast the power status to all connected PC power supplies which the operating system could detect. Although I suppose without proper authentication, this could be abused; maybe the computer could be paired to the UPS like one-touch bluetooth/wifi!

2019-07-24 6:06 pm
areilly
I suspect that the “hardware solution” is almost universal at this point, and there are now relatively few “file systems” in use that will ever see “power failure”. All of the mobile phones, tablets and laptops of the world already have built-in battery backup, and will shut themselves down rather than risk unexpected power outage. I expect that all of the cloud-domain servers have contingencies of various sorts, if not their own power stations. I expect that most large-enough business servers will be on UPSes. That leaves a few (million) remaining desktop PCs. Almost not worth worrying about 🙂
On the PC front, many would now be running file systems with full data journalling. I assume that NTFS does, I know that ZFS does and I’d be surprised if APFS doesn’t. That means that you might lose some recent data during an outage (and that might be important), but the system as of five minutes ago is probably secure. That’s probably good enough for most PC users, and most of them will be actually operating on non-local files anyway.
What the filesystem/OS folk are really going to have to figure out is how to best use the possibly-coming non-volatile main memory future.

2019-07-24 8:03 pm
Alfman verbose=1
areilly,
I suspect that the “hardware solution” is almost universal at this point, and there are now relatively few “file systems” in use that will ever see “power failure”. All of the mobile phones, tablets and laptops of the world already have built-in battery backup, and will shut themselves down rather than risk unexpected power outage. I expect that all of the cloud-domain servers have contingencies of various sorts, if not their own power stations. I expect that most large-enough business servers will be on UPSes. That leaves a few (million) remaining desktop PCs. Almost not worth worrying about
External solutions are somewhat less reliable though. For example, one of my former employers had a standby generator backup, but it’s switchover time was not nearly fast enough to keep computers & servers online. They solved this by buying lots of small UPSes literally to keep the servers online for a couple of seconds. But sometimes these would fail, and it turns out sometimes you need an expensive large UPS to keep the servers online even if it’s only for a brief amount of time. At my house I have a 1.5kw UPS that keeps my servers running during power events, but the security cameras regularly loose power on the same UPS. I’ve also witnessed cases where a UPS works at least 90% of the time, but occasionally the computer will loose power anyways – I assume due to the AC phase at the time of power loss. 50-60hz AC kind of sucks for modern electronics. Power supplies have to have enough capacitance to hold 100% system load at the zero voltage crossings twice every cycle. But even if the power supply is properly rated and to spec for normal conditions, it could fail in abnormal conditions like brownouts or behind a UPS that takes too long to detect and correct the outage.
Some of you may have heard that new york city is having major power capacity problems. Our power on long island hasn’t gone out fortunately, but the AC line voltage coming in is at 90V (120V nominal) presumably due to the heat wave and all the air conditioners running. Low and behold I discovered for one system this was not low enough to trip one of the UPSs, but cause an immediate shutdown for my security camera computer. I don’t know how many times it went on and off during these power brownouts, but I do know that it did in fact cause data corruption.
With enough trial & error & money you can probably solve all the error modes. But a big problem is even knowing something can go wrong before it goes wrong. The conditions may be hard to recreate although in my case I found that running a high current power tool on the same circuit as the UPS (not on the UPS itself) causes the security system to hard crash (even though none of my other servers are effected). I ordered a better power supply and some super capacitors in hopes of solving the security camera outages.
On the PC front, many would now be running file systems with full data journalling. I assume that NTFS does, I know that ZFS does and I’d be surprised if APFS doesn’t. That means that you might lose some recent data during an outage (and that might be important), but the system as of five minutes ago is probably secure.
Full journaling may be a software solution (ext3/4 support this too btw), but most systems don’t enable this out of the box due to bad performance. I’m not keen on having to accept bad performance in order to solve this problem, I’d rather have a proper hardware solution 🙂
For better or worse, the complex nature of SSDs means that it’s not entirely safe to assume that data written 5 minutes or even days ago is safe. SSDs have their own internal data structure that can become corrupted just like a file system. The author alluded to this in the article. Sometimes not only does recent data get lost, the entire “disk” can get bricked – I’ve had this happen to me and what’s scary is that even raid doesn’t protect you from power loss related media corruption. (unless you have a battery backup raid controller to perform a clean shutdown),
What the filesystem/OS folk are really going to have to figure out is how to best use the possibly-coming non-volatile main memory future.
Yes, that has the potential to bear fruit and solve these problems in the future.