One of the advantages of ZFS is that it doesn’t need a fsck. Replication, self-healing and scrubbing are a much better alternative. After a few years of ZFS life, can we say it was the correct decision? The reports in the mailing list are a good indicator of what happens in the real world, and it appears that once again, reality beats theory. The author of the article analyzes the implications of not having a fsck tool and tries to explain why he thinks Sun will add one at some point.
ZFS doesn’t need a fsck tool because of the way it works. It is designed to always have a valid on-disk structure. It doesn’t matter when you pull the plug of your computer; ZFS guarantees that the file system structure will always be valid. This is not journaling – a journaling file system can and will leave the file system structure in an inconsistent state, but it keeps a journal of operations that allows you to repair the file system inconsistency. ZFS never will be inconsistent, so it doesn’t need a journal (and it gains a bit of performance thanks to that).
How does ZFS achieve that? Well, thanks to COW (Copy On Write) and transactional behaviour. When the file system needs to write something to the disk, it never overwrites parts of the disk that are being used; instead, it uses a free area of the disk. After that, the file system “pointers” that point to the old data are modified to point to the new data. This operation is done atomically – either the pointer is modified, or it isn’t. So, in case of pulling the plug, the pointers will point to the old data or the new data. Only one of the two states is possible, and since both are consistent, there’s no need for a fsck tool.
Data loss is not always caused by power failures – hardware errors can occur too. All storage devices fail at some point, and when that happens and the filesystem is no longer able to read the data, you would traditionally use a fsck tool to try to recover as much data as possible. What does ZFS do to avoid the need of fsck in those cases? Checksumming and replication. ZFS checksums everything, so if there’s corruption somewhere, ZFS detects it easily. What about replication? ZFS can replicate both data and metadata in different places of the storage pool. And when ZFS detects that a block (be it data or metadata) has been corrupted, it just uses one of the replicas available, and restores the corrupted block with a copy (a process called “self-healing”).
Data replication means you need to dedicate half of your storage space to the replicas. Servers don’t have many problems with this, they will just buy more disks, they have money. But desktop users don’t like that – most users will choose more storage capacity over reliability. Their data is often not that critical (they aren’t risking money, like enterprises do), and nowadays an increasingly big portion of their data is stored in the cloud. And while disks can fail, they usually work reliably without problems for years (and no, it’s not really getting worse – disk vendors who sell disks which lose data too easily will face bankruptcy). It’s an even bigger issue for laptops, because you can’t add more disks to them. Even in those cases, ZFS will help them: metadata (the filesystem internal structures, the directory structure, inodes, etc) is duplicated even on single disks. So you can lose data from a file, but the filesystem structure will be safe.
All this means that ZFS can survive most hardware/software problems (it doesn’t remove the need for a backup: your entire disk pool can still get ruined by lightning). In other words, users don’t really need the fsck tool. Self-healing is equivalent to a fsck repair operation. ZFS does have a “scrub” commmand, which checks all the checksums of the file system and self-heals the corrupted ones. In practice, it’s the same process as a fsck tool that detects an inconsistency and tries to fix it, except that ZFS selfhealing is more powerful because it doesn’t needs to “guess” how file systemstructures are supposed to be; it just checks checksums and rewrites entire blocks without caring what’s on them. What about corruption caused by bugs in the file system code? Developers have ruled out that possibility. The design of ZFS makes that much harder to happen and, at the same time, much easier to detect. ZFS developers say that if such a bug appeared they would fix it, but the probability of such an event “would be very rare”, and it’d be fixed very quickly; only the guy who hit the problem and reports it first would have the problem.
When you look at ZFS from this point of view, the fsck clearly becomes unnecesary. And not just unnecesary, it’s even a worse alternative to scrubbing + self-healing. So why care? It’s a thing from the past, right?
But what would happen if you found a corrupted ZFS pool that is so corrupted it can’t even get mounted to get scrubbed? “That can’t happen with ZFS”, you say. Well, that’s true, it’s theoretically not possible. But what if corruption does happen? What if you faced a situation where you needed a fsck? You would have a big problem, as you would have a corrupted filesystem and no tools to fix it! And it turns out that those situations do exist in the real world.
The proof can be easily found in the ZFS mailing lists: There’re some people reporting the loss of entire pools or speaking about it: see this, or this, or this, or this, or this, or this. There are error messages allocated to tell you that you must “Destroy and re-create the pool from a backup source” (with 541 search results of that string found in Google), and bugs filed for currently unrecoverable situations (see also the “Related Bugs” in that link). There are cases where fixing the filesystem requires to restore copies of the uberblock… Using dd!
It’s not that ZFS is buggy, misdesigned or unsafe – excluding the fact that new filesystems are always unsafe, ZFS is not more dangerous than your average Ext3/4 or HFS+ file system. In fact, it’s much safer thanks to the integrated RAID and self-healing. Except for one thing: those filesystems have a fsck tool. The people who use them maybe have more risk of getting their filesystem corrupted, but when that happens, there’s always a tool that fixes the problem. I’ve had integrity problems with Ext3 some times (Ext3 doesn’t turn barriers on by default!), but a fsck run always fixed the problem. Solaris UFS users have had similar problems. ZFS however leaves users who face corruption out in the cold. Its designers assumed that corruption couldn’t happen. Well, guess what – corruption is happening! After a few years, we can say that those assumptions were wrong (which doesn’t preclude that the rest of their assumptions were correct and brilliant!).
In a thread we can find Jeff Bonwick, the leading Solaris hacker and main ZFS designer, explaining a corruption problem:
Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is fraud to win benchmarks this way.Doing so causes real harm to real people.Same goes for NFS implementations that ignore sync. We have specifications for a reason.People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode.It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry.”
Well, ZFS was supposed to allow using “commodity hardware”, because ZFS was supposed to work around, by design, all the bad things that crappy disks do. But it turns out that in that message we see how cheap, crappy hardware renders the ZFS anti-corruption design useless. So you aren’t 100% safe with all kinds of commodity hardware; if you want reliability, you need hardware which is not that crappy. But even decent hardware can have bugs, and fail in mysterious ways. If cheap hardware was able to corrupt a ZFS file system, why should we expect that expensive hardware won’t have firmware bugs or hardware failures that will trigger the same kind of corruption, or other new types of corruption?
But (I’ll repeat it again) this is not a design failure of ZFS – not at all! Other file systems have the same problem, or even worse ones. Jeff is somewhat right – bad disks suck. But bad disks are not a new thing: they exist, and they aren’t going away. Current “old” file systems seem to cope with it quite well – they need a fsck when a corruption is found, and that fixes the problem most of the time. The civil society that Jeff dreams we should live in doesn’t exist, and never existed. The Sun/SPARC world has very high quality requirements, but expecting quality from PC class hardware is like believing unicorns exist.
This means that if you design something for PC hardware, you need to at least acknowledge that crappy hardware does exist and is going to make your software abstractions leaky. A good design should not exclude worst-case scenarios. For ZFS, this means they need to acknowledge that disks are going to break things and corrupt the data in ways that the ZFS design isn’t going to be able to avoid. And when that happens, your users will want to have a good fsck tool to fix the mess or recover the data. It’s somewhat contradictory that the ZFS developers worked really hard to design those anti-corruption mechanisms, but they left the extreme cases of data corruption where a fsck is necessary uncovered. There’s an interesting case of ZFS corruption on a virtualized Solaris guest: VirtualBox didn’t honor the sync cache commands of the virtualized drive. In a virtualized world, ZFS doesn’t only have to cope with hardware, but also with the software and the file system running in the underlying host. In those scenarios, ZFS can only be as safe as the host filesystem is, and that means that, again, users will face corruption cases that require a fsck.
I think that pressure from customers will force Sun Oracle to develop some kind of fsck tool. The interesting thing is that the same thing happened to other file systems in the past. Theoretically, a journaling file system like XFS can’t get corrupted (that’s why journaling was invented in first place – to avoid the fsck process). When XFS was released, SGI used the “XFS doesn’t need fsck” slogan as well. However, Real World doesn’t care about slogans. So SGI had to develop one, customers asked for it. The same thing is said to have happened to NetApp with WAFL: They said they didn’t need a fsck, but they needed to write onein the end anyway.
Wouldn’t it be much better to accept the fact that users are going to need a fsck tool, and design the filesystem to support it from day one instead of hacking it later? Surprisingly we have a recent example of that, Btrfs. From its creator: said: “In general, when I have to decide between fsck and a feature, I’m going to pick fsck. The features are much more fun, but fsck is one of the main motivations for doing this work”.. Not only does Btrfs have a fsck tool, the filesystem was explicitely designed to make the fsck job more powerful. It even has some interesting disk format modifications to make that job easier: it has something called back references, which means that extents and other parts of the filesystem have a “back reference” to the structures of the filesystem that made a reference to them. It makes the fsck more reliable at the expense of some performance (you can disable them), but it has also made it easier/possible to support some things, like shrinking a filesystem. My opinion? This approach seems more trustworthy than pretending that corruption cannot exist.
Lesson to learn: all filesystems need fsck tools. ZFS pretended to avoid it, just like others did in the past, but at some point they will need to realize that there are always obscure cases where those tools are needed, and they will make new – and probably great – tools. Worse-case scenarios always happen. Users will always want a fsck tool when those obscure cases happen. Especially enterprise users!
About the author:
Pobrecito Hablador, an anonymous open source hobbyist.
Most of your listed problems are related to the problem of buggy hardware, resulting in failed transactions.
This PSARC http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-supp… will solve this.
We don’t need no stinking fsck.
Because ZFS is unsinkable.
That, plus background scrubbing, give you the same end result as an fsck.
No real need for a separate fsck. If needed, one can just alias “zpool scrub <poolname>” to fsck and be done with it.
The article states directly that the problems causing the corruption were related to bad hardware! Bad hardware is a fact of life, and the existence of bad hardware is the reason why some fsck-like tool is needed.
Given the BER of normal hard disks, SATA cabling and all the components participating in the job of storing data (a fact of life, too) , it’s a miracle, why people still using filesystems without checksums
But back to your comment: You don’t fight bad hardware with an inadequate tool like fsck … scrub in conjunction with the PSARC 2009/479 transaction roolback code is a much better solution.
Most of your listed problems are related to the problem of buggy hardware, resulting in failed transactions.
That was the whole point of the article here. Bad hardware exists and is actually very very very widely used because it’s cheap.
So, ZFS might not usually need fsck or similar, but what do you do in the case where you can’t mount it? For example, the hardware has corrupted the ZFS headers and you can’t mount your volume and as such the self-healing and correction facilities can never run? Yes, that’s right; you need an off-line tool to get it into a state where you can mount it, ie. fsck or similar.
You don’t even need something similar to a fsck, you just need a transaction rollback. The rest is done by scrubbing …
You don’t even need something similar to a fsck, you just need a transaction rollback. The rest is done by scrubbing …
But as said, can you do that if you can’t even mount it?
Yes, that code has just been checked in. And no, it’s not called fsck.
You want to look in result of PSARC 2009/479 ( the http://www.c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-… ). And no it isn’t an fsck
And no it isn’t an fsck
You guys are hanging too much on the word fsck, you know? Try to read the whole post and not cling on to a single word you might not like that much. I was only talking about a way of getting the ZFS volume and/or pool into a sane state, not necessarily a tool called ‘fsck’ or similar.
As the article has specified quite clearly, other filesystems like NTFS, ext, XFS etc. have been dealing with the same ‘buggy’ hardware for years. Granted, they usually run on systems with far, far better developed and tested storage drivers on such varied hardware devices than Solaris will ever have and that’s where quite a few of the unseen problems are probably happening. ZFS doesn’t seem to handle these issues well because it assumes a system working as it expects.
I fail to see how transactions can fail and bork the system. They either succeed or they don’t. If this isn’t the case then you need to be looking at where the problem is in your own stack.
It’s highly ironic that ZFS was specifically designed and hyped by Sun to bring ‘storage to the masses’ with commodity hardware……and when there turns out to be a problem that same ‘buggy’ hardware that Sun has said you can use with confidence with ZFS is blamed for the problems.
In the quote there Jeff told us exctly why Apple can’t use ZFS, or why it can’t be used in desktop scenarios until it is optimised for the purpose. Single pools will even be quite common in large storage scenarios a ZFS will sit on large LUNS with their own redundancy along with filesystems used by other operating systems.
Well yes you do, because fsck merely stands for ‘filesystem check’. All it does is make sure that the filesystem is in a state that can be used before you mount it. On different filesystems those will consist of different checks, so yes, this is a fsck for ZFS. It should be checking consistency on every mount. The only difference with ZFS is that the fsck should take far less time than on other filesystems.
I’m not entirely sure what you or a few other people around here think ‘fsck’ stands for.
…in cake recipes…
I doubt that a ZFS fsck would be able to recover the trashed file systems in those posts/bug reports. File systems like Ext2/3 have simple designs that can be easily repaired by an fsck, however ZFS is an extremely complex file system and I don’t see what an fsck would do that the file system doesn’t do already.
And what is this fsck supposed to check and fix? As soon as somebody can answer this a fsck tool wil be made i think.
Fixing/detecting corrupted SHA256 block hashes for the deduplication feature, for one. I’ve relied on file systems in the past that worked on a similar concept.
Nothing more terrifying than learning that a block of the root fs has a hash of zero!
The fs will need to be offline and the hash values are metadata, so it falls into the ‘fsck’ category IMHO.
That opens an interesting question: What’s the correct stuff. The checksum or the data Furthermore: Dedup uses the already computed checksums of the filesystem. You don’t have to sync it to your data.
The solution to that would be parity data saved inline with normal data, sacrificing a little bit of space. Then, some small % of data could be hosed, yet still recovered, whether it was the data, hash, or parity.
But, since server people want better drives and more backups, us cheapskates want all of that 1TB our $80 paid for, and we all want faster storage…I don’t see it happening .
“zfs scrub” does this. And taking the filesystem offline for this is a waste of time.
FS devs learned a few lessons. FS development became sexy again and everyone saw the need for a new filesysten.
That said, ZFS will be the pioneer with arrows in his back. Other FSs will offer more features and better performance soonish with less resource usage and a more elegant design. (And I don’t think Oracle with stop Solaris’ negative growth rates.)
Oh … can’t wait to see sync dedup in another filesystem really ready for prime-time … let’s say in the next 5-10 years
It won’t fail that easily. i have a home server based on zfs and even frequent power losses don’t render data useless proven by 2 years of 24/7 always online usage. Before i had ufs based setup and every 6-8 power loss rendered data so useless that even fsck couldn’t fix the problem. ZFS is as solid as rock.
Just an observation, I think 24/7 means no powerloss. 😉
24/7 usually means they don’t turn them off.
At first a fsck doesn’t solve a lot of problem. It checks the filesystem, but not the data. It’s called fsck and not datackc for a reason. So we end of with a mountable filesystem, but the data in it … that’s a different story.
With ZFS you can tackle the problem from a different perspective. At first you have to keep two things in mind (sorry, simplifications ahead): ZFS works with transaction groups and ZFS is copy-on-write. Furthermore you have to know that there isn’t one uberblock, there are 128 of them, (transaction group number modulo 128 is the uberblock used for a certain transaction group).
Given this points, there is a good chance, that you have an consistent state of your filesystem shortly before the crash and that it hasn’t overwritten since due to the COW.
So you just have to rollback the transaction-groups until you have a state that can be scrubbed without errors …. and you have a recovered state that is consistent and with validated integrity. You just lost the last few transactions. That can’t be done with a fsck tool. You can’t guarantee the the integrity of the data after the system reported back to you, that the filesystem has been recovered.
You may call the results of PSARC 2009/479 something like an fsck tool, but it isn’t. It just leverages the transactional behaviour of ZFS to enable other tools to do their work ( http://www.c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-… )
Just to end it here: ZFS doesn’t need a fsck tool, because it doesn’t solve the real problem. ZFS needs something better and with all the features of ZFS in conjunction with PSARC 2009/479 it will deliver something better.
Everyone who quotes that link implicitly _agrees_ with the article’s premise, which is not that zfs needs ‘fsck’ but *that zfs needs a way to fix unmountable volumes to the point where they can be imported/mounted again*, in order for the filesystem’s self healing capacities to kick in. Please try to read between the lines instead of criticising the author for using the word ‘fsck’.
First of all, I talk as someone who already experienced what looked liked a corrupted zpool thanks to a failing drive + a bug in the sata driver + a bug in the zfs release of the time, a really bad and rare case.
The pool consisted of a raidz of 46 500GB drives, taking approximately 18TB.
Fscking the filesystem would have just been a nightmare. In our case restoring the data from a validated and consistent source was both the fastest and the easiest option.
If you can’t do a fast restore from a valid backup source and/or don’t have any redondancy of your storage and machines, that means you just don’t care at all of your data and your business. So I don’t see why you should ask for an fsck tool in the first place.
Edited 2009-11-03 08:54 UTC
Lot’s of filesystems do a great job of maintaining their own consistency. It’s really external errors that bother a lot of us, say you drop a backup drive on the floor, I’m going to fsck it before I even attempt to mount it. It sounds like ZFS has a fsck or scrub that can run. If it can guarantee that it’s mountable then you should be able to boot up a drive enough to run a fsck and verify the integrity of the filesystem from external errors. This likely pushes some of the problems elsewhere, it’s a background process on a live system so you could probably have other software failures if they touch broken parts of the filesystem since it was assumed valid, other filesystems have similar problems with data being damaged though. Fscking just makes you stop everything else when you do it, when you find problems you find them in fsck and not when you database crashes for some unknown reason because the blocks on the disk were screwed up.
What ZFS really needs is to run under Linux and probably Windows and to do that it probably requires some license changes and probably some substantial attitude changes within Sun. Until that happens, its at best a bit player. The “bad hardware” problems are pretty weak as well, I can’t recall hearing NTFS devs or Ext3 devs complaining about it. Part of that is Sun’s management needed each and every home-run they could get as they shopped the company around and for some reason they chose to roll a filesystem out with the kind of visibility that they did. Actual support will always trump hype, if it’s so perfect then give it to the rest of the world and the rest of the world will adopt it.
“The “bad hardware” problems are pretty weak as well, I can’t recall hearing NTFS devs or Ext3 devs complaining about it.”
But you fail to notice that SUN does Enterprise storage. That is a completely different thing than commodity hard drives for Windows and Linux that doesnt obey standards, as Jeff Bonwick explains. Enterprise storage has much higher requirements, and therefore you will hear complaints from Enterprise storage people. For Linux and Windows, which does not have those high demands, nor is capable of handling such demands – anything will do. Windows and Linux are not used in Enterprise storage area. That is the reason you dont hear NTFS or ext3 devs complain about it.
Here you see that Linux does not handle Enterprise Storage, according to a storage expert. Maybe he is wrong, maybe he knows more about Enterprise Storage than most people.
http://www.enterprisestorageforum.com/sans/features/article.php/374…
http://www.enterprisestorageforum.com/sans/features/article.php/374…
Regarding “that attitude change that SUN needs”, maybe you will see it quite soon as Oracle is bying SUN. SUN is the company that has released most open source code, and last year was in rank 30 of those who contributed most code to Linux kernel. We will see if Oracle will close SUN tech and charge a lot, or if Oracle will continue in the same vein as SUN. But, SUN was in the process of open sourcing _everything_, we have to see if Oracle will also open source everything they own.
>I can’t recall hearing NTFS devs or Ext3 devs complaining about it.”
Usually ext3/4 devs are complaining about different applications (like KDE) that should do their very own homework. So to speak, they don’t have any clue what they’re actually doing. If it comes to reliable filesystems Linux is a huge disappointment. Apart from XFS, but that’s another story.
You’re not listening carefully enough, then. Linux has the same problems if a disk does not honor barriers. Even funnier, on Linux barriers don’t work at all even with properly working disks when LVM is in use. ZFS does not need Linux, but it seems that Linux does need ZFS.
Oh, and ZFS is 100% open, (after all it’s in FreeBSD and other operating systems) too bad Linux isn’t and therefore can’t be integrated with foreign code, just as others can.
Of course not. Suppose that this kind of bad hardware accounted for 1% of FS corruption. It is unlikely that anybody even knows about it because it is in the noise. But now ZFS comes along and gets rid of the other 99%. Now it is responsible for 100% of ZFS file system corruption on a FS that is designed to have none. That’s a big deal. In reality, the bad disks are probably responsible for even more corruption on other FS’s, but since you have already accepted a bit of corruption with each crash, you can’t see the difference. Remember, fsck does not get you back what you lost and arbitrarily large amounts of data corruption can still occur. But in general the amount lost and corrupted is small and everybody has learned to live with it. But now we have ZFS and and it guarantees consistent and complete data, but possibly a few milliseconds out of date in a crash, assuming the underlying disks follow the standards. Compare that to fsck where one file may be up to date, a bunch more are a few milliseconds behind, another is corrupted and another is deleted.
Up until now with ZFS, that 1% caused by bad hardware leaves the FS unusable. But with the zpool recovery just added, in that 1% of cases you end up losing a couple of seconds of data and the file system recovers virtually instantaneously, instead of scanning and rescanning and patching to get back to a inconsistent state with some data from a few seconds ago and some up to date as happens with fsck.
[quote]if you want reliability, you need hardware which is not that crappy.[/quote]
Well, how do I to know what hardware meets ZFS’s requirements. Does something like a ZFS HDD ‘compatibility’ list exist?
Here is hardware compatibility list with OpenSolaris:
http://www.sun.com/bigadmin/hcl/data/os/
For ZFS to play well with your hardware, I dont know of such a list. To be truly sure, which hardware plays by the rules and dont breaks any standards, you have to buy Enterprise stuff.
Or, you could try to see which components SUNs storage servers use, and buy them components.
No you don’t. You can use any kind of disks assuming they are not broken.
Yes, I use ordinary Samsung 1TB spinpoint and they work fine.
However, some of these cheap hardware does not adhere to standards. Then it can be problem – if you do something unusual than just using the hardware. Maybe hot swapping discs, etc. Hot swapping discs must be supported by the drives and the card, and there were some other issues too. I can not remember right now.
What I am trying to say is that if you just use your hardware as normal, and do not try to use unusual functionality (without confirming) then everything is fine. But for instance, that crazy guy that installed OpenSolaris in VirtualBox, ontop Windows XP and then created ZFS raid with 10TB – I do not consider it as normal usage. First of all run everything ontop WinXP is just a bad idea. And on top of that VirtualBox is slightly unstable, and has different unusual quirks that ZFS does not expect. That is the reason he lost his data. Many levels of fail.
If you want to use unusual functionality, first confirm it follows the standards, etc. If you use normal plain functionality, everything is fine.
You totally misunderstood. Mentioned “crappy” disks are bad. Malfunctioning. Broken. If you have such a disk, you should get it back to the shop and reclaim your money. ZFS doesn’t have any special requirements for disks.
Read this, and you will understand
http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-supp…
Two reasons:
“One: The user has never tried another filesystem that tests for end-to-end data integrity, so ZFS notices more problems, and sooner.
Two: If you lost data with another filesystem, you may have overlooked it and blamed the OS or the application, instead of the inexpensive hardware.”
… i don’t like the concept of an fsck out of a completely different reason: With fsck you press you data in the form your filesystem expect. When you are lucky enough, it’s the same form your data was before, but most often it isn’t.
When you rule out bit rot by checksums, cheap and crappy hardware by transaction rollback, power failure with ZIL and always-consistent on-disk-state, this would leave just software bugs to an fsck. But i think, such problems should be handled in the filesystem itself like enabling the code to read the buggy structure and fix it simply by rewritting it correctly the next time, not with a sideband tool.
The advantage: The fsck just put it into the expected form, a bug fix to the code understands the problem and can do exactly the right steps to fix the bug in the structure and not just pressing it into the expected form.
BTW: I’ve wrote a rather long piece to this topic in my blog: http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-…
It seems to me the transactional nature of ZFS, the checksums, the ‘scrub’ command and the recently added zpool recovery feature all negate the need for a fsck utility.
It seems to me that a script called fsck.zfs containing a zpool restore and a scrub command would satisfy everyone.
Including the OP.