When it comes to dealing with storage, Solaris 10 provides admins with more choices than any other operating system. Right out of the box, it offers two filesystems, two volume managers, an iscsi target and initiator, and, naturally, an NFS server. Add a couple of Sun packages and you have volume replication, a cluster filesystem, and a hierarchical storage manager. Trust your data to the still-in-development features found in OpenSolaris, and you can have a fibre channel target and an in-kernel CIFS server, among other things. True, some of these features can be found in any enterprise-ready UNIX OS. But Solaris 10 integrates all of them into one well-tested package. Editor’s note: This is the first of our published submissions for the 2008 Article Contest.
The details of the whole Solaris storage stack could fill a book, so in this article I will focus only on filesystems. There are four common on-disk filesystems for Solaris, and my goal is to familiarize the reader with each of them, and to mention a few deployment scenarios where each is appropriate.
UFS
UFS in its various forms has been with us since the days of BSD on VAXen the size of refrigerators. The basic UFS concepts thus date back to the early 1980s and represent the second pass at a workable UNIX filesystem, after the very slow and simple filesystem that shipped with the truly ancient Version 7 UNIX. Almost all commercial UNIX OSs have had a UFS, and ext3 in Linux is similar to UFS in design. Solaris inherited UFS through SunOS, and SunOS in turn got it from BSD.
Until recently, UFS was the only filesystem that shipped with Solaris. Unlike HP, IBM, SGI, and DEC, Sun did not develop a next-generation filesystem during the 1990s. There are probably at least two reasons for this: most competitors developed their new filesystems using third party code which required per-system royalties, and the availability of VxFS from Veritas. Considering that a lot of the other vendors’ filesystem IP was licensed from Veritas anyway, this seems like a reasonable decision.
Solaris 10 can only boot from a UFS root filesystem. In the future, ZFS boot will be available, as it already is in OpenSolaris. But for now, every Solaris system must have at least one UFS filesystem.
UFS is old technology but it is a stable and fast filesystem. Sun has continuously tuned and improved the code over the last decade and has probably squeezed as much performance out of this type of FS as is possible. Journaling support was added in Solaris 7 at the turn of the century and has been enabled by default since Solaris 9. Before that, volume level journaling was available. In this older scheme, changes to the raw device are journaled, and the filesystem is not journaling-aware. This is a simple but inefficient scheme, and it worked with a small performance penalty. Volume level journaling is now end-of-lifed, but interestingly, the same sort of system seems to have been added to FreeBSD recently. What is old is new again.
UFS is accompanied by the Solaris Volume Manager, which provides perfectly servicible software RAID.
Where does UFS fit in in 2008? Besides booting, it provides a filesystem which is stable and predictable and better integrated into the OS than anything else. ZFS will probably replace it eventually, but for now, it is a good choice for databases, which have usually been tuned for a traditional filesystem’s access characteristics. It is also a good choice for the pathologically conservative administrator, who may not have an exciting job, but who rarely has his nap time interrupted.
ZFS
ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.
ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous “rampant layering violation” quote which has been repeated so many times. Remember, though, that this is just one developer’s aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.
Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can’t fill up the whole pool, and reservations may be changed at will. However, if you don’t want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn’t just painless, it is irrelevant.
ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.
The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem’s root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.
These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet – it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.
SAM and QFS
SAM and QFS are different things but are closely coupled. QFS is Sun’s cluster filesystem, meaning that the same filesystem may be simultaneously mounted by multiple systems. SAM is a hierarchical storage manager; it allows a set of disks to be used as a cache for a tape library. SAM and QFS are designed to work together, but each may be used separately.
QFS has some interesting features. A QFS filesystem may span multiple disks with no extra LVM needed to do striping or concatenation. When multiple disks are used, data may be striped or round-robined. Round-robin allocation means that each file is written to one or two disks in the set. This is useful since, unlike striping, participation by all disks is not needed to fetch a file – each disk may seek totally independently. QFS also allows metadata to be separated from data. In this way, a few disks may serve the random metadata workload while the rest serve a sequential data workload. Finally, as mentioned before, QFS is an asymmetric cluster filesystem.
QFS cannot manage its own RAID, besides striping. For this, you need a hardware controller, a traditional volume manager, or a raw ZFS volume.
SAM makes a much larger backing store (typically a tape library) look like a regular UNIX filesystem. This is accomplished by storing metadata and often-referenced data on disk, and migrating infrequently used data in and out of the disk cache as needed. SAM can be configured so that all data is staged out to tape, so that if the disk cache fails, the tapes may be used like a backup. Files staged off of the disk cache are stored in tar-like archives, so that potentially random access of small files can become sequential. This can make further backups much faster.
QFS may be used as a local or cluster filesystem for large-file intensive workloads like Oracle. SAM and QFS are often used for huge data sets such as those encountered in supercomputing. SAM and QFS are optional products and are not cheap, but they have recently been released into OpenSolaris.
VxFS
The Veritas filesystem and volume manager have their roots in a fault-tolerant proprietary minicomputer built by Veritas in the 1980s. They have been available for Solaris since at least 1993 and have been ported to AIX and Linux. They are integrated into HP-UX and SCO UNIX, and Veritas Volume Manager code has been used (and extensively modified) in Tru64 UNIX and even in Windows. Over the years, Veritas has made a lot of money licensing their tech, and not because it is cheap, but because it works.
VxFS has never been part of Solaris but, when UFS was the only option, it was a popular addition. VxVM and VxFS are tightly integrated. Through vxassist, one may shrink and grow filesystems and their underlying volumes with minimal trouble. VxVM provides online RAID relayout. If you have a RAID5 and want to turn it into a RAID10, no problem, no downtime. If you need more space, just convert it back to a RAID5. VxVM has a reputation for being cryptic, and to some extent it is, but it’s not so bad and the flexibility is impressive.
VxFS is a fast, extent based, journaled, clusterable filesystem. In fact, it essentially introduced these features to the world, along with direct IO. Newer versions of VxFS and VxVM have the ability to do cross-platform disk sharing. If you ever wanted to unmount a volume from your AIX box and mount it on Linux or Solaris, now you can.
VxFS and VxVM are still closed source. A version is available from Symantec that is free on small servers, with limitations, but I imagine that most users still pay. Pricing starts around $2500 and can be shocking for larger machines. VxFS and VxVM are solid choices for critical infrastructure workloads, including databases.
Conclusion
These are the four major choices in the Solaris on-disk filesystem world. Other filesystems, such as ext2, have some degree of support in OpenSolaris, and FUSE is also being worked on. But if you are deploying a Solaris server, you are going to be using one or more of these four. I hope that you enjoyed this overview, and if you have any corrections or tales of UNIX filesystem history, please let me know.
About the Author
John Finigan is a Computer Science graduate student and IT professional specializing in backup and recovery technology. He is especially interested in the history of enterprise computing and in Cold War technology.
w00t! No FAT support?!?!
Key thing to note that all future Solaris work is being done assuming ZFS as the primary file system… so everything else (for the local machine anyway) is legacy.
The first release of OpenSolaris installs upon ZFS only, and that OS is going to be the basis of whatever Sun cobbles together to call Solaris 11, if that even happens.
oh solaris 11 will happen that you can be sure of. once open solaris has had a few general avalibility builds and gets out there more. i would say a year off or so.
It would be advisable to stay on topic and edit out any snipey and unprofessional off-topic asides like the above quoted material. This article is supposed to be about “Solaris Filesystem Choices”. Please talk about Solaris filesystems.
Aside from some understandable concerns about layering, I think most “Linux folks” recognize that ZFS has some undeniable strengths.
I hope that this Article Contest does not turn into a convenient platform from which authors feel they can hurl potshots at others.
Edited 2008-04-21 20:25 UTC
Both of those quoted sentences are factual, and I think it’s important to understand that technology and politics are never isolated subjects.
However, I understand the spirit of your sentiment. In my defense, I wrote the article both to educate and to entertain. If a person just wants to know about Solaris filesystems, the Sun docs are way better than anything I might write.
Let’s not confuse facts with speculation.
You wrote: “It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves.”
In interpretive writing, you can establish that “[ZFS] has gotten some derision from Linux folks” by providing citations (which you did not provide, actually).
But appending “… who are accustomed to getting that hype themselves” is tacky and presumptuous. Do you have references to demonstrate that Linux advocates deride ZFS specifically because they are not “getting hype”? If not, this is pure speculation on your part. So don’t pretend it is fact.
Moreover, referring to “Linux folks” in this context is to make a blanket generalization.
+1. The author of this article is clearly a tacky, presumptuous speculator, short on references and long on partisanship.
Seriously, I know I shouldn’t reply here, but in the light of the above revelation, I will. It is extremely silly to turn this into some semantic argument on whether I can find documentation on what is in someone’s heart. If I could find just two ‘folks’ who like linux and resent non-linux hype relating to ZFS, it would make my statement technically a fact. Are you willing to bet that these two people don’t exist?
Yet, would this change anything? No, it would be complete foolishness. Having spent my time in academia, I am tired of this kind of sophistry of “demonstrating facts”. I read, try things, form opinions, write about it. You have the same opportunity.
I figure that with popularity comes envy of that popularity. And with that comes potshots. Ask any celebrity. As Morrissey sings, “We Hate It When Our Friends Become Successful”.
http://www.oz.net/~moz/lyrics/yourarse/wehateit.htm
It’s probably best to simply expect potshots to be taken at Linux and Linux users and accept them with good grace. Politely pointing out the potshots is good form. Drawing them out into long flame-threads (as has not yet happened here) is annoying to others and is thus counterproductive. It just attracts more potshots.
Edited 2008-04-22 18:41 UTC
Certainly this happens. On the other hand, who would be better than a celebrity to demonstrate the “because I am successful, I must be brilliant” fallacy we mere mortals are susceptible to. I think we would both agree that the situation is complicated.
Myself, I believe that a little bias can be enjoyable in a tech article, if it is explicit. It helps me understand the context of the situation–computing being as much about people as software.
Don’t twist words. My comments were quite obviously in reference to a particular sentence. (I’d add that I enjoyed the majority of your essay.)
Facts are verifiable through credible references. This is basic Supported Argument 101.
Good god, man. What academic world do you come from where you don’t have to demonstrate facts? You’re the one insisting that your statements are fact.
Sure you’re entitled to your opinion. But don’t confuse facts with speculation. That is all.
edit: added comment
Edited 2008-04-22 19:54 UTC
> …demonstrate that Linux advocates deride ZFS specifically because they are not “getting hype”?
I don’t think he said one causes the other or is a result of jealousy. He merely pointed out that Linux advocates usually receive hype rather than derision but in this case many gave derision. I fail to see the “specifically because.”
The design of ZFS is ugly and doesn’t fit in with the Unix philosophy.
UFS > ZFS.
Hands down. No debate. Don’t try, you’re wasting your time.
🙂
Could you elaborate on that? I have some concerns about mixing the fs and raid layers. But the self healing features are attractive. And the admin utilities are a dream. I only wish that I had such a nice command-line interface to manipulate fdisk/mdadm/lvm/mkfs in the Linux world. There is no reason that this could not be done. But the fact is that, in all these years, it hasn’t been done. People point me at EVMS when I speak along these lines. But EVMS really doesn’t cut it. In fact, every time I check it out, I come away wondering what it is really for, and what problem it actually solves.
Furthermore, I can imagine where plain UFS is the best solution (i. e. where ZFS would be “too much of the good”), for example on systems with lower ressources or where extending the the storage “pool” won’t happen. UFS is a very stable and fast file system (the article mentions this), and along with the well known UNIX mounting operations, it can still be very powerful. For example, FreeBSD uses the UFS2 file system with “soft updates”. But well, these settings usually aren’t places where Solaris come to use.
But remember, kids: This doesn’t obsolete your accurate backups. 🙂
Once you have taken the fime to read the zfs manpages, these utilities are very welcome. Especially the central zfs service program interface makes formatting and mounting very easy. It has advantages over the relatively static /etc/vfstab.
And nice to see that the Veritas volume manager has been mentioned in the article. IN VINUM VERITAS. See vinum(8) manpage. =^_^=
Edited 2008-04-21 21:38 UTC
Wiser words are rarely spoken
Wow, never heard that before, that’s a great pun. Anyway, I am a fan of VxVM / FS. I heard somewhere that it may be open sourced; I hope this is true even if it sounds somewhat unlikely. VxFS on *BSD would be good stuff.
How about these: *Test* your backups from time to time. Restore an entire machine (on spare hardware!) to make sure you can. It’s easy to say ‘we take daily backups’, but not nearly as easy to take those daily backups and make them into a running system.
No one cares about backups. They only care about restoring!
And restores have the precondition of… what? Exactly: working backups. Of course you’re right, the purpose of the backups IS the restore, and that’s waht people are interested in.
It’s not very complicated to backup data partition-wise. Depending on what partition it is (root, system ressources, users’ home dirs) you backup more or less often, and if you’re lucky, restore is very easy (for example from tape or from optical media). I’m sure you know enough about this important topic so I don’t need to explain further. 🙂
Oh yes, you can save much time and trouble if you do your backups on /dev/null and restore from /dev/zero. 🙂
More here: http://www.rinkworks.com/stupid/cs_backups.shtml
Can’t say I disagree. The layering violations are more important than some people realise, and what’s worse is that Sun didn’t need to do it that way. They could have created a base filesystem and abstracted out the RAID, volume management and other features while creating consistent looking userspace tools.
The all-in-one philosophy makes it that much more difficult to create other implementations of ZFS, and BSD and Apple will find it that much more difficult to do – if at all really. It makes coexistence with other filesystems that much more difficult as well, with more duplication of similar functionality. Despite the hype surrounding ZFS by Sun at the time of Solaris 10, ZFS still isn’t Solaris’ main filesystem by default. That tells you a lot.
a) There are no layering violations. The Linux camp keeps claiming that because it’s implemented completely different than how they do their stuff. ZFS works differently, period.
b) So, what’s inconsistent with zpool and zfs?
A filesystem, a volume manager, a software RAID manager and bad block detection and recovery code with functionality not unlike smartd, along with various other things, all in one codebase? That’s a (unnecessary) layering violation in anybody’s book, so saying the above isn’t going to make what you’ve written true.
Nothing. It’s about the only real advantage of ZFS.
So, having network drivers, sockets support, file descriptors, IP support, TCP support, UDP support, and HTTP support all in one codebase means the Linux kernel’s network stack is full of rampant layering violations?
Please stop parroting one Linux developer’s view. Go look at the ZFS docs. ZFS is layered. Linux developers talk crap about every thing that is not linux. Classic NIH syndrome.
ZFS was designed to make volume management and filesystems easy to use and bulletproof. What you and linux guys want defeats that purpose and the current technologies in linux land illustrate that fact to no end.
That’s just plain wrong. ZFS is working fine on BSD and OS X. ZFS doesn’t make coexistence with other filesystems difficult. On my Solaris box I have UFS and ZFS filesytems with zero problems. In fact I can create a zvol from my pool and format it with UFS.
Feel free to describe what those layers are and what they do. It certainly isn’t layered into a filesystem, volume manager and RAID subsystems.
When it’s been around as long as the Vertitas Storage System, or indeed, pretty much any other filesystem, volume manager or software RAID implementation, give us a call.
I don’t see lots of Linux users absolutely desperate to start ditching what they have to use ZFS.
I’m afraid you’ve been at the Sun koolaid drinking fountain. ZFS is not implemented in a working fashion in any way shape or form on OS X (Sun always seems to get very excited about OS X for some reason) or FreeBSD. They are exceptionally experimental, and pre-alpha, and integrating it with existing filesystems, volume managers and RAID systems is going to be exceptionally difficult unless they just go ZFS completely.
So what? You’re sitting on a Solaris box. When you have HPFS, LVM, RAID and other partitions on your system and you’re working out how to consolidate them (or you’re a non-Solaris OS developer trying to work that out), give us a call.
So that means it isn’t layered…. hmmm what are you smoking. It’s different so it must be bad. I get it.
Breaking that layering was intentional because that layering adds nothing but more points of failure.
It’s like saying electric cars are broken and rampant violations because they are not powered by gas. Which is utter nonsense.
Huh?? WTF does that have to do with anything? It is easy to use has no bearing on how long something has been in the market.
Your condition will never be true. Call me when Linux has been around for as long as Unix System V has been around. Unix System V has been around since 1983. Linux since 1992. Linux will never be around longer than Unix SV unless Unix dies at a particular time and linux continues.
BTW ZFS has been around longer than ReiserFS 4. Wait but ReiserFS 4 is completely useless.
The first comment on Jeff Bonwick’s blog post that was linked in an earlier post had some guy running a 70TB linux storage solution who was waiting to dump it for ZFS.
It is just being ported and is unstable. That doesn’t mean it is impossible to port, as you claimed, because ZFS isn’t layered.
It all depends on how many resources apple wants to put into ZFS and their Business plan. Your claim was directly in relation to some rubbish quote by Andrew Morton. You then, based on ill conceived conjecture, claimed ZFS is not portable because of “rampant layering” violations. Which is just nonsense.
You can create a RAID volume and ZFS can add it to a pool. You can then create a zvol from that pool and format it with other filesysems. You can create a LVM volume and add it to a ZFS pool as long as it is a block device. You can even take a file on a filesystem and zfs can use it in a pool. You have no idea what you are talking about.
http://www.csamuel.org/2006/12/30/zfs-on-linux-works
The link above is about some guys using LVM in linux with ZFS.
You should stop drinking the Anti-Sun kool-aid. Its no secret that you are an anti-Sun troll on OSnews.
WTF are you on about again? You claimed ZFS can’t co exist with other files Systems because of its design. When you have figured out basic software layering and architecture or have at least learned how to look at some HTML code give us a call.
Edited 2008-04-25 21:40 UTC
Thanks for side-stepping it ;-).
What points of failure?
Obviously filesystems and storage management software don’t need to be proved. The point is that you’ve got lots of systems out there for storage management that people are already using, and Sun expects people to drop all that for ZFS, which does the same thing – but maybe slightly better in some areas. That’s not enough.
Yadd, yadda, yadda. This was about layering violations, wasn’t it? The reason why Linux became popular and people moved off Solaris to it was because it ran well on x86 and generally available hardware. Sun thought everyone would move to a ‘real’ OS in Solaris and run on ‘real’ hardware. They didn’t. ZFS follows in that fine tradition as it simply will not run on 32-bit systems.
I don’t use Reiser 4, and neither does anyone else.
Very scientific. Some bloke posting on someone’s blog….. I don’t find dumping a storage set up a valuable use of time, money or resources, and the cost/benefit just isn’t there. Does ZFS have tools to help interoperability and migration, or will he be doing this himself?
You cannot equate ZFS to existing storage systems and make them interoperate. If you go down the ZFS route it’s really all or nothing. If it was layered into logical units and containers then that would be possible, and it would be possible for people like Apple and FreeBSD to reuse existing code and infrastructure.
You proceeded to proudly claim that ZFS didn’t violate any layers that you would expect to see in a storage system stack (a filesystem, a volume manager and RAID containers), and then you actually admitted it:
Then you didn’t explain how ZFS was logically structured, nor did you explain these mythical points of failure.
Since you haven’t explained how ZFS is actually layered……….
You can’t just type out a bunch of words and make them true unfortunately. ZFS will simply not cooperate with existing filesystems and existing volume management and RAID systems. You can’t for example, have the ZFS system manage existing RAID systems or volumes that FreeBSD might use, nor can Apple use the ZFS system to manage HPFS volumes. You just end up with duplicate systems lying around.
That was the point. ZFS cannot work with existing storage systems code, and to do so will mean picking the code apart.
Have you actually ever used ZFS, or are you just a naysayer going on half-assed theory?
I guess when I’m booting OpenSolaris on my ZFS root pool into 32bit mode, e.g. to test driver code, I must be imagining it all.
Do you actually run it on a 32-bit x86 system, or do you have trouble reading? Even Sun themselves recommend you don’t run ZFS on 32-bit Solaris systems, mostly related to the large amount of memory it likes to consume for caching.
If a filesystem doesn’t run well, it doesn’t run at all.
Can you provide some proof? A link to some docs on Sun’s website where this limitation is spelled out.
If I boot the 32bit kernel, I’m in 32bit mode. Whether the machine has a 32bit or 64bit processor matters exactly squat.
And I’ve run it in VirtualBox in a VM with 512MB RAM, yet Solaris didn’t shit itself. At all.
Again, did you ever use the filesystem (under proper conditions) or are you just spewing crap? I figure the latter.
If I roll my eyes more, they’ll fall out of their sockets.
Edited 2008-04-25 23:36 UTC
Sorry, but ‘I ran it on a VM for five minutes, and OMG it didn’t crap!’ doesn’t tell anyone anything.
Obviously you haven’t. ZFS under 32-bit systems is a well known gotcha, and it has bitten the FreeBSD people in porting, so if you don’t know that you don’t know about ZFS. ZFS needs cache and lots of it, and tends to grow unbounded with your workload without serious tuning, so you are going to need several gigabytes of memory to run it. Just because you run it for ten seconds in a VM and it doesn’t reach those limits doesn’t mean jack I’m afraid.
I just laugh at all the fan boys who are taking a still unstable and unproven filesystem and storage stack, an unfinished distribution in OpenSolaris and who think that because they’ve run the system for a couple of days they can run it in production for something.
I didn’t I was pointing out how stupid your comment was.
Multiple, with no error checking between the layers.
Of course Sun wants everyone to use ZFS. Just like the linux guys want everyone to use linux or Apple wants every one to use a Mac. Should I go on? Its called a product and the process is marketing. Every single player in the computing industry from RedHat to some new startup is guilty of it.
Put down the crack pipe. My response was to your silly idea that something has to be around for longer than something else to be better.
Again put down the crack pipe.
Ask him. Obviously he is willing to put the time and effort into it because he finds the linux solution inadequate.
What nonsense? Give me a real world example. Apple wants to replace HFS+ and many people in the Apple community are very excited by it.
http://drewthaler.blogspot.com/2007/10/don-be-zfs-hater.html
Apple didn’t make ZFS default in leopard not because of any intrinsic limitation of ZFS’ design.
You are just hand waving. Give me some cogent technical details as to why you think it is not possible. Go into as much technical detail as you would like.
It doesn’t violate any layers because it it trying to re-define them. Are you just plain daft?
If you go to implement something that was supposed to fit in a layer and then pruposefully change it to make it incompatible. You stupid claim makes sense.
ZFS was never designed to fit in that traditional layer and it was intentional because the designers thought the traditional model was broken. There is no violation.
People who love ZFS love it because ZFS doesn’t use those unnecessary layers.
ZFS has three layers. The ZPL, DMU and SPA. All of these have end to end checksuming. Unlike RAID, LVM and FS.
“I’ve implemented and supported a Linux-based storage system (70 TB and growing) on a stack of: hardware RAID, Linux FC/SCSI, LVM2, XFS, and NFS. From that perspective: flattening the stack is good. The scariest episodes we’ve had have been in unpredictable interactions between the layers when errors propagated up or down the stack cryptically or only partially (or, worse, didn’t). With the experience we’ve had with the Linux-based system (which, admittedly, is generally working OK), it would be hard to imagine a more direct answer to every item on our list of complaints (not only reliability, but also usability issues) than ZFS, and I think the depth of the stack is ultimately behind for the large majority of those complaints.
Unsurprisingly, I’m aiming to migrate the system (on the same equipment) to ZFS on Solaris as soon as we can manage to. ”
Here is the comment from the blog post. A lot of real world customers don’t like the stupid layers. Get it!
You haven’t explained a lot of things. Explain again in as much detail why ZFS can not coexist with other filesystems. Also what in its design makes it hard for Apple to implement it.
Yes it can. I already explained it to you and also linked to ZFS on FUSE using LVM.
Its evident you have never used ZFS and are just hanging on Andrew Morton’s words and ranting. Let’s get techincal. I am waiting for you techincal explanation. Don’t just say it can’t, show me exactly why it can’t.
Rubbish! Prove it. All you have done is make claims. how about backing it up with some real examples and techincal discussions?
When other file systems like VXfs can detect hardware data corruption and not silently corrupt data give us a call.
case in point:
http://www.opensolaris.org/jive/thread.jspa?messageID=81720
This could be anything, and could well be a Solaris driver problem that ZFS has shown up that didn’t occur elsewhere. That thread proves very little, and there was no evidence silent corruption was occurring before he decided to jack in the existing set up. It’s kind of like closing the stable door and not working out why it was open in the first place.
It proves one thing, that you don’t understand anything at a technical level.
The difference is ZFS detected the problem and VXfs can’t. That’s a huge difference. If you can’t understand why that is a “big deal” you shouldn’t be talking about storage at all. If you work on anything related to data storage your boss should fire you on the spot.
Pulling a thread from an OpenSolaris forums with some random guy who decided to throw away an existing Veritas set up (worrying in itself) proves absolutely nothing.
You have no idea what ZFS detected. The thing is, you don’t know what caused the problem or what it is. There is no evidence at all that there was a problem on the existing system either (and Veritas has multiple ways around this kind of failure). You don’t know if moving off something to an unstable Solaris set up to use ZFS caused it, or if it’s something else. There are too many variables.
‘OMG, ZFS detected this!’ tells you nothing, nor how to fix it, nor what the problem is. The point is, you have it. Look at the guy below who gets warnings and errors using the same controller.
I take it trouble shooting isn’t in your repertoire?
If you decide to dump an existing and working storage solution and install an unstable Solaris OS derivative so you can technology wank over ZFS, I’d cut your ass off and throw it out of the window.
Some of you guys really worry me, you know?
Edited 2008-04-26 20:50 UTC
What are the layering violations? Could someone point me toward some good links (or search terms) for the info?
I’m just curious if this is a “the tools aren’t split out into separate fs, raid, volume, disk management tools” issue or a “source code is unreadable as everything is lumped together in one big lump” issue, or what. What are the layers of ZFS on Solaris, for example, as compared to the same layers in Linux. What’s so different about FreeBSD that a single developer was able to get basic ZFS support working in under two weeks, and yet there’s still no ZFS support on Linux?
Urh, you do realise that ZFS is already available in FreeBSD 7.0, right?
Export a zvol and put whatever filesystem you want on top (okay, may only UFS is directly supported, but you can use iSCSI and such to export the zvol to other systems and then put whatever FS you want on top).
Errrrr, no it isn’t. It’s extremely experimental and barely functional, and on limited architectures at that. Hell, even running it on 32-bit systems will leave you with something exceptionally borked. ZFS also needs exceptional tuning to work with non-Solaris kernels. There is a huge class of hardware it simply will not run on – probably ever.
I’ve seen some people wandering around assuming that they can just run ZFS in FreeBSD, and run it in production. That’s just………scary.
Edited 2008-04-25 02:56 UTC
Good read. There were some infos that were new to me, and it gives an good overview over the topic. What I would have liked was some more information about the current status of the ZFS implementations in OpenSolaris
Thank you. I kept the size of the ZFS portion reasonable for symmetry but much, much more could be written about it. In particular, now that the iSCSI target is in production Solaris, it would have been interesting to discuss the extensive integration of ZFS and the target.
Good point about the dev features. I just checked my fresh Solaris 10 Update 5 (latest) install and the pool version is 4. The current dev pool version appears to be 10 since November 2008. Since then, they have changed the zpool on-disk spec to enable gzip compression, use of NVRAM devices for acceleration, quotas, booting, cifs, and other things.
See:
http://opensolaris.org/os/community/zfs/version/10/
And cycle the number of the URL from 1 to 10.
I enjoyed the article, hope to see more like it in the future.
Excellent OSNews contest article
Edited 2008-04-22 08:03 UTC
Will we see more background articles on OSnews in the future ? These would be welcome. And it would be even better if such articles would be written by people with real-life experience. Although the article about Solaris filesystems is well written, some important facts are missing. A few examples:
* There is an important difference between ZFS in Solaris and ZFS in OpenSolaris. OpenSolaris has all the latest and greatest ZFS features, Solaris not yet.
* Sun does support Solaris, but does not offer support for the OpenSolaris builds. My opinion is that with regard to the risk of hitting bugs that the OpenSolaris developer edition builds can be compared to Linux kernel release candidates.
* If there are four filesystems to choose from, this means not every filesystem is suited for every workload. The article does not mention e.g. that when using a single disk, UFS performs better for database workloads than ZFS. This is not a coincidence but is due to the fundamental characteristics of these filesystems (see also Margo Seltzer e.a., File System Logging Versus Clustering: A Performance Comparison, USENIX 1995, http://www.eecs.harvard.edu/~margo/usenix.195).
Are you, sir, suggesting that I lack real world experience? Well I never…!!
But seriously, there are a few points here which are worth responding to; a careful eye picks out some of the things I could have talked about, but cut to avoid the problem of too many themes in one article:
This is true, but frankly I find it unimportant in the last year or so since most of the ZFS features I consider important are in mainline Solaris, since, say, U4. A lot of the dev stuff is either icing, like gzip compression, or far-off alpha stuff, like on-disk encryption.
Sure. I, personally, would never recommend that anyone runs a dev version of an OS in a commercial setting. It’s rarely worth the hassle.
I did allude to this. But I am not sure that that Seltzer paper is an applicable reference because I don’t think that the designs of ZFS and LFS are close enough, ie, ZFS is not a pure-play log structured FS and requires no cleaner. I am not even sure the FFS comparison stands because journaling changes a lot, performance wise.
Still, the kernel of the argument is probably that ZFS and LFS both often change sequential IO to random and vice-versa, and this can cause weird performance characteristics. I think that this is true. However, experience (!) has taught me that single disk performance is rarely important, for the simple reason that almost any modern FS can max out a single disk on sequential and random. Putting it another way, if you require performance, you probably need more disks.
That’s an interesting statement. Can you tell me where I can more information about the fact that ZFS doesn’t require a cleaner ?
Check this out:
http://blogs.sun.com/nico/entry/comparing_zfs_to_the_41
By saying that it doesn’t require a cleaner, what I had in mind is that it isn’t background garbage collected and does not treat the disk as a circular log of segments that are either clean or dirty. It is more like a versioned tree of blocks. By cleaner I mean an asynchronous daemon that eventually “gets around to” cleaning segments that are dirty.
However, I am always willing to learn something. Do you think this evaluation is incorrect?
Check this out:
http://blogs.sun.com/nico/entry/comparing_zfs_to_the_41
By saying that it doesn’t require a cleaner, what I had in mind is that it isn’t background garbage collected and does not treat the disk as a circular log of segments that are either clean or dirty. It is more like a versioned tree of blocks. By cleaner I mean an asynchronous daemon that eventually “gets around to” cleaning segments that are dirty.
However, I am always willing to learn something. Do you think this evaluation is incorrect? [/q]
As known, filesystems like ZFS and the Sprite LFS write data to a huge log. This log contains both data and metadata. I’m not a ZFS expert, but how can ZFS discover unreferenced blocks without rereading the metadata that was written earlier to the log (assuming that not all metadata fits in RAM) ?
The metadata and spacemaps are COW as well. All written data isn’t live until the uberblock is written, which happens as last thing in a transaction group. If ZFS fails to commit a transaction group, that is the uberblock not getting written, there were practically no blocks allocated since the new updated spacemaps aren’t being referenced by the updated metadata tree, which itself is also not referenced by the current uberblock.
So in theory, there can be no unreferenced blocks, unless it happens due to bugs in ZFS. In this case, it has to walk the metadata tree and check it against the spacemaps.
…that is if I got your question right.
Edited 2008-04-22 19:56 UTC
What does this mean then? http://developers.sun.com/sxde/support.jsp
Indiana is going to be named OpenSolaris (before that, OpenSolaris was just a code base). The developer edition (SXDE: “Solaris Express Developer Edition” built from OpenSolaris) is supported. The developer preview (aka Indiana) is just a preview.
Following up with your wording,SXDE is an OpenSolaris build.
I got what you meant, but the wording is incorrect.
I was referring to end-user support, not to support for those who want to write OpenSolaris kernel or userland code.
This is a very good overview. No hype, just enough technical details, fair and balanced
“Remember, though, that this is just one developer’s aesthetic opinion.”
It’s not just one developer aesthetic opinion. Layered design is superior to monolithic design.
ZFS is good, but layered ZFS would be better, for many reasons. Can ZFS do reiserfs over LVM over RAID over NFS, SMB and gmailfs? You would be surprised how some people use the technology sometimes.
Yes, ZFS is a great file system. Howover it’s still got room to improve. I would like to see it GPL’ed (dual license or just GPL Solaris). I believe there are plans for that. Make it layered if it can be and rename it agrouffs.
Edited 2008-04-22 14:34 UTC
My interpretation of the original comments were that they were about the customary layering of subsystems in the kernel, which is not something that users see or care about. In other words, I think that he meant that “ZFS does not map directly onto the storage management architecture in the linux kernel and thus would be a pain to implement in linux” and just said it in a sensationalistic way.
Yes! There is no rule saying that ZFS cannot be used over a traditional volume manager, on a loopback file (over NFS), or that a zvol cannot be formatted with a traditional filesystem like UFS. It is just not common.
It’s not just that. It’s maintainability. When features get added to the wrong layer, it means code redundancy, wasted developer effort, wasted memory, messy interfaces, and bugs that get fixed in one filesystem, but remain in the others.
It does make a difference just how many filesystems you care about supporting. The Linux philosophy is to have one that is considered standard, but to support many. If Sun is planning for ZFS to be the “be all and end all” filesystem for *Solaris, it is easy to see them coming to a different determination regarding proper layering. Neither determination is wrong. They just have different consequences.
Perhaps btrfs will someday implement all of ZFS’s goodness in the Linux Way. I confess to being a bit impatient with the state of Linux filesystems today. But not enough to switch to Solaris. I guess one can’t expect to have everything.
This is a good, balanced explanation. I think the question is whether the features provided by ZFS are best implemented in a rethought storage stack. In my opinion, the naming of ZFS is a marketing weakness. I would prefer to see something like “ZSM”, expanding to “meaningless letter storage manager”. Calling it a FS makes it easy for people to understand, but usually to understand incorrectly.
I see ZFS as a third generation storage manager, following partitioned disks and regular LVMs. Now, if the ZFS feature set can be implemented on a second generation stack, I say, more power to the implementors. But the burden of proof is on them, and so far it has not happened.
I too am impatient with the state of Linux storage management. For better or worse, I just don’t think it is a priority for the mainline kernel development crew, or Red Hat, which, like it or not, is all that matters in the commercial space. I think ext3 is a stable, well-tuned filesystem, but I find LVM and MD to be clumsy and fragile. Once ext4 is decently stable, I would love to see work on a Real Volume Manager ™.
I think you wrong here. I do think the Linux kernel community is aware of the desperate need of a ‘kick-ass-enterprise-ready-filesystem-like-ZFS’. A lot of people where waiting for the arrival of Reiser4, but we all know how that ended 🙂
Ext4 is just a ‘let’s be pragmatic’ solution: we need something better than Ext3.
ZFS for Linux is (besides license issues) a ‘no-go’ because of the (VFS) layering stuff.
But I think that there’s hope: BTRFS. It doesn’t sound as sexy as ZFS, but it has a lot to offer when it becomes stable and available. I’m following the development closely, and I get the idea that Chris Mason makes sure to not fall into the ‘reiser trap’ by communicating in a constructive matter with the rest of the kernel community.
Although not ready in the near future (read 2008), I personally have high expectations of BTRFS. And I believe it will become the default filesystem for many distributions when it arrives.
Regards Harry
The problem with BTRFS is that a whole team worked for several years on ZFS to get it to the current point, while BTRFS still has only one guy behind it (as far as I know). While there may be first workable prototypes pretty soon, the fleshing out of details (usually things work nice on paper, but practically prove to be crappy and need to be redesigned) and fine-tuning is going to need LOTS of time.
I hope you are right. To be clear, I think ext4 could turn out to be an excellent incremental improvement.
Yes, I hope this turns out as well, since it seems to bypass some of the MD and LVM cruft. What I find interesting about BTRFS is that it is an Oracle project, and yet better storage for unstructured data is not really in Oracle’s interest–what they would like is to get your unstructured data into their db.
Note: Not a conspiracy theory! I just think it is unusual.
Also, what weeman said
Not so much “bypassing” as “working more closely with”:
http://lwn.net/Articles/265533/
The article to which the linked comment by Chris Mason is attached is also worth a read for anyone interested in btrfs:
http://lwn.net/Articles/265257/
Edited 2008-04-22 20:47 UTC
I don’t see many people sharing your impatience in all honesty. The software RAID subsystem within Linux is pretty good, and has been tested exceptionally well over and over again over a number of years. You need to have spent a reasonable amount of money on a pretty decent hardware RAID set up if you really want something better. That extends to ZFS as well, as software RAID by itself can only take you so far.
The only perceived problem is that you don’t get volume management, RAID and other storage management features for free in one codebase. If distributions started partitioning by using LVM by default, created userspace tools that had a consistent command line interface, as well as GUI tools that made LVM and software RAID much more visible and usable, then you’d see them more widely used on a wider variety of systems.
You’re going to have to qualify that one, because LVM and MD software RAID were stable and being used before ZFS was even a glint in Bonwick’s eye. Indeed, ZFS has yet to be proven in the same way.
You’ll probably see one come about when it becomes a requirement to do storage management on running desktop systems and other appliances. Until that day arrives those who require the features of ZFS are already using hardware RAID and some form of volume management, and beyond that, something like the Veritas Storage System.
That’s part of the fallacy I see in some of the hype surrounding ZFS. It’s useful and a nice thing to have, especially when compared with what you got in Solaris before, but its significance is overplayed. Most of what’s in there is only really useful to people with some pretty large storage arrays, and if you have something that matters to you then you’ll already be running hardware RAID of some description and volume management, and if you have money riding on it then you’ll use Veritas as they have the tools that make the thing really useful, and it’s very well proven having been around for the best part of two decades.
When we do get to a stage where desktop systems and various appliances need these storage capabilities (real home entertainment systems, whenever they actually happen) we’ll have far better and affordable solid state or holographic storage, and the vast majority of what ZFS provides that is useful will have been transparently rolled into the hardware and not even software and kernel developers will need to worry about it.
To conclude, ultimately, when it comes to storage management today, we are limited by the hardware and the storage mediums that we have. You can’t polish a turd by slapping ZFS on it, no matter how much self-healing goes on.
Clumsy: See your comments about consistent userland tools. Actually I think LVM is ok, but I am not a fan of MD’s userland tools, and I am not convinced that the separation of the two subsystems is optimal. I believe this is for historical reasons, anyway. Regardless, this is a matter of taste.
Fragile: I can crash Ubuntu 6.06 pretty easily like this:
1. LVM snap a volume
2. Mount the snap read only
3. Read the snap, write to the parent, let a couple of hours pass.
4. Unmount the snap.
5. Sleep a little while.
6. Destroy the snap.
7. Panic.
The solution? Sync between each step, 4 to 6. This is not the only weirdness I have seen, but it stands out. Humorously, I was trying to use LVM snapshots in this case to reduce backup-related downtime.
The problem with hardware RAID is that most of them aren’t very good, and the ones that are are really expensive. To approximate ZFS features like clones and checksumming, not to mention well integrated snapshots, you really need a NetApp or, less elegantly, block storage from EMC. The price per gig of these is about 10 times that of the dumb storage you would use with a software solution. I find it really ironic to hear that one establishment Linux perspective is “just use expensive, closed, proprietary hardware”.
And before anyone says that if you have the needs, you’ll have the budget, remember that overpriced storage is a problem with any budget, since it causes more apps to be crammed onto a smaller number of spindles with correspondingly worse performance. Plus, what is good about the little guy being stuck with a mediocre solution?
Well yer, but you hardly need to create a whole storage system to achieve that. One reasonable userland tool and multiple subsystems would do it. It probably hasn’t been done because there’s not much demand, and many people are using tools on top of the normal command line tools.
Yer. You’re going to need to allocate enough space to the snaphot to take into account divergences and changes between the original logical volume and the snapshot. Which part of that did you not understand? In reality, the logical volume should just be disabled, so I don’t know what you’ve encountered there.
This is the same for ZFS. Regardless of the space saving methods employed, snapshots and clones are not free (certainly if you’re writing to a snapshot, or you expect to keep a RO snapshot diverging from an original) as proponents of ZFS want to make out. The only time they will be is when we get Turing style storage, but then, all of our problems will be solved. Do you see what I mean about polishing a turd? To go any further you need to solve fundamental problems with storage hardware.
Humourously, I do this all the time and I don’t need an all-new file and storage system from Sun.
This debate of software versus hardware RAID has been done to death, and I’m sure you can find some adequate information via Google. ZFS is not adding anything to the debate, and software RAID in ZFS has been around an awful lot less than other software RAID implementations.
Yer, but if I need those features the data within them is worth more than any cost. Additionally, ZFS is still experimental and has been around for far less time than any of those solutions.
I’m not entirely sure you understand why people use hardware RAID, and if you don’t understand that then you’ve adequately demonstrated why ZFS is trying to solve problems that don’t really need solving. Trying to label it as ‘proprietary’ to try and create a sensible usage scenario for ZFS is pretty desperate.
I have a choice of software RAID and various hardware RAID options, some better than others, proper hardware RAID allows you to do sensible hot swapping of drives and abstracts the RAID storage away from the system. Linux and Solaris support lots of ‘proprietary hardware’, but we’re not tied specifically to any of them. That just seems like a pretty flimsy argument. “Oh, but it’s proprietary!” is the usual last resort argument Sun themselves use, which is ironic in itself considering that various roadblocks make additional implementations of ZFS difficult.
If I’m worried about universally moving RAID arrays between machines then I have software RAID for that, and it doesn’t stop me getting the data off and moving it off somewhere else.
The market for that scenario is so small as to be insignificant. Overpriced storage is certainly not a problem today. People are not going to switch to ZFS because they think they’re running out of space.
Nothing, but the problem is that the little guy has little need for all of the features available. He might use a volume manager and he might well use some form of RAID, but that’s pretty much it. He’s not going to get excited about ZFS if he’s already using that stuff, and more importantly, he won’t switch.
As I said in the previous post, when the little guy requires storage management of the kind provided by ZFS (and most of it will have to be entirely transparent) then storage hardware will have advanced far beyond any requirement for ZFS.
We get it you don’t like ZFS or just don’t plain understand it.
If you don’t like what’s been written then just let us know, or at least tell us what you’re upset about ;-).
The biggest issue with all Linux tools are that they are not designed to work together, in that they don’t harmonise the commandline options (or even the command names).
For instance, why is that you can’t just use:
“mount -t <fs> -o <options> <device> <mount point>
to mount any filesystem? Some can be, if they have a mount.fs command installed, while others require running fsmnt commands.
Or, why can’t one just use:
“mkfs -t <fs> <device>”
to format partitions with the filesystem of choice? Some have mkfs.fs command that work that way. Others don’t and require you to run the separate command.
Or, why can’t one use:
“fsck -t <fs> <device>
to check filesystems for errors.
Or, why can’t one use:
“ifconfig <interface> <options>
to configure wired NICs, wireless NICs, vlans, bridges, and tunnels?
There are too many separate projects doing each little piece of the pie in their own corner of the world with very little regard for what the others are doing. This wouldn’t be so bad … if the layers between them were clearly defined, and the interfaces clearly defined, and didn’t change at the drop of a hat.
LVM, DM, and MD are crap when you are working with large filesystems. On one 64-bit Debian Lenny box, I can’t create multiple LVM volume if their total size goes over 1.2 TB. On a separate 64-bit Debian Etch box, I can create a single LVM volume that spans 7 TB, but creating a second one of 10 GB (yes GB) on the same box fails. On a third 64-bit Debian Etch box, I had to manually stitch together a bunch of 1.9 TB physical volumes into one large 10 TB volume group, but I can only partition it into smaller logical volumes if the total size of the volumes is less than 2 TB.
So much for trying to simplify things by using volume management for storage of backups and as storage for virtual machines.
I’d much rather see something useful in the server room. LVM, MD, and DM have a long way to go before that happens.
We’re using hardware RAID, working with multiple TB storage arrays, and struggling with volume management, as the current crop of Linux tools really, really, really suck. I’m very tempted to turn our VM boxes into nothing more than NFS servers so that I can run FreeBSD or OpenSolaris on them and get some useful volume/fs management tools.
Your personal anecdotes on experiences with Linux storage management tools in a narrow set of circumstances count for very little. “Oh, this happened to me with LVM once!” I’ve created an absolute ton of multi gigabyte and terabyte logical volumes, and if what you described was the case generally then I think we’d have all heard about it. The very fact that you’ve mentioned ‘Debian Lenny’ sees your credibility get shot through the head right there.
Just because LVM does this or doesn’t do that on a certain distribution, that doesn’t mean that that is the way it works generally for everyone else. I daresay ZFS will have a few bugs and quirks once it gets more widely used, and if Sun actually gets a clue you might have an outside chance of finding the problem on their web site ;-).
Well, they’re used on a very, very, very wide basis these days, and I don’t see people breaking down the doors and throwing out all their Linux boxes just to run ZFS. Just isn’t happening.
You’ll do a cost/benefit analysis of that, like everyone else, and find out that it just isn’t worth it. I don’t know what you’d gain running FreeBSD, and the less-then-perfect userland in Solaris alone is enough to turn people off.
From my brief readings on ZFS, you can add remote block devices into the zpool. You can create zraid volumes from the pool. And you can export volumes in such a way that you can put other FS on top. What’s so different?
You have a FS, on top of a volume, on top of a RAID, on top of storage devices.
ZFS is not just a filesystem. It is a storage management system that includes device management, volume management, raid management, pooled storage, and a 128-bit filesystem on top of all that.
Please, no. Something architectural like this should not be GPL’d. It should be licensed in such a way that anyone can take it and built upon it (BSD, MIT, X11, LGPL, whatever).
The GPL is great for applications. But lower-level, libraries, architectural stuff should not be GPL’d. Otherwise you get every Tom, Dick, And Harry Co writing their own (usually) incompatible version of the same.
How many more layers do you need?
I want to thank everyone who has said they enjoyed the article. I appreciate it, and I hope the contest yields a couple more good reads.
Awesomesauce. Very well done.
Now if OSNews could use some quality typography to present these words…
As one of the engineers of ZFS explains it about the rampant layering violation:
http://blogs.sun.com/bonwick/entry/rampant_layering_violation
“Andrew Morton has famously called ZFS a “rampant layering violation” because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit — that is, changing where the boundaries are between layers — we could make the whole thing much simpler.
….
The ZFS architecture eliminates an entire layer of translation — and along with it, an entire class of metadata (volume LBAs). It also eliminates the need for hardware RAID controllers. At the same time, it provides a useful new interface — object storage — that was previously inaccessible because it was buried inside a monolithic filesystem.
I certainly don’t feel violated. Do you?”
Dont drink too much Sun Koolaid.
Well, a filesystem is not a volume manager is not a RAID device. They’re successive containers for each other.
There must be a shared font of Sun anti-freeze somewhere, because if Sun believes that it eliminates the need for hardware RAID controllers then they don’t know what hardware RAID controllers are used for – which is worrying if they’re writing a new storage system.
“…[SUN] don’t know what hardware RAID controllers are used for – which is worrying if they’re writing a new storage system.”
I certainly dont. Please enlighten me?
If the filesystem can do RAID functionality, why do you still need the hardware controller? Save battery backed cache, whose reliability is also disputed?
Ohoh, me three, what are they used for??
Sigh………. Oh, alright then. It’s clear a lot of people haven’t done this.
For starters:
1. Reliable and supported disk hot swapping, which you just don’t get from on board controllers. Swap out a disk and replace it and you’ll have no guarantee whatsoever that it will get seen and rebuilt, unless you take down the system.
2. Hardware RAID controllers have a lot of logic to deal with disk problems themselves. With Time Limited Error Recovery disks, or something similar, these problems can be dealt with by the RAID controller very well. With on board desktop hard drives they have a tendency to timeout on disk failures, and this can have some strange effects on any RAID arrays above them. With different hardware, you can’t really predict it.
3. Bootable limitations. Since the OS is handling the RAID the OS cannot boot from the RAID array. You need a more complex layout of a RAID array and possibly RAID-1 mirrored boot partitions on software RAID.
4. Hardware RAID usually provides far better support for things like hot spares, and these things can often be changed on-the-fly on a running system.
5. While software RAID allows you to break free from the confines of one hardware controller, a hardware controller will allow you to use a RAID array with a different OS.
Like I’ve said before, the storage issues are far more hardware related than software related, and the above outweighs any advantages you can list in running ZFS as an alternative.
I know this must be very hard for you. Rest assured, there is a place waiting in heaven for you just for attempting to impart your wisdom to us.
You have some valid points (1, 3, 4). However, what are “Hardware RAID Controllers”? This is about as useful as talking about “Computers”.
If it is not EMC, NetApp, or HDS, it is probably not nearly as good as you think it is. The beauty of ZFS is that it can do a lot of what these high end storage systems can do at a price that is much, much lower. That doesn’t mean that it is as good as these systems. But remember that Linux itself would probably never have caught on if proprietary unix gear had been priced accessibly–people were willing to use a system that was inferior in a lot of ways back in 2000 or so, because back then, Sun was selling a 2-way server for $20,000, if I recall.
On the other hand, if the average user or admin, whom you assert doesn’t need ZFS, were the determining factor, I suspect that we would be just fine with Windows 98 today. I find it really hard to understand why a guy who is obviously passionate about technology is passionate about the current mediocre solution being good enough.
Not really ;-). I’m trying to work out what the hype is about, and why everybody is miraculously going to dump all their existing storage systems because ZFS is unbelievably fantastic. I’ve yet to find any reason.
That’s absolutely great, and no one says that things shouldn’t improve but that’s a very, very limited market. It’s even more limited if ZFS cannot work with those systems, or LVM, MD or other filesystems transparently – hence the layering violation query.
Yes it would. Not only was cheap hardware in the hands of many people important (which Sun steadfastly refused to get involved in), but so was distribution. Solaris has yet to get any of that right in the form of a community of non-Sun contributors for OpenSolaris and they are still so anal that they want you to register and ask a lot of pointless questions before you download.
I’ve spoken to so many Sun salespeople, engineers and employees who are [still] firmly entrenched in the belief that Linux (and even x86) is an upgrade path to a ‘proper’ system in Solaris [and SPARC]. I’d laugh if it wasn’t so sad.
Academia and many, many universities in particular went through this whole process with their existing Sun and Solaris kit (mostly when performance of web applications became an issue for them, or when they wanted a userspace that didn’t make life difficult for them), and I fail to see them going through it again.
It still is, except they now try and compare Red Hat’s most expensive 24-hour on call support to Sun’s e-mail only subscription for some reason. The leopard doesn’t change its spots.
I’m not saying things don’t move on. However, I don’t see people dumping all their COBOL code because you can rewrite it in something far better and cooler.
I’m not. I’m just trying to be realistic about what ZFS solves, where the problems with storage really lie and whether people are so fed up that they are screaming to move to ZFS. Just don’t see it.
Again with this lame FUD.
No you are not. No one is claiming ZFS solves all problems. That was your strawman. Sun sells SAM QFS, Lustre and ZFS. Even Sun doesn’t think ZFS solves all.
Edited 2008-04-25 23:18 UTC
People don’t dump existing filesystems, operating systems and storage solutions just because ZFS is on the scene sweetheart. If it doesn’t work with what you already have it is of pretty limited use. That’s why Microsoft provides the ability to convert to successive NTFS filesystems, and from FAT, why the ext3 filesystem in Linux is an evolution of ext2 and why ext4 follows on from it, and why brand new filesystems like Reiser4 that aren’t backwards compatible have a hard time getting usage.
If that irritates you then welcome to the real world.
I left what you quoted of my comment in so that you could go back and read it again ;-). I didn’t say that people are claiming that ZFS solves all the problems at all (although sometimes I wonder). It’s ironic that is a strawman, because that’s not what was written at all – as you can see.
All I said was that the problems that ZFS purports to solve that some people think are revolutionary in some way are not all that important, and more importantly, they aren’t important enough to dump what they have and move to something new and incompatible. You should have more than enough reading matter here now to work out why that is.
There must be a shared font of Sun anti-freeze somewhere, because if Sun believes that it eliminates the need for hardware RAID controllers then they don’t know what hardware RAID controllers are used for – which is worrying if they’re writing a new storage system.
A hardware RAID is used for exactly the same purposes as a software implementation except the hardware one does consume less CPU time. On the other hand it’s much more difficult to improve it’s capabilities without having to buy a new RAID card.
So, kind of like, storage devices -> zpool -> raidz -> zfs
Stop thinking of ZFS as just a filesystem, and start thinking of it as a storage management system, and you’ll start to see that it’s not all that different from what you wrote (filesystem on top of volume manager on top of raid on top of storage device).
Other than continuous background verify, faster rebuilds, hot-swappable hardware support, and lower CPU use, there’s really not a lot that hardware RAID gives you that software RAID doesn’t. And it’s a lot easier to move a disk with a software RAID volume on it between systems if hardware dies. Even if moving between non-identical systems.
So which is it, filesystem or storage system? You just contradicted yourself.
Hmmmm, no I haven’t. Read what you quoted ;-). It seems that even fairly cheap point scoring is beyond you.
From my understanding, it’s a misconception and poor naming that led to the “layering violations”. If Sun had called it ZSMS (Zetabyte Storage Management System) then no one would bat an eye. But, since they called it ZFS, everyone has got their panties in a knot.
If you look at ZFS, it’s more than just a filesystem. It’s a storage management system, that lets you add/remove storage devices from a storage pool, create raid volumes using space in that pool, create storage volumes using space in that pool, and format those volumes using a 128-bit filesystem with a bunch of nifty features. But, you can also put UFS on top of the volumes (and probably others). And export the volumes out using network storage protocols (NFS, and I think CIFS).
ZSMS gives you a nice, unified way to do the same kinds of things as the cluster-* that is MD, device-mapper, LVM, mkfs.*, *mnt, mount.*, FUSE, and who knows what other crap you’d have to use to get the same featureset. What really irks me about the Linux tools is how non-standardised they are (why are some tools mount.fs while others are fsmnt?), how out of whack with each other they are, and how obviously un-designed to work together they are.
Now, you want a nicely layered approach to storage, then have a look at FreeBSD’s GEOM: separate RAID modules (0,1,3,5), encryption, journalling, remote block device, volume management, and more, that can all be neatly stacked in any order, using standard FS tools. All with normalised commandline options, all with normalised command names, all with normalised module names, all designed to work together.
Kind of like ZFS (all that’s missing is a pooled storage module, 128-bitness, and checksumming).
And completely unlike Linux.
There are lots of people critizing ZFS. I strongly suspect they havent tried it. Or have they?
It is like the debate about MS Word and LaTeX. It is hard to explain the advantages of LaTeX over Word, because every time they say “oh, it is possible to do that in Word too, just….” But when/if they finally try LaTeX a while, they almost always convert to LaTeX.
LaTeX has a feeling that cant be described. It must be experienced. The same with ZFS. Before you guys critize it, maybe you should just download Solaris and try it on a spare computer for a while? If you do that, then I will take your critisim seriously.
I say “just try LaTeX for a while, then you will understand what I am talking about” – and if they dont, their negative comments about LaTeX is not that worth for me.
Yeah LaTeX! It makes me smile when people say, well, Word has the equation editor.
When I started writing my thesis, I asked, would it be ok to use LaTeX? They said sure, as long as you convert it to Word for submission. Life can be tough sometimes