A series of patches has been proposed on the Linux kernel mailing list earlier by a team of engineers from Red Hat, ClusterFS, IBM, and Bull to extend the Ext3 filesystem, adding support for very large filesystems. After a long discussion, the developers came forward with a plan to roll these changes into a new version: Ext4. LWN covered the changes as well as the arguments put forward for a new filesystem a few weeks back.
While they’re changing things, why don’t they go straight to a 64-bit file system? why 48-bits?
Wouldn’t it make more sense to change now, then have to worry about it again within the next decade? Is it more complicated then changing the type in the struct from a 16-bit uint to a 32-bit ulong?
“go stright to 64-bit”
My thought’s exactly. And then I thought why not do like the internet and ipv6, go 128-bit and be done with it.
Probably because the amount of re-design and re-engineering involved to successfully get to 128bit and still maintain enough compatibility with existing systems is huge. Just look at how long it’s taken Sun to get ZFS ready as an example.
It would be a good thing, though, to see at least 64bit support. I guess I’ll just have to keep waiting a while longer.
What would the features be?
I hope more then just bigger sizes.
It would be nice to see some of ZFS’s features (e.g. pools and block checksums) turn up in this filesystem, but even if they can’t do that, I still can’t see anything bad coming of it. Given the improvements in ext3 we’ve seen during the past few years, I think ext4 might end up being something very, very good.
Edit: Regarding the 64-bit vs. 48-bit business, Ts’o does seem to be proposing a 64-bit filesystem (I think). Not sure why they wouldn’t just go to 64 bits, or 128 ala ZFS for that matter.
Edited 2006-06-30 17:39
This is really inane on the part of the linux developers. Continuing to develop the old ext2 codebase is just retarded when MUCH better filesystems are already available. XFS (formerly SGI’s FS for IRIX) has been in the Linux kernel since 2004 as has ReiserFS. Both file systems are much nicer than any hack of the old ext2 code will ever be… Why bother updating the old Linux FS when much better FSes have been available for years?
Choice is a good thing.
Because the newer FS’es aren’t all that better, perhaps?
ReiserFS has several shortcomings compared with ext3 and ext3 has several shortcomings compared with ReiserFS.
XFS has several shortcomings compared with ext3 and ext3 has several shortcomings compared with XFS.
The perfect solution does not exist, so the sensible solution would be to enhance as many FS’es as possible.
And that’s exactly what they are doing.
Better? At least I never got file corruption with ext3. I tried XFS for a year and ReiserFS for about six months… They didn’t really liked power outages (especially ReiserFS) because the journalling mode is writeback, which does NOT save your data in a journal.
Reiser4? Not mature enough. It’s nice to have speed and features, but are they really more important than your data?
I’d rather be conservative with my data. Of course, results may vary; I know people who had nothing but trouble with ext3. Although I wouldn’t mind to try XFS again (the recovery tools were great), I wouldn’t touch ReiserFS again, even with a 10-foot pole.
expanding ext3 filesystem limit from 8TB today to 1024 PB.
Why not?
Edited 2006-06-30 17:52
Why not?
Waste of developer resources. Sure, “If people wan to spend time on ext4, let them”, but it creates extra overhead for people that build tools that interact with filesystems as well, not to mention going to 48 bit first instead of 64 immediatly is, in my perhaps uneducated view, insanely short-sighted.
I am thinking the same thing move to 64 bit or 48 bit addressing would increase the overhead. I don’t think 64 bit addressing is a step to increase speed, but would be useful for reading large filesystems. So maybe 48 bit is a compromise between speed and scalability. Btw I thought OS X was using 48bit addressing, I might be wrong though.
I wouldn’t say that having 1EB (or 131072 times the current max. of 8TB) is “insanely short-sighted”.
Also, note: “Some reserved space in the ext3 superblock has been grabbed to store the high 16 bits of some global block counts.”
this means that 48-bit isn’t just any number, but is there for a reason: compatibility.
From what i can see, ext4 looks like a stable “upgrade” path from ext3 with only kernel support needed (no filesystem rebuild)
i want versioning like vms filesystem
I’ll second that argument. The version in VMS is just what is needed.
While thay are at it, why don’t thay add the sort of things like RMS files that also exist on VMS
For thos who don’t know, The VMS ODS2 can handle keyed files and you have a whole bunch of standard utilities that for many small apps, make a RDB (eg ORacle) redundant.
However, taking off my rose tined glasses, to implement VMS style versioning would involve a lot more that just making the file system handle file versioning. You also have to have all the utilities like rm, mv etc handle for example delete by version.
Good idea though… IMHO, probability < 1%
/s
The version in VMS is just what is needed.
Current operative systems (including linux) are implementing that and even cooler dunctionality around the concept know as “filesystem snapshots”
Current operative systems (including linux) are implementing that and even cooler dunctionality around the concept know as “filesystem snapshots”
IIRC, VMS 7.0 came with a log based, snapshot-capabile filesystem called Spiralog. I’m not finding a huge amount of info on it on the web, aside from highly technical docs, so I dont have a good link.
Whether or not anyone used it much I don’t know.
isnt that slightly different? VMS keeps versions of files like foobar.whatever . you save a change and you get foobar.whatever;1 and foobar.whatever;2 and so on. can make for a messy home dir though
btw, im not at all familiar with FS snapshots… please educate me
Edited 2006-06-30 18:46
VMS Versioniong-like functionality is coming to ZFS (based on what I know about VMS’ versioning):
See examples in this thread about the new “user undo” functionality onder developer is implementing:
http://www.opensolaris.org/jive/thread.jspa?messageID=39676髼
OK, so they’re moving to a 1 EB limit… XFS has 9 times that. I stand by my claim that they should stop putting band-aids on ext2 and start using the much more advanced code that they already have.
does xfs support persistent inodes? if not, it’s pretty much useless for folks like us who need an afs cache. I may be wrong here but I think the same may go for selinux support.
Yes, it does seem that they are duplicating the work of XFS. XFS is meant for serious (i.e. > 1TB) needs – and it is extremely stable, it’s the oldest journalling filesystem in any UNIX in use today.
If you want XFS, why do you even bother about what EXT3 developers do? Just use XFS. The “duplication” argument is silly, we’ve many cases of duplication in open source projects (kde/gnome), and there’s no need to repeat the discussion.
People is free to dislike XFS for technical reasons, and some do – check this opinion from the Linux VFS guru: http://lkml.org/lkml/2003/12/2/150 .
Yeah, that’s right: the FAT filesystems took a conceptually similar path, from FAT12 to FAT16 and mutations of it, to FAT32, for the DOS/Windows world, all in the name of backwards compatibility (mostly) while still keeping limits on them that weren’t justified with current technology already available, such as NTFS, which has (IIRC) been 64-bit address capable with the huge files (64 bit) since… 1993 as the final release build of the first working version of the filesystem, including journaling, even then.
The biggest pity is that Microsoft hasn’t just made the data structures and all that goes with it fully documented for outsiders to use, because NTFS could readily become the data-exchange format standard that’s interoperable between all other OS’s, just as FAT filesystems are currently a lowest common denominator.
Then again, there’s always also the BeOS filesystem, too, which has been 64 bit from way back when, using extents as well, as well as being journaled.
Both of the above mentioned filesystems have existed long before Ext4 was planned, and are stable, and work well for most things, but no single OS or filesystem I’m aware of is “perfect” or “optimal” for all things except (best case) for fulfilling ideological wet dreams. What does the history of Microsoft and the FAT mutants demonstrate as being the most likely progression? Well, they went from something very simple to describe and implement, but limited for scalability, to something radically different that’s much more complicated to describe and implement that also provides much richer functionality than what it replaced, and completely dumped the old filesystem that was before it, while simultaneously evolving the old filesystem alongside the newer one in a transition period. Observation seems to point that Ext4 will likely be relegated to much the same evolutionary role of FAT32: it will exist for a time for sufficient backwards compatibility mode while some other filesystem that ultimately replaces it is adopted over time. Which filesystem is this? That’s still being decided!
You do however forget that ext2 and ext3 are much much better than any variation of FAT.
FAT32 is pretty much unusable for everybody, while ext3 is quite a good choice for home users. Much of what BFS can do can be done with ext3 as well. Extended file attributes as well as journalizing. NTFS doesn’t use truly journalizing but merely quasi-journalizing.
Performance-wise BFS is a poor choice compared with ext3 – however BFS was built from scratch with journalizing and extended fileattributes (as well as being 64-bit from the beginning), and there is more to a FS than just performance.
A steady stepwise progression (evolution) is usually a better approach than revolution, but occasionally the amount of legacy elements (garbage) becomes too high. I’m sure the linux ext-family will be completely someday. For the home linux user ext3 is the best choice. But I wouldn’t use it on large scale servers. Reserve those for XFS or ZFS.
Would you care to explain the how and why of NTFS doing “quasi-journalizing” compared to BFS and Linux Ext3? NTFS uses a journal for the file attributes/metadata, be they extended or not: just like BFS does in terms of what’s written to the journal. If you mean “journalizing” as in writing *everything* (metadata/attributes *AND* file data) BFS never writes the main data that’s in a file to the journal.
For dealing with lots of small files in terms of creation/destruction, BFS blows chunks for speed, but it works very well for fast disk-to-disk transfers and streaming of large media files. A large part of the reason for Ext3 performance is how much the OS itself caches in the combined VM/filesystem cache: BeOS as of 5.03 and Zeta have separate caches for each, and there’s a distinct bug (at least as of BeOS 5.03) that causes the filesystem cache and VM cache to collide if the system doesn’t have an arbitrary filesystem cache size set to ensure that it doesn’t happen for memory configurations starting at (I believe) 256 megs (it may be 128 megs, I can’t remember: I haven’t had a system with that little memory since early 99) where it definitely causes a problem. I haven’t tested it myself, but supposedly this bug no longer exists in Zeta.
BFS comes out on top of Ext3 if you’re searching for files on an entire hard drive using the metadata in comparison to Ext3, for files that already exist, due to the data structure used. BFS also doesn’t have the case of you running out of inodes if you manage to create more files than expected at format time: Ext3 and before have fixed limits to the number of files on a volume that is determined at format time. This limit may not be hit most of the time, and it may indeed provide greater performance over BFS in some aspects, but NTFS and BFS don’t have an inode (or equivalent structure) limit.
FAT32 is usable for everybody, as it is easy to write a filesystem driver for. Usable!=efficient Because of the intertwined linked-list nature of a FAT filesystem, it is inherently unscaleable with large numbers of threads/processes using a filesystem at the same time, and requires (in the worst case, which is easy to run into) keeping an entire FAT for a volume in RAM for top speed, and may require reading in the entire FAT to access a single file that is maximally fragmented across that volume such that each allocation unit is n allocation units apart from the last one, or at least that it doesn’t link to the next allocation unit for that file in the same allocation unit of the FAT (extra points for people that followed that). As such, FAT32 is by far the worst possible thing you could use for a busy filesystem, especially on a web server or anything else creating, destroying or modifying lots of small files. It also has a (by today’s standards) rather small file size limitation of (IIRC) 4 gigs-1 byte (a zero-length file is valid, so you can’t use 2^32 to say “4 gigs”). In addition, with current hard drive size growth, it’ll hit the wall for maximum theoretical volume size within 2 years with a single drive, though due to all the other issues, it hit the upper practical limit several years ago.
All 3 mentioned filesystems are subject to fragmentation, regardless of what anyone would wish to claim otherwise. As to whether it has as serious of a performance impact or not depends on other things. FAT32 is the simplest to defragment in terms of knowing how to go about doing it, but also typically tends to need it more; ext3 is probably just a little more difficult to write a defragmenter for, but the filesystem is simple enough in structure to make it fairly easy. BFS with the extents and the tree structure makes it quite a bit more difficult, and I know of no defragmenter that exists for it at this time, for whatever reasons that may be, such as Be, Inc. not fully documenting the filesystem details and releasing them to the public, even in Dominic Gimpaolo’s book (the creator of the BFS used in BeOS, I hope I got his name spelled correctly) and the common statement “it doesn’t get fragmented” which can be demonstrated to be false, if you work at it and Ext3 is no different there in terms of fragmentation.
Ok, EXT4. Look. EXT3 is terrible. We have the source code to:
ZFS (The ultimate)
XFS (The second ultimate)
JFS (Robust and good)
ReiserFS (Some people like it)
It is arguable that these are all better than EXT3.
Why on earth would they re-invent the wheel AGAIN!?!.
I’m really sick of RedHat and a lot of the major distros and a lot of the Linux kernel guys foisting EXT on people and extolling the virtues of EXT3. EXT3 was a hack to give EXT2 journaling.
It does work great in terms of making ext2 more durable but it does little to help out in performance of scalability.
We need to stop re-inventing the wheel.
For example on FreeBSD, why haven’t they switched to XFS? Why keep using the hack known as UFS+softupdates? I believe the XFS guys would have re-licensed for FreeBSD if they showed interest.
Very strange.
We need to stop re-inventing the wheel.
No… Reinventing is good. They need to stop to fix the broken wheel over and over.
i don’t get it why there’s so much resistance.
ext4 is an upgrade from ext3, not a new redesigned FS.
From the way i see it, it’s designed so that you upgrade your kernel, and start using it on your already deployed ext3 system (hence the 48 and not 64bits, see my previous post).
So, why not?
Edited 2006-06-30 20:05
I wouldn’t go as far as saying that ZFS, XFS, JFS and ReiserFS are better than ext3FS.
It definitely depends on where the FS is to be deployed.
In a home user system ext3 would be a better choice than any of the other file systems. However, if I were to deploy a filesystem in a large server park, I’d go for ZFS,XFS or JFS. ReiserFS is too new to be recommended, though it looks promising. For a smaller server ReiserFS just might be the right thing.
EXT2/3 survives today because it’s better than all the others in one important way. It doesn’t have more features, it doesn’t have more speed, it has more simplicity. It behaves preditably, it’s easy to repair and it’s code is simple and light. It has fewer bugs than the others and is therefor much better suited to people who care about the reliability of their data over the features the other filesystems offer (which is pretty much everybody).
This is great news, especially the extents thing. It should make the filesystem metadata much smaller and reduce the related overhead significantly.
i have to bring reiser4 into this since nobody seems to like reiser4.
IMHO reiser4 already has all the advanced features they plan to put in, it’s already in mm, and afaik reiser4 is now stable and faster than pretty much any other FS at the cost of CPU power. Which isn’t a bad tradeoff considering most CPU’s are getting more powerful and our CPU’s sit idle most of the time.
but, i use ext3 because of inotify support (which hasn’t been programmed into reiser4 yet) and I still look forward to ext4 if it will generally improve the linux FS for all distros (since all distro’s seem to recommend ext3, except for gentoo where reiserfs is just as acceptable and better because of our portage system, i.e. lots of small files).
Reiser4 is still too new for most people to consider I assume. Give it time. People tend to be conservative about technology.
I’m still trying to get over the chock of seeing colored text in my console – and I’ve been seeing those for years.
Reiser4 has a couple of problems which stand in the way of it being widely adopted…
The full reiser4 code is *very* intrusive, it basically comes with its very own VFS layer. Kernel hackers don’t like code duplication, and are not inclined to change the other filesystems to avoid it (mainly because reiser4 doesn’t worth it). Filesystems are mostly irrelevant, unless they aren’t in the kernel, so reiser4 is mostly irrelevant now.
But this isn’t the end of the story… it has a totally different on-disk structure from reiserfs. Kernel hackers like stuff that lasts, and they aren’t inclined to go on adding filesystems that are radically different with each version. The reason for this is simple: they will become responsible for mantaining it forever… And they are still burned by reiserfs, which was almost immediately abandoned by the Namesys people, because they were now working on the Next Big Thing(tm), reiser4.
Then, reiser4 has all those features that look good on paper, but that have dubious usefulness on the real world. Sure, the desktop geeks are all excited, but they get excited by everything whether they are going to have a use for them or not.
Being this about storing *data*, people are very conservative. Maybe when we have solid-state disks, and the hardware part becomes solved, then people will start to be more receptive to radically new filesystems.
I read through this on the kernel mailing list, and quite frankly, trying to put ext3 on life support by bunging more features on that it was never originally designed for is a mistake. Making it support 16 TB filesystems is just a terribly silly idea. And then you’ve got the perpetual thing of working out what version of ext3 you’ve got in the kernel this week.
ZFS in the Solaris world isn’t going to rock anyone’s world any time soon, but the kinds of features that it has at a fundamental level are things that need to be worked towards in the long run. To do that, you need a filesystem that is just fundamentally designed for those things in mind. That’s why there are filesystems around for this purpose, like JFS and XFS. What do we get as excuse? The supposed code quality of anything apart from ext3 and the number of lines of code in XFS.
As Andrew Morton is hinting and no one is listening, ext3 is just not goint to make it as a modern filesystem.
“Making it support 16 TB filesystems is just a terribly silly idea. And then you’ve got the perpetual thing of working out what version of ext3 you’ve got in the kernel this week.”
That’s 1 Exabyte actually, 16TB (for the move to unsigned 32-bit) + 16 bits
Do you want to know why XFS hasn’t been adopted by major distributions?
http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0639.html
If you have an active database on an XFS filesystem, and the machine crashes, your database absolutely *will* be completely destroyed, along with any other files that were being written to at the time of the crash. I have experienced this myself several times in the few months that I’ve been using XFS.
File system design is hard. File system implementation is harder. File system development impacts user’s data and so reliability is critical.
Because we abstract filesystems in such a way that the file system is responsible for efficiently managing the storage — rather than dividing the task between the file system manager and the storage manager — it is difficult to design a one-size-fits-all file system.
There’s room for maybe a dozen ‘styles’ of file system in Linux, ranging from small fast, low power systems for embedded devices to the sort of mass storage envisioned by the ext4 designers.
What’s unfortunate about ext4 is that it is going to be extent based, and designed by people who don’t have a whole lot of experience either with extent based systems or with storage management in configurations of that size.
It would be far better to leave ext3 with the role Linus described for it: they hyper-stable well-tested ‘small’ file system and just keep maintaining it and then concentrate development on extending and improving one of the more modern file systems.
Afaik the nulling of files after a crash is repaired, it’s a non issue now.
> Afaik the nulling of files after a crash is repaired, it’s a non issue now.
Despite what they say in the XFS FAQ about it being “fixed” in 1.1, it is still an issue in current the current version. I’ve heard this from numerous XFS victims who believed the FAQ.
XFS is still by far the least robust filesystem that Linux supports.
> Afaik the nulling of files after a crash is repaired, it’s a non issue now.
Nope, its definitely still an issue, at least as of the 2.6.16 Linux kernel.
The filesystem that I can read, write, change permissions and all major features on Linux, BSD and Windows (and, if possible, others) will be my FS of choice…
…Sadly, most of the developer community isn’t looking for this kind of (data) freedom…
You people bashing ext3 for being a “hack”, “old”, “slow”, “featureless”, should really read this…
http://ext2.sourceforge.net/2005-ols/paper-html/index.html
Ext3 is reliable, is fast enough (saying it is faster or slower than any of the others depends on which benchmarks you choose to believe) and has all the features that matter. This is why it is the most widely used Linux filesystem.
I want my filesystems to have…
1) adequate throughput
2) the features I need (ACLs, journaling)
3) extreme reliability
4) a very mature fsck in case it gets corrupted for whatever reason
XFS, JFS and ReiserFS mostly fit the first two requirements. They can mostly fit the third requirement too, although we can find evidence of data corruption happening with all filesystems (ext3 included), mostly caused by faulty hardware, power failures at exactly the worst possible time, or (ir very, very, rare cases) bugs in the filesystem implementation.
It is the fourth requirement that really gives ext3 the edge… e2fsck is really good (the ext2/3 on-disk structures being designed for reliability helps too) and can recover most data out of utterly screwed filesystems.
The others have more complicated on-disk layouts, and their tools end up not being as good.
ReiserFS is the worst of the lot. A single corrupt block can trash the entire fs, and all you can do is pray for reiserfsck to be able to save the rest of the data, which doesn’t seem to be something to bet your life on, as far as anecdotal evidence goes.
If I had a big SMP machine, with expensive disks connected to battery backed RAID controllers, I would probably choose XFS. But in the real world there is something called a “budget”, which forces us to buy single or dual CPU machines, with off-the-shelf SATA drives. I choose ext3 there, case closed.
Besides, neither XFS nor JFS originated on Linux. There is a lot of cruft in their code, as others have mentioned already.
“””They can mostly fit the third requirement too, although we can find evidence of data corruption happening with all filesystems (ext3 included), mostly caused by faulty hardware, power failures at exactly the worst possible time, or (ir very, very, rare cases) bugs in the filesystem implementation.”””
If you actually run tests, you will find that ext3, using default settings, is far more resistant to data corruption than most of the other filesystems at their safest settings. I believe that reiserfs v3 can might match ext3, since it supports the same levels of journalling, including an “ordered” mode, now.
You will find that XFS is notably more fragile than the others, due to its data nulling policy, which has not been fixed, despite what the FAQ says.
To be perfectly clear: “XFS will eat your data”.
ext for you and me!
Seriously, if you are talking everyday home system kind of use then I would say the most important thing is reliability IMO. And I find ext3 to be the most reliable. Benchmarks will never give me a good idea of what performance increase I should expect, downloading my pron, listening to oggs, posting on osnews, forum trolling, and chatting on yahoo messenger.
🙂
Plenty of examples of reiserFS troubles on the linspire forums but then again…
BeOS Goodpoint: journal base FS (I use BeOS 4.5 and 5.0 Personal Edition). Comparing with Windows 9x, Me and NT4 before Win 2000, BFS is very attractive to me at that moment.
BeOS is database like FS. Find and sort of file like as Database.
Bad Point: Lack of any security measure: UNIX like permission and ACL capacity.
In my opinions, FAT16 and FAT32 is so problematic FS if improper shutdown and hang computer causing data inconsistent, I wait to see whether have cross-link, lost cluster (delete lost cluster of recovering lost file in fileXXX and DIRXXXX). This is very annoying and easily lost of important data.
Under FAT16, Two copy of FAT tables (metadata)do not self-recovery if one copy FAT table occur error (user does not know one of FAT tables gone wrong and FAT turn to up backup copy of FAT table) except you run scandisk or norton disk dockor like to find out and fix problem to sync them again. But I don’t sure FAT32 to have this problem.
No file permission and ACL
NTFS good point:
very stable – journal based FS
support UNICODE in filename native. It is useful for non-english language users such as Chinese, Japaneses, Korean character (I am Chinese).
Built in permission and ACL (also bad point, user require pay more to buy professional edition windows)
Bad point: 4KB default cluster size for NTFS is too small for today hard disk. Large cluster size can improve I/O performance and lower fragmentation.
Defragmenter can not defrag bigger than 4K cluster size. It is quite Awesome to me…
UFS2 is FreeBSD FS (This is my hobby OS). UFS2 is a new design (not compatible with UFS1) but share some codes with UFS1. It is 64-bit FS and support ACL natively. UFS2 (include UFS1) is not journal based FS but use different approach to keep metadata consistent if encount power failure or OS crush – SOFTUPDATES.
Good point: SOFTUPDATES allow write most of data to harddisk in async. mode and few in sync mode. It improves FS performance but do not hurt data consistent after power failure and OS failure.
Bad point: Running fsck after OS crush and power failure is required.
Not support UNICODE.