Fast commits for ext4

Thom Holwerda 2021-01-30 Linux 15 Comments

The Linux 5.10 release included a change that is expected to significantly increase the performance of the ext4 filesystem; it goes by the name “fast commits” and introduces a new, lighter-weight journaling method. Let us look into how the feature works, who can benefit from it, and when its use may be appropriate.

Better file system performance is always welcome, especially when it concerns what is probably the most common file system among desktop Linux users.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

15 Comments

2021-01-30 8:57 pm
Alfman verbose=1
This looks very promising if the benchmarks are anything to go by. It seems surprising that after so many years in existence, there are still such dramatic performance optimizations left on the table. I look forward to deploying this!
I haven’t benchmarked it recently, by my computers had incurred ext4 performance regressions that I was able to undo with this ext4 mount flag “noauto_da_alloc”. But one should be aware that certain access patterns that don’t commit any more and could cause writes to get deferred for longer.
https://www.kernel.org/doc/html/latest/admin-guide/ext4.html
So I think there are still optimization opportunities.
2021-01-31 1:07 am
sukru
And here I am still waiting for BTRFS to be fully stable:
https://btrfs.wiki.kernel.org/index.php/Status
RAID 5 will outright lose your data, and while RAID 1+0 works, in their terms “RAID10 OK tbd mostly OK reading from mirrors in parallel can be optimized further (see below)”, they won’t actually read the data in parallel.
I tried ZFS, and it was not for me. And as fast as ext4 + thin provisioning is, I really would like the bit-rot protection.

2021-01-31 1:54 am
Alfman verbose=1
sukru,
And here I am still waiting for BTRFS to be fully stable:
https://btrfs.wiki.kernel.org/index.php/Status
RAID 5 will outright lose your data, and while RAID 1+0 works, in their terms “RAID10 OK tbd mostly OK reading from mirrors in parallel can be optimized further (see below)”, they won’t actually read the data in parallel.
I tried ZFS, and it was not for me. And as fast as ext4 + thin provisioning is, I really would like the bit-rot protection.
Yeah, most of my stuff is still using traditional raid/lvm/ext4. Some things are annoying, but I’ve mostly gotten used to working with those. In my environment logical volumes ala LVM2 are more important than btrfs, but even so there are things that I like about btrfs/zfs. I’ve never been very comfortable with the ZFS licensing even though I think it’s the better product and more capable. I’m a bit squeamish about “layer violations”, but I understand why they do it.
Linux mdraid is not optimal either, it would be extremely nice to have drivers that automatically balanced the load, but linux doesn’t do that and raid array performance can be disappointing. But I’ve found that using raid 10 with a different raid layout (even in place of raid 1) can help scatter reads across disks in the set and improve throughput.
mdadm –create /dev/md16 –level=10 –raid-devices=2 –layout=f2 /dev/sda6 /dev/sdb6
I don’t know if it’s possible to do something similar in btrfs?

2021-01-31 2:33 am
sukru
Alfman,
According to their docs, they will not scatter the reads:
RAID1, RAID10
The simple redundancy RAID levels utilize different mirrors in a way that does not achieve the maximum performance. The logic can be improved so the reads will spread over the mirrors evenly or based on device congestion.
For archival / NAS, bit-rot is a real concern for me. So many of my photos have succumbed to “purples bands” over the years, I prefer to keep the auto recovery and have the performance hit.
==
For “layer violations”, actually it is a good thing, as long as it goes through proper APIs / known protocols.
For example TRIM is there for SSDs for a very long while. The disk essentially knows which block are free or used.
And then there is NVMe, with the new DirectStorage API on windows, which is already used in the new Xbox Series.
And we will soon have even more direct access to SSD chips with “zoned storage”: https://zonedstorage.io/introduction/zns/
Overall abstractions are useful, until they are not. I have no issues with it, except of course if every filesystem tries to do their own thing.

2021-01-31 4:40 am
Alfman verbose=1
sukru,
For “layer violations”, actually it is a good thing, as long as it goes through proper APIs / known protocols.
For example TRIM is there for SSDs for a very long while. The disk essentially knows which block are free or used.
Yes but you don’t need layer violations to implement trim. Every layer from the file system on up can implement trim without needing file system code getting intertwined with volume management or raid functionality. In other words, trim works fine with traditional file systems and block layers without requiring the file system to be aware of or coupled with the block layers (as they would be with ZFS for example).
Consider the following (which resembles my setup).
6 * SATA disks -> mdadm raid -> lvm thin volumes -> QEMU VMs -> ext4fs guest FS
Consider what happens when you fill a 100GB VM and then delete 90GB. Without trim the guest FS marks the sectors as unused, but the host still needs to allocate space for them. However with trim the host can delete blocks that are unused by the guest without any layer violations, the guest remains oblivious to how (or even if) the host block layers are implemented.
I’ve debated what ZFS and btrfs might buy me in my particular setup. I do like the idea of btrfs bitrot protection, but a lot of the functionality doesn’t really belong inside the VM. I’m not sure I’d get much benefit running btrfs on the host through since that’s not where I need the file system. The host mainly uses disk images (and snapshots) for the VMs. ZFS has direct support for contained block devices, btrfs doesn’t. Technically VM disk images could be stored inside the file system, but I am not sure about the performance and worry about storing 100% of the VMs in giant file system on the host. How long would fsck take? I’ve also contemplated hosting the file system on the host and then using NFS and other guest file sharing mechanisms. But every time I’ve tested this the performance was much worse than mounting file systems in the guest.
Anyways, I’d be curious how others are doing it.
Overall abstractions are useful, until they are not. I have no issues with it, except of course if every filesystem tries to do their own thing.
Well, that’s kind of what’s happening with ZFS and btrfs. You’ve got more and more functionality like raid moving into the file system and is one of the arguments against FS layer violations, but again I realize there are benefits to be had in doing this. So I kind of refrain from taking an absolute position on it.

2021-01-31 12:57 pm
sukru
Alfman,
These ebbs and webs in filesystem design will probably take some time to stabilize, and then will break again for the next cycle.
For more meta:
There is a btrfs implementation on top of mdraid in netgear readynas systems. But I could not find their patches / code on how this is implements.
https://kb.netgear.com/26091/Bit-rot-Protection-and-Copy-on-Write-COW-in-depth
And this is everywhere. For example, layer 3 switches, or layer 2 routers, or even application specific software defined networks are really useful for specific applications.
https://en.wikipedia.org/wiki/Multilayer_switch
Back to file systems: All these discussions just expose the need of updated APIs. mdraid does a very good job of implementing RAID, however hides valuable information. If we know a filesystem block is wrong, we should have an API to tell “find the right one, fix the other”. And many other similar zfs/btrfs features will probably become standard in the future.

2021-02-01 3:00 pm
Flatland_Spider
The BTRFS saga is incredibly sad/funny. 11 years, and the Linux ZFS killer is just kind of a novelty FS where only the basic functionality works.
I’ll probably run it as the default FS for my laptop since a single SSD in system dedicated to one person seems to be the extent of its capabilities, but it is in no way ready for servers. md, LVM, and XFS work too well. Not to mention ZFS is a really good storage FS, and it has picked up quite a bit of steam lately. Probably due to BTRFS failing to deliver on most of its promises.
On top of the feature set being the bare minimum, the workflow is awful. ZFS has had a lot of thought put into it, and it shows. The workflow is very smooth.

2021-02-01 10:28 pm
sukru
ZFS too has some interesting limitations.
For example, it cannot shrink partitions:
https://askubuntu.com/questions/1198119/shrink-zfs-partition-and-increase-swap-partition-size
For enterprise that is not common. For a home user it might be necessary in an pinch.
What is more, you cannot add a disk to a vdev:
https://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html
Which is also a concern for a home NAS. So you have a 4 disk raidz, and want to add a fifth? Nope, you need to back everything up, built a new array from scratch, and restore those backups.
With BTRFS you can add the fifth drive, re-balance everything, have RAID1 redundancy. (Yes it does odd disk redundancy).

2021-01-31 2:48 am
Artem S. Tashkinov
Been running ext4 without the journal since forever.
Most people are under the wrong impression that it makes your system more resilient to power outages or crashes, it does NOT.
However it makes things slower and decreases your SSD disk life span.

2021-01-31 5:22 am
Alfman verbose=1
birdie,
Been running ext4 without the journal since forever.
So basically ext2fs? Haha.
Journaling was the main difference between ext2 and ext3.
Apparently you can disable journaling in ext4fs…
https://community.wandisco.com/s/article/Using-Ext4-filesystem-for-journaling
I found some very old benchmarks…
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0390131ba84fd3f726f9e24fc4553828125700bb
Most people are under the wrong impression that it makes your system more resilient to power outages or crashes, it does NOT.
It would depend on the level of journaling you are using. Journaling every operation should be much more resilient, but is too expensive. Journaling metadata can at least help protect from catastrophic FS corruption. You will likely see incomplete writes, but the file system’s own structures shouldn’t get corrupted. At least this is the intent journaling assuming the hardware & drivers don’t have buggy write barriers.
However it makes things slower and decreases your SSD disk life span.
According to the link above journaling seems to add ~2% overhead in iozone tests. Maybe that could be an approximation of the lifespan consumed by journaling too. I wouldn’t turn it off personally, but I’d be curious if you’ve got more literature on the subject.

2021-01-31 5:46 am
Artem S. Tashkinov
I was talking about the defaults (data=ordered). The journalling mode you’re talking about (data=journal) decreases performance even further:
data=journal All data are committed into the journal prior to being written into the main file system. Enabling this mode will disable delayed allocation and O_DIRECT support.
The main goal of journalling is to aid FS recovery and recovery performance in case of a crash.

2021-01-31 6:23 am
Alfman verbose=1
birdie,
I was talking about the defaults (data=ordered). The journalling mode you’re talking about (data=journal) decreases performance even further:
That’s what I figured, although in my response I took it to mean you literally used ext4 without a journal. I think nearly everyone runs with the default journaling options.
The main goal of journalling is to aid FS recovery and recovery performance in case of a crash.
Yes, of course.

2021-01-31 1:01 pm
Artem S. Tashkinov
But I literally use ext4 without journal on all my PCs. 🙂
mkfs.ext4 -O ^has_journal all the way.
2021-01-31 1:30 pm
sukru
birdie,
You don’t need to have the journal on the same device:
https://cubethethird.wordpress.com/2018/02/05/improving-ssd-lifetime-on-linux-what-i-learned-not-to-do/
However you need to have the journal on the fastest device, hence SSD wearing will be an issue. There is research on this topic: https://csyhua.github.io/csyhua/hua-tc2018-journaling.pdf . But I don’t know any of these that made into actual products.
2021-01-31 4:45 pm
Alfman verbose=1
birdie,
But I literally use ext4 without journal on all my PCs.
mkfs.ext4 -O ^has_journal all the way.
Ok, then I found your response especially confusing when you said you were talking about the defaults. Oh well.