bcachefs: a general purpose COW filesystem

Thom Holwerda 2015-08-21 Linux 20 Comments

For those who haven’t kept up with bcache, the bcache codebase has been evolving/metastasizing into a full blown, general purpose posix filesystem – a modern COW filesystem with checksumming, compression, multiple devices, caching, and eventually snapshots and all kinds of other nifty features.

I’ll admit I had to do a bit of reading to educate myself on what bcache actually is. Fascinating to see that it has evolved into a full-blown file system.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

20 Comments

2015-08-22 2:21 am

Alfman verbose=1
I’ve always liked the concept: use a fast block device (64GB SSD) as a cache for a slow one (2TB HD).

Dividing data between a small SSD and HD as separate volumes has always seemed like a lousy choice. Do you want to speed up the OS, or the user files, or the applications/libraries…? With SSD acting as a cache, now you can get the benefits of SSD performance for whatever data you have on your HD that you use most often, without having to come up with a partitioning strategy to divide your data between HD and SSD. This works just like RAM caching, except larger and persistent.

One aspect I really dislike about bcache is that it imposes it’s own disk format. In principal, a cache should not need to change the structure of the disk in order to cache it.

Facebook has developed an alternative called flashcache, which IMHO is a better design. It works much more like how you’d expect a cache to work (ie transparently without changing the cached disk). However it’s not available in the mainline linux kernel.

https://github.com/facebook/flashcache/blob/master/doc/flashcache-do…

A third alternative is dmcache. Theoretically this is my favorite because it’s supposed to integrate with LVM, which I find extremely useful. Managing all disks through LVM makes sense to me, especially since we already have enough layers to deal with: disk/raid/lvm and now caching. The only problem is every time I try dmcache with LVM, it is not ready yet. Hopefully some day soon…

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plai…

Edited 2015-08-22 02:34 UTC

2015-08-22 4:20 am

sergio
ZFSonLinux configuring an SSD as a cache drive for the zpool.

And the best part: you can create volumes inside the zpool and format them with the FS that you prefer (you can even export the zvols with iSCSI to a Windows or OSX machine and they will benefit of the SSD cache too).

Works like a charm and It’s rock solid… the only problem with ZFS: it’s a freaking memory hog. Really insatiable (not a Linux problem though, ZFS is a pig in Solaris and BSD too… but It’s really worth it BTW).

2015-08-22 6:40 am

Alfman verbose=1
sergio,

ZFSonLinux configuring an SSD as a cache drive for the zpool.

Good point, I could use some education on ZFS ZFS does everything, apparently. I doubt I can use it because of the licensing problems, but I find it very interesting never the less.

I don’t know if ZFS would be appropriate for what I need. Let me give you a scenario: a server with lots of customer VMs running. LVM has been perfect for me. I divide the raid up into arbitrarily sized volumes for customers. I can instantly provision new customer disks, dynamically expand disks, instantly snapshot them, etc. There are some caveats but generally it does everything I need and nothing I don’t.

The disks get mounted inside the VM, so I think that nullifies the benefits of having combined volume+file system management under ZFS. A customer could use ZFS inside the VM, but with snapshotting available at the VM level, I’m not sure what would be gained there either. Caching devices are only useful on the host, rather than inside a VM.

I can see how ZFS would be a delight with virtuoso style compartments. But what about on full virtualization platforms? Does ZFS even let you store block devices within it? If not, then I don’t imagine that storing block devices as ZFS files would yield high performance. Can you think of any advantages for ZFS?

2015-08-22 6:46 pm

rleigh
Not sure why there would be any licensing problems. The license might prevent it being merged directly into the Linux kernel due to licence incompatibility with GPLv2, but it’s still under a proper free software licence and it’s absolutely OK to use it. It is, after all, directly merged into the FreeBSD sources where the compatibility issues weren’t present.

Can you store block devices on ZFS? See: zvol http://docs.oracle.com/cd/E23824_01/html/821-1448/gaypf.html . So you could create a separate zvol per VM.

However, as for any block device the zvol is of a fixed (but changeable) size. Like an LV. As an alternative you could create a separate *dataset* per VM with suitable access permissions and then present that to the VM, for example over NFSv4 or directly bind it into the VM filesystem if that’s visible to the host. This would avoid any size issues–just use zfs quotas and this would avoid wasting as much diskspace–you don’t need to allocate the space up front and you can also deduplicate and/or compress data between different VMs on the fly.

On FreeBSD I have one dataset per jail, which means I can snapshot and rollback the state individually, and then have whatever services I need in each jail (sshd, postgres, whatever). But it’s all in a single storage pool as regular datasets/files.

I used LVM+ext[34] on RAID for many years. Experimented with and got burned by Btrfs on several occasions–it’s still terrible. And then tried ZFS on Linux, before migrating to FreeBSD (literally, swap over the discs containing the Linux zpool and run “zpool import”, job done!) The limitations of LVM become increasingly apparent as you start to appreciate what ZFS can do. But if you’re happy with LVM-style volumes, then zvols are certainly the equivalent of an LV carved out of the zpool (VG/PV-equivalent).

Regards,

Roger

2015-08-22 8:48 pm

Alfman verbose=1
rleigh,

Can you store block devices on ZFS? See: zvol http://docs.oracle.com/cd/E23824_01/html/821-1448/gaypf.html . So you could create a separate zvol per VM.

However, as for any block device the zvol is of a fixed (but changeable) size. Like an LV. [/q]

Perfect, exactly what I wanted to know!

As an alternative you could create a separate *dataset* per VM with suitable access permissions and then present that to the VM, for example over NFSv4 or directly bind it into the VM filesystem if that’s visible to the host. This would avoid any size issues–just use zfs quotas and this would avoid wasting as much diskspace–you don’t need to allocate the space up front and you can also deduplicate and/or compress data between different VMs on the fly.

I’m uncertain of what you are saying here. You are talking about using a network file system instead of a local disk within the VM?

I’ve considered something like that, it might have some merit. So far though I’ve been giving the VM a virtfs local disk that works independently of networking. I’ve done some quick testing with 9p_fs (which is kind of like a NFS over virtio channels) to expose host filesystems to the guest, but due to caching differences is slower than a virtual disk. Also linux distros can handle virtio disks out of the box and that does have some value for customers.

On FreeBSD I have one dataset per jail, which means I can snapshot and rollback the state individually, and then have whatever services I need in each jail (sshd, postgres, whatever). But it’s all in a single storage pool as regular datasets/files.

Seems like a good setup. I like containers, and bsds have always been ahead of linux in that area. When everything is under your control, I think jails/containers are preferable to full machine virtualization. For better or worse I opted for machine level virtualization because it offered more isolation for customers. I’ve been able to migrate customer machines from elsewhere very easily regardless of what they were already running. One customer is 32bit and it doesn’t really matter.

I have tried migrating some customers by reinstalling the apps and importing their data+settings, but it took some time and in the end they didn’t appreciate my efforts. Maybe there is a better way to migrate an existing installation into a container? I haven’t dug too deeply into this because it doesn’t matter with machine level virtualization.

I used LVM+ext[34] on RAID for many years. Experimented with and got burned by Btrfs on several occasions–it’s still terrible.

btrfs has been on my back-burner, I just haven’t gotten around to using it and I’ve read other warnings like yours that it’s not sufficiently mature. I don’t really know personally though.

[q]The limitations of LVM become increasingly apparent as you start to appreciate what ZFS can do. But if you’re happy with LVM-style volumes, then zvols are certainly the equivalent of an LV carved out of the zpool (VG/PV-equivalent).

I understand this completely. LVM works fine for me because my use case doesn’t need anything more fine grained than partition level snapshotting, but otherwise I agree the fine grained capabilities of ZFS are very compelling, to mention nothing of the other features like hashing for data integrity.
2015-08-22 9:28 pm

rleigh
rleigh,

Can you store block devices on ZFS? See: zvol http://docs.oracle.com/cd/E23824_01/html/821-1448/gaypf.html . So you could create a separate zvol per VM.

However, as for any block device the zvol is of a fixed (but changeable) size. Like an LV.

Perfect, exactly what I wanted to know!

[/q]

While I haven’t used zvols myself, looking at the docs it looks like you can snapshot them just like a regular dataset and then zfs send/receive/rollback, so you could use this to do online snapshot and incremental streaming backups of the snapshots of the customer VMs. The only caveat appears to be snapshotting if you change the zvol size.

One of the limitations of LVM which I found annoying, both for regular use and in my schroot(1) virtualisation tool, was the need to allocate space for the snapshot LV, and it would stop working at some future point when either it or the parent LV had sufficient writes to use up the space allocation–to be 100% reliable you had to allocate the same space as the original LV! The ZFS snapshots still take up space, but since the space is from the general pool, you don’t need to preallocate it as a fixed size up front. Though you do need to remember to delete them before the free space in the pool is exhausted by snapshots!

[q]As an alternative you could create a separate *dataset* per VM with suitable access permissions and then present that to the VM, for example over NFSv4 or directly bind it into the VM filesystem if that’s visible to the host. This would avoid any size issues–just use zfs quotas and this would avoid wasting as much diskspace–you don’t need to allocate the space up front and you can also deduplicate and/or compress data between different VMs on the fly.

I’m uncertain of what you are saying here. You are talking about using a network file system instead of a local disk within the VM?

I’ve considered something like that, it might have some merit. So far though I’ve been giving the VM a virtfs local disk that works independently of networking. I’ve done some quick testing with 9p_fs (which is kind of like a NFS over virtio channels) to expose host filesystems to the guest, but due to caching differences is slower than a virtual disk. Also linux distros can handle virtio disks out of the box and that does have some value for customers. [/q]

Yes, that’s exactly what I’m suggesting. It may or may not be workable for your situation. I just mentioned it as something worth exploring since it might have some advantages. But it’s a tradeoff; having a zvol keeps the VM data contained as a single unit without any permissions or other issues at the expense of disc space. But if the customer has paid for x amount of space, the zvol will preallocate that amount of disc, so it’s guaranteed to be available which might not be the case if the space was overcommitted to too many VMs.

[q]I have tried migrating some customers by reinstalling the apps and importing their data+settings, but it took some time and in the end they didn’t appreciate my efforts. Maybe there is a better way to migrate an existing installation into a container? I haven’t dug too deeply into this because it doesn’t matter with machine level virtualization.

This is a bit outside my experience, I’m afraid. For my uses (mainly software development and testing), I have jails and VMware images, but they all NFS mount the user data so the actual VM images and jails contain only the base OS plus any installed packages. When I have needed to copy data in, I generally copy it in via NFS/rsync/scp, but that’s obviously after it’s been set up.

Regards,

Roger
2015-08-23 7:48 pm

indieinvader
You might be interested in what Joyent is doing with SmartOS[1] and the newly resurrected “LX BrandZ”[2], which allows you to run Linux hosts on bare metal inside of Solaris Zones.

[1]: https://smartos.org

[2]: http://www.slideshare.net/bcantrill/illumos-lx
2015-08-22 9:35 pm

Alfman verbose=1
rleigh,

Regarding FreeBSD, all else being equal I think I would be a happy BSD user. Many things seem cleaner on your side of the fence. However one of the things I worry about is 3rd party support. And having built my own linux distro, I know how much having 3rd party support makes or breaks the experience. I’m very adept at supporting myself, but it can be extremely tedious and time consuming – more and more I’m favoring things that just work already.

For example, Dell’s openmanage tools are extremely useful for managing dell servers. From instrumentation, to raid to remote control, they are very handy to have. Dell provides redhat linux support, but unfortunately the packages are proprietary with no source. I find lots of posters asking about freebsd support, but as far as I can tell it doesn’t exist.

As much as this is absolutely not freebsd’s fault, to be perfectly honest it has influenced my decision making and I’m still using linux. I’m curious what a freebsd user like yourself would have to say about these sort of issues.
2015-08-22 10:04 pm

rleigh
Well, I’m a fairly recent FreeBSD user, to be honest (last 20 months). I spent the previous 15 years as a Linux user and the last 10 as a Debian developer.

I think the support issue is certainly real. Though it’s really no different than it was for Linux before IBM/Dell/etc. got on board with it. The main issue is hardware support; what’s there is good but it doesn’t support the same breadth of stuff as Linux. But that’s not to say it won’t work on most hardware; it just lacks the esoteric and very latest stuff. For my own needs, I’ve just checked the hardware compatibility when e.g. purchasing an LSI SAS controller to be sure it would work (and it did just plug and play). But this is not really much different than the situation was for Linux not so long ago.

Third party software support is definitely not as good as for Linux, particularly the Debian/Ubuntu/RHEL/CentOS distributions where packages are built by many third-parties (to varying degrees of quality). On the other hand, I was pleasantly surprised by the amount of software in the ports collection and the tools (pkg) to manage it. It’s on a par with Debian’s repository and is mostly up-to-date. This is massively improved from my previous encounters with it. Having prebuilt binaries puts it on a par with apt-get/yum. It’s still a little less polished, but it’s improved measureably in the short time I’ve been using it.

Regards,

Roger

2015-08-22 11:27 pm

sergio
sergio,

ZFSonLinux configuring an SSD as a cache drive for the zpool.

Good point, I could use some education on ZFS ZFS does everything, apparently. I doubt I can use it because of the licensing problems, but I find it very interesting never the less.

I don’t know if ZFS would be appropriate for what I need. Let me give you a scenario: a server with lots of customer VMs running. LVM has been perfect for me. I divide the raid up into arbitrarily sized volumes for customers. I can instantly provision new customer disks, dynamically expand disks, instantly snapshot them, etc. There are some caveats but generally it does everything I need and nothing I don’t.

The disks get mounted inside the VM, so I think that nullifies the benefits of having combined volume+file system management under ZFS. A customer could use ZFS inside the VM, but with snapshotting available at the VM level, I’m not sure what would be gained there either. Caching devices are only useful on the host, rather than inside a VM.

I can see how ZFS would be a delight with virtuoso style compartments. But what about on full virtualization platforms? Does ZFS even let you store block devices within it? If not, then I don’t imagine that storing block devices as ZFS files would yield high performance. Can you think of any advantages for ZFS?

Do you plan to run the VMs and ZFS in the same physical machine?

If that’s the case, I don’t recommend ZFS at all cause the performance impact of ZFS is huge.

OTOH if you run a physical machine dedicated to ZFS and export all the volumes using iSCSI to another physical machine running the VMs… then I _highly_ recommend ZFS in this scenario.

The performance is excellent and you can format the exported volumes with the filesystem that you want (VMFS for example if you use ESXi). ZFS gives you everything: redundancy, integrity checks, snapshots, ssd caching and even deduplication (killer feature for virtualization) if you have enough RAM and CPU to support it.

ZFS is simply amazing, only really high end storage boxes support all the features that ZFS gives for free. Obviously the hardware requirements are pretty high.

Edited 2015-08-22 23:29 UTC

2015-08-24 10:45 am

ggeldenhuys
Yup, ZFS has had all that and much for for years. It has been around the block a couple of times, so we know it is rock solid – stable and reliable.

I’ve used stacks of file systems over the years. I wouldn’t trust my data on anything but ZFS now.

As for the memory usage… What’s the point in having 32GB (or more) RAM and the system only ever uses 5% of it. Hell, I don’t mind ZFS or the OS itself using more – if it makes my system faster, which is exactly what ZFS does on my FreeBSD workstation.

2015-08-24 7:07 pm

Bill Shooter of Bul Platinum Prime
What system only uses 5% of 32 GB? That’s like 1.6 GB.

2015-08-25 4:52 am

Kochise
Windows 2000 or XP

2015-08-22 4:29 am

WereCatf
One aspect I really dislike about bcache is that it imposes it’s own disk format. In principal, a cache should not need to change the structure of the disk in order to cache it.

I’m not sure I understand you correctly, are you under the impression that the source disk to be cached has to be of some specific format? Or that the destination disk to be used as cache has to? If the first then you’re wrong, bcache doesn’t require the source to be of some particular format. It even says on the website itself that it is filesystem agnostic.

2015-08-22 5:40 am

Alfman verbose=1
WereCatf,

I’m not sure I understand you correctly, are you under the impression that the source disk to be cached has to be of some specific format? Or that the destination disk to be used as cache has to? If the first then you’re wrong, bcache doesn’t require the source to be of some particular format. It even says on the website itself that it is filesystem agnostic. [/q]

Yes you can put any filesystem (or LVM, etc) on a bcached block device. However this isn’t what I was referring to. To clarify, I’m speaking about the backing block device itself, before any filesystem comes into play. The backing disk needs to be specially formatted for bcache.

http://evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bc…

[q]Getting started:

You’ll need make-bcache from the bcache-tools repository. Both the cache device

and backing device must be formatted before use.

make-bcache -B /dev/sdb

make-bcache -C /dev/sdc

The “cache device” will implicitly need some special cache specific formatting in all cases, but in principal the “backing device” does not. It can be a standard raw disk that just happens to be cached. For instance I might cache an existing windows NTFS disk (not that I’d ever want to, but just as an example). After flushing the cache, I can migrate the disk back to a windows machine and have it still work flawlessly. This is not the case with bcache because the backing device is formatted for bcache. The raw disk is unusable without a conversion.

Does that clear things up?

Edited 2015-08-22 05:47 UTC

2015-08-22 8:17 am

Lennie
The reason for this is simple:

bcache is a cache for writing data.

So writes first go onto the SSD than later are written to HDD.

When you have some kind of failure you don’t want to end up mounting only the HDD and put old data online and start reading/writing old data. The system needs some way to know that the only disks the system sees contain old data (maybe because the SSD connector is flaky).

A solution does exists, you can create a bcache blockdevice without a cache. So just format the HDD before you use it normally. Then you can add SSD later when you need it.

Edited 2015-08-22 08:19 UTC

2015-08-22 7:38 pm

Alfman verbose=1
Lennie,

When you have some kind of failure you don’t want to end up mounting only the HDD and put old data online and start reading/writing old data. The system needs some way to know that the only disks the system sees contain old data (maybe because the SSD connector is flaky). [/q]

I like the LVM+dmcache approach better myself. LVM is already designed to label disks and detect those that are missing etc.

An aside: I’m not sure if this has been fixed, but bcache’s superblock can conflict with the superblock of normal file systems. It shouldn’t happen when things are working normally but just something to watch out for.

http://bcache.evilpiepirate.org/#index7h1

If a partition or disk device does not register in the cache array at boot it may be because of a rogue superblock.

…

Because devices /dev/sdc and /dev/sdd then correctly identified as “bcache” file-systems they successfully added automatically at every subsequent boot, but /dev/sda2 did not and instead required a manual register. After identifying a left-over superblock from the previous ext2 file-system at byte 1024 which “make-bcache” does not erase during format because its offset begins at byte 4096, the problem was then corrected thus:

# dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2

[q]A solution does exists, you can create a bcache blockdevice without a cache. So just format the HDD before you use it normally. Then you can add SSD later when you need it.

There’s actually a tool that resizes/moves existing ext file system data to relocate it inside a bcache container while keeping all FS pointers intact:

https://github.com/g2p/blocks/blob/master/README.md

I’m partial to the LVM approach. While I could technically stack LVM on top of bcache (which is itself on top of raid, etc), I just find that bcache disk management is kind of duplicating what LVM already does better. If you aren’t using LVM, then you gotta do what you gotta do, but since I am it just seems redundant and I’d much rather use LVM to label and manage backing volumes.

Let me just emphasize it’s just an opinion, and I’m not trying to discredit bcache. Having the choice of bcache is great, more options is better than having too few. I’d love to hear about anyone’s successful experiences with bcache.

2015-08-22 6:52 am

Kochise
… are cooking the COW. Next time will be the SOW. Dolphin and Nautilus can browse the files. Stop bitching about the animal within.
2015-08-25 5:43 pm

MrVain
ZFS has a very efficient disk cache called ARC, if you dont mind having a very small ARC, you can get away with 4GB RAM servers. This means ZFS needs to always reach for the disks, so performance is not that good with little RAM. And if processes demands more RAM, the ARC will release it upon request. Until that happens the ARC will use all available free RAM. Is this bad?

ZFS deduplication is broken. Do not use. First of all, you need 1GB RAM per 1TB disk – maybe this is why they say ZFS is a RAM hog. Second, if you do many snapshots it can takes days before you delete a snapshot which is deduplicated.

Oracle has recently bought Greenbyte which has a ZFS deduplication engine which is best in class:

http://www.theregister.co.uk/2012/10/12/greenbytes_chairman/

“….”Our VDI offer is insane.” It’s a numbers game with the dedupe technology being the joker in GreenByte’s pack. Take 5,000 virtual desktop images, full-fat clones 40GB in size and each with 2GB of swap space. That sums up to 210TB of storage. Apply GreenBytes dedupe and it gets compressed down to under 4TB. A 4TB IO Offload Engine could support 5,000 fat clones…..”

Until ZFS gets Greenbyte dedupe engine, avoid dedup. But always use compression as it will increase performance. It is faster to read 10.000 bytes and uncompress in RAM, than it is to read 20.000 bytes. If data is not compressible LZ4 will abort prematurely. So no performance loss.

Snapshots are nice. On Solarish (OpenSolaris derivatives such as SmartOS) you can install a Linux Virtual Machine in a Container, configure Linux and Oracle database, and then make a snapshot. As soon as a developer/tester needs a server, you just clone the snapshot which takes a second, and then the user gets root access to the VM. The root access is local, not global to the server. You can probably do this with FreeBSD too. The clone will read from the Master template but write all changes to a new filesystem, so it works like a kind of dedupe – but very well. This requires not much disk space, as only the delta will be saved.

And if you virtualize a 32bit VM on 64 bit hardware, you can use gobs of GB of RAM and 10Gbit nics, etc – speeding up the 32 bit VM tremendously:

http://www.theregister.co.uk/2011/08/15/kvm_hypervisor_ported_to_so…

http://www.theregister.co.uk/2013/08/27/greenbytes_latency_smash_wi…

“….Though Joyent has not released official benchmarks rating its new hypervisor, Hoffman claims some ample performance gains. With I/O-bound database workloads, he says, the SmartOS KVM is five to tens times faster than bare metal Windows and Linux (meaning no virtualization), and if you’re running something like the Java Virtual Machine or PHP atop an existing bare metal hypervisor and move to SmartOS, he says, you’ll see ten to fifty times better performance – though he acknowledges this too will vary depending on workload.

“If anyone uses or ships a server, the only reason they wouldn’t use SmartOS on that box would be religious reasons,” he says. “We can actually take SQL server and a Windows image and run it faster than bare metal windows. So why would you run bare metal Windows?”….”

Also, ZFS can use fast SSDs as cache. Have been able to do that for ages. Some also use battery backed up RAM cards for that purpose. Just add it on the fly to an existing raid, and remove it when you want. One people used a cheap hardware raid card with 512MB RAM for that purpose (as a ZIL).

.

Actually, I dont get why he bcache engineer says “the reliability of ext4 and xfs”?? These filesystems have been proven to be unreliable by researchers. Researchers have also proven that the only safe filesystem today, is ZFS. Here are the research papers:

https://en.wikipedia.org/wiki/ZFS#Data_integrity

Edited 2015-08-25 17:45 UTC