Windows 8: Virtualizing Storage for Scale, Resiliency, Efficiency

Guest post by moondevil 2012-01-11 Windows 20 Comments

The latest blog entry from Steven Sinofsky about Windows 8 describes the Storage Spaces functionality . From the blog entry it seems Windows 8 is getting something ZFS-like. The Storage Spaces can be created in the command line via Powershell, or in the Control Panel for the ones that prefer a more mouse-friendly interface.

20 Comments

2012-01-11 2:58 am
looncraz
Not terribly unique by any means.
I particularly dislike the fact that 10TB is reported available, even if you only have, say, 1 TB available… I prefer to know the true state of my storage capabilities and utilization as that prevents issues down the road.
That aside, this is just a fancy form of RAID with vendor lock-in and no additional perks to mention.
In the end, this is more like setting up dynamic disks in Windows as it stands now… something I actually prefer over BIOS-driven software RAID (except you can’t use it for boot volumes).
Wonder how much money was wasted on this…
–The loon

2012-01-11 5:21 am
Alfman verbose=1
looncraz,
“I particularly dislike the fact that 10TB is reported available, even if you only have, say, 1 TB available… I prefer to know the true state of my storage capabilities and utilization as that prevents issues down the road.”
Hmm, that’s peculiar. It sounds like their “thin provisioning” is really an overcommit feature, where more resources are promised than are available.
My own opinion is that overcommiting is a very bad idea if, when we call the operating system’s buff, it cannot be recovered from gracefully (ie applications die for no fault of their own). It will be interesting to learn how these windows 8 FS pools handle overcommits.
In the obvious case of create/read/write calls, they can just return error codes, but it gets much more complicated when memory mapped data manipulation fails (consider the linux memory over-commit solution is the “out-of-memory killer”).

2012-01-11 7:47 am
static666
Hmm, that’s peculiar. It sounds like their “thin provisioning” is really an overcommit feature, where more resources are promised than are available.
Well, overcommit is possible but is a pretty niche feature for big old stupid DBs in server environments where you’re better announcing a giant data partition and letting it be initialized at once rather than bring it down for maintenance every time to expand it later.
The more important feature of thin provisioning is that pool space is allocated on the fly to the individual volume requesting it.
Basically, on a 1TB drive you’re going to have C: D: E: volumes each reporting 1TB available with their free space shared and you wouldn’t need to worry about running of space or wasting it on some volumes. But you will still have 1TB in total, of course. And I’m not sure if Windows will be able to reclaim and reallocate unused space.
In the obvious case of create/read/write calls, they can just return error codes, but it gets much more complicated when memory mapped data manipulation fails (consider the linux memory over-commit solution is the “out-of-memory killer”).
It certainly won’t panic. Ever tried to add a swap file with holes in Linux? It simply won’t allow this. You have to allocate every 0 of it to a real storage.

2012-01-11 8:38 am
Alfman verbose=1
static666,
“The more important feature of thin provisioning is that pool space is allocated on the fly to the individual volume requesting it.”
To be honest I’m not sure what benefits the “thin provisioning” has to offer over logical volumes.
Certain file systems can already be dynamically expanded as needed without overcommitting them in the first place. The existing data + structures can be used as is within an ever larger volume. This way, no fs re-formatting is needed, just appending more space to the end (LVM is perfect for this).
Unless there is a limitation of the NTFS at play, I’m uncertain why microsoft would chose the overcommitment implementation described since it offers no discernible advantage over a file system that can grow dynamically over it’s lifetime.
Another question: Can the pool’s size limit ever be changed or is it a hard limit until the whole pool is reformatted?
“It certainly won’t panic. Ever tried to add a swap file with holes in Linux? It simply won’t allow this. You have to allocate every 0 of it to a real storage.”
I wasn’t really talking about swap, linux routinely overcommits memory, but the scenario here is a bit different with disk space.
What happens when overcommitted disk resources run out? Windows ought to return an error, but sometimes, as with memory mapped files, that can be tricky to handle. I don’t really know what windows does then.
Edited 2012-01-11 08:45 UTC

2012-01-11 11:07 am
kokara4a
Not terribly unique by any means.
I particularly dislike the fact that 10TB is reported available, even if you only have, say, 1 TB available… I prefer to know the true state of my storage capabilities and utilization as that prevents issues down the road.
I’m also worried about performance. The way I see it, you can take 1TB drive and pretend to have a 10TB one. But that would mean that you can write anywhere in the 10TB range. I don’t see how you can restrict the filesystem to only use the first 1TB. Well, surely you can but then it will work only for overcommit-aware filesystems.
So I imagine there is some remapping that goes on behind the scenes. This means that will (1) eat disk space to keep the mapping and (2) need to consult the mapping for actual disk access. If the mapping doesn’t fit in RAM it will have to be read from disk with the associated performance hit.
In any case, I just don’t see anything interesting in this feature. I think ZFS has all Windows 8 has to offer plus a lot of extra features. Although I’m not using ZFS because I only run Linux. I cannot wait for Btrfs to be production-ready.

2012-01-11 7:34 am
static666
What a blatant repackaging of Windows 2000 introduced LDM under a new posh name with thin provisioning on top. Are they trying to make Windows 8 look more feature-packed than it actually is going to be?
Took them another 12 years to finally get rid of the need to carefully split the disk into C: and D: drives only to run of space on the former.

2012-01-11 9:56 am
Soulbender
Are they trying to make Windows 8 look more feature-packed than it actually is going to be?
It’s called marketing and sadly enough it works.

2012-01-11 9:21 am
MrVain
The biggest advantage of ZFS, is that it protects your data against data corruption.
In RAM memory sticks, bits are flipped randomly because of cosmic radiation, current spikes, etc – that is why we have ECC RAM that detects and corrects these random bit flips.
These random bit flips also occurs on disks, but no filesystem are built to detect and correct these bit flips. In particular, research shows that ntfs, ext, jfs, xfs, etc and hardware raid are not safe:
http://en.wikipedia.org/wiki/ZFS#Data_Integrity
But research shows that ZFS does detect and protect against data corruption. See research in the link above.
2012-01-11 7:03 pm
Wafflez
I just hope this will have checksums.
Then bye bye FreeBSD + ZFS + Samba + uShare.
I just love vendor lock-in: xBox, Windows PC and laptop, Windows Phone connected to media server and together.
<3
2012-01-11 11:37 pm
hechacker1
I think this is a very positive improvement in Windows 8. Especially for the server environment, or for users with lots of storage needs.
1. It’s not just a reimplementation of existing windows logical and dynamic disks. It isn’t even RAID 1, 0, or 5. It’s a system that allows you to specify redundancy like RAID 1, or parity with RAID 5 or 6 like redundancy.
2. It uses 256MB chunks distributed around all disks equally, unless you specify certain conditions. For example, you could allocate speed critical sections to your fastest drive.
3. Over-provisioning is a nice feature because it means you can “set it and forget it.” No need to do lengthy NTFS resizes and RAID rebuilds in the future when you need to add space. Simply pop in a new disk. You really reduce the risk of doing resizes and rebuilds too.
4. It can intelligently repopulate the 256MB chunks as needed, instead of rebuild the whole array.
5. Windows 8 includes things like background NTFS scans (scrubbing), online chkdsk (for most problems), and really fast offline chkdsk for serious problems.
6. Yes it doesn’t include checksumming built into the storage pool (AFAIK so far), but it does provide an API to checksum the data you have on the storage pool. This allows you to have a program choose the correct copy of the data from the pool, and restore from that (assuming you have redundancy).
7. It lets you specify a backing device for the journal (similar to Linux), so you could put the journal on a SSD for the storage pool, and really speed up the slowest parts of calculating parity and keeping a consistent state.
All in all, it isn’t ZFS, or even provide nearly as much flexibility as some of Linux’s solutions, but it is a major improvement to Windows when dealing with lots of data.
Even for the average user, the background scrubbing and online NTFS checks will help catch errors before they get worse.
Edited 2012-01-11 23:42 UTC

2012-01-12 8:41 am
Neolander
3. Over-provisioning is a nice feature because it means you can “set it and forget it.” No need to do lengthy NTFS resizes and RAID rebuilds in the future when you need to add space. Simply pop in a new disk. You really reduce the risk of doing resizes and rebuilds too.
I don’t understand this part. When you have virtual volumes made of several physical drives, isn’t it possible to use, say, NTFS on every physical drive, and a much simpler virtual filesystem for the virtual drive that can be resized and remapped at will and transmits every “advanced” command (metadata, etc…) to the underlying physical FS ?
7. It lets you specify a backing device for the journal (similar to Linux), so you could put the journal on a SSD for the storage pool, and really speed up the slowest parts of calculating parity and keeping a consistent state.
This leads me to ask something I’ve regularly wondered about on Linux : what happens if the journal is corrupted on a journalized FS ? SSDs often brutally fail without warning and with full loss of data, could the filesystem survive such an event if it was only its journal that was on a SSD ?
Edited 2012-01-12 08:42 UTC

2012-01-12 4:56 pm
Alfman verbose=1
Neolander,
“isn’t it possible to use, say, NTFS on every physical drive, and a much simpler virtual filesystem for the virtual drive that can be resized and remapped at will and transmits every ‘advanced’ command (metadata, etc…) to the underlying physical FS ?”
You’re thinking something like a unionfs distributed across disks? That could work, but in my experience with AUFS, I think there are some performance issues with that approach, for example: the virtual fs has to check every union to see whether a file exists. It’s not an insurmountable problem – one could create a fast index to locate files immediately, but I don’t know why one wouldn’t just use a dynamically growable file system directly over a dynamically growable container like an LVM.
“This leads me to ask something I’ve regularly wondered about on Linux : what happens if the journal is corrupted on a journalized FS ? SSDs often brutally fail without warning and with full loss of data, could the filesystem survive such an event if it was only its journal that was on a SSD ?”
I don’t believe the journal is ever consulted if the file system is clean, so in that regard it’s “optional”. It is only intended to protect against a single type of failure mode: incomplete writes to the file system structures caused by power failures/crashes. I would think that a complete loss of the journal could only cause data loss if there is a simultaneous failure in the rest of the file system… however I don’t have any experience with that kind of setup.
In any case I can’t see any advantage to keeping the journal on an SSD as opposed to another hard disk, assuming the SSD has worse failure characteristics. The very property which SSD excels at, random access/seek time, gives no benefit for purely sequential journal data. The journal itself is extremely unlikely to be a bottleneck on a second disk.

2012-01-16 6:33 pm
phoenix
In any case I can’t see any advantage to keeping the journal on an SSD as opposed to another hard disk, assuming the SSD has worse failure characteristics. The very property which SSD excels at, random access/seek time, gives no benefit for purely sequential journal data. The journal itself is extremely unlikely to be a bottleneck on a second disk.
The journal writes will be syncronous, and take priority over async writes. Since there’s a limited number of IOPS that a harddrive can handle, removing the sync writes from that data path can speed things up. Especially if you get an SLC-based SSD for the journal, as those are optimised for writes.

2012-01-12 7:04 pm
hechacker1
I don’t understand this part. When you have virtual volumes made of several physical drives, isn’t it possible to use, say, NTFS on every physical drive, and a much simpler virtual filesystem for the virtual drive that can be resized and remapped at will and transmits every “advanced” command (metadata, etc…) to the underlying physical FS ?
The difference you are talking about is only in implementation. You can either have an underlying NTFS partition with some type of file-system on top to combine it, or an underlying storage pool (like LVM), with a NTFS partition on top.
Does it really matter which way it’s done? Not really.
What matters in this case is that when you do need to resize, it’s a simple operation of adding another disk. There’s no need to resize the filesystem, or rebuild the entire array to distribute parity. Each of those operations carries the risk of failure; while having over-provisioning from the start means that the storage pool has already considered those cases in its design.
7. It lets you specify a backing device for the journal (similar to Linux), so you could put the journal on a SSD for the storage pool, and really speed up the slowest parts of calculating parity and keeping a consistent state.
This leads me to ask something I’ve regularly wondered about on Linux : what happens if the journal is corrupted on a journalized FS ? SSDs often brutally fail without warning and with full loss of data, could the filesystem survive such an event if it was only its journal that was on a SSD ?
With ext4, if the journal is corrupted, then you can just discard it. You can fallback to ext4 mode without journaling (similar to ext2). The data itself will be fine, but anything in the journal could be lost. Then it’s possible to rebuild the journal when you reactivate it on a new device.
NTFS has a similar structure. The journal can be disabled when required. With Windows 8, it can also be allocated to a faster device (finally!).
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363877(v=…).aspx
Edited 2012-01-12 19:06 UTC

2012-01-12 8:41 pm
Alfman verbose=1
hechacker1,
“There’s no need to resize the filesystem, or rebuild the entire array to distribute parity. Each of those operations carries the risk of failure; while having over-provisioning from the start means that the storage pool has already considered those cases in its design.”
Your comments seem to imply that over-provisioning from the start is somehow more robust, but I cannot see why. I see the “thin provisioning” semantics and disk failure handling as two different variables. In other words, one doesn’t require initial over-provisioning in order to achieve the fault tolerance you are describing, even during a resize.
We can agree it’s better than what windows had before, but I’d still like you to justify why dynamic expansion with over-provisioning is superior to the same thing without over-provisioning? I get the feeling it’s got to do with a quirk of their implementation rather than something that needed to be done that way.
“NTFS has a similar structure. The journal can be disabled when required. With Windows 8, it can also be allocated to a faster device (finally!).”
Just a nit pick, I don’t think the journal needs to be on a “faster device”. just having it on a separate device of the same speed would eliminate all the excess seeking on the primary device.

2012-01-14 2:22 am
hechacker1
Without over-provisioning at the beginning, you are forced to run utilities (even in Linux) to:
1. Expand the volume onto a new disk.
2. Allow the underlying storage pool to recalculate and distribute parity for the new disk (which affects the entire pool).
3. Resize the file system on top with a tool. This can be fast, but it can also be slow and risky depending on the utility. NTFS generally can resize up well. Still, it’s an operation were you probably want a backup before doing.
In contrast with over provisioning, you don’t have to do the above steps. It’s handled automatically and from the start.
With regards to having a SSD as a backing device, it allows you to speed up parity in situations like RAID 5. The read-modify-write cycle is slow to perform for a HDD, but fast for a SSD. Especially in cases where small writes dominate the workload. A SSD allows for fast random r/w, but then writes to the HDD pool can happen in a serialized operation.
Some RAIDs get around this problem with a volatile cache, but you are risking data to minimize the parity performance hit. Using an SSD means it’s non-volatile and the journal could playback to finish the operation. I guess you could do it on a regular HDD, but you would still be measuring performance latency in 10s of milliseconds instead of <1ms. It’s an order of magnitude difference.
It’s all theoretical at this point, but Microsoft briefly mentioned having an SSD would be faster. We’ll have to wait for more details.
Edited 2012-01-14 02:26 UTC
2012-01-14 4:58 am
Alfman verbose=1
hechacker1,
“1. Expand the volume onto a new disk.”
Not different than with overprovisioning.
“2. Allow the underlying storage pool to recalculate and distribute parity for the new disk (which affects the entire pool).”
First of all, empty clusters don’t technically need to be initialized at all since they’re going to be added to free chains anyway.
Second of all, both the new data and parity tracks can be trivially initialized to zeros without computing any parity.
Thirdly, whether or not the initial pool was over-provisioned doesn’t determine the need (or not) to compute parity for the new disk.
“3. Resize the file system on top with a tool. This can be fast, but it can also be slow and risky depending on the utility. NTFS generally can resize up well. Still, it’s an operation were you probably want a backup before doing.”
Here we agree, the ability to resize safely and transparently depend totally on the tools and fs used. However, if these these are the design goals around which we build our fs/tools, I don’t believe initial over-provisioning is implicitly required required to achieve them.
As far as safety is concerned, the preparation that is absolutely necessary to complete a resize is the generation of a free cluster list for the new disk (which is trivial because it’s initially empty) and maybe the creation of new inodes on the new disk. Then, in one instant, the new prepared disk can be immediately added to the disk mapping, this change can even managed by an existing journal for safety.
None of those tasks are inherently dependent upon an initial over-provisioning. So I’m still wondering why ms would opt for an over-provisioned implementation. It occurs to me that they might be designing around software patents, in which case it makes more sense.
Edit: If you still think I’m missing something, well that may be, let me know what it is.
“In contrast with over provisioning, you don’t have to do the above steps. It’s handled automatically and from the start.”
Well that’s the thing I’m worried about. If microsoft’s implementation is dependent upon it’s initial over-provisioning, then that means that a windows disk pool will need to be rebuilt from scratch once it’s initial over-provision limit is reached. This is worse than an implementation which can be dynamically expanded without a static limit.
“With regards to having a SSD as a backing device, it allows you to speed up parity in situations like RAID 5…”
I agree with that, but we were talking about an external journal, in which case is the performance of the journal device is almost certainly going to be faster (or at least no slower) than the primary disk because all of it’s writes are linear. However I didn’t mean to get sidetracked by this.
Edited 2012-01-14 05:06 UTC
2012-01-16 6:44 pm
phoenix
Well that’s the thing I’m worried about. If microsoft’s implementation is dependent upon it’s initial over-provisioning, then that means that a windows disk pool will need to be rebuilt from scratch once it’s initial over-provision limit is reached. This is worse than an implementation which can be dynamically expanded without a static limit.
You are confusing two separate things: expanding a filesystem and expanding a storage pool. Probably because you are thinking in terms of a single system where all the storage is local.
Think bigger, like in terms of virtual machines running on a server, getting their storage from a storage pool.
The options are:
* create a storage pool of size X GB using all of the physical storage available; split that up into X/10 GB virtual disks to support 10 VMs.
* create a storage pool of size Y GB, 10x the size of the physical storage available; split that up into Y/10 GB virtual disks to support 10 VMs (meaning, each VM in this setup has 10x the storage space as the VMs in the previous setup).
If you go with the first option, and stuff your VMs full of X/10 GB of data, then you run into a sticky situation. Now you have to add storage to the pool, expand the pool to use the new storage, expand the size of the virtual disks (usually done while the VM is off), then expand the disk partitions inside the VM, then expand the filesystem inside the VM. This leads to lots of downtime, and many places for this to go sideways.
If you go with the second option, your VMs have disks 10x the size of the first option already, even though they aren’t using that much data, and you don’t expect them to for awhile. Now you stuff your VMs with X/10 GB of data, meaning the pool is run out of physical storage space. Now, all you do is add physical storage, expand the pool to use the new storage, and carry on. That’s it. The VMs never need to know how much actual storage space is in the pool, as they just see huge virtual disks. They still have free space in their virtual disks and filesystems. Saves you a lot of time, effort, and potential crashes.
Eventually, the VMs will get stuff full of data to the point that they run out of disk space, and you have to resort to option 1 (expand pool, expand virtual disks, expand filesystems). But option 2 lets you push this way out into the future.
Edited 2012-01-16 18:47 UTC
2012-01-16 7:55 pm
Alfman verbose=1
phoenix,
Excellent point, you are right I failed to consider that. Virtual machines emulate hardware media which have a fixed size and large virtual disks have to be thin-provisioned to be stored efficiently.
I should have thought of it, but in my defense, the example they used was about over-provisioning one’s local C/D/… drives, which still doesn’t make sense to me.
In any case, I won’t deny you are right, it could be useful for thin provisioned virtual machines. But I thought those problems were already solved. Can’t logical volumes already provide the kind of sparse allocation semantics needed for virtual machine thin provisioning? Also, if virtual volumes are stored in VMDK or QCOW files, those formats already do support dynamically growing disk images with no special over-provisioning configurations on the host.

2012-01-16 6:18 pm
phoenix
3. Over-provisioning is a nice feature because it means you can “set it and forget it.” No need to do lengthy NTFS resizes and RAID rebuilds in the future when you need to add space. Simply pop in a new disk. You really reduce the risk of doing resizes and rebuilds too.
I don’t understand this part. When you have virtual volumes made of several physical drives, isn’t it possible to use, say, NTFS on every physical drive, and a much simpler virtual filesystem for the virtual drive that can be resized and remapped at will and transmits every “advanced” command (metadata, etc…) to the underlying physical FS ?
You’re thinking of things backwards. You don’t want to layer filesystems like that.
You want to aggregate your physical storage into a “virtual volume” or “storage pool”. Then split that pool up into smaller chunks, whether that be “logical volumes” or “partitions” or whatever. And, then, finally, at the very top, the “filesystem”.
Linux does this via LVM (physical storage -> physical volumes -> volume groups -> logical volume -> filesystem).
Solaris/OSol/FreeBSD does this via ZFS (physical storage -> virtual devices -> pool -> filesystems).
And now Windows has this (physical storage -> pool -> spaces -> filesystems).
This leads me to ask something I’ve regularly wondered about on Linux : what happens if the journal is corrupted on a journalized FS ? SSDs often brutally fail without warning and with full loss of data, could the filesystem survive such an event if it was only its journal that was on a SSD ?
In most filesystems/storage systems, the journal is write-only. Data is written to the journal, then later, the data is also written to the filesystem.
The only time the journal is read is at boot, and only if the filesystem is “dirty”.
If the journal disk dies, and the filesystem is “clean”, then no big deal, all the data in the journal is also in the filesystem.
If the journal disk dies and the filesystem is “dirty”, then you will lose whatever data is in the journal and not yet written to the disk.