RAID’s Days May Be Numbered

Submitted by Robert Escue 2009-09-18 Hardware 32 Comments

This is an article which discusses the increase in storage capacity while performance and hard error rates have not improved significantly in years, and what this means for protecting data in large storage systems. “The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future â€” and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

32 Comments

2009-09-18 2:32 pm
kragil
What the article states might be true, but the cost of storage goes down far faster than the problems rise.
So you can just throw more disks and a permanently running replication daemon at the problem will have a working solution.
It is not the end of the world. The mainframe solved this one decades ago.

2009-09-18 3:33 pm
Robert Escue
That only works if you have relatively small amounts of data that is not modified frequently. If you have a large operation with frequently changing data the potential for failure increases significantly.
I am not a fan of cheap storage because what most people think they are getting in savings they end up paying for in terms of management, performance and reliability. I don’t care for SATA storage because I am not convinced that it will work reliably in the long run as opposed to SAS or Fibre Channel.
While the possibility of a massive failure for most cases is slight, it depends entirely on how the solution is deployed and what mechanisms are in place to protect the data. Having redundant storage to prevent data loss can get real expensive.
Edited 2009-09-18 15:33 UTC
2009-09-19 9:43 am
MrVain
“It is not the end of the world. The mainframe solved this one decades ago.”
You seem to have missed the point. If a drive fails in a raid, it takes hours to rebuild the raid. The larger the drives, the longer it takes to rebuild the raid. If you have 1TB, it takes maybe 10 hours. 2TB may take 24h. 4TB maybe 2 days. 8TB drives may take one week? Because drives doesnt get much faster, only larger.
At some point, it will take long time to rebuild the raid. Scaringly long time. Say it takes one week. When you rebuild a raid, it stresses the other discs very much, to the point it is common another disc breaks! This happens more often than you think. Then you are screwed, if another disc fails.
Therefore you use raid-6, which allows two discs to fail. But there is likelihood that both drives fail during rebuild. At some point in the future, the discs get so large, another disc will fail as fast as you rebuild a broken drive. This is due to larger and larger drives.
A decade ago, the mainframes didnt have this large drives. To rebuild a raid was no problem, it went very quick. Today, it takes a very long time. Therefore, you are wrong, mainframes have not solved this problem. This is the reason people say that raid-5 is soon obsolete. This is what the article is about.
Also, enter Silent Corruption. Discs will read/write bits errorneously without even noticing it! You will not get a notification: there was an error. This is a BAD thing. 20% of a discs surface is dedicated to error correcting codes, and the codes can not fix every error, nor even detect every error. There are lots and lots of errors in every read and write, that gets corrected on the fly. But sometimes there will be errors that can not get error corrected by the disc. Nor even detected. It is like the lamp on the oven says it is turned off, but the lamp shows wrong, the oven is in fact turned on – the HW doesnt detect this, so it lies to you. Look at a spec of a new drive, it says “unrecoverable error: 1 in 10^14”. There are errors that even doesnt get detected. But ZFS detects, and also recovers the data. The “1 error in 10^14” doesnt apply with ZFS. Because ZFS detects and corrects them.
SUN knows about these problems and ZFS does fix this problem. ZFS also allows other more safe, configurations than raid-5 and raid-6, which makes ZFS less susceptible to taking a long time to rebuild a raid. For instance, three discs are allowed to fail in raidz3 configs. Or you can mirror lots of discs, and combine them into flexible raid configs. The best thing is that ZFS does NOT like HW raid and they only disturb ZFS. Therefore, ditch HW raid while you can get a fair price, and use ZFS to get a cheaper and safer solution. 48 SATA 7200 rpm discs, reads 2-3GB/sec and writes 1GB/sec. And the data is safe too.
CERN did a study on silent corruption, and the moral was that one error in 10^14 is not correct. This article is not correct, according to studies at CERN. The errors occur more frequently, in practice:
http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
Your data is at risc. Silent Corruption and bit rot eats your data. Silently. Without the HW telling you. The HW doesnt even notice.
Clearly, something has to be done to cope with these errors that large drives and future filesystems will face. The main architect of ZFS explains some of the future problems that will be more and more common.
http://queue.acm.org/detail.cfm?id=1317400
Edited 2009-09-19 09:54 UTC

2009-09-19 11:56 am
unclefester
To put things into perspective the probability that you will die within the next year is far greater than 0.01. I wouldn’t be worried about a few corrupted files within that context.

2009-09-19 7:53 pm
MrVain
Maybe if you are a CEO for a company with a critical system, worth of billions of dollars or of thousands of lifes, you are willing to take any measure to minimize the risc of loosing data? And, ZFS doesnt require extra, specialized hardware. Just a Sata controller card with no raid functionality + 7200 rpm SATA discs.
RAID – aRrAy of Inexpensive Discs(?). With ZFS it becomes true. A good HW raid card costs much. What happens if the vendor goes bankrupt? Where to find a new HW card? You are locked in.
ZFS code is open and you can do whatever you want with it. ZFS is future proof. And it doesnt cost anything. Move your discs to a Mac OS X computer, or FreeBSD computer, or Solaris SPARC, or Solaris x86 and write “zpool import” and you are done with the migration. All data is stored endian neutral.
To me it is a no brainer why not use ZFS. It is better, safer, easy to administer and free. Ive heard to create a raid with Linux and LVM takes like 30 commands. With ZFS you write “zpool create raidz1 disc0 disc1 disc2 disc3” and you are done. No formatting. Copy your data immediately. No fsck exists. All data is always online.
But these great advantages that ZFS gives, is nothing new with SUNs technology. DTrace is also as good as ZFS. And Niagara Sparc. Zones. etc. And they are all open tech. And GOOD tech.
Here is a Linux guy builds his first ZFS storage server. Well researched and a good read:
http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/
2009-09-21 12:38 pm
Laurence
To put things into perspective the probability that you will die within the next year is far greater than 0.01. I wouldn’t be worried about a few corrupted files within that context.
If you’re going to talk like that, then why worry about anything? Except maybe your job because you get fired after all your data is lost to corruption

2009-09-18 3:37 pm
fridder
I am a little sleepy, so I may be looking at this wrong, but would RAIDZ, with the whole checksumming bit and all that, mitigate some the issues here? I also wonder what is the comparable time to rebuild a raidz volume to what the author presented here.

2009-09-18 4:02 pm
Robert Escue
Adam Leventhal explains it much better than I can:
http://blogs.sun.com/ahl/entry/double_parity_raid_z
ZFS definitely helps, so does a good backup strategy. You are only as safe as your last successful restore.
2009-09-19 10:27 am
christian
I also wonder what is the comparable time to rebuild a raidz volume to what the author presented here.
The time to rebuild a RAIDz array depends not on the array size, but on the used data size of the array. RAIDz only rebuilds the blocks referenced by the higher levels of the filesystem stack, which is possible because it is so integrated with the higher level fileystem.
So in answer to your question, it depends.

2009-09-18 4:35 pm
Bounty
8x 300$ 300GB SAS drives = 2400$ for 2TB Raid setup
v.s.
3x 80$ 1TB SATA drives = 240$ for a 2TB Raid setup
So you basically still have to work out your solution to meet a specific projects needs between speed, security etc. I could double the # of SATA drives for RAID 10 and still be waaaaay under the cost of the SAS solution. Might not even need another box to hold the drives. I think array performance is a bigger factor when making a decision because the cost to be overly redundant with SATA is still waaaay cheap.

2009-09-18 4:50 pm
werfu
You’re right on the price, but there is one point where the SAS drives win: 8x10k SAS drive is way more faster than 3x any 1TB SATA drive available on the market. And even not taking the hard drive rotation speed into consideration, SAS is more efficient than SATA.

2009-09-20 4:49 pm
bannor99
Whether you keep SAS interfaces or switch to SATA, the real issue here is the reliability of hard disks.
Solid state disks should and will be the solution.

2009-09-21 8:12 pm
Zan Lynx
Flash memory (if that is what you mean by solid state) fails over time as well.

2009-09-22 6:40 am
MamiyaOtaru
Flash memory (if that is what you mean by solid state)
Hint: take a guess what the SS in SSD stands for.
But yeah, it does fail over time. And the situation is not likely to get better as it goes through process shrinks.
2009-09-22 4:44 pm
Zan Lynx
Hint: There are more types of “solid state” storage than Flash EEPROM.
There is battery backed RAM, there is MRAM, there is stuff in the laboratories that might make everything we use as storage obsolete. Someday.
http://en.wikipedia.org/wiki/Racetrack_memory is completely awesome, for example.
There is even nanoscale punch cards: http://en.wikipedia.org/wiki/Millipede_memory

2009-09-18 5:09 pm
Robert Escue
And what are you using for the disk controller, the enclosure, the filesystem? All of these play a role in the performance and reliability of a storage solution.
If this is something you are going to cook up for home is one thing, if you are putting this together for a customer, its not a road I would go down.

2009-09-18 6:20 pm
Earl Colby pottinger
I thought modern raid systems used error correcting codes, which means even random bad bits did not irreversible damage the data if there was a drive failure.
What am I missing?
PS. My days of studying raid-systems was over a decade ago.

2009-09-18 9:04 pm
jwwf
I thought modern raid systems used error correcting codes, which means even random bad bits did not irreversible damage the data if there was a drive failure.
What am I missing?
PS. My days of studying raid-systems was over a decade ago.
True if you are running a system that actually does checksumming, like NetApp or ZFS. But many lower end systems do not. Even if you have checksumming and so can detect, say, a flipped bit in a sector that you need for reconstructing a stripe, you would need a second parity stream to reconstruct the stripe (one for the dead disk’s missing sectors, one for the “bad” sector from the otherwise “good” disk). So IMHO multiple-parity raid plus checksumming is a great idea, if you have to use parity raid that is.

2009-09-20 4:36 pm
JPowers27
My main computer at work has 3T of disk space. The data is spaced out over: 21 35g, 78 17.5G, 14 70g drives (113 drives in totals).
We use RAID5 sets of up to 14 drives per set. With this configuration, the rebuild times for a failed drive is small.
We use i5/OS which uses a scatter/pack storage method (data is written to the drives in such a way that all disks have the same used percent). Each RAID control has it’s own CPU to handle the parity generation & checking, this removes a lot of processing from the main CPU’s (4 in my case). It also means that we actually do parity checks/validation on both reads and write.
PS: Each ASP (Auxiliary Storage Pool) can be up to 16T and we can have 32 ASPs per box.
We can also add/removed drives for an ASP with the system active. For example if I want to replace some of the 17.5G drives with 35G drives, I would do the following: 1) Identify the drives to be replace, 2) Issue command to removed drives from ASP (this will stop any new data from being written to the drives and also start moving the data on the drives to other drives), 3) Physically replace the drives, 4) Initialize the new disks, 5) Attach new disks to the ASP, 6) Issue command to re-balance the drives (spread the data equally across the drives).

2009-09-20 7:38 pm
MrVain
Do you really have 113 drives in total? No, that can not be true?
And, did you know that your hw raid doesnt protect against Silent Corruption? It might be that errors are introduced by the drive or the controller, without the hw even noticing.
CERN did another study, they wrote 1MB special bit pattern each second on 3000 servers with Linux and HW raid. Efter 3 weeks they found 152 instances where the file showed errors. They knew how the file should look like, and the file didnt. They thought everything was fine, but it wasnt. This, is called Silent Corruption. To counter that, you need end-to-end checksums. HW raid doesnt do that.
Read this short text and convince yourself that HW raid can do nothing against this type of error:
http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

2009-09-21 3:39 pm
JPowers27
Yes we have 113 drives.
We do end to end checksum; that’s one of the reasons for having a CPU on each RAID card. On write, the card will do a read to validate the data.
This is a massive multi-user box; we average around 2000 users at the same time. It also back ends 1500 time clocks. It’s not unusual to see 4000 programs running a the same time. Thus, disk read/write speed is very important; the more disk arms you have the faster data can be moved to/from the drives. Having 3x1T drives would be too slow since you could only read/write one data packet at a time since you only have 1 arm (assuming raid, but even 3 arms would be slow).

2009-09-21 9:26 pm
MrVain
Holey Moley! That is some impressive numbers! The geek in me, like. :o)
If we talk about end-to-end integrity. I have asked lots of knowledgeable people and it turns out that HW raid doesnt do any good end-to-end data checks. Sure, the HW raid do some basic checks, but that is not much. I suggest you check it up, exactly how good the end to end check is. Didnt you see the CERN studies of HW raid?

2009-09-21 3:27 am
jwwf
My main computer at work has 3T of disk space. The data is spaced out over: 21 35g, 78 17.5G, 14 70g drives (113 drives in totals)
AS400, nice. Thats’s a pretty beefy setup, what are you guys running on it?

2009-09-20 4:45 pm
bannor99
This would have been much more interesting 4-7 years ago but with the prevalence of SATA, a faster SATA interface on the horizon and 512 GB SSDs having been available since late last year, I have to say,”So what?”.
There are ways to solve this problem using SSDs; although
hard disks typically provide more storage density per unit area, that’s not true when you get into the enterprise space – have a look at what the largest SAS disk is and what it’ll cost you.
And, you might even save money right off the bat when you factor in the lower cooling and power reqirements of SSDs.

2009-09-20 9:01 pm
Robert Escue
Is it? I can’t think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.
SSD’s are also immature technology that is only beginning to be integrated into devices like Sun’s Unified Storage System 7000 series arrays, and even then they are being used as cache, not primary storage. While this might not be the case in five to ten years, SSD’s are not quite ready for prime time data center use just yet.

2009-09-21 1:04 am
bannor99
Is it? I can’t think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.
I wasn’t separating SATA from SSDs in this context.
I was implying that despite the high cost of SSDs, you
could balance that against using a cheaper interface and still win on almost all counts.
However, there is no reason why SSDs can’t be made with SAS interfaces.
And don’t be knocking SATA hard disks too much. I seen data from several sources that they have lower failure rates than some of the higher end drives, even if they’re not as fast.
SSD’s are also immature technology that is only beginning to be integrated into devices like Sun’s Unified Storage System 7000 series arrays, and even then they are being used as cache, not primary storage. While this might not be the case in five to ten years, SSD’s are not quite ready for prime time data center use just yet.
SSDs have been around a long time ( see articles at http://www.storagesearch.com ). It’s not a matter of being ready for the data center, it’s the enterprise resistance to change and the slower adoption of newer tech by the big players.

2009-09-21 1:24 pm
Robert Escue
There is a large amount of consumer grade SSD’s which might work well for those who do not expect high throughput I/O and a long life cycle. When you start paying $20,000.00 and upward for a storage array, everything changes.
I am sure I am not the only one who thinks that SSD’s are not ready and waiting for early adopters and risk takers in the field to shake them out before recommending them as part of a replacement or upgrade solution for existing storage.

2009-09-21 3:20 am
jwwf
Is it? I can’t think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.
If your budget (in money, power, and space) is truly unlimited, you’re right. But what if you can choose, say, mirrored SATA or parity raid fibre given the constraints at hand? What if you can choose SATA that is half full, versus fibre at 80% ? What if you can afford more SATA spindles? I don’t think it is that clear cut given the normal constraints.

2009-09-21 1:31 pm
Robert Escue
Budget does come into play. Our history with SATA storage here has not been a good one. We bought storage from a vendor that after six months decided to get out of the SATA storage game and dropped support for the devices. We now have over a hundred 500 GB SATA drives that we pulled from the arrays and junked the rest.
We also have SATA solutions from other vendors (NetApp, HP) and the jury is still out. Power, AC, and space do come into consideration but some of the people I work with also expect a level of performance that SATA might not meet, plus factor in the idea of using SATA over SCSI and FC makes some uncomfortable.
Most of the server rooms I have worked in are near or over capacity for power and AC, but new ideas are usually the hardest sell.

2009-09-21 8:55 pm
jwwf
Budget does come into play. Our history with SATA storage here has not been a good one. We bought storage from a vendor that after six months decided to get out of the SATA storage game and dropped support for the devices. We now have over a hundred 500 GB SATA drives that we pulled from the arrays and junked the rest.
That’s pretty bad. I guess it’s kind of in line with what I mean though: SATA is just one piece of the puzzle. If the rest of the stack is junk, it almost doesn’t matter what the drive interface is. From your other posts I think you are saying the same thing. I just wouldn’t blame SATA, rather junky arrays.
We also have SATA solutions from other vendors (NetApp, HP) and the jury is still out. Power, AC, and space do come into consideration but some of the people I work with also expect a level of performance that SATA might not meet, plus factor in the idea of using SATA over SCSI and FC makes some uncomfortable.
Most of the server rooms I have worked in are near or over capacity for power and AC, but new ideas are usually the hardest sell.
True. It’s kind of funny that by that mentality anybody would accept anything but direct attach storage. I mean, just because the SAN controller has fibre ports on both sides doesn’t mean there isn’t a very complicated black box in the middle. Thinking of it as “fibre from host to spindle” is sort of meaningless when there is no direct path from host to physical disk.
2009-09-21 9:12 pm
Robert Escue
Our problem is the Government wants to build a SAN, but they want to use existing components that are in production (a bad idea) and I really don’t think it sank in that mixing components is a good idea (we have 1 GB, 2 GB and 4 GB FC arrays and libraries). Our stuff is direct attach at the moment, which works but is not flexible.
Unfortunately this is what happens when you build something piecemeal and buy the key pieces (the FC switches) last.

2009-09-22 10:53 am
c0t0d0s0
In Germany there is a saying “Just because everybody have used a train, everybody believes to be an expert of the railway business”. Or for other countries: Just having used an airplane doesn’t make you a aviation business expert.
That said, may people tend to think just because they have one or two disks in their server at home, they are storage experts. For example: The SATA vs. SAS thing. Of course SAS doesn’t make storage more available, but better mechanics do so. And often you find this components with SAS. Or they don’t think about the implications of the operating hours specification.
I wrote a rather long article not long ago about this topics and all the misconceptions of storage. Maybe someone will find it useful: http://www.c0t0d0s0.org/archives/5906-Thoughts-about-this-DIY-Thump…