RAID: Part 5 – RAID5 and RAID6

Now that the parity post is out of the way, we can move into RAID5 and RAID6 configurations.  The good news for anyone who actually plodded through the parity post is that we’ve essentially already covered RAID5!  RAID5 is striping with single parity protection, generated on each row of data, exactly like my example.  Because of that I’ll be writing this post assuming you’ve read the parity post (or at least understand the concepts).

RAID5

Actually, from the parity post not only have we covered RAID5…we also covered most of our criteria for RAID type analysis.  Sneaky!

Before continuing on, let me make a quick point about RAID5 (note: this also applies to RAID6) group size.  In our example we did 4+1 RAID5. X+1  is standard notation for RAID5, meaning X data disks and 1 parity disk (…kind of – I’ll clarify later regarding distributed parity) but there is no reason it has to be 4+1.  There is a lower limit on single parity schemes, and that is three disks (since if you had two disks you would just do mirroring) which would be 2+1.  There is no upper bound on RAID5 group size, though I will discuss this nuance in the protection factor section.  I could theoretically have a 200+1 RAID5 set.  On an EMC VNX system, the upper bound of a RAID5 group is a system limitation of 16 disks, meaning we can go as high as 15+1.  The more standard sizes for storage pools are 4+1 and the newer 8+1.

That said, let’s talk about usable capacity. RAID5 differs from RAID1/0 in that the usable capacity penalty is directly dependent on how many disks are in the group.  I’ve explained that in RAID5, for every stripe, exactly one strip must be dedicated to parity.  Scale out to the disk level, and this translates into one whole disk’s worth of parity in the group.   In the 4+1 case our capacity penalty is 20% (1 out of 5 disks are used for parity).  Here are the capacity penalties for the schemes I just listed:

  • 2+1 – 33% (this is the worst case scenario, and still better than the 50% of RAID 1/0)
  • 4+1 – 20%
  • 8+1 – 11%
  • 15+1 – 6.5%

So as we add more data disks into a RAID5 group our usable capacity penalty goes down, but is always better than RAID1/0.

Protection factor?  After the parity post we know and understand why RAID5 can survive a single drive failure.  Let’s talk about degraded and rebuild.

  • Degraded mode – Degraded on RAID5 isn’t too pretty.  In this case we have lost a single disk but are still running because of our parity bits.  In this case for a read request coming in to the failed disk, the system must rebuild that data in memory.  We know that process – every remaining disk must be read in order to generate that data.  For a write request coming into the failed disk, the system must rebuild the existing data in memory, read and recalculate parity, and write the new parity value to disk. The one exception to the write condition is if in a given stripe we have lost the parity strip instead of a data strip.  In this case we get a performance increase because the data is just written to whatever data strip it is destined for with no regard to parity recalculation.  However this teensy performance increase is HEAVILY outweighed by the I/O crushing penalty going on all around it.
  • Rebuild mode – Rebuild is also ugly.  The replacement disk must be rebuilt, which means that every bit of data on every remaining drive must be read in order to calculate what the replacement disk looks like.  And all the while, for incoming reads it is still operating in degraded mode.  Depending on controller design, writes can typically be sent to the new disk – but we still have to update parity.

Protection factor aside, the performance hit from degraded mode is why hot spares are tremendously important to RAID5. You want to spend as little time as possible in degraded mode.

Circling back to usable capacity, why do I want smaller RAID groups?  If I have 50 disks, why would I want to do ten 4+1’s instead of one 49+1.  Why waste 10 times the space to parity?  The answer is two-fold.

First related to the single drive failure issue, the 49+1 presents a much larger fault domain.  In English, fault domain means a set of things that are tied to each other for functionality.  Think of it like links in a chain: if one link fails, the entire chain fails (well, a chain in analogy like this one fails) .  With 49+1, I can lose at most one drive out of 50 at any time and keep running.  With ten 4+1’s, I can lose up to 10 drives as long as they come out of different RAID groups.  It is certainly possible that I lose two disks in one 4+1 group and that group is dead, but the likelihood of it happening with a given set of 5 disks is lower than a set of 50 disks.  The trade-off here is that as we add more disks to our RAID group size, we gain usable capacity but increase our risk of a two drive failure causing data loss.

Second, related to the Degraded and Rebuild issues, the more drives I have, the more pieces of data I must read in order to construct data during a failure.  If I have 4+1 and lose a disk, for every read that comes into the system I have to read four disks to generate that data.  But with a 49+1 if I lose a disk, now I have to read forty-nine disks in order to generate that data!  As I add more disks to a RAID5 set, Degraded and Rebuild operations become more taxing on the storage array.

On to write penalty!  In the parity post I explained that any write to existing data causes the original data and parity to be read, some calculations (which happen so fast they aren’t relevant) and then the new data and new parity must be written to disk.  So the write penalty in this case is 4:1.  Four I/O operations for each write coming into the system.  Interestingly enough, this doesn’t scale with RAID group size.  Whether a 2+1 or  200+1, the write penalty is always 4:1 for single parity schemes.

Full Stripe Writes

RAID1/0 has a 2:1 write penalty, and RAID5 has a 4:1 write penalty.  Does this mean that writes to RAID1/0 are always more efficient than RAID5?  Not necessarily.  There is a special case for writes to parity called Full Stripe Writes (FSWs).  A FSW is a special case that typically happens with large block sequential writes (like backup operations).  In this case we are writing such a large amount of data that we actually overwrite one entire stripe.  E.g. in our 4+1 scenario, if the strip size was 64KB and we wrote 256KB of data starting at the first disk, we would end our write at the end of the stripe.  In this case, we have no need to do a parity update because every bit of data that we are protecting with the parity is getting overwritten.  Because of this, we can actually just calculate parity in memory (since we already have the entire stripe’s data in memory) and write the entire stripe at once.

The payback is enormous here, because we only have one extra write for every four writes coming into the system.  In the 4+1 that we described, this translates into a write penalty of 5:4.  This is actually a big improvement even over RAID1/0!

FSWs are not something to hope for when choosing a RAID type.  They are very dependent on the application behavior, file system alignment, and I/O pattern.  Modern storage arrays enable this behavior more often because they hold data in protected cache before flushing to disk, but choosing RAID5 for something that is heavily write oriented and simply hoping that you will get the 5:4 write penalty would be very foolish.  However, if you do your homework you can usually figure out if it is happening or not.  As a simple example, if I was dumping large backups onto a storage array, I would almost always choose RAID5 or RAID6 because this generally will leverage FSWs.

RAID6

RAID6 is striping with dual parity protection.  Essentially most of what we know about RAID5 applies, except that in any given stripe instead of one parity value there are two.  What this allows us to do is to recover in the event that we lose two drives.  RAID6 can survive two drive failures.

In order to do this, a catch with this second value is that the second parity bit must actually be different from the first.  If the second parity value was just a copy of the first, that doesn’t buy us anything for data recovery.  Another catch is that the 2nd parity value can’t use the first parity value for the calculation…otherwise the 2nd parity value is dependent on the first and in a recovery scenario we run into a bit of a storage array and the egg problem.  Not what we want.

In the parity post I declared my undying love for XOR, and to prove to the rest of you doubters that it is just as amazing as I made it out to be – the 2nd parity value also uses XOR!  It is just too efficient to pass up.  But obviously we must XOR some different data values together.  RAID6’s second parity actually comes from diagonal stripes.

Offhand you might be imagining something like this:

wrongr6

As the helpful text indicates, not so much.  Why not, though?  We satisfied both of our criteria – the 2nd parity bit is different than the first, and it doesn’t include it either.

From a protection standpoint, this probably works but we pay a couple of performance penalties.  First and foremost, we lose the ability to do FSWs.  In order to do a full stripe write with this scheme, I have to essentially overwrite every single disk at one time.  Not gonna happen.  Second, in recovery scenarios my protection information is tied to more strips than RAID5.  I have a set of horizontal strips for one parity value and then another set of diagonal strips for the 2nd parity strip.

Instead, remember that we are working with an ordered set of 1’s and 0’s in every strip, so really the 2nd parity bit is calculated like:

rightr6

It is a strange, strange thing, but essentially the parity is calculated (or should be calculated) within the same stripe using different bits in each strip.

For a more comprehensive and probably more clear look into the hows of RAID6 (including recovery methodology), EMC’s old whitepaper on it is still a great resource.  I really encourage you to check it out if you need some more detail or explanation, or just want to read a different perspective on it.  https://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf  Their diagrams are much more informative than mine, although they have very few kittens in them from what I’ve seen so far.

On to our other criteria – the degraded and rebuild modes are pretty much the same as RAID5 except that we may have to read one additional parity disk during the operation.  In other words, degraded and rebuild modes are not pleasant with RAID6.  Make sure you have hotspares to get you out of both as fast as possible.

Usable capacity – the penalty is calculated similarly to RAID5, just with X+2 notation. So e.g. a 6+2 RAID6 would have a 2/8 (two out of eight disks used for parity) penalty, or 25%.  Just like RAID5, this value depends on the size of the group itself, with a technical minimum of four drives.  I say technical because RAID6 schemes are usually implemented to protect a large number of disks – instead of two data and two parity disks, why not just do a 2+2 RAID1/0?  Ahh, variety.

Finally, write penalty.  Because every time I write data I have to update two parity values, there is a 6:1 write penalty with RAID6.  The update operation is once again the same as RAID5 except the second parity value must be read, new parity calculated, and new parity written.

RAID6 can utilize FSWs as discussed above, but if it doesn’t, write operations are taxed HEAVILY with the 6:1 write penalty.  RAID6 has its place, but if you are trying to support small block random writes, it is probably advisable to steer clear.  Again there is no such thing as read penalty, so from a read perspective it performs identically to all other RAID types given the same number of disks in the group.

Distributed vs Dedicated Parity

Briefly I wanted to mention something about parity and the RAID notation like 4+1.  We “think” of this as “4 data disks, one parity disk” which makes sense from a capacity perspective.  In practice, this is called dedicated parity…and it’s not such a good idea.

Every write that comes in the system generates 4 back end I/Os.  Two of those I/Os are slated for the strip that the data is on, and the other two I/Os hit the parity strip.  Were we to stack all the parity strips up on one disk (as we would with a dedicated parity disk), what do you think that would look like under any serious write load?

You could roast marshmallows on the parity disk

You could roast marshmallows on the parity disk

The parity disk has a lot of potential to become a bottleneck.  Instead, RAID5 and 6 implementations use what is called distributed parity in order to provide better I/O balancing.

distributedparity

In this manner, the parity load for the RAID group is distributed evenly across the disks.  Now, does this guarantee even balance?  Nope.  If I hit the top stripe hard, the top parity strip on Disk1 is still going to cook.  But under normal write load with small enough strip size, this provides a much needed load balance.

Not all protection schemes use distributed parity – NetApp’s RAID-DP is a good example of this.  But in cases where parity is not distributed, there must be some other mechanism to alleviate the parity load…otherwise the parity disk is going to be a massive bottleneck.

Uncorrectable Bit Errors

Finally, I wanted to mention Uncorrectable Bit Errors and their impact on RAID5 vs RAID6.  If you check out the whitepaper from EMC above, you’ll see a reference to uncorrectable errors.  You can also google this topic – here is a good paper on it.

An uncorrectable error is one that happens on a disk and renders the data for that particular sector unrecoverable.  The error rate is measured in errors per bit read.  Many consumer grade drives are 1 error per 10^15 bits (~113TB) read, and enterprise grade drives are 1/10^16 (~1.1PB). Generally the larger capacity drives (NL-SAS) are actually consumer grade from this standpoint.

During normal operations with RAID protection a UBE is OK because we have recovery information built into the RAID scheme.  But in a RAID5 rebuild scenario, a UBE is instant death for the RAID group.  Remember we have to be able to reconstruct that failed disk in its entirety, and in order to do that we have to read every bit of data off of every other disk in the group.

So consider that 3TB capacity drives are going to exhibit an UBE every ~113TB of data read, giving a run through the entire disk an approximately 2.5% chance of winning the lottery. Then consider that your RAID5 group is probably going to have at least four or five of these guys in it.

I’ve seen RAID5 used for capacity drives before. And there are mechanisms built into storage arrays to try to sweep and detect errors before a drive fails.  And to date (knock on wood) I haven’t seen a RAID group die a horrible death during rebuild.  But it is always my emphatic recommendation to protect capacity drives with RAID6.  You will find this best practice repeated ad nauseum throughout the storage world.  It is nearly impossible to justify the additional risk of RAID5 against the cost of a few extra capacity disks, even if it pushes you into an extra disk shelf.  Fighting a battle today for a few more dollars on the purchase is going to be a lot less painful than explaining why a 50TB storage pool is invalid and everything in it must be rolled from backup. (and you’ve got backups right?  and they work?)

The Summary Before the Summary

This was a tremendous amount of information and is probably not digestible in one sitting.  Maybe not even two.  My hope is really that by reading this you will learn just a bit about the operations behind the curtain that will help you make an informed decision on when to use RAID5 and RAID6.  If this saves just one person from saying “we need to use RAID1/0 because it is the fast one,” I will be happy.

My next post will be a wrap up of RAID and some comparisons between the types to bring a close to this sometimes bizarre topic of RAID.