RAID: Part 5 – RAID5 and RAID6

Now that the parity post is out of the way, we can move into RAID5 and RAID6 configurations.  The good news for anyone who actually plodded through the parity post is that we’ve essentially already covered RAID5!  RAID5 is striping with single parity protection, generated on each row of data, exactly like my example.  Because of that I’ll be writing this post assuming you’ve read the parity post (or at least understand the concepts).


Actually, from the parity post not only have we covered RAID5…we also covered most of our criteria for RAID type analysis.  Sneaky!

Before continuing on, let me make a quick point about RAID5 (note: this also applies to RAID6) group size.  In our example we did 4+1 RAID5. X+1  is standard notation for RAID5, meaning X data disks and 1 parity disk (…kind of – I’ll clarify later regarding distributed parity) but there is no reason it has to be 4+1.  There is a lower limit on single parity schemes, and that is three disks (since if you had two disks you would just do mirroring) which would be 2+1.  There is no upper bound on RAID5 group size, though I will discuss this nuance in the protection factor section.  I could theoretically have a 200+1 RAID5 set.  On an EMC VNX system, the upper bound of a RAID5 group is a system limitation of 16 disks, meaning we can go as high as 15+1.  The more standard sizes for storage pools are 4+1 and the newer 8+1.

That said, let’s talk about usable capacity. RAID5 differs from RAID1/0 in that the usable capacity penalty is directly dependent on how many disks are in the group.  I’ve explained that in RAID5, for every stripe, exactly one strip must be dedicated to parity.  Scale out to the disk level, and this translates into one whole disk’s worth of parity in the group.   In the 4+1 case our capacity penalty is 20% (1 out of 5 disks are used for parity).  Here are the capacity penalties for the schemes I just listed:

  • 2+1 – 33% (this is the worst case scenario, and still better than the 50% of RAID 1/0)
  • 4+1 – 20%
  • 8+1 – 11%
  • 15+1 – 6.5%

So as we add more data disks into a RAID5 group our usable capacity penalty goes down, but is always better than RAID1/0.

Protection factor?  After the parity post we know and understand why RAID5 can survive a single drive failure.  Let’s talk about degraded and rebuild.

  • Degraded mode – Degraded on RAID5 isn’t too pretty.  In this case we have lost a single disk but are still running because of our parity bits.  In this case for a read request coming in to the failed disk, the system must rebuild that data in memory.  We know that process – every remaining disk must be read in order to generate that data.  For a write request coming into the failed disk, the system must rebuild the existing data in memory, read and recalculate parity, and write the new parity value to disk. The one exception to the write condition is if in a given stripe we have lost the parity strip instead of a data strip.  In this case we get a performance increase because the data is just written to whatever data strip it is destined for with no regard to parity recalculation.  However this teensy performance increase is HEAVILY outweighed by the I/O crushing penalty going on all around it.
  • Rebuild mode – Rebuild is also ugly.  The replacement disk must be rebuilt, which means that every bit of data on every remaining drive must be read in order to calculate what the replacement disk looks like.  And all the while, for incoming reads it is still operating in degraded mode.  Depending on controller design, writes can typically be sent to the new disk – but we still have to update parity.

Protection factor aside, the performance hit from degraded mode is why hot spares are tremendously important to RAID5. You want to spend as little time as possible in degraded mode.

Circling back to usable capacity, why do I want smaller RAID groups?  If I have 50 disks, why would I want to do ten 4+1’s instead of one 49+1.  Why waste 10 times the space to parity?  The answer is two-fold.

First related to the single drive failure issue, the 49+1 presents a much larger fault domain.  In English, fault domain means a set of things that are tied to each other for functionality.  Think of it like links in a chain: if one link fails, the entire chain fails (well, a chain in analogy like this one fails) .  With 49+1, I can lose at most one drive out of 50 at any time and keep running.  With ten 4+1’s, I can lose up to 10 drives as long as they come out of different RAID groups.  It is certainly possible that I lose two disks in one 4+1 group and that group is dead, but the likelihood of it happening with a given set of 5 disks is lower than a set of 50 disks.  The trade-off here is that as we add more disks to our RAID group size, we gain usable capacity but increase our risk of a two drive failure causing data loss.

Second, related to the Degraded and Rebuild issues, the more drives I have, the more pieces of data I must read in order to construct data during a failure.  If I have 4+1 and lose a disk, for every read that comes into the system I have to read four disks to generate that data.  But with a 49+1 if I lose a disk, now I have to read forty-nine disks in order to generate that data!  As I add more disks to a RAID5 set, Degraded and Rebuild operations become more taxing on the storage array.

On to write penalty!  In the parity post I explained that any write to existing data causes the original data and parity to be read, some calculations (which happen so fast they aren’t relevant) and then the new data and new parity must be written to disk.  So the write penalty in this case is 4:1.  Four I/O operations for each write coming into the system.  Interestingly enough, this doesn’t scale with RAID group size.  Whether a 2+1 or  200+1, the write penalty is always 4:1 for single parity schemes.

Full Stripe Writes

RAID1/0 has a 2:1 write penalty, and RAID5 has a 4:1 write penalty.  Does this mean that writes to RAID1/0 are always more efficient than RAID5?  Not necessarily.  There is a special case for writes to parity called Full Stripe Writes (FSWs).  A FSW is a special case that typically happens with large block sequential writes (like backup operations).  In this case we are writing such a large amount of data that we actually overwrite one entire stripe.  E.g. in our 4+1 scenario, if the strip size was 64KB and we wrote 256KB of data starting at the first disk, we would end our write at the end of the stripe.  In this case, we have no need to do a parity update because every bit of data that we are protecting with the parity is getting overwritten.  Because of this, we can actually just calculate parity in memory (since we already have the entire stripe’s data in memory) and write the entire stripe at once.

The payback is enormous here, because we only have one extra write for every four writes coming into the system.  In the 4+1 that we described, this translates into a write penalty of 5:4.  This is actually a big improvement even over RAID1/0!

FSWs are not something to hope for when choosing a RAID type.  They are very dependent on the application behavior, file system alignment, and I/O pattern.  Modern storage arrays enable this behavior more often because they hold data in protected cache before flushing to disk, but choosing RAID5 for something that is heavily write oriented and simply hoping that you will get the 5:4 write penalty would be very foolish.  However, if you do your homework you can usually figure out if it is happening or not.  As a simple example, if I was dumping large backups onto a storage array, I would almost always choose RAID5 or RAID6 because this generally will leverage FSWs.


RAID6 is striping with dual parity protection.  Essentially most of what we know about RAID5 applies, except that in any given stripe instead of one parity value there are two.  What this allows us to do is to recover in the event that we lose two drives.  RAID6 can survive two drive failures.

In order to do this, a catch with this second value is that the second parity bit must actually be different from the first.  If the second parity value was just a copy of the first, that doesn’t buy us anything for data recovery.  Another catch is that the 2nd parity value can’t use the first parity value for the calculation…otherwise the 2nd parity value is dependent on the first and in a recovery scenario we run into a bit of a storage array and the egg problem.  Not what we want.

In the parity post I declared my undying love for XOR, and to prove to the rest of you doubters that it is just as amazing as I made it out to be – the 2nd parity value also uses XOR!  It is just too efficient to pass up.  But obviously we must XOR some different data values together.  RAID6’s second parity actually comes from diagonal stripes.

Offhand you might be imagining something like this:


As the helpful text indicates, not so much.  Why not, though?  We satisfied both of our criteria – the 2nd parity bit is different than the first, and it doesn’t include it either.

From a protection standpoint, this probably works but we pay a couple of performance penalties.  First and foremost, we lose the ability to do FSWs.  In order to do a full stripe write with this scheme, I have to essentially overwrite every single disk at one time.  Not gonna happen.  Second, in recovery scenarios my protection information is tied to more strips than RAID5.  I have a set of horizontal strips for one parity value and then another set of diagonal strips for the 2nd parity strip.

Instead, remember that we are working with an ordered set of 1’s and 0’s in every strip, so really the 2nd parity bit is calculated like:


It is a strange, strange thing, but essentially the parity is calculated (or should be calculated) within the same stripe using different bits in each strip.

For a more comprehensive and probably more clear look into the hows of RAID6 (including recovery methodology), EMC’s old whitepaper on it is still a great resource.  I really encourage you to check it out if you need some more detail or explanation, or just want to read a different perspective on it.  Their diagrams are much more informative than mine, although they have very few kittens in them from what I’ve seen so far.

On to our other criteria – the degraded and rebuild modes are pretty much the same as RAID5 except that we may have to read one additional parity disk during the operation.  In other words, degraded and rebuild modes are not pleasant with RAID6.  Make sure you have hotspares to get you out of both as fast as possible.

Usable capacity – the penalty is calculated similarly to RAID5, just with X+2 notation. So e.g. a 6+2 RAID6 would have a 2/8 (two out of eight disks used for parity) penalty, or 25%.  Just like RAID5, this value depends on the size of the group itself, with a technical minimum of four drives.  I say technical because RAID6 schemes are usually implemented to protect a large number of disks – instead of two data and two parity disks, why not just do a 2+2 RAID1/0?  Ahh, variety.

Finally, write penalty.  Because every time I write data I have to update two parity values, there is a 6:1 write penalty with RAID6.  The update operation is once again the same as RAID5 except the second parity value must be read, new parity calculated, and new parity written.

RAID6 can utilize FSWs as discussed above, but if it doesn’t, write operations are taxed HEAVILY with the 6:1 write penalty.  RAID6 has its place, but if you are trying to support small block random writes, it is probably advisable to steer clear.  Again there is no such thing as read penalty, so from a read perspective it performs identically to all other RAID types given the same number of disks in the group.

Distributed vs Dedicated Parity

Briefly I wanted to mention something about parity and the RAID notation like 4+1.  We “think” of this as “4 data disks, one parity disk” which makes sense from a capacity perspective.  In practice, this is called dedicated parity…and it’s not such a good idea.

Every write that comes in the system generates 4 back end I/Os.  Two of those I/Os are slated for the strip that the data is on, and the other two I/Os hit the parity strip.  Were we to stack all the parity strips up on one disk (as we would with a dedicated parity disk), what do you think that would look like under any serious write load?

You could roast marshmallows on the parity disk

You could roast marshmallows on the parity disk

The parity disk has a lot of potential to become a bottleneck.  Instead, RAID5 and 6 implementations use what is called distributed parity in order to provide better I/O balancing.


In this manner, the parity load for the RAID group is distributed evenly across the disks.  Now, does this guarantee even balance?  Nope.  If I hit the top stripe hard, the top parity strip on Disk1 is still going to cook.  But under normal write load with small enough strip size, this provides a much needed load balance.

Not all protection schemes use distributed parity – NetApp’s RAID-DP is a good example of this.  But in cases where parity is not distributed, there must be some other mechanism to alleviate the parity load…otherwise the parity disk is going to be a massive bottleneck.

Uncorrectable Bit Errors

Finally, I wanted to mention Uncorrectable Bit Errors and their impact on RAID5 vs RAID6.  If you check out the whitepaper from EMC above, you’ll see a reference to uncorrectable errors.  You can also google this topic – here is a good paper on it.

An uncorrectable error is one that happens on a disk and renders the data for that particular sector unrecoverable.  The error rate is measured in errors per bit read.  Many consumer grade drives are 1 error per 10^15 bits (~113TB) read, and enterprise grade drives are 1/10^16 (~1.1PB). Generally the larger capacity drives (NL-SAS) are actually consumer grade from this standpoint.

During normal operations with RAID protection a UBE is OK because we have recovery information built into the RAID scheme.  But in a RAID5 rebuild scenario, a UBE is instant death for the RAID group.  Remember we have to be able to reconstruct that failed disk in its entirety, and in order to do that we have to read every bit of data off of every other disk in the group.

So consider that 3TB capacity drives are going to exhibit an UBE every ~113TB of data read, giving a run through the entire disk an approximately 2.5% chance of winning the lottery. Then consider that your RAID5 group is probably going to have at least four or five of these guys in it.

I’ve seen RAID5 used for capacity drives before. And there are mechanisms built into storage arrays to try to sweep and detect errors before a drive fails.  And to date (knock on wood) I haven’t seen a RAID group die a horrible death during rebuild.  But it is always my emphatic recommendation to protect capacity drives with RAID6.  You will find this best practice repeated ad nauseum throughout the storage world.  It is nearly impossible to justify the additional risk of RAID5 against the cost of a few extra capacity disks, even if it pushes you into an extra disk shelf.  Fighting a battle today for a few more dollars on the purchase is going to be a lot less painful than explaining why a 50TB storage pool is invalid and everything in it must be rolled from backup. (and you’ve got backups right?  and they work?)

The Summary Before the Summary

This was a tremendous amount of information and is probably not digestible in one sitting.  Maybe not even two.  My hope is really that by reading this you will learn just a bit about the operations behind the curtain that will help you make an informed decision on when to use RAID5 and RAID6.  If this saves just one person from saying “we need to use RAID1/0 because it is the fast one,” I will be happy.

My next post will be a wrap up of RAID and some comparisons between the types to bring a close to this sometimes bizarre topic of RAID.

RAID: Part 4 – Parity, Schmarity

We’re in the home stretch now.  We’ve covered mirroring and striping, and three RAID types already.

Now we are moving into RAID5 and RAID6.  They leverage striping and a concept called parity.

I’m still new to this blogging thing and once again I bit off more than I could chew.  I wanted to include RAID5 and 6 within this post but it again got really lengthy.  Rather than put an abridged version of them in here, I will cover them in the next post.

For a long time I didn’t really understand what parity was, I just knew it was “out there” and let us recover from disk failures.  But the more I looked into it and how it worked, the more it really amazed me.  It might not grab you the way it grabs me…and that’s OK, maybe you just want to move on to RAID5/6 directly.  But if you are really interested in how it works, forge on.

What is Parity?

Parity (pronounced ‘pear-ih-tee’) is a fancy (pronounced ‘faincy’) kind of recovery information.  In the mirroring post we examined data recovery from a copy perspective.  This is a pretty straight forward concept – if I make two copies of data and I lose one copy, I can recover from the remaining copy.

The downside of this lies in the copy itself.  It requires double the amount of space (which we now refer to as a 50% usable capacity penalty) to protect data with an identical copy.

Parity is a more efficient form of recovery information.  Let’s walk through a simple example, but one that I hope will really illustrate the mechanism, benefits, and problems of parity.  Say that I’m writing numbers to disk, the numbers 18, 24, 9, and last but certainly not least, 42.


As previously discussed, a mirroring strategy would require 4 additional disks in order to mirror the existing data – very inefficient from a capacity perspective.

Instead, I’m going to perform a parity calculation – or a calculation that results in some single value I can use to recover from.  In this case I’m going to use simple addition to create it.

18 + 24 + 9 + 42 = 93

So my parity value is 93 and I can use this for recovery (I’ll explain how in just a moment).

Next question – where can I store this value?  Well we probably shouldn’t use any of the existing disks, because they contain the information we are protecting.  This is a pretty common strategy for recovery. If I’m protecting funny kitten videos on my hard drive, I don’t want to back them up to the same disk because a disk failure takes out the original hilarious videos and the adorable backups. Instead I want to store them on a different physical medium to protect from physical failure.


Similarly, in my parity scheme if a disk were to fail that contained data and the parity value, I would be out of luck.

To get around this, I’ll add a fifth disk and write this value:


Now the real question: how do I use it for recovery?  Pretty simply in this case.  Any disk that is lost can be rebuilt by utilizing the parity information, along with the remaining data values.  If Disk3 dies, I can recover the data on it by subtracting the remaining data values from the parity value:


Success – data recovery after a disk failure…and it was accomplished without adding a complete data copy!  In this case 1/5 of the disk space is lost to parity, translating to a 20% usable capacity penalty.  That is a serious improvement from mirroring.

What happens if there are two disks that fail? As with most things, it depends.  Just like RAID1, if a second disk fails after the system has fully recovered from the first failure, everything is fine.

But if there is a simultaneous failure of two disks?  This presents a recovery problem because there are two unknowns.  If Disk1 and Disk2 are lost simultaneously, my equation looks like:

93-42-9 = 42 = ?+?

Or in English, what two values add up to 42?  While it is true that 18+24=42, so does 20+22, neither of which are my data.  There are a lot of values that meet this criteria…in this case more of them aren’t my data than are.  And guessing with data recovery is, in technical terms, a terribad idea.  So we know that this parity scheme can survive a single disk failure.

Another important question – what happens if we overwrite data?  For instance, if Disk2’s value of 24 gets overwritten with a value of 15, how do we adjust?  It would be a real bummer if the system had to read all of the data in the stripe to calculate parity again for just one affected strip.

There is some re-reading of data, but it isn’t near that bad.  We remove the old data value (24) from the parity value (93), and then add the new one (15) in.  Then we can replace both the data and the parity on disk.

The process looks like:

  1. Get the old parity value
  2. Get the old data value
  3. Subtract the old data value from the old parity value, creating an intermediate parity value
  4. Add the new data value to the intermediate parity value, creating the new parity value
  5. Update the parity value with the new parity value
  6. Update the data value with the new data value

Because we are working with disks, we can replace the “Get” phrasing with read and “Update” with write.  Looking back, at this list, we see that there are two gets (reads) and two updates (writes).

The Magical XOR Operation

I know, I know – my example was amazing.  “Parity calculations should use that mechanism!,” I can hear you saying.  “Give him the copyright and millions in royalties,” you are no doubt proclaiming to everyone around you.

Unfortunately my scheme has a serious problem, and that is that my parity calculation is accumulative.  The larger the numbers get that I am protecting, the larger my parity value gets. Remember at the end of the day we aren’t working with numbers. We are working with data bits (1’s and 0’s) on disk, and we are working with a fixed strip size.  Were we to write 4 x 64KB of data with accumulative parity, I would need 256KB of parity to protect it.  Not ideal – this is essentially the same as mirroring’s usable capacity penalty!

Instead, in the real world parity is supported entirely by the bitwise “exclusive or,” or XOR, operation.  Visually the operator itself looks kind of like a crosshair, or a plus inside a circle.  XOR is a very unique operator in that it essentially allows you to add and subtract from a value (similar to my scheme) without increasing the total amount of information (the bit count).

Another cool thing about the bitwise XOR is it functions both as addition and subtraction.  To “add” two values, you XOR them together.  Then to remove one value from it, you XOR that value again. So instead of A + B – B = A, we have:


XOR’s principle is very simple – if two values are different, the output is TRUE (1); otherwise the output is FALSE (0).  In other words:

  1. Take any pair of 1’s and/or 0’s and compare
  2. If they are the same, output 0
  3. Else output 1

That’s all, folks.  The 4 possible input combinations and their outputs are:

  • 0 XOR 0 = 0
  • 0 XOR 1 = 1
  • 1 XOR 0 = 1
  • 1 XOR 1 = 0

Not too crazy looking is it?  Let’s put it to the test and prove that this works for recovery.  Similar to before, I have 4 data disks but this time with just ones and zeros on them.  The parity calculation goes like so:


Every data bit gets XOR’d together and the result is the parity bit – in this case it is 1.

Now say Disk2 with 0 on it experiences a failure and the system needs to recover. Recovery would simply be the result of XORing all the remaining data bits and the parity bit.

Recovery (from left to right): 1 XOR 1 = 0, XOR 1 = 1, XOR 1 = 0

Success!  We have recovered the 0 bit.  Unfortunately XOR’s magic only extends so far, and in the event of a simultaneous two disk failure we are still up the creek without a paddle.  One parity value can only protect you against the loss of one disk because it can only recover one unknown value.

How about updating the parity bit in the event that we overwrite some data? Again, this works as outlined above:

  1. Read the old parity value
  2. Read the old data value
  3. XOR the old data value with the old parity value, creating an intermediate parity value
  4. XOR the new data value with the intermediate parity value, creating the new parity value
  5. Write the parity value with the new parity value
  6. Write the data value with the new data value

Same as before, there are two reads and two writes.

This is obviously a very simple example and was in no way meant to be a mathematical proof of parity recovery or XOR, but it works on any scale you choose.  Test it out!

Quick note – in the real world, strip size is not a single bit like in this example. With a 64KB strip size, that comes out to 524288 bits in a single strip.  524288 ones and zeros.  XOR functions quite simply as you add bits on, since it just compares each bit in place.  For example, 1100 XOR 1010 is 0110.  The first digit of the result is the first digit of each input XOR’d together.  The second digit is the second digits of each input XOR’d together.  And so on.  There are more detailed XOR manifestos out there as well as XOR calculators online…feel free to consult if you are interested and my explanation left you wanting.

Bitwise XOR is the mechanism for generating a parity bit without increasing the total bit count.  Using this, RAID controllers can generate parity that is identical in size to any amount of data strips being protected.  No matter the strip size, and no matter the stripe width, this mechanism will always result in an identically sized parity.

So what?

So what, indeed.  That was a lot to take in, I know…parity is certainly more complicated than mirroring.  Is it absolutely necessary to understand how parity works at this level?  No, not really.  But thus far I’ve never had a problem arise because I understood how something worked too deeply. I have encountered plenty of issues because I haven’t understood how something worked, or made assumptions about what something was doing behind the scenes.

When we get into some aspects of RAID5 and RAID6, understanding what parity is supposed to do will help clarify what those RAID types are useful for.  And if you don’t agree, feel free to wipe this from your memory banks and replace it with something more useful.