RAID: Part 6 – WrapUp

Finally the end – what a long, wordy trip it has been.  If you waded through all 5 posts, awesome!

As a final post, I wanted to attempt to bring all of the high points together and draw some contrasts between the RAID types I’ve discussed.  My goal with this post is less about the technical minutia and more about providing some strong direction to equip readers to make informed decisions.

Does Any of This Matter?

I always spend some time asking myself this question as I dive further and further down the rabbit hole on topics like this.  It is certainly possible that you can interact with storage and not understand details about RAID.  However I am a firm believer that you should understand it.  RAID is the foundation on which everything is built.  It is used in almost every storage platform out there.  It dictates behavior.  Making a smart choice here can save you money or waste it.  It can improve storage performance or cripple it.

I also like the idea that understanding the building blocks can later empower you to understand even more concepts.  For instance, if you’ve read through this you understand about mirroring, striping, and parity.  Pop quiz: what would a RAID5/0 look like?

raid50

Pretty neat that even without me describing it in detail, you can understand a lot about how this RAID type would function.  You’d know the failure capabilities and the write penalties of the individual RAID5 members.  And you’d know that the configuration couldn’t survive a failure of either RAID5 set because of the top level striping configuration.  And let’s say that I told you the strip size of the RAID5 group was 64KB, and that the strip size of the RAID0 config was 256MB.  Believe it or not, this is a pretty accurate description of a 10 disk VNX2 storage pool from a single tier RAID5 perspective.

Again to me this is part of the value – when fancy new things come out, the fundamental building blocks are often the same.  If you understand the functionality of the building block, then you can extrapolate functionality of many things.  And if I give you a new storage widget to look at, you’ll instantly understand certain things about it based on the underlying RAID configuration.  It puts you in a much better position than just memorizing that RAID5 is “parity.”

Okay, I’m off my soapbox!

Workload – Read

  • RAID1/0 – Great
  • RAID5 – Great
  • RAID6 – Great

I’ve probably hammered this home by now, but when we are looking at largely read workloads (or just the read portion of any workload) the RAID type is mostly irrelevant from a performance perspective in non-degraded mode.  But as with any blanket statement, there are caveats.  Here are some things to keep in mind.

  • Your read performance will depend almost entirely on the underlying disk (ignoring sequential reads and prefetching).  I’m not talking about the obvious flash vs NLSAS; I’m talking about RAID group sizing.  As a general statement I can say that RAID1/0 performs identically to RAID5 for pure read workloads, but an 8 disk RAID1/0 is going to outperform a 4+1 RAID5.
  • Ask the question and do tests to confirm: does your storage platform round robin reads between mirror pairs in RAID1/0?  If not (and not all controllers do), your RAID1/0 read performance is going to be constrained to half of the spindles.  From the previous bullet point, our 8 disk RAID1/0 would be outperformed by a 4+1 disk RAID5 in reads because only 4 of the 8 spindles are actually servicing read requests.

Workload – Write

  • RAID1/0 – Great (write penalty of 2)
  • RAID5 – Okay (write penalty of 4)
  • RAID6 – Bad (write penalty of 6)

Writes are where the RAID types start to diverge pretty dramatically due to the vastly different write penalties between them.  Yet once again sometimes people draw the wrong conclusion from the general idea that RAID1/0 is more efficient at writes than RAID6.

  • The underlying disk structure is still dramatically important.  A lot of people seem to focus on “workload isolation,” meaning e.g. with a database that I would put the data on RAID5 and the transaction logs on RAID1/0.  This is a great idea from a design perspective starting with a blank slate.  However, what if my RAID5 disk pool I’m working with is 200 disks and I only have 4 disks for RAID1/0?  In this case I’m pretty much a lock to have better success dropping logs into the RAID5 pool because there are WAY more spindles to support the I/O.  There are a lot of variables here about the workload, but the point I’m trying to make is you should take a look at all the parts as a whole when making these decisions.
  • If your write workload is large block sequential, take a look at RAID5 or RAID6 over RAID1/0 – you will typically see much more efficient I/O in these cases.  However, make sure you do proper analysis and don’t end up with heavy small block random writes on RAID6.

Going back and re-reading some of my previous posts, I feel like I may have given the impression that I don’t like RAID1/0.  Or that I don’t see value in RAID1/0.  That is certainly not the case and I wanted to draw an example to show when you need to use RAID1/0 without question.  That example is when we see a “lot” of small block random writes and don’t need excessive amounts of capacity.  What is a “lot”?  Good question.  Typically the breaking point is around 30-40% write ratio.

Given that a SAS drive should only be allowed to support around 180 IOPs, let’s crunch some numbers for an imaginary 10,000 front end IOPs workload. How many spindles do we need to support the workload at specific read/write ratios?  (I will do another blog post on the specifics of these calculations)

Read/Write Ratio RAID1/0 disk count RAID5 disk count RAID6 disk count
90%/10% 62 73 78
75%/25% 70 98 123
60%/40% 78 125 167

So, at lighter write percentages, the difference in the RAID type doesn’t matter as much.  But as we already learned RAID1/0 is the most efficient at back end writes, and this gets incredibly apparent at the 60/40 split.  In fact, I need over twice the amount of spindles if I choose RAID6 instead of RAID1/0 to support the workload.  Twice the amount of hardware up front, and then twice the amount of power suckers and heat producers sitting your data center for years.

Capacity Factor

  • RAID1/0 – Bad (50% penalty)
  • RAID5 – Great (generally ~20% penalty or less)
  • RAID6 – Great (generally ~25% penalty or less)

Capacity is a pretty straightforward thing so I’m not going to belabor the point – you need some amount of capacity and you can very quickly calculate how many disks you need of the different RAID types.

  • You can get more or less capacity out of RAID5 or 6 by adjusting RAID group size, though remember the protection caveats.
  • Remember that in some cases (for instance, storage pools on an EMC VNX) a choice of RAID type today locks you in on that pool forever.  By this I mean to say, if someone else talks you into RAID1/0 today and it isn’t needed, not only is it needlessly expensive today, but as you add storage capacity to that pool it is needlessly expensive for years.

Protection Factor

  • RAID1/0 – Lottery! (meaning, there is a lot of random chance here)
  • RAID5 – Good
  • RAID6 – Great

As we’ve discussed, the types vary in protection factor as well.

  • Because of RAID1/0’s lottery factor on losing the 2nd disk, the only thing we can state for certain is that RAID1/0 and RAID6 are better than RAID5 from a protection standpoint.  By that I mean, it is entirely possible that the 2nd simultaneous disk failure will invalidate a RAID1/0 set if it is the exact right disk, but there is a chance that it won’t.  For RAID5, a 2nd simultaneous failure will invalidate the set every time.
  • Remember is that RAID1/0 is much better behaved in a degraded and rebuild scenario than RAID5 or 6.  If you are planning on squeezing every ounce of performance out of your storage while it is healthy and can’t stand any performance hit, RAID1/0 is probably a better choice.  Although I will say that I don’t recommend running a production environment like this!
  • You can squeeze extra capacity out of RAID5 and 6 by increasing the RAID group size, but keep it within sane limits.  Don’t forget the extra trouble you can have from a fault domain and degraded/rebuild standpoint as the RAID group size gets larger.
  • Finally, remember that RAID is not a substitute for backups.  RAID will do the best it can to protect you from physical failures, but it has limits and does nothing to protect you from logical corruption.

Summary

I think I’ve established that there are a lot of factors to consider when choosing a RAID type.  At the end of the day, you want to satisfy requirements while saving money.  In that vein, here are some summary thoughts.

If you have a very transactional database, or are looking into VDI, RAID1/0 is probably going to be very appealing from a cost perspective because these workloads tend to be IOPs constrained with a heavy write percentage.  On the other hand, less transactional databases, application, and file storage tend to be capacity constrained with a low write percentage.  In these cases RAID5 or 6 are going to look better.

In general the following RAID types are a good fit in the following disk tiers, for the following reasons:

  • EFD (a.k.a. Flash or SSD) – RAID5.  Response time here is not really an issue, instead you want to squeeze as much capacity as possible out of them for use, ’cause these puppies are pricey!  RAID5 does that for us.
  • SAS (a.k.a. FC) – RAID5 or RAID1/0.  The choice here hinges on write percentage.  RAID6 on these guys is typically a waste of space and added write penalty.  They rebuild fast enough that RAID5 is acceptable.  Note – as these disks get larger and larger this may shift towards RAID1/0 or RAID6 due to rebuild times or even UBEs, but these are actually enterprise grade and have exponentially less UBE rate.
  • NLSAS (a.k.a. SATA) – RAID6.  Please use RAID6 for these disks.  As previously stated, they need the added protection of the extra parity, and you should be able to justify the cost.

Again, this is just in general, and I can’t overstate the need for solid analysis.

Hopefully this has been accurate and useful. I really enjoyed writing this up and hope to continue producing useful (and accurate!) material in the future.

RAID: Part 5 – RAID5 and RAID6

Now that the parity post is out of the way, we can move into RAID5 and RAID6 configurations.  The good news for anyone who actually plodded through the parity post is that we’ve essentially already covered RAID5!  RAID5 is striping with single parity protection, generated on each row of data, exactly like my example.  Because of that I’ll be writing this post assuming you’ve read the parity post (or at least understand the concepts).

RAID5

Actually, from the parity post not only have we covered RAID5…we also covered most of our criteria for RAID type analysis.  Sneaky!

Before continuing on, let me make a quick point about RAID5 (note: this also applies to RAID6) group size.  In our example we did 4+1 RAID5. X+1  is standard notation for RAID5, meaning X data disks and 1 parity disk (…kind of – I’ll clarify later regarding distributed parity) but there is no reason it has to be 4+1.  There is a lower limit on single parity schemes, and that is three disks (since if you had two disks you would just do mirroring) which would be 2+1.  There is no upper bound on RAID5 group size, though I will discuss this nuance in the protection factor section.  I could theoretically have a 200+1 RAID5 set.  On an EMC VNX system, the upper bound of a RAID5 group is a system limitation of 16 disks, meaning we can go as high as 15+1.  The more standard sizes for storage pools are 4+1 and the newer 8+1.

That said, let’s talk about usable capacity. RAID5 differs from RAID1/0 in that the usable capacity penalty is directly dependent on how many disks are in the group.  I’ve explained that in RAID5, for every stripe, exactly one strip must be dedicated to parity.  Scale out to the disk level, and this translates into one whole disk’s worth of parity in the group.   In the 4+1 case our capacity penalty is 20% (1 out of 5 disks are used for parity).  Here are the capacity penalties for the schemes I just listed:

  • 2+1 – 33% (this is the worst case scenario, and still better than the 50% of RAID 1/0)
  • 4+1 – 20%
  • 8+1 – 11%
  • 15+1 – 6.5%

So as we add more data disks into a RAID5 group our usable capacity penalty goes down, but is always better than RAID1/0.

Protection factor?  After the parity post we know and understand why RAID5 can survive a single drive failure.  Let’s talk about degraded and rebuild.

  • Degraded mode – Degraded on RAID5 isn’t too pretty.  In this case we have lost a single disk but are still running because of our parity bits.  In this case for a read request coming in to the failed disk, the system must rebuild that data in memory.  We know that process – every remaining disk must be read in order to generate that data.  For a write request coming into the failed disk, the system must rebuild the existing data in memory, read and recalculate parity, and write the new parity value to disk. The one exception to the write condition is if in a given stripe we have lost the parity strip instead of a data strip.  In this case we get a performance increase because the data is just written to whatever data strip it is destined for with no regard to parity recalculation.  However this teensy performance increase is HEAVILY outweighed by the I/O crushing penalty going on all around it.
  • Rebuild mode – Rebuild is also ugly.  The replacement disk must be rebuilt, which means that every bit of data on every remaining drive must be read in order to calculate what the replacement disk looks like.  And all the while, for incoming reads it is still operating in degraded mode.  Depending on controller design, writes can typically be sent to the new disk – but we still have to update parity.

Protection factor aside, the performance hit from degraded mode is why hot spares are tremendously important to RAID5. You want to spend as little time as possible in degraded mode.

Circling back to usable capacity, why do I want smaller RAID groups?  If I have 50 disks, why would I want to do ten 4+1’s instead of one 49+1.  Why waste 10 times the space to parity?  The answer is two-fold.

First related to the single drive failure issue, the 49+1 presents a much larger fault domain.  In English, fault domain means a set of things that are tied to each other for functionality.  Think of it like links in a chain: if one link fails, the entire chain fails (well, a chain in analogy like this one fails) .  With 49+1, I can lose at most one drive out of 50 at any time and keep running.  With ten 4+1’s, I can lose up to 10 drives as long as they come out of different RAID groups.  It is certainly possible that I lose two disks in one 4+1 group and that group is dead, but the likelihood of it happening with a given set of 5 disks is lower than a set of 50 disks.  The trade-off here is that as we add more disks to our RAID group size, we gain usable capacity but increase our risk of a two drive failure causing data loss.

Second, related to the Degraded and Rebuild issues, the more drives I have, the more pieces of data I must read in order to construct data during a failure.  If I have 4+1 and lose a disk, for every read that comes into the system I have to read four disks to generate that data.  But with a 49+1 if I lose a disk, now I have to read forty-nine disks in order to generate that data!  As I add more disks to a RAID5 set, Degraded and Rebuild operations become more taxing on the storage array.

On to write penalty!  In the parity post I explained that any write to existing data causes the original data and parity to be read, some calculations (which happen so fast they aren’t relevant) and then the new data and new parity must be written to disk.  So the write penalty in this case is 4:1.  Four I/O operations for each write coming into the system.  Interestingly enough, this doesn’t scale with RAID group size.  Whether a 2+1 or  200+1, the write penalty is always 4:1 for single parity schemes.

Full Stripe Writes

RAID1/0 has a 2:1 write penalty, and RAID5 has a 4:1 write penalty.  Does this mean that writes to RAID1/0 are always more efficient than RAID5?  Not necessarily.  There is a special case for writes to parity called Full Stripe Writes (FSWs).  A FSW is a special case that typically happens with large block sequential writes (like backup operations).  In this case we are writing such a large amount of data that we actually overwrite one entire stripe.  E.g. in our 4+1 scenario, if the strip size was 64KB and we wrote 256KB of data starting at the first disk, we would end our write at the end of the stripe.  In this case, we have no need to do a parity update because every bit of data that we are protecting with the parity is getting overwritten.  Because of this, we can actually just calculate parity in memory (since we already have the entire stripe’s data in memory) and write the entire stripe at once.

The payback is enormous here, because we only have one extra write for every four writes coming into the system.  In the 4+1 that we described, this translates into a write penalty of 5:4.  This is actually a big improvement even over RAID1/0!

FSWs are not something to hope for when choosing a RAID type.  They are very dependent on the application behavior, file system alignment, and I/O pattern.  Modern storage arrays enable this behavior more often because they hold data in protected cache before flushing to disk, but choosing RAID5 for something that is heavily write oriented and simply hoping that you will get the 5:4 write penalty would be very foolish.  However, if you do your homework you can usually figure out if it is happening or not.  As a simple example, if I was dumping large backups onto a storage array, I would almost always choose RAID5 or RAID6 because this generally will leverage FSWs.

RAID6

RAID6 is striping with dual parity protection.  Essentially most of what we know about RAID5 applies, except that in any given stripe instead of one parity value there are two.  What this allows us to do is to recover in the event that we lose two drives.  RAID6 can survive two drive failures.

In order to do this, a catch with this second value is that the second parity bit must actually be different from the first.  If the second parity value was just a copy of the first, that doesn’t buy us anything for data recovery.  Another catch is that the 2nd parity value can’t use the first parity value for the calculation…otherwise the 2nd parity value is dependent on the first and in a recovery scenario we run into a bit of a storage array and the egg problem.  Not what we want.

In the parity post I declared my undying love for XOR, and to prove to the rest of you doubters that it is just as amazing as I made it out to be – the 2nd parity value also uses XOR!  It is just too efficient to pass up.  But obviously we must XOR some different data values together.  RAID6’s second parity actually comes from diagonal stripes.

Offhand you might be imagining something like this:

wrongr6

As the helpful text indicates, not so much.  Why not, though?  We satisfied both of our criteria – the 2nd parity bit is different than the first, and it doesn’t include it either.

From a protection standpoint, this probably works but we pay a couple of performance penalties.  First and foremost, we lose the ability to do FSWs.  In order to do a full stripe write with this scheme, I have to essentially overwrite every single disk at one time.  Not gonna happen.  Second, in recovery scenarios my protection information is tied to more strips than RAID5.  I have a set of horizontal strips for one parity value and then another set of diagonal strips for the 2nd parity strip.

Instead, remember that we are working with an ordered set of 1’s and 0’s in every strip, so really the 2nd parity bit is calculated like:

rightr6

It is a strange, strange thing, but essentially the parity is calculated (or should be calculated) within the same stripe using different bits in each strip.

For a more comprehensive and probably more clear look into the hows of RAID6 (including recovery methodology), EMC’s old whitepaper on it is still a great resource.  I really encourage you to check it out if you need some more detail or explanation, or just want to read a different perspective on it.  https://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf  Their diagrams are much more informative than mine, although they have very few kittens in them from what I’ve seen so far.

On to our other criteria – the degraded and rebuild modes are pretty much the same as RAID5 except that we may have to read one additional parity disk during the operation.  In other words, degraded and rebuild modes are not pleasant with RAID6.  Make sure you have hotspares to get you out of both as fast as possible.

Usable capacity – the penalty is calculated similarly to RAID5, just with X+2 notation. So e.g. a 6+2 RAID6 would have a 2/8 (two out of eight disks used for parity) penalty, or 25%.  Just like RAID5, this value depends on the size of the group itself, with a technical minimum of four drives.  I say technical because RAID6 schemes are usually implemented to protect a large number of disks – instead of two data and two parity disks, why not just do a 2+2 RAID1/0?  Ahh, variety.

Finally, write penalty.  Because every time I write data I have to update two parity values, there is a 6:1 write penalty with RAID6.  The update operation is once again the same as RAID5 except the second parity value must be read, new parity calculated, and new parity written.

RAID6 can utilize FSWs as discussed above, but if it doesn’t, write operations are taxed HEAVILY with the 6:1 write penalty.  RAID6 has its place, but if you are trying to support small block random writes, it is probably advisable to steer clear.  Again there is no such thing as read penalty, so from a read perspective it performs identically to all other RAID types given the same number of disks in the group.

Distributed vs Dedicated Parity

Briefly I wanted to mention something about parity and the RAID notation like 4+1.  We “think” of this as “4 data disks, one parity disk” which makes sense from a capacity perspective.  In practice, this is called dedicated parity…and it’s not such a good idea.

Every write that comes in the system generates 4 back end I/Os.  Two of those I/Os are slated for the strip that the data is on, and the other two I/Os hit the parity strip.  Were we to stack all the parity strips up on one disk (as we would with a dedicated parity disk), what do you think that would look like under any serious write load?

You could roast marshmallows on the parity disk

You could roast marshmallows on the parity disk

The parity disk has a lot of potential to become a bottleneck.  Instead, RAID5 and 6 implementations use what is called distributed parity in order to provide better I/O balancing.

distributedparity

In this manner, the parity load for the RAID group is distributed evenly across the disks.  Now, does this guarantee even balance?  Nope.  If I hit the top stripe hard, the top parity strip on Disk1 is still going to cook.  But under normal write load with small enough strip size, this provides a much needed load balance.

Not all protection schemes use distributed parity – NetApp’s RAID-DP is a good example of this.  But in cases where parity is not distributed, there must be some other mechanism to alleviate the parity load…otherwise the parity disk is going to be a massive bottleneck.

Uncorrectable Bit Errors

Finally, I wanted to mention Uncorrectable Bit Errors and their impact on RAID5 vs RAID6.  If you check out the whitepaper from EMC above, you’ll see a reference to uncorrectable errors.  You can also google this topic – here is a good paper on it.

An uncorrectable error is one that happens on a disk and renders the data for that particular sector unrecoverable.  The error rate is measured in errors per bit read.  Many consumer grade drives are 1 error per 10^15 bits (~113TB) read, and enterprise grade drives are 1/10^16 (~1.1PB). Generally the larger capacity drives (NL-SAS) are actually consumer grade from this standpoint.

During normal operations with RAID protection a UBE is OK because we have recovery information built into the RAID scheme.  But in a RAID5 rebuild scenario, a UBE is instant death for the RAID group.  Remember we have to be able to reconstruct that failed disk in its entirety, and in order to do that we have to read every bit of data off of every other disk in the group.

So consider that 3TB capacity drives are going to exhibit an UBE every ~113TB of data read, giving a run through the entire disk an approximately 2.5% chance of winning the lottery. Then consider that your RAID5 group is probably going to have at least four or five of these guys in it.

I’ve seen RAID5 used for capacity drives before. And there are mechanisms built into storage arrays to try to sweep and detect errors before a drive fails.  And to date (knock on wood) I haven’t seen a RAID group die a horrible death during rebuild.  But it is always my emphatic recommendation to protect capacity drives with RAID6.  You will find this best practice repeated ad nauseum throughout the storage world.  It is nearly impossible to justify the additional risk of RAID5 against the cost of a few extra capacity disks, even if it pushes you into an extra disk shelf.  Fighting a battle today for a few more dollars on the purchase is going to be a lot less painful than explaining why a 50TB storage pool is invalid and everything in it must be rolled from backup. (and you’ve got backups right?  and they work?)

The Summary Before the Summary

This was a tremendous amount of information and is probably not digestible in one sitting.  Maybe not even two.  My hope is really that by reading this you will learn just a bit about the operations behind the curtain that will help you make an informed decision on when to use RAID5 and RAID6.  If this saves just one person from saying “we need to use RAID1/0 because it is the fast one,” I will be happy.

My next post will be a wrap up of RAID and some comparisons between the types to bring a close to this sometimes bizarre topic of RAID.

RAID: Part 4 – Parity, Schmarity

We’re in the home stretch now.  We’ve covered mirroring and striping, and three RAID types already.

Now we are moving into RAID5 and RAID6.  They leverage striping and a concept called parity.

I’m still new to this blogging thing and once again I bit off more than I could chew.  I wanted to include RAID5 and 6 within this post but it again got really lengthy.  Rather than put an abridged version of them in here, I will cover them in the next post.

For a long time I didn’t really understand what parity was, I just knew it was “out there” and let us recover from disk failures.  But the more I looked into it and how it worked, the more it really amazed me.  It might not grab you the way it grabs me…and that’s OK, maybe you just want to move on to RAID5/6 directly.  But if you are really interested in how it works, forge on.

What is Parity?

Parity (pronounced ‘pear-ih-tee’) is a fancy (pronounced ‘faincy’) kind of recovery information.  In the mirroring post we examined data recovery from a copy perspective.  This is a pretty straight forward concept – if I make two copies of data and I lose one copy, I can recover from the remaining copy.

The downside of this lies in the copy itself.  It requires double the amount of space (which we now refer to as a 50% usable capacity penalty) to protect data with an identical copy.

Parity is a more efficient form of recovery information.  Let’s walk through a simple example, but one that I hope will really illustrate the mechanism, benefits, and problems of parity.  Say that I’m writing numbers to disk, the numbers 18, 24, 9, and last but certainly not least, 42.

noparity

As previously discussed, a mirroring strategy would require 4 additional disks in order to mirror the existing data – very inefficient from a capacity perspective.

Instead, I’m going to perform a parity calculation – or a calculation that results in some single value I can use to recover from.  In this case I’m going to use simple addition to create it.

18 + 24 + 9 + 42 = 93

So my parity value is 93 and I can use this for recovery (I’ll explain how in just a moment).

Next question – where can I store this value?  Well we probably shouldn’t use any of the existing disks, because they contain the information we are protecting.  This is a pretty common strategy for recovery. If I’m protecting funny kitten videos on my hard drive, I don’t want to back them up to the same disk because a disk failure takes out the original hilarious videos and the adorable backups. Instead I want to store them on a different physical medium to protect from physical failure.

kittehbackup

Similarly, in my parity scheme if a disk were to fail that contained data and the parity value, I would be out of luck.

To get around this, I’ll add a fifth disk and write this value:

parity

Now the real question: how do I use it for recovery?  Pretty simply in this case.  Any disk that is lost can be rebuilt by utilizing the parity information, along with the remaining data values.  If Disk3 dies, I can recover the data on it by subtracting the remaining data values from the parity value:

reconstruct

Success – data recovery after a disk failure…and it was accomplished without adding a complete data copy!  In this case 1/5 of the disk space is lost to parity, translating to a 20% usable capacity penalty.  That is a serious improvement from mirroring.

What happens if there are two disks that fail? As with most things, it depends.  Just like RAID1, if a second disk fails after the system has fully recovered from the first failure, everything is fine.

But if there is a simultaneous failure of two disks?  This presents a recovery problem because there are two unknowns.  If Disk1 and Disk2 are lost simultaneously, my equation looks like:

93-42-9 = 42 = ?+?

Or in English, what two values add up to 42?  While it is true that 18+24=42, so does 20+22, neither of which are my data.  There are a lot of values that meet this criteria…in this case more of them aren’t my data than are.  And guessing with data recovery is, in technical terms, a terribad idea.  So we know that this parity scheme can survive a single disk failure.

Another important question – what happens if we overwrite data?  For instance, if Disk2’s value of 24 gets overwritten with a value of 15, how do we adjust?  It would be a real bummer if the system had to read all of the data in the stripe to calculate parity again for just one affected strip.

There is some re-reading of data, but it isn’t near that bad.  We remove the old data value (24) from the parity value (93), and then add the new one (15) in.  Then we can replace both the data and the parity on disk.

The process looks like:

  1. Get the old parity value
  2. Get the old data value
  3. Subtract the old data value from the old parity value, creating an intermediate parity value
  4. Add the new data value to the intermediate parity value, creating the new parity value
  5. Update the parity value with the new parity value
  6. Update the data value with the new data value

Because we are working with disks, we can replace the “Get” phrasing with read and “Update” with write.  Looking back, at this list, we see that there are two gets (reads) and two updates (writes).

The Magical XOR Operation

I know, I know – my example was amazing.  “Parity calculations should use that mechanism!,” I can hear you saying.  “Give him the copyright and millions in royalties,” you are no doubt proclaiming to everyone around you.

Unfortunately my scheme has a serious problem, and that is that my parity calculation is accumulative.  The larger the numbers get that I am protecting, the larger my parity value gets. Remember at the end of the day we aren’t working with numbers. We are working with data bits (1’s and 0’s) on disk, and we are working with a fixed strip size.  Were we to write 4 x 64KB of data with accumulative parity, I would need 256KB of parity to protect it.  Not ideal – this is essentially the same as mirroring’s usable capacity penalty!

Instead, in the real world parity is supported entirely by the bitwise “exclusive or,” or XOR, operation.  Visually the operator itself looks kind of like a crosshair, or a plus inside a circle.  XOR is a very unique operator in that it essentially allows you to add and subtract from a value (similar to my scheme) without increasing the total amount of information (the bit count).

Another cool thing about the bitwise XOR is it functions both as addition and subtraction.  To “add” two values, you XOR them together.  Then to remove one value from it, you XOR that value again. So instead of A + B – B = A, we have:

xoraddsub

XOR’s principle is very simple – if two values are different, the output is TRUE (1); otherwise the output is FALSE (0).  In other words:

  1. Take any pair of 1’s and/or 0’s and compare
  2. If they are the same, output 0
  3. Else output 1

That’s all, folks.  The 4 possible input combinations and their outputs are:

  • 0 XOR 0 = 0
  • 0 XOR 1 = 1
  • 1 XOR 0 = 1
  • 1 XOR 1 = 0

Not too crazy looking is it?  Let’s put it to the test and prove that this works for recovery.  Similar to before, I have 4 data disks but this time with just ones and zeros on them.  The parity calculation goes like so:

xorcalc

Every data bit gets XOR’d together and the result is the parity bit – in this case it is 1.

Now say Disk2 with 0 on it experiences a failure and the system needs to recover. Recovery would simply be the result of XORing all the remaining data bits and the parity bit.

Recovery (from left to right): 1 XOR 1 = 0, XOR 1 = 1, XOR 1 = 0

Success!  We have recovered the 0 bit.  Unfortunately XOR’s magic only extends so far, and in the event of a simultaneous two disk failure we are still up the creek without a paddle.  One parity value can only protect you against the loss of one disk because it can only recover one unknown value.

How about updating the parity bit in the event that we overwrite some data? Again, this works as outlined above:

  1. Read the old parity value
  2. Read the old data value
  3. XOR the old data value with the old parity value, creating an intermediate parity value
  4. XOR the new data value with the intermediate parity value, creating the new parity value
  5. Write the parity value with the new parity value
  6. Write the data value with the new data value

Same as before, there are two reads and two writes.

This is obviously a very simple example and was in no way meant to be a mathematical proof of parity recovery or XOR, but it works on any scale you choose.  Test it out!

Quick note – in the real world, strip size is not a single bit like in this example. With a 64KB strip size, that comes out to 524288 bits in a single strip.  524288 ones and zeros.  XOR functions quite simply as you add bits on, since it just compares each bit in place.  For example, 1100 XOR 1010 is 0110.  The first digit of the result is the first digit of each input XOR’d together.  The second digit is the second digits of each input XOR’d together.  And so on.  There are more detailed XOR manifestos out there as well as XOR calculators online…feel free to consult if you are interested and my explanation left you wanting.

Bitwise XOR is the mechanism for generating a parity bit without increasing the total bit count.  Using this, RAID controllers can generate parity that is identical in size to any amount of data strips being protected.  No matter the strip size, and no matter the stripe width, this mechanism will always result in an identically sized parity.

So what?

So what, indeed.  That was a lot to take in, I know…parity is certainly more complicated than mirroring.  Is it absolutely necessary to understand how parity works at this level?  No, not really.  But thus far I’ve never had a problem arise because I understood how something worked too deeply. I have encountered plenty of issues because I haven’t understood how something worked, or made assumptions about what something was doing behind the scenes.

When we get into some aspects of RAID5 and RAID6, understanding what parity is supposed to do will help clarify what those RAID types are useful for.  And if you don’t agree, feel free to wipe this from your memory banks and replace it with something more useful.