RAID: Part 6 – WrapUp

Finally the end – what a long, wordy trip it has been.  If you waded through all 5 posts, awesome!

As a final post, I wanted to attempt to bring all of the high points together and draw some contrasts between the RAID types I’ve discussed.  My goal with this post is less about the technical minutia and more about providing some strong direction to equip readers to make informed decisions.

Does Any of This Matter?

I always spend some time asking myself this question as I dive further and further down the rabbit hole on topics like this.  It is certainly possible that you can interact with storage and not understand details about RAID.  However I am a firm believer that you should understand it.  RAID is the foundation on which everything is built.  It is used in almost every storage platform out there.  It dictates behavior.  Making a smart choice here can save you money or waste it.  It can improve storage performance or cripple it.

I also like the idea that understanding the building blocks can later empower you to understand even more concepts.  For instance, if you’ve read through this you understand about mirroring, striping, and parity.  Pop quiz: what would a RAID5/0 look like?

raid50

Pretty neat that even without me describing it in detail, you can understand a lot about how this RAID type would function.  You’d know the failure capabilities and the write penalties of the individual RAID5 members.  And you’d know that the configuration couldn’t survive a failure of either RAID5 set because of the top level striping configuration.  And let’s say that I told you the strip size of the RAID5 group was 64KB, and that the strip size of the RAID0 config was 256MB.  Believe it or not, this is a pretty accurate description of a 10 disk VNX2 storage pool from a single tier RAID5 perspective.

Again to me this is part of the value – when fancy new things come out, the fundamental building blocks are often the same.  If you understand the functionality of the building block, then you can extrapolate functionality of many things.  And if I give you a new storage widget to look at, you’ll instantly understand certain things about it based on the underlying RAID configuration.  It puts you in a much better position than just memorizing that RAID5 is “parity.”

Okay, I’m off my soapbox!

Workload – Read

  • RAID1/0 – Great
  • RAID5 – Great
  • RAID6 – Great

I’ve probably hammered this home by now, but when we are looking at largely read workloads (or just the read portion of any workload) the RAID type is mostly irrelevant from a performance perspective in non-degraded mode.  But as with any blanket statement, there are caveats.  Here are some things to keep in mind.

  • Your read performance will depend almost entirely on the underlying disk (ignoring sequential reads and prefetching).  I’m not talking about the obvious flash vs NLSAS; I’m talking about RAID group sizing.  As a general statement I can say that RAID1/0 performs identically to RAID5 for pure read workloads, but an 8 disk RAID1/0 is going to outperform a 4+1 RAID5.
  • Ask the question and do tests to confirm: does your storage platform round robin reads between mirror pairs in RAID1/0?  If not (and not all controllers do), your RAID1/0 read performance is going to be constrained to half of the spindles.  From the previous bullet point, our 8 disk RAID1/0 would be outperformed by a 4+1 disk RAID5 in reads because only 4 of the 8 spindles are actually servicing read requests.

Workload – Write

  • RAID1/0 – Great (write penalty of 2)
  • RAID5 – Okay (write penalty of 4)
  • RAID6 – Bad (write penalty of 6)

Writes are where the RAID types start to diverge pretty dramatically due to the vastly different write penalties between them.  Yet once again sometimes people draw the wrong conclusion from the general idea that RAID1/0 is more efficient at writes than RAID6.

  • The underlying disk structure is still dramatically important.  A lot of people seem to focus on “workload isolation,” meaning e.g. with a database that I would put the data on RAID5 and the transaction logs on RAID1/0.  This is a great idea from a design perspective starting with a blank slate.  However, what if my RAID5 disk pool I’m working with is 200 disks and I only have 4 disks for RAID1/0?  In this case I’m pretty much a lock to have better success dropping logs into the RAID5 pool because there are WAY more spindles to support the I/O.  There are a lot of variables here about the workload, but the point I’m trying to make is you should take a look at all the parts as a whole when making these decisions.
  • If your write workload is large block sequential, take a look at RAID5 or RAID6 over RAID1/0 – you will typically see much more efficient I/O in these cases.  However, make sure you do proper analysis and don’t end up with heavy small block random writes on RAID6.

Going back and re-reading some of my previous posts, I feel like I may have given the impression that I don’t like RAID1/0.  Or that I don’t see value in RAID1/0.  That is certainly not the case and I wanted to draw an example to show when you need to use RAID1/0 without question.  That example is when we see a “lot” of small block random writes and don’t need excessive amounts of capacity.  What is a “lot”?  Good question.  Typically the breaking point is around 30-40% write ratio.

Given that a SAS drive should only be allowed to support around 180 IOPs, let’s crunch some numbers for an imaginary 10,000 front end IOPs workload. How many spindles do we need to support the workload at specific read/write ratios?  (I will do another blog post on the specifics of these calculations)

Read/Write Ratio RAID1/0 disk count RAID5 disk count RAID6 disk count
90%/10% 62 73 78
75%/25% 70 98 123
60%/40% 78 125 167

So, at lighter write percentages, the difference in the RAID type doesn’t matter as much.  But as we already learned RAID1/0 is the most efficient at back end writes, and this gets incredibly apparent at the 60/40 split.  In fact, I need over twice the amount of spindles if I choose RAID6 instead of RAID1/0 to support the workload.  Twice the amount of hardware up front, and then twice the amount of power suckers and heat producers sitting your data center for years.

Capacity Factor

  • RAID1/0 – Bad (50% penalty)
  • RAID5 – Great (generally ~20% penalty or less)
  • RAID6 – Great (generally ~25% penalty or less)

Capacity is a pretty straightforward thing so I’m not going to belabor the point – you need some amount of capacity and you can very quickly calculate how many disks you need of the different RAID types.

  • You can get more or less capacity out of RAID5 or 6 by adjusting RAID group size, though remember the protection caveats.
  • Remember that in some cases (for instance, storage pools on an EMC VNX) a choice of RAID type today locks you in on that pool forever.  By this I mean to say, if someone else talks you into RAID1/0 today and it isn’t needed, not only is it needlessly expensive today, but as you add storage capacity to that pool it is needlessly expensive for years.

Protection Factor

  • RAID1/0 – Lottery! (meaning, there is a lot of random chance here)
  • RAID5 – Good
  • RAID6 – Great

As we’ve discussed, the types vary in protection factor as well.

  • Because of RAID1/0’s lottery factor on losing the 2nd disk, the only thing we can state for certain is that RAID1/0 and RAID6 are better than RAID5 from a protection standpoint.  By that I mean, it is entirely possible that the 2nd simultaneous disk failure will invalidate a RAID1/0 set if it is the exact right disk, but there is a chance that it won’t.  For RAID5, a 2nd simultaneous failure will invalidate the set every time.
  • Remember is that RAID1/0 is much better behaved in a degraded and rebuild scenario than RAID5 or 6.  If you are planning on squeezing every ounce of performance out of your storage while it is healthy and can’t stand any performance hit, RAID1/0 is probably a better choice.  Although I will say that I don’t recommend running a production environment like this!
  • You can squeeze extra capacity out of RAID5 and 6 by increasing the RAID group size, but keep it within sane limits.  Don’t forget the extra trouble you can have from a fault domain and degraded/rebuild standpoint as the RAID group size gets larger.
  • Finally, remember that RAID is not a substitute for backups.  RAID will do the best it can to protect you from physical failures, but it has limits and does nothing to protect you from logical corruption.

Summary

I think I’ve established that there are a lot of factors to consider when choosing a RAID type.  At the end of the day, you want to satisfy requirements while saving money.  In that vein, here are some summary thoughts.

If you have a very transactional database, or are looking into VDI, RAID1/0 is probably going to be very appealing from a cost perspective because these workloads tend to be IOPs constrained with a heavy write percentage.  On the other hand, less transactional databases, application, and file storage tend to be capacity constrained with a low write percentage.  In these cases RAID5 or 6 are going to look better.

In general the following RAID types are a good fit in the following disk tiers, for the following reasons:

  • EFD (a.k.a. Flash or SSD) – RAID5.  Response time here is not really an issue, instead you want to squeeze as much capacity as possible out of them for use, ’cause these puppies are pricey!  RAID5 does that for us.
  • SAS (a.k.a. FC) – RAID5 or RAID1/0.  The choice here hinges on write percentage.  RAID6 on these guys is typically a waste of space and added write penalty.  They rebuild fast enough that RAID5 is acceptable.  Note – as these disks get larger and larger this may shift towards RAID1/0 or RAID6 due to rebuild times or even UBEs, but these are actually enterprise grade and have exponentially less UBE rate.
  • NLSAS (a.k.a. SATA) – RAID6.  Please use RAID6 for these disks.  As previously stated, they need the added protection of the extra parity, and you should be able to justify the cost.

Again, this is just in general, and I can’t overstate the need for solid analysis.

Hopefully this has been accurate and useful. I really enjoyed writing this up and hope to continue producing useful (and accurate!) material in the future.

RAID: Part 4 – Parity, Schmarity

We’re in the home stretch now.  We’ve covered mirroring and striping, and three RAID types already.

Now we are moving into RAID5 and RAID6.  They leverage striping and a concept called parity.

I’m still new to this blogging thing and once again I bit off more than I could chew.  I wanted to include RAID5 and 6 within this post but it again got really lengthy.  Rather than put an abridged version of them in here, I will cover them in the next post.

For a long time I didn’t really understand what parity was, I just knew it was “out there” and let us recover from disk failures.  But the more I looked into it and how it worked, the more it really amazed me.  It might not grab you the way it grabs me…and that’s OK, maybe you just want to move on to RAID5/6 directly.  But if you are really interested in how it works, forge on.

What is Parity?

Parity (pronounced ‘pear-ih-tee’) is a fancy (pronounced ‘faincy’) kind of recovery information.  In the mirroring post we examined data recovery from a copy perspective.  This is a pretty straight forward concept – if I make two copies of data and I lose one copy, I can recover from the remaining copy.

The downside of this lies in the copy itself.  It requires double the amount of space (which we now refer to as a 50% usable capacity penalty) to protect data with an identical copy.

Parity is a more efficient form of recovery information.  Let’s walk through a simple example, but one that I hope will really illustrate the mechanism, benefits, and problems of parity.  Say that I’m writing numbers to disk, the numbers 18, 24, 9, and last but certainly not least, 42.

noparity

As previously discussed, a mirroring strategy would require 4 additional disks in order to mirror the existing data – very inefficient from a capacity perspective.

Instead, I’m going to perform a parity calculation – or a calculation that results in some single value I can use to recover from.  In this case I’m going to use simple addition to create it.

18 + 24 + 9 + 42 = 93

So my parity value is 93 and I can use this for recovery (I’ll explain how in just a moment).

Next question – where can I store this value?  Well we probably shouldn’t use any of the existing disks, because they contain the information we are protecting.  This is a pretty common strategy for recovery. If I’m protecting funny kitten videos on my hard drive, I don’t want to back them up to the same disk because a disk failure takes out the original hilarious videos and the adorable backups. Instead I want to store them on a different physical medium to protect from physical failure.

kittehbackup

Similarly, in my parity scheme if a disk were to fail that contained data and the parity value, I would be out of luck.

To get around this, I’ll add a fifth disk and write this value:

parity

Now the real question: how do I use it for recovery?  Pretty simply in this case.  Any disk that is lost can be rebuilt by utilizing the parity information, along with the remaining data values.  If Disk3 dies, I can recover the data on it by subtracting the remaining data values from the parity value:

reconstruct

Success – data recovery after a disk failure…and it was accomplished without adding a complete data copy!  In this case 1/5 of the disk space is lost to parity, translating to a 20% usable capacity penalty.  That is a serious improvement from mirroring.

What happens if there are two disks that fail? As with most things, it depends.  Just like RAID1, if a second disk fails after the system has fully recovered from the first failure, everything is fine.

But if there is a simultaneous failure of two disks?  This presents a recovery problem because there are two unknowns.  If Disk1 and Disk2 are lost simultaneously, my equation looks like:

93-42-9 = 42 = ?+?

Or in English, what two values add up to 42?  While it is true that 18+24=42, so does 20+22, neither of which are my data.  There are a lot of values that meet this criteria…in this case more of them aren’t my data than are.  And guessing with data recovery is, in technical terms, a terribad idea.  So we know that this parity scheme can survive a single disk failure.

Another important question – what happens if we overwrite data?  For instance, if Disk2’s value of 24 gets overwritten with a value of 15, how do we adjust?  It would be a real bummer if the system had to read all of the data in the stripe to calculate parity again for just one affected strip.

There is some re-reading of data, but it isn’t near that bad.  We remove the old data value (24) from the parity value (93), and then add the new one (15) in.  Then we can replace both the data and the parity on disk.

The process looks like:

  1. Get the old parity value
  2. Get the old data value
  3. Subtract the old data value from the old parity value, creating an intermediate parity value
  4. Add the new data value to the intermediate parity value, creating the new parity value
  5. Update the parity value with the new parity value
  6. Update the data value with the new data value

Because we are working with disks, we can replace the “Get” phrasing with read and “Update” with write.  Looking back, at this list, we see that there are two gets (reads) and two updates (writes).

The Magical XOR Operation

I know, I know – my example was amazing.  “Parity calculations should use that mechanism!,” I can hear you saying.  “Give him the copyright and millions in royalties,” you are no doubt proclaiming to everyone around you.

Unfortunately my scheme has a serious problem, and that is that my parity calculation is accumulative.  The larger the numbers get that I am protecting, the larger my parity value gets. Remember at the end of the day we aren’t working with numbers. We are working with data bits (1’s and 0’s) on disk, and we are working with a fixed strip size.  Were we to write 4 x 64KB of data with accumulative parity, I would need 256KB of parity to protect it.  Not ideal – this is essentially the same as mirroring’s usable capacity penalty!

Instead, in the real world parity is supported entirely by the bitwise “exclusive or,” or XOR, operation.  Visually the operator itself looks kind of like a crosshair, or a plus inside a circle.  XOR is a very unique operator in that it essentially allows you to add and subtract from a value (similar to my scheme) without increasing the total amount of information (the bit count).

Another cool thing about the bitwise XOR is it functions both as addition and subtraction.  To “add” two values, you XOR them together.  Then to remove one value from it, you XOR that value again. So instead of A + B – B = A, we have:

xoraddsub

XOR’s principle is very simple – if two values are different, the output is TRUE (1); otherwise the output is FALSE (0).  In other words:

  1. Take any pair of 1’s and/or 0’s and compare
  2. If they are the same, output 0
  3. Else output 1

That’s all, folks.  The 4 possible input combinations and their outputs are:

  • 0 XOR 0 = 0
  • 0 XOR 1 = 1
  • 1 XOR 0 = 1
  • 1 XOR 1 = 0

Not too crazy looking is it?  Let’s put it to the test and prove that this works for recovery.  Similar to before, I have 4 data disks but this time with just ones and zeros on them.  The parity calculation goes like so:

xorcalc

Every data bit gets XOR’d together and the result is the parity bit – in this case it is 1.

Now say Disk2 with 0 on it experiences a failure and the system needs to recover. Recovery would simply be the result of XORing all the remaining data bits and the parity bit.

Recovery (from left to right): 1 XOR 1 = 0, XOR 1 = 1, XOR 1 = 0

Success!  We have recovered the 0 bit.  Unfortunately XOR’s magic only extends so far, and in the event of a simultaneous two disk failure we are still up the creek without a paddle.  One parity value can only protect you against the loss of one disk because it can only recover one unknown value.

How about updating the parity bit in the event that we overwrite some data? Again, this works as outlined above:

  1. Read the old parity value
  2. Read the old data value
  3. XOR the old data value with the old parity value, creating an intermediate parity value
  4. XOR the new data value with the intermediate parity value, creating the new parity value
  5. Write the parity value with the new parity value
  6. Write the data value with the new data value

Same as before, there are two reads and two writes.

This is obviously a very simple example and was in no way meant to be a mathematical proof of parity recovery or XOR, but it works on any scale you choose.  Test it out!

Quick note – in the real world, strip size is not a single bit like in this example. With a 64KB strip size, that comes out to 524288 bits in a single strip.  524288 ones and zeros.  XOR functions quite simply as you add bits on, since it just compares each bit in place.  For example, 1100 XOR 1010 is 0110.  The first digit of the result is the first digit of each input XOR’d together.  The second digit is the second digits of each input XOR’d together.  And so on.  There are more detailed XOR manifestos out there as well as XOR calculators online…feel free to consult if you are interested and my explanation left you wanting.

Bitwise XOR is the mechanism for generating a parity bit without increasing the total bit count.  Using this, RAID controllers can generate parity that is identical in size to any amount of data strips being protected.  No matter the strip size, and no matter the stripe width, this mechanism will always result in an identically sized parity.

So what?

So what, indeed.  That was a lot to take in, I know…parity is certainly more complicated than mirroring.  Is it absolutely necessary to understand how parity works at this level?  No, not really.  But thus far I’ve never had a problem arise because I understood how something worked too deeply. I have encountered plenty of issues because I haven’t understood how something worked, or made assumptions about what something was doing behind the scenes.

When we get into some aspects of RAID5 and RAID6, understanding what parity is supposed to do will help clarify what those RAID types are useful for.  And if you don’t agree, feel free to wipe this from your memory banks and replace it with something more useful.

RAID: Part 3 – RAID 1/0

So, if you have been following along dear reader, we are now up to speed on several things.  We have discussed mirroring (and RAID1, which leverages it) and striping (and RAID0, which leverages that).  We have also discussed RAID types using some familiar and standard terminology which will allow us to compare and contrast the versions moving forward.

Now, on to the big dog of RAID – RAID 1/0.  This is called “RAID one zero” and “RAID ten,” and sometimes “RAID one plus zero” (and indicated as RAID 1+0).  I have never heard it called “RAID one slash zero” but perhaps somebody somewhere does that also.  All of these things are referring to the same thing, and RAID ten is the most common term for it.

Why do we need RAID1/0?

In this section I wanted to ask a sometimes overlooked question – what are the problems with RAID0 and RAID1 that cause people to need something else?

If you know about RAID0 (or even better if you read Part 2) you should have an excellent idea of the failings of it.  Just to reiterate, the problem of RAID0 is that it only leverages striping, and striping only provides a performance enhancement.  It provides nothing in the way of protection, hence any disk that fails in a RAID0 set will invalidate the entire set. RAID0 is the ticking time bomb of the storage world.

RAID1’s problems aren’t quite as obvious as the “one disk failure = worst day ever” of RAID0, but once again let’s go back to Part 1 and look at the benefits I listed of RAID:

  1. Protection – RAID (except RAID0) provides protection against physical failures.  Does RAID1 provide that?  Absolutely – RAID1 can survive a single disk failure.  Check box checked.
  2. Capacity – RAID also provides a benefit of capacity aggregation.  Does RAID1 provide that?  Not at all.  RAID1 provides no aggregate capacity or aggregate free space benefit because there are always exactly two disks in a RAID1 pair, and the usable capacity penalty is 50%.  Whether I have a RAID1 set using a 600GB drive or a 3TB drive, I get no aggregate capacity benefit with RAID1, beyond the idea of just splitting a disk up into logical partitions…which can be done on a single disk without RAID in the first place.
  3. Performance – RAID provides a performance benefit since it is able to leverage additional physical spindles.  Does RAID1 provide that?  The answer is yes…sort of.  It does provide two spindles instead of one, which fits the established definition.  However there are some caveats.  There isn’t a performance boost on writes because of the write penalty of 2:1 (both of the spindles are being used for every single write).  There is a performance boost on reads because it can effectively round-robin read requests back and forth on the disks.  But, and a BIG BUT, there are only two spindles.  There are only ever going to be two spindles.  Unlike a RAID0 set which can have as many disks as I want to risk my data over, a RAID1 set is performance bound to exactly two spindles.

Essentially the problem with the mirrored pair is just that – there are only ever going to be two physical disks.

By now it may have become obvious, but RAID0 and RAID1 are almost polar opposites.  RAID1’s benefit lies mostly around protection, and RAID0’s benefit is performance and capacity.  RAID1 is the stoic peanut butter, and RAID0 is the delicious jelly.  If only there was a way to leverage them both….

What is RAID1/0?

RAID1/0 is everything you wanted out of RAID0 and RAID1. It is the peanut butter and jelly sandwich.  (Note: please do not attempt to combine your storage array with peanut butter or jelly.  Especially chunky peanut butter.  And even more especiallyer chunky jelly)

Essentially RAID1/0 looks like a combination of RAID1 and RAID0, hence the label.  More accurately, it is a combination of mirroring and striping in that order.  RAID1/0 replaces the individual disks of a RAID0 stripe set with RAID1 mirror pairs.  It is also important to understand what RAID1/0 is and what it is not.  It is true that it leverages the good things out of both RAID types, but it also still maintains the bad things of both RAID types. This will become apparent as we dive into it.

raid10

This is a busy image, but bear with me as I break it down.

  • This is an eight disk RAID1/0 configuration, and on this configuration (similar to the Part 2 examples) we are writing A,B,C,D to it. For simplicity’s sake we ignore write order and just go alphabetically
  • The orange and green help indicate what is happening at their particular parts of the diagram
  • The physical disks themselves (the black boxes) are in mirrored pairs that should hopefully be familiar by now (indicated by the green boxes and plus signs).  This is the same RAID1 config that I’ve covered previously.
  • The weirdness picks up at the orange part. The orange box indicates that we are striping across every mirrored pair.  This is also identical to the RAID0 configuration, except that the the physical disks of the RAID0 config have been replaced with these RAID1 pairs.

This is what is meant by RAID1/0.  First comes RAID1 – we build mirrored pairs.  Then comes RAID0 – we stripe data across the members, which happen to be those mirrored pairs.  It may help to think about RAID1/0 as RAID0 with an added level of protection at the member level (since we know RAID0 provides no protection otherwise).

As the host writes A,B,C,D, the diagram indicates where the data will land, but let’s cover the order of operations.

  1. The host writes A to the RAID1/0 set
  2. A is intercepted by the RAID controller.  The particular strip it is targeted for is identified.
  3. The strip is recognized to be on a mirrored pair, and due to the mirror configuration the write is split.
  4. A lands on both disks that make up the first member of the RAID0 set.
  5. Once the write is confirmed on both disks, the write is acknowledged back to the host as completed
  6. The host writes B to the RAID1/0 set
  7. B is intercepted by the RAID controller.  The particular strip it is targeted for is identified.  Due to the mirror configuration the write is split.
  8. B lands on both disks that make up the second member of the RAID0 set.
  9. Once the write is confirmed on both disks, the write is acknowledged back to the host as completed
  10. The host writes C to the RAID1/0 set
  11. etc.

Hopefully this gives an accurate, comprehensible version of the how’s of RAID1/0.  Now, let’s look at RAID1/0 using the same terminology we’ve been using.

From a usable capacity perspective, RAID1/0 maintains the same penalty as RAID1.  Because every member is a RAID1 pair, and every RAID1 pair has a 50% capacity penalty, it stands to reason that RAID1/0 also has a 50% capacity penalty as a whole.  No matter how many members are in a RAID1/0 group, the usable capacity penalty is always 50%.

The write penalty is a similar tune.  Because every member is a RAID1 pair, and every RAID1 pair has a 2:1 write penalty, RAID1/0 also has a write penalty of 2:1.  Again no matter how many members are in the set, the write penalty is always 2:1.

RAID1/0 reminds me of the Facts of Life. You know, you take the good, you take the bad?  RAID1/0 is a leap up from RAID0 and RAID1, but it doesn’t mean that we’ve gotten rid of their problems.  It is better to think that we’ve worked around their problems.  The same usable capacity penalty exists, but now I have the ability to aggregate capacity by putting more and more members into a RAID1/0 configuration.  The same write penalty exists, but again I can now add more spindles to the RAID1/0 configuration for a performance boost.

The protection factor is weird, but still a combination of the two.  How many disks failures can a RAID1/0 set survive?  The answer is, it depends.  There is still striping on the outer layer, and by now we have beaten the dead horse enough to know that RAID0 can’t lose any physical disks.  It is a little clearer, especially for this transition, to think of this concept as RAID0 can’t survive any member failures, and in traditional RAID0 members are physical disks.  In this capacity, RAID1/0 is the same: RAID1/0 can’t survive any member failures.  The difference is that now a member is made up of two physical disks that are protecting each other.  So can a RAID1/0 set lose a disk and continue running?  Absolutely – RAID1/0 can always survive one physical disk failure.

…But, can it survive two?  This is where it gets questionable.  If the second disk failure is the other half of the mirrored pair, the data is toast.  Just as toast as if RAID0 had lost one physical disk since the effect is the same.  But what if it doesn’t lose that specific disk?  What if it loses a disk that is part of another RAID1 pair?  No problem, everything keeps running.  In fact, in our example, we can lose 4 disks like this and keep running:

raid10_4fails

You can lose as many as half of the disks in the RAID1/0 set and continue running, just as long as they are the right disks.  Again, if we lose two disks like this, ’tis a bad day:

raid10_2fails

So there are a few rules about the protection of RAID1/0

  • RAID1/0 can always survive a single disk failure
  • RAID1/0 can survive multiple disk failures, so long as the disk failures aren’t within the same mirrored pair
  • With RAID1/0 data loss can occur over as little as two disk failures (if they are part of the same mirror pair) and is guaranteed to occur at (n/2)+1 failures where n is the total disk count in the RAID1/0 set. 

Degraded and rebuild concepts are identical to RAID1 because the striping portion provides no protection and no rebuild ability.

  • Any mirror pair in degraded mode will see a write performance increase (splitting writes no longer necessary), and potentially a read performance decrease.  Other mirror pairs continue to operate as normal
  • Any mirror pair in rebuild mode will see a heavy performance penalty.  Other mirror pairs continue to operate as normal with no performance penalty.

Why not RAID0/1?

This is one of my favorite interview questions, and if you are interviewing with me (or at places I’ve been) this might give you a free pass on at least one technical question.  I picked it up from a colleague of mine and have used it ever since.

Why not RAID0/1?  Or is there even a concept of RAID0/1?  Would it be the same as RAID1/0?

It does exist, and it is extremely similar on the surface.  The only difference is the order of operations: RAID1/0 is mirrored, then striped, and RAID0/1 is striped, then mirrored.  This seemingly minor difference in theory actually manifests as a very large difference in practice.

raid01

Most things about RAID0/1 are identical to RAID1/0 (like performance and usable capacity), with one notable exception – what happens during disk failure?

I covered the failure process of RAID1/0 above so I won’t rehash that. For RAID0/1, remember that any failure of a RAID0 member invalidates the entire set.  So, what happens whenever the top left disk in RAID0/1 fails?  Yep, the entire top RAID0 set fails, and now it is effectively running as RAID0 using only the bottom set.

This has two implications.  The most severe being that RAID0/1 can survive a single disk failure, but never two disk failures.  The other is that if a disk failed and a hot spare was available (or the bad disk was swapped out with a good disk), the rebuild affects the entire RAID set rather than just a portion of it.

It would be possible to design a RAID controller to get around this.  It could recognize that there is still a valid member available to continue running from in the second stripe set.  But then essentially what it is doing is trying to make RAID0/1 be like RAID1/0.  Why not just use RAID1/0 instead?  That is why RAID1/0 is a common implementation and RAID0/1 is not.

Wrap Up

In Part 4 I’m going to cover parity and hopefully RAID5 and 6, and then I’ll provide some notes to bring the entire discussion together.  However, I wanted to include some thoughts about RAID1/0 in case someone stumbled on this and had some specific questions or issues related to performance, simply because I’ve seen this a lot.

RAID1/0 performs more efficiently than other RAID types from a write perspective only.  A lot of people seem to think that RAID1/0 is “the fastest one,” and hence should always be used for performance applications.  This is demonstrably untrue.  As I’ve stated previously, there is no such thing as a read penalty for any RAID type.  If your application is entirely or mostly read oriented, using RAID1/0 instead of RAID5 or 6 does nothing but cost you money in the form of usable capacity.  And yes, there are workloads with enormous performance requirements that are 100% read.

RAID1/0 has a massive usable capacity penalty.  If you are protecting data with RAID1/0, you need to purchase twice as much storage as it needs.  If you are replicating that data like-for-like, you need to purchase four times the amount of storage that it needs.  Additionally, sometimes your jumping off point locks you into a RAID type as well, so a decision to use RAID1/0 today may impact the future costs of storage as well.  I can’t emphasize this point enough – RAID1/0 is extremely expensive and not always needed.

I like to think of people who always demand RAID1/0 like the people who might bring a Ferrari when asked to “bring your best vehicle.”  But it turns out, I needed to tow a trailer full of concrete blocks up a mountain.  Different vehicles are the best at different things…just like RAID types.  We need to fully understand the requirements before we bring the sports car.

If you are having performance problems, or more likely someone is telling you they are having performance problems, jumping from RAID5 to RAID1/0 may not do a thing for you.  It is important to do a detailed analysis of the ENTIRE storage environment and figure out what the best fit solution is.  You don’t want to be that guy who advocated a couple hundred thousand dollars of a storage purchase when it turned out there was a host misconfiguration.

RAID: Part 2 – Mirroring and Striping

In this post I’m going to cover two of the three patterns of RAID, and cover RAID0, 1.  Initially I was going to include RAID1/0 but this one got a bit too long, so I’ll break it out into its own post.

Mirroring

Mirroring is one of the more straightforward RAID patterns, and is employed in RAID1 and RAID1/0 schemes.  Mirroring is used for data protection.

Exactly as it sounds, mirroring involves pairs of disks (always pairs – exactly two) that are identical copies of each other.  In RAID1, a virtual disk would be comprised of two physical disks on the back end. So as I put files on the virtual disk that my server sees, essentially it is making two copies of it on the back side.

mirroring

In this scenario imagine a server is writing the alphabet to disk.  The computer host starts with A and continues through till I.  In the case of one physical disk on the left hand side, the writes end up on the single physical disk as expected.

On the right hand side, using the RAID1 pair, the computer host writes the same information, but on the storage array the writes are mirrored to two physical disks.  This pair must have identical contents at all times in order to guarantee the consistency and protection of RAID1.

The two copies are accomplished by write splitting, generally done by the RAID controller (the brains behind RAID protection – sometimes dedicated hardware, sometimes software).  The RAID controller detects a write coming into the system, realizes that the RAID type is 1, and splits the write onto the two physical disks of the pair.

How much capacity is available in both scenarios for the hosts?  300GB.  In the RAID1 configuration, even though there is a total of 600GB of space, 50% of the usable capacity is lost to RAID1.  Usable capacity is one of the key differentiators in the different RAID schemes.  Sometimes this is expressed in terms of usable and raw capacities – in this case there is 600GB of raw capacity, but only 300GB usable.  Again, the 50% factor comes in to play.

Another important concept to start to understand is what is called the write penalty.  The write penalty of any RAID protection scheme (except RAID0, discussed below) exists because in order to protect data, the storage system must have the ability to recover or maintain it in the event of a physical failure. And in order to recover data, it must have some type of extra information.  There is no free lunch, and unfortunately no magic sauce (although some BBQ sauce comes close) (note: please do not try to combine your storage array with BBQ sauce).  If the array is going to protect data that hosts are writing, it must write additional information along with it that can be used for recovery.

Write penalties are expressed in a ratio of physical disk writes to host writes (or physical disk writes to virtual disk writes, or most accurately back-end writes to front-end writes).  RAID1 has a write penalty of 2:1.  This means that for every host write that comes in (a write coming in to the front-end of the storage array), the physical disks (on the back-end of the storage array) will see two writes.

You might be wondering, what about read penalties?  Read penalties don’t really exist in normal operations. This may be a good topic for another post but for now just take it on faith that the read penalty for every non-degraded RAID type is 1:1.

The protection factor here is pretty obvious, but let’s again discuss it using common terminology in the storage world.  RAID1 can survive a single disk failure.  By that I mean if one disk in the pair fails, the remaining good disk will continue to provide data service for both reads and writes.  If the second disk in the pair fails before the failed disk is replaced and rebuilt, the data is lost.  If the first disk is replaced and rebuilds without issue, then the pair returns to normal and can once again survive a single disk failure.  So when I say “can survive a single disk failure,” I don’t mean for the life of the RAID group – I mean at any given time assuming the RAID group is healthy.

Another important concept – what does degraded and rebuild mean to RAID1 from a performance perspective?

  • Degraded – In degraded mode, only one of the two disks exists.  So what happens for writes?  If you thought things got better, you are strangely but exactly right.  When only one disk of the pair exists, only one disk is available for writes.  The write penalty is now 1:1, so we see a performance improvement for writes (although an increased risk for data loss, since if the remaining disk dies all data is lost). There is a potential performance reduction for reads since we are only able to read from one physical disk (this can be obscured by cache, but then so can write performance).
  • Rebuild – During a rebuild, performance takes a big hit.  The existing drive must be fully mirrored onto the new drive from start to finish, meaning that the one good drive must be entirely read and the data written to the new partner.  And, because the second disk is in place, writes will typically start being split again so that pesky 2:1 penalty comes back into play.  And the disks must continue to service reads and writes from the host.  So for the duration of the rebuild, you can expect performance to suffer.  This is not unique to RAID1 – rebuild phases always negatively impact potential performance.

 Striping

Striping is sometimes confused or combined with parity (which we’ll cover in RAID5 and RAID6) but it is not the same thing.  Striping is a process of writing data across several disks in sequence. RAID0 only uses striping, while the rest of the RAID types except RAID1 use striping in combination with mirroring or parity.  Striping is used to enhance performance.

striping

In this example the computer system is once again writing some alphabetic characters to disk.  It is writing to Logical Block Addresses, or sectors, or blocks, or whatever you like to imagine makes up these imaginary disks.  And these must be enormous letters because apparently I can only fit nine of them on a 300GB disk!  

On the left hand side there is a single 300GB physical disk.  As the hosts writes these characters, they are hitting the same disk over and over.  Obvious – there is only one disk!

What is the important thing to keep in mind here?  As mentioned in Part 1, generally the physical disk is going to be the slowest thing in the data path because it is a mechanical device.  There is a physical arm inside the disk that must be positioned in order to read or write data from a sector, and there is a physical platter (metal disk) that must rotate to a specific position.  And with just one disk here, that one arm and one platter must position itself for every write. The first write is C, which must fully complete before the next write A can be accomplished.

Brief aside – why the write order column?  The write order is to clarify something about this workload.  Workload is something that is often talked about in the storage world, especially around design and performance analysis.  It describes how things are utilizing the storage, and there are a lot of aspects of it – sequential vs random, read vs write, locality of reference, data skew, and many others. In this case I’m clarifying that the workload is random, because the host is never writing to consecutive slots.  If instead I wrote the data as A, B, C, D, E, F, G, H, I, this would be sequential.  I’ll provide some more information about random vs sequential in the RAID5/6 discussion.

On the right hand side there are three 100GB disks in a RAID0 configuration.  And once again it is writing the same character set.

This time, though, the writes are being striped across three physical disks.  So the first write C hits disk one, the second write A hits disk two, and the third write E hits disk 3.  What is the advantage?   The writes can now execute in parallel as long as they aren’t hitting the same physical disk.  I don’t need for the C write to complete before I start on A.  I just need C to complete before I start on B.

How about reads?  Yep, there is increased performance here as well.  Reads can also execute in parallel, assuming the locations being read are on different physical disks.

Effectively RAID0 has increased our efficiency of processing I/O operations by a factor of three.  I/O operations per second, or IOPs as they are commonly called, is a way to measure disk performance (e.g. faster disks like SAS can process roughly double the amount of IOPs than NLSAS or SATA disks).  And striping is a good way to bump up the IOPs a system is capable of producing for a given virtual disk set.

This is a good time to define some terminology around striping.  I wouldn’t necessarily say this is incredibly useful, but it can be a good thing to comprehend when comparing systems because these are some of the areas where storage arrays diverge from each other.

The black boxes outlined in green represent strips.  The red dotted line indicates a stripe.

The black boxes outlined in green represent strips. The red dotted line indicates a stripe.

  • Strip – A strip is a piece of one disk.  It is the largest “chunk” that can be written to any disk before the system moves on to the next disk in the group.  In our three disk example, a strip would be the area of one disk holding one letter.
  • Strip Size (also called stripe depth) – This is the size of a strip from a data perspective.  The size of all strips in any RAID group will always be equivalent.  On EMC VNX, this value is 64KB (some folks might balk at this having seen values of 128 – this is actually 128 blocks and a block is 512 bytes).  On VMAX this varies but (I believe) for most configurations the strip size is 256KB, and for some newer ones it is 128KB (will try to update this if/when I verify this).  A strip size of 64KB means that if I were to write 128KB starting at sector 0 of the first disk, the system would write 64KB to the disk before moving on to the next disk in the group.  And if the strip size were 128KB, the system would write the entire 128KB to disk before moving on to the next disk for the next bit of data.
  • Stripe – A stripe is a collection of strips across all disks that are “connected”, or more accurately seen as contiguous.  In our 3 disk example, if our strip size was 64KB, then the first strip on each disk would collectively form the first stripe.  The second strip on each disk would form the second stripe, and would be considered, from a logical disk perspective, to exist after the first stripe.  So the order of consecutive writes would go Stripe1-Strip1, Stripe1-Strip2, Stripe1-Strip3, Stripe2-Strip1, Stripe2-Strip2, etc.
  • Stripe Width – this is how many data disks are in a stripe.  In RAID0 this is all of them because disks only hold data, but for other RAID types this is a bit different.  In our example we have a stripe width of 3.
  • Stripe Size – This is stripe width x strip size.  So in our example if the strip size is 64KB, the stripe size is 64KB x 3 or 192KB

Note: these are what I feel are generally accepted terms.  However, these terms get mixed up A LOT.  If you are involved in a discussion around them or are reading a topic, keep in mind that what someone else is calling stripe size might not be what you are thinking it is.  For example, a 4+1 RAID5 group has five disks, but technically has a stripe width of 4.  Some people would say it has a stripe width of five.  In my writing I will always try to maintain these definitions for these terms.

Since I defined some common terms above in the RAID1 section, let’s look at them again from the perspective of RAID0.

First, usable capacity.  RAID0 is unique because there is no protection information that is written.  Because of this, there is no usable capacity penalty.  If I combine five 1TB disk in a RAID0 group, I have 5TB usable.  In RAID0, raw always equals usable.

How about write penalty?  Once again we have a unique situation on our hands.  Every front end write only hits one physical disk, so there is no write penalty – or the write penalty can be expressed as 1:1.

“Amazing!” you might be thinking.  Actually, probably not, because you have probably realized what the big issue with RAID0 is.  Just to be sure, let’s discuss protection factor.  RAID0 does not write any protection information in any way, hence it provides no protection.  This means that the failure of any member in a RAID0 group immediately and instantly invalidates the entire group (this is an important concept for later, so make sure you understand it).  If you have two disks in a RAID0 configuration and one disk fails, all data on both disks is unusable.  If you have 30 disks in a RAID0 configuration and one disk fails, all data on all 30 disks is unusable.  Any RAID0 configuration can survive zero failed disks.  If you have a physical failure in RAID0, you better have a backup somewhere else to restore from.

How about the degraded and rebuild concepts?  Good news everyone!  No need to worry ourselves with these concepts because neither of these things will ever happen.  A degraded RAID0 group is a dead RAID0 group.  And a rebuild is not possible because RAID0 does not write information that allows for recovery.

So, why do we care about RAID0?  For the most part, we don’t.  If you run RAID0 through the googler, you’ll find it is discussed a lot for home computer performance and benchmarking.  It is used quite infrequently in enterprise contexts because the performance benefit is outweighed by the enormous protection penalty.  The only places I’ve seen it used are for things like local tempdb for SQL server (note: I’m not a DBA and haven’t even played one on TV, but this is still generally a bad idea.  TempDB failure doesn’t affect your data, but I believe it does cause SQL to stop running…). 

We do care about RAID0, and more specifically the striping concept, because it is used in every other RAID type we will discuss.  To say that RAID0 doesn’t protect data isn’t really fair to it.  It is more accurate to say that striping is used to enhance performance, and not to protect data.  And it happens that RAID0 only uses striping.  It’s doing what it is designed to do.  Poor RAID0.

Neither of these RAID types are used very often in the world of enterprise storage, and in the next post I’ll explain why as I cover RAID1/0.

RAID: Part 1 – General Overview

For my first foray into the tech blogging world, I wanted to have a discussion on the simple yet incredibly complex subject of RAID.  Part 1 will not be technical, and instead hopefully provide some good footing on which to build.

For the purposes of this discussion I’m only going to focus on RAID 0, 1, 1/0 (called “RAID one zero” or more commonly “RAID ten”), 5, and 6.  These are generally the most common RAID types in use today, and the ones available for use on an EMC VNX platform.  Newcomers may feel daunted by the many types of RAID…I know I was. I spent some time memorizing a one line definition of what they mean. While this may be handy for a job interview, a far more valuable use of time would be to memorize how they work!  You can always print out a summary and hang it next to your desk.

I’ve found RAID to be one of the more interesting topics in the storage world because it seems to be one of the more misunderstood, or at least not fully understood, concepts – yet it is probably one of the most widely used.  Almost every storage array uses RAID in some form or another.  Often I deal with questions like:

  • Why don’t we just use RAID 1/0 since it is the fastest?
  • Why don’t I just want to throw all my disks into one big storage pool?
  • RAID6 for NLSAS is a good suggestion, but RAID5 isn’t too much different right?
  • RAID6 gives two disk failure protection, why would anyone use RAID5 instead?
  • Isn’t RAID6 too slow for anything other than backups?

Most of these questions really just stem from not understanding the purpose of RAID and how the types work.

In this post we’ll tackle the most basic of questions – what does RAID do, and why would I want to use RAID?

What does RAID do?

RAID is an acronym for Redundant Array of Independent (used to be Inexpensive, but not so much anymore) Disks.  The easiest way to think of RAID is a group of disks that are combined together into one virtual disk.

raid_example

If I had five 200GB disks, and “RAIDed” them together, it would be like I had one 1000GB disk.  I could then allocate capacity from that 1000GB disk.

Why would I want to use RAID?

RAID serves at least three purposes – protection, capacity, and performance.

Protection from Physical Failures

With the exception of RAID 0 (I’ll discuss the types later), the other RAID versions listed will protect you against at least one disk failure in the group.  In other words, if a hard drive suffers a physical failure, not only can you continue running (possibly with a performance impact), but you won’t lose any data.  A RAID group that has suffered a failure but is continuing to run is generally known as degraded.  What this means is a little different for each type so we’ll cover those details later.  When the failed disk is replaced with a functional disk, some type of rebuild operation will commence, and when complete the RAID group will return to normal status without issue.

Most enterprise storage arrays, and many enterprise servers, allow you to implement what is commonly known as a hot spare.  A hot spare is a disk that is running in a system, but not currently in use.  The idea behind a hot spare is to reduce restore time.  If a disk fails and you have to:

  1. Wait for a human to recognize the failure
  2. Open a service request for a replacement
  3. Wait for the replacement to be shipped
  4. Have someone physically replace the disk

That is potentially a long period of time that I am running in degraded mode. Hence the hot spare concept. With a hot spare in the system, when the disk fails, a spare is instantly available and rebuild starts.  Once the rebuild is finished, the RAID group returns to normal.  The failed disk is no longer a part of any active RAID group, and itself can be seen as a spare, unused disk in the system (though obviously not a hot spare because it is failed!).  Eventually it will be replaced, but because it isn’t involved in data service there is less of a critical business need to replace it.

An important and sometimes hazy concept, especially with desktops, is that RAID only protects you against physical failures.  It does not protect you against logical corruption.  As a simple example, if I protect your computer’s hard drives with RAID1 and one of those drives dies, you are protected. If instead you accidentally delete a critical file, RAID will do nothing for you.  In this situation, you need to be able to recover the file through the file system if possible, or restore from a backup.  There are a lot of types of logical corruption, and rest assured that RAID will not protect you from any of them.

Capacity

There are two capacity related benefits to RAID.  Note that there is generally also a capacity penalty that comes along with RAID, but we will discuss that when we get into the types.

Aggregated Usable Capacity

Continuing the example above with the five 200GB disks, if you were to come ask me for storage, without RAID the most I could give you would be a 200GB disk.  I might be able to give you multiple 200GB disks, and you might be able to combine those through a volume manager, but as a storage admin I could only present you one 200GB disk.

What if you need a terabyte of space?  I’d have to give you all five separate disks, and then you’d have to do some volume management on your end to put them together.

With RAID, I can assemble those together on the back end as a virtual device, and present it as one contiguous address space to a host.  As an example, 2TB datastores are fairly common in ESX, and I would venture to say a lot of those datastores run on disk drives much smaller than 2TB.  Maybe it is a 10 or 20 disk 600GB SAS pool, and we have allocated two TB out of that for the ESX datastore.

Aggregated Free Space

Think about the hard drive in your computer.  It is likely that you’ve got some amount of free capacity on it.  Let’s say you have a 500GB hard drive with 200GB of free space.

Now let’s think about five computers with the same configuration.  500GB hard drives, 200GB free on each.  This means that we are not using 1000GB of space overall, but because it is dedicated to each individual computer, we can’t do anything with it.

freespace

If instead we took those 500GB hard drives and grouped them, we could then have a sum total of 2500GB to work with and hand out.  Now perhaps it doesn’t make sense to give all users 300GB of capacity, since that is what they are using and they would be out of space…but perhaps we could give them 400GB instead.

Now we’ve allocated (also commonly known as “carving”) five 400GB virtual disks (also commonly known as LUNs) out of our 2500GB pool, leaving us 500GB of free space to work with.  Essentially by pooling the resources, we’ve gained the ability to add one additional hard drive without adding another physical disk.

Performance

Performance of disk based storage is largely based on how many physical spindles are backing it (this changes with EFD and large cache models, but that is for another discussion).  A hard drive is a mechanical device, and is generally the slowest thing in the data path.  Ergo, the more I can spread your data request (and all data requests) out over a bunch of hard drives, the more performance I’m going to be able to leverage.

If you need 200GB of storage and I give you one 200GB physical disk, that is one physical spindle  backing your storage.  You are going to be severely limited on how much performance you can squeeze out of that hard drive.

If instead I allocate your 200GB of space out of a RAID group or pool, now I can give you a little bit of space on multiple disks.  Now your virtual disk storage is backed by many physical spindles, and in turn you will get a lot more performance out of it.

It should be said that this blog is focused on enterprise storage arrays, but some of the benefits listed above apply to any RAID controller, even one in a server or workstation.  The aggregated free space, and in most scenarios the performance benefit, only apply to a shared storage arrays.

Hopefully this was a good high level introduction to the why’s of RAID.  In the next post I will cover the how’s of RAID 1 and 0.