Shedding Light on Storage Encryption

I’ve been noticing some fundamental misunderstandings around storage encryption – I see this most when dealing with XtremIO although plenty of platforms support it (VNX2 and VMAX).  I hope this blog post will help someone who is missing the bigger picture and maybe make a better decision based on tradeoffs.  This is not going to be a heavily technical post, but is intended to shed some light on the topic from a strategic angle.

Hopefully you already know, but encryption at a high level is a way to make data unreadable gibberish except by an entity that is authorized to read it.  The types of storage encryption I’m going to talk about are Data At Rest Encryption (often abbreviated DARE or D@RE), in-flight encryption, and host-based encryption.  I’m talking in this post mainly about SAN (block) storage, but these concepts also apply to NAS (file) storage.  In fact, in-flight encryption is probably way more useful on a NAS array given the inherent security of FC fabrics.  But then, iSCSI, and it gets cloudier.

Before I start, security is a tool and can be used wisely or poorly with equivalent results.  Encryption is security.  All security, and all encryption, is not great.  Consider the idea of cryptographic erasure, by which data is “deleted” merely because it is encrypted and nobody has the key.  Ransomware thrives on this.  You are looking at a server with all your files on it, but without the key they may as well be deleted.  Choosing a security feature for no good business reason other than “security is great” is probably a mistake that is going to cause you headaches.

encryptionblogpic

Here is a diagram with 3 zones of encryption.  Notice that host-based encryption overlaps the other two – that is not a mistake as we will see shortly.

Data At Rest Encryption

D@RE of late is typically referring to a storage arrays ability to encrypt data at the point of entry (write) and decrypt on exit (read).  Sometimes this is done with ASICs on an array or I/O module, but it is often done with Self Encrypting Drives (SEDs).  However the abstract concept of D@RE is simply that data is encrypted “at rest,” or while it is sitting on disk, on the storage array.

This might seem like a dumb question, but it is a CRUCIAL one that I’ve seen either not asked or answered incorrectly time and time again: what is the purpose of D@RE?  The point of D@RE is to prevent physical hardware theft from compromising data security.  So, if I nefariously steal a drive out of your array, or a shelf of drives out of your array, and come up with some way to attach them to another system and read them, I will get nothing but gibberish.

Now, keep in mind that this problem is typically far more of an issue on a small server system than it is a storage array.  A small server might just have a handful of drives associated with it, while a storage array might have hundreds, or thousands.  And those drives are going to be in some form of RAID protection which leverages striping.  So even without D@RE the odds of a single disk holding meaningful data is small, though admittedly it is still there.

More to the point, D@RE does not prevent anyone from accessing data on the array itself.  I’ve heard allusions to this idea that “don’t worry about hackers, we’ve got D@RE” which couldn’t be more wrong, unless you think hackers are walking out of your data center with physical hardware.  If the hackers are intercepting wire transmissions, or they have broken into servers with SAN access, they have access to your data.  And if your array is doing the encryption and someone manages to steal the entire array (controllers and all) they will also have access to your data.

D@RE at the array level is also one of the easiest to deal with from a management perspective because usually you just let the array handle everything including the encryption keys.  This is mostly just a turn it on and let it run solution.  You don’t notice it and generally don’t see any fall out like performance degradation from it.

In-Flight Encryption

In-flight encryption is referring to data being encrypted over the wire.  So your host issues a write to a SAN LUN, and that traverses your SAN network and lands on your storage array.  If data is encrypted “in-flight,” then it is encrypted throughout (at least) the switching.

Usually this is accomplished with FC fabric switches that are capable of encryption.  So the switch that sees a transmission on an F port will encrypt it, and then transmit it encrypted along all E ports (ISLs) and then decrypt it when it leaves another F port.  So the data is encrypted in-flight, but not at rest on the array.  Generally we are still talking about ASICs here so performance is not impacted.

Again let’s ask, what is the purpose of in-flight encryption?  In-flight encryption is intended to prevent someone who is sniffing network traffic (meaning they are somehow intercepting the data transmissions, or a copy of the data transmissions, over the network) from being able to decipher data.

For local FC networks this is (in my opinion) not often needed.  FC networks tend to be very secure overall and not really vulnerable to sniffing.  However, for IP based or WAN based communication, or even stretched fabrics, it might be sensible to look into something like this.

Also keep in mind that because data is decrypted before being written to the array, it does not provide the physical security that D@RE does, nor does it prevent anyone from accessing data in general.  You also sometimes have the option of not decrypting when writing to the array.  So essentially the data is encrypted when leaving the host, and written encrypted on the array itself.  It is only decrypted when the host issues a read for it and it exits the F port that host is attached to. This results in you having D@RE as well with those same benefits.  A real kicker here becomes key management, because in-flight encryption can be removed at any time without issue.  You can remove or disable in-flight encryption and not see any change in data because at the ends it is unencrypted.  However, if the data is written encrypted on the array, then you MUST have those keys to read that data.  If you had some kind of disaster that compromised your switches and keys, you would have a big array full of cryptographically erased data.

Host Based Encryption

Finally, host-based encryption is any software or feature that encrypts LUNs or files on the server itself.  So data that is going to be written to files (whether SAN based or local files) is encrypted in memory before the write actually takes place.

Host-based encryption ends up giving you both in-flight encryption and D@RE as well.  So when we ask the question, what is the purpose of host-based encryption?, we get the benefits we saw from in-flight and D@RE, as well as another one.  That is the idea that even with the same hardware setup, no other host can read your data.  So if I were to forklift your array, fabric switches, and get an identical server (hardware, OS, software) and hook it up, I wouldn’t be able to read your data.  Depending on the setup, if a hacker compromises the server itself in your data center, they may not be able to read the data either.

So why even bother with the other kinds of encryption?  Well for one, generally host-based encryption does incur a performance hit because it isn’t using ASICs.  Some systems might be able to handle this but many won’t be able to.  Unlike D@RE or in-flight, there will be a measurable degradation when using this method.  Another reason is that key management again becomes huge here.  Poor key management and a server having a hardware failure can lead to that data being unreadable by anyone.  And generally your backups will be useless in this situation as well because you have backups of encrypted data that you can’t read without the original keys.

And frankly, usually D@RE is good enough.  If you have a security issue where host-based encryption is going to be a benefit, usually someone already has the keys to the kingdom in your environment.

Closing Thoughts

Hopefully that cleared up the types of encryption and where they operate.

Another question I see is “can I use one or more at the same time?”  The answer is yes, with caveats.  There is nothing that prevents you from using even all 3 at the same time, even though it wouldn’t really make any sense.  Generally you want to avoid overlapping because you are encrypting data that is already encrypted which is a waste of resources.  So a sensible pairing might be D@RE on the array and in-flight on your switching.

A final HUGELY important note – and what really prompted me to write this post – is to make sure you fully understand the effect of encryption on all of your systems.  I have seen this come up in a discussion about XtremIO using D@RE paired with host-based encryption.  The question was “will it work?” but the question should have been “should we do this?”  Will it work?  Sure, there is nothing problematic about host-based encryption and XtremIO D@RE interacting, other than the XtremIO system encrypting already encrypted data.  What is problematic, though, is the fact that encrypted data does not compress, and most encrypted data won’t dedupe either…or at least not anywhere close to the level of unencrypted data.  And XtremIO generally relies on its fantastic inline compression and dedupe features to fit a lot of data on a small footprint. XtremIO’s D@RE happens behind the compression and deduplication, so there is no issue.  However host-based encryption will happen ahead of the dedupe/compression and will absolutely destroy your savings. So if you wanted to use the system like this, I would ask, how was it sized?  Was it sized with assumptions about good compression and dedupe ratios?  Or was it sized assuming no space savings?  And, does the extra money you will be spending for the host-based encryption product and the extra money you will be spending on the additional required storage justify the business problem you were trying to solve?  Or was there even a business problem at all?  A better fit would probably be something like a tiered VNX2 and FAST cache which could easily handle a lot of raw capacity and use the flash where it helps the most.

Again, security is a tool, so choose the tools you need, use them judiciously, and make sure you fully understand their impact (end-to-end) in your environment.

SAN vs NAS Part 5: Summary

We’ve covered a lot of information over this series, some of it more easily consumable than others.  Hopefully it has been a good walkthrough of the main differences between SAN and NAS storage, and presented in a little different way than you may have seen in the past.

I wanted to summarize the high points before focusing on a few key issues:

  • SAN storage is fundamentally block I/O, which is SCSI.  With SAN storage, your local machine “sees” something that it thinks is a locally attached disk.  In this case your local machine manages the file system, and transmissions to the array are simple SCSI requests.
  • NAS storage is file I/O, which is either NFS or CIFS.  With NAS storage, your local machine “sees” a service to connect to on the network that provides file storage.  The array manages the file system, and transmissions to the array are protocol specific file based operations.
  • SAN and NAS have different strengths, weaknesses, and use cases
  • SAN and NAS are very different from a hardware and protocol perspective
  • SAN and NAS are sometimes only offered on specific array platforms

Our Question

So back to our question that started this mess: with thin provisioned block storage, if I delete a ton of data out of a LUN, why do I not see any space returned on the storage array?  We know now that this is because there is no such thing as a delete in the SAN/block/SCSI world.  Thin provisioning works by allocating storage you need on demand, generally because you tried to write to it.  However once that storage has been allocated (once the disk has been created), the array only sees reads and writes, not creates and deletes.  It has no way of knowing that you sent over a bunch of writes that were intended to be a delete.  The deletes are related to the file system, which is being managed by your server, not the array.  The LUN itself is below the file system layer, and is that same disk address space filled with data we’ve been discussing.  Deletes don’t exist on SAN storage, apart from administratively deleting an entire object – LUN, RAID set, Pool, etc.

With NAS storage on the other hand, the array does manage the file system.  You tell it when to delete something by sending it a delete command via NFS or CIFS, so it certainly knows that you want to delete it.  In this manner file systems allocations on NAS devices usually fluctuate in capacity.  They may be using 50GB out of 100GB today, but only 35GB out of 100GB tomorrow.

Note: there are ways to reclaim space either on the array side with thin reclamation (if it is supported), or on the host side with the SCSI UNMAP commands (if it is supported).  Both of these methods will allow you to reclaim some/all of the deleted space on a block array, but they have to be run as a separate operation from the delete itself.  It is not a true “delete” operation but may result in less storage allocated.

Which Is Better?

Yep, get out your battle gear and let’s duke it out!  Which is better?  SAN vs NAS!  Block vs File!  Pistols at high noon!

Unfortunately as engineers a lot of times we focus on this “something must be the best” idea.

Hopefully if you’ve read this whole thing you realize how silly this question is, for the most part.  SAN and NAS storage occupy different areas and cover different functions.  Most things that need NAS functionality (many access points and permissions control) don’t care about SAN functionality (block level operations and utilities), and vice versa.  This question is kind of like asking which is better, a toaster or a door stop?  Well, do you need to toast some delicious bread or do you need to stop a delicious door?

In some cases there is overlap.  For example, vSphere datastores can be accessed over block protocols or NAS (NFS).  In this case what is best is most often going to be – what is the best fit in the environment?

  • What kind of hardware do you have (or what kind of budget do you have)?
  • What kind of admins do you have and what are their skillsets?
  • What kind of functionality do you need?
  • What else in the environment needs storage (i.e. does something else need SAN storage or NFS storage)?
  • Do you have a need for RDMs (LUNs mapped directly from the array in order to expose some of the SCSI functionality)?

From a performance perspective 10Gb NFS and 10Gb iSCSI are going to do about the same for you, and honestly you probably won’t hit the limits of those anyway.  These other questions are far more critical.

Which leads me to…

What Do I Need?

A pretty frequently asked question in the consulting world – what do I need, NAS or SAN?  This is a great question to ask and to think about but again it goes back to what do you need to do?

Do you have a lot of user files that you need remote access to?  Windows profiles or home directories?  Then you probably need NAS.

Do you have a lot of database servers, especially ones that utilize clustering?  Then you probably need SAN.

Truthfully, most organizations need some of both – the real question is in what amounts.  This will vary for every organization but hopefully armed with some of the information in this blog series you are closer to making that choice for your situation.

SAN vs NAS Part 4: The Layer Cake

Last post we covered the differences between NFS and iSCSI (NAS and SAN) and determined that we saw a different set of commands when interacting with a file.  The NFS write generated an OPEN command, while the iSCSI write did not.  In this post we’ll cover the layering of NAS (file or file systems) on top of SAN (SCSI or block systems) and how that interaction works.

Please note!  In modern computing systems there are MANY other layers than I’m going to talk about here.  This isn’t to say that they don’t exist or aren’t important, but just that we are focusing on a subset on them for clarity.  Hopefully.

First, take a look at the NFS commands listed here: https://tools.ietf.org/html/rfc1813

nfscommandsNotice that a lot of these commands reference files, and things that you would do with files like read and write, but also create, remove, rename, etc.

Compare this with the SCSI reference: http://www.t10.org/lists/op-alph.htm

Notice that in the SCSI case, we still have read and write, but there is no mention of files (other than “filemarks”).  There is no way to delete a file with SCSI – because again we are working with a block device which is a layer below the file system.  There is no way to delete a file because there is no file.  Only addresses where data is stored.

As a potentially clumsy analogy (like I often wield!) think about your office desk.  If it’s anything like mine, there is a lot of junk in the drawers.  File storage is like the stuff in a drawer.  The space in a drawer can have a lot of stuff in it, or it can have a little bit of stuff in it.  If I add more stuff to the drawer, it gets more full.  If I take stuff out of the drawer, it gets less full.  There is meaning to how much stuff is in an individual drawer as a relation to how much more stuff I can put in the drawer.

Block storage, on the other hand, is like the desk itself.  There are locations to store things – the drawers.  However, whether I have stuff in a drawer or I don’t have stuff in a drawer, the drawer still exists.  Emptying out my desk entirely doesn’t cause my desk to vanish.  Or at least, I suspect it wouldn’t…I have never had an empty desk in my life.  There is no relationship to the contents of the drawers and the space the desk occupies.  The desk is a fixed entity.  An empty drawer is still a drawer.

To further solidify this file vs block comparison, take a look at this handsome piece of artwork depicting the layers:

fsvisio_1Here is a representation of two files on my computer, a word doc and a kitty vid, and their relationship to the block data on disk.  Note that some disk areas have nothing pointing to them – these are empty but still zero filled (well…maybe, depending on how you formatted the disk).  In other words, these areas still exist!  They still have contents, even if that content is nothing.

When I query a file, like an open or read, it traverses the file system down to the disk level.  Now I’m going to delete the word doc.  In most cases, this is what is going to happen:

fsvisio_2My document is gone as far as I can “see.”  if I try to query the file system (like look in the directory it was stored in) it is gone.  However on the disk, it still exists.  (Fun fact: this is how “undelete” utilities work – by restoring data that is still on disk but no longer has pointers from the file system.)  It isn’t really relevant that it is still on the disk, because from the system’s perspective (and the file system’s perspective) it doesn’t exist any more.  If I want to re-use that space, the system will see it as free and store something else there, like another hilarious kitten video.

Sometimes this will happen instead, either as you delete something (rarely) or later as a garbage collection process:

fsvisio_3The document data has been erased and replaced with zeros.  (Fun fact: this is how “file shredder” programs work – by writing zeros (or a pattern) once (or multiple times) to the space that isn’t being actively used by files.)  Now the data is truly gone, but from the disk perspective it still isn’t really relevant because something still occupies that space.  From the disk’s perspective, something always occupies that space, whether it is kitty video data, document data, or zeros.  The file system (the map) is what makes that data relevant to the system.

This is a really high level example, but notice the difference in the file system level and the disk level.  When I delete that file, whether the actual disk blocks are scrubbed or left intact, the block device remains the same except for the configuration of the 1’s and 0’s.  All available addresses are still in place.  Are we getting closer to understanding our initial question?

Let’s move this example out a bit and take a look at an EMC VNX system from a NAS perspective.  This is a great example because there are both SAN/block (fibre channel) and NAS/file (cifs/nfs) at the same time.  The connections look like this:

dm1

From my desktop, I connect via NFS to an interface on the NAS (the datamover) in order to access my files.  And the datamover has a fibre channel connection to the block storage controllers which is where the data is actually stored.  The datamover consumes block storage LUNs, formats them with appropriate file systems, and then uses that space to serve out NAS.  This ends up being quite similar to the layered file/disk example above when we were looking at a locally hosted file system and disk.

What does it look like when I read and write?  Simply like this:

DM2My desktop issues a read or write via NFS, which hits the NAS, and the NAS then issues a read or write via SCSI over Fibre Channel to the storage processor.

Reads and writes are supported by SCSI, but what happens when I try to do something to a file like open or delete?

DM3The same command conversion happens, but it is just straight reads and writes at the SCSI level. It doesn’t matter whether the NAS is SAN attached like this one, or it just has standard locally attached disks.  This is always what’s going to happen because the block protocol and subsystems don’t work with files – only with data in addresses.

By understanding this layering – what file systems (NAS) do vs what disks (SAN) do – you can better understand important things about their utility.  For instance, file systems have various methods to guarantee consistency, in spite of leveraging buffers in volatile memory.  If you own the file system, you know who is accessing data and how.  You have visibility into the control structure.  If the array has no visibility there, then it can’t truly guarantee consistency.  This is why e.g. block array snapshots and file array snapshots are often handled differently.  With NAS snapshots, the array controls the buffers and can easily guarantee consistent snapshots.  But for a block snapshot, the array can only take a picture of the disk right now regardless of what is happening in the file system.  It may end up with an inconsistent image on disk, unless you initiate the snapshot from the attached server and properly quiesce/clean the file system.

Back to the idea of control, because NAS systems manage the file side of things, they also have a direct understanding of who is trying to access what.  Not only does this give it the ability to provide some access control (unlike SAN which just responds happily to any address requests it gets), it also explains why NAS is often ideal for multi-access situations.  If I have users trying to access the same share (or better yet, the same file), NAS storage is typically the answer because it knows who has what open.  It can manage things on that level.  For the SAN, not so much.  In fact if you want two hosts to access the same storage, you need to have some type of clustering (whether direct software or file system) that provides locks and checks.  Otherwise you are pretty much guaranteed some kind of data corruption as things are reading and writing over top of one another.  Remember SAN and SCSI just lets you read and write to addresses, it doesn’t provide the ability to open and own a file.

In part 5 I’ll provide a summary review and then some final thoughts as well.

SAN vs NAS Part 3: File Systems

In the last blog post, we asked a question: “who has the file system?”  This will be important in our understanding of the distinction between SAN and NAS storage.

First, what is a file system?  Simply (see edit below!), a file system is a way of logically sorting and addressing raw data.  If you were to look at the raw contents of a disk, it would look like a jumbled mess.  This is because there is no real structure to it.  The file system is what provides the map.  It lets you know that block 005A and block 98FF are both the first parts of your text file that reads “hello world.”  But on disk it is just a bunch of 1’s and 0’s in seemingly random order.

Edit: Maybe I should have chosen a better phrase like “At an extremely basic level” instead of “Simply.” 🙂 As @Obdurodon pointed out in the comments below, file systems are a lot more than a map, especially these days.  They help manage consistency and help enable cool features like snapshots and deduplication.  But for the purposes of this post this map functionality is what we are focusing on as this is the relationship between the file system and the disk itself.

File systems allow you to do things beyond just reads and writes.  Because they form files out of data, they let you do things like open, close, create, and delete.  They allow you the ability to keep track of where your data is located automatically.

(note: there are a variety of file systems depending on the platform you are working with, including FAT, NTFS, HFS, UXFS, EXT3, EXT4, and many more.  They have a lot of factors that distinguish them from one another, and sometimes have different real world applications.  For the purposes of this blog series we don’t really care about these details.)

Because SAN storage can be thought of as a locally attached disk, the same applies here.  The SAN storage itself is a jumbled mess, and the file system (data map) is managed by the host operating system.  Similar to your local C: drive in your windows laptop, your OS puts down a file system and manages the location of the block data.  Your system knows and manages the file system so it interacts with the storage array at a block level with SCSI commands, below the file system itself.

With NAS storage on the other hand, even though it may appear the same as a local disk, the file system is actually not managed by your computer – or more accurately the machine the export/share is mounted on.  The file system is managed by the storage array that is serving out the data.  There is a network service running that allows you to connect to and interact with it.  But because that remote array manages the file system, your local system doesn’t.  You send commands to it, but not SCSI commands.

With SAN storage, your server itself manages the file system and with NAS storage the remote array manages the file system.  Big deal, right?  This actually has a MAJOR impact on functionality.

I set up a small virtual lab using VirtualBox with a CentOS server running an NFS export and an iSCSI target (my remote server), and a Ubuntu desktop to use as the local system.  After jumping through a few hoops, I got everything connected up.  All commands below are run and all screenshots are taken from the Ubuntu desktop.

I’ll also take a moment to mention how awesome Linux is for these type of things.  It took some effort to get things configured, but it was absolutely free to set up a NFS/iSCSI server and a desktop to connect to it.  I’ve said it before but will say it again – learn your way around Linux and use it for testing!

So remember, who has the file system?  Note that with the iSCSI LUN, I got a raw block device (a.k.a. a disk) presented from the server to my desktop.  I had to create a partition and then format it with EXT4 before I could mount it.  With the NFS export, I just mounted it immediately – no muss no fuss.  That’s because the file system is actually on the server, not on my desktop.

Now, if I were to unmount the iSCSI LUN and then mount it up again (or on a different linux desktop) I wouldn’t need to lay down a file system but that is only because it has already been done once.  With SAN storage I have to put down a file system on the computer it is attached to the first time it is used, always.  With NAS storage, there is no such need because the file system is already in place on the remote server or array.

Let’s dive in and look at the similarities and differences depending on where the file system is.

Strace

First let’s take a look at strace.  strace is a utility that exposes some of the ‘behind the scenes’ activity when you execute commands on the box.  Let’s run this command against a data write via a simple redirect:

strace -vv -Tt -f -o traceout.txt echo “hello world” > testfile

Essentially we are running strace with a slew of flags against the command [ echo “hello world” > testfile ].  Here is a screenshot of the relevant portion of both outputs when I ran the command with testfile located on the NFS export vs the local disk.

strace

Okay there is a lot of cryptic info on those pics, but notice that in both cases the write looks identical.  The “things” that are happening in each screenshot look the same.  This is a good example of how local and remote I/O “appears” the same, even at a pretty deep level.  You don’t need to specify that you are reading or writing to a NAS export, the system knows what the final destination is and makes the necessary arrangements.

Dstat

Let’s try another method – dstat.  Dstat is a good utility for seeing the types of I/O running through your system.  And since this is a lab system, I know it is more or less dead unless I’m actively doing something on it.

I’m going to run a large stream of writes (again, simple redirection) in various locations (one location at a time!) while I have dstat running in order to see the differences.  The command I’m using is:

for i in {1..100000}; do echo $i > myout; done

With myout located in different spots depending on what I’m testing.

For starters, I ran it against the local disk:

localdisk_dstat

Note the two columns in the center indicating “dsk” traffic (I/O to a block device) and “net” traffic (I/O across the network interfaces).  You can think of the “dsk” traffic as SCSI traffic.  Not surprisingly, we have no meaningful network traffic, but sustained block traffic.  This makes sense since we are writing to the local disk.

Next, I targeted it at the NFS export.

nfs_dstat

A little different this time, as even though I’m writing to a file that appears in the filesystem of my local machine (~/mynfs/myout) there is no block I/O.  Instead we’ve got a slew of network traffic.  Again this makes sense because as I explained even though the file “appears” to be mine, it is actually the remote server’s.

Finally, here are writes targeted at the iSCSI LUN.

iscsi_dstat

Quite interesting, yes?  We have BOTH block and network traffic.  Again this makes sense.  The LUN itself is attached as a block device, which generates block I/O.  However, iSCSI traffic travels over IP, which hits my network interfaces.  The numbers are a little skewed since the block I/O on the left is actually included in the network I/O on the right.

So we are able to see that something is different depending on where my I/O is targeted, but let’s dig even deeper.  It’s time to…

WIRESHARK!

For this example, I’m going to run a redirect with cat:

cat > testfile

hello world

ctrl+c

This is simply going to write “hello world” into testfile.

After firing up wireshark and making all the necessary arrangements to capture traffic on the interface that I’m using as an iSCSI initiator, I’m ready to roll.  This will allow me to capture network traffic between my desktop and server.

Here are the results:

iscsi_write

There is a lot of stuff on this pic as expected, but notice the write command itself.  It is targeted at a specific LBA, just as if it were a local disk that I’m writing to.  And we get a response from the server that the write was successful.

Here is another iSCSI screenshot.

iscsi_write2

I’ve highlighted the write and you can see my “hello world” in the payload.  Notice all the commands I highlighted with “SCSI” in them.  It is clear that this is a block level interaction with SCSI commands, sent over IP.  Note also that in both screenshots, there is no file interaction.

Now let’s take a look at the NFS export on my test server.  Again I’m firing up wireshark and we’ll do the same capture operation on the interface I’m using for NFS.  I’m using the same command as before.

nfscap_write

Here is the NFS write command with my data.  There are standard networking headers and my hello world is buried in the payload.  Not much difference from iSCSI, right?

The difference is a few packets before:

nfscap_open

We’ve got an OPEN command!  I attempt to open the file “testfile” and the server responds to my request like a good little server.  This is VERY different from iSCSI!  With iSCSI we never had to open anything, we simply sent a write request for a specific Logical Block Address.  With iSCSI, the file itself is opened by the OS because the OS manages the file system.  With NFS, I have to send an OPEN to the NAS in order to discover the file handle, because my desktop has no idea what is going on with the file system.

This is, I would argue, THE most important distinction between SAN and NAS and hopefully I’ve demonstrated it well enough to be understandable.  SAN traffic is SCSI block commands, while NAS traffic is protocol-specific file operations.  There is also some overlap here (like read and write), but these are still different entities with different targets.  We’ll take a look at the protocols and continue discussing the layering effect of file systems in Part 4.

SAN vs NAS Part 2: Hardware, Protocols, and Platforms, Oh My!

In this post we are going to explore some of the various options for SAN and NAS.

SAN

There are a couple of methods and protocols for accessing SAN storage.  One is Fibre Channel (note: this is not misspelled, the protocol is Fibre, the cables are fiber) where SCSI commands are encapsulated within Fibre Channel frames.  This may be direct Fibre Channel (“FC”) over a Fibre Channel fabric, or Fibre Channel over Ethernet (“FCoE”) which further encapsulates Fibre Channel frames inside ethernet.

With direct Fibre Channel you’ll need some FC Host Bus Adapters (HBAs), and probably some FC switches like Cisco MDS or Brocade (unless you plan on direct attaching a host to an array which most of the time is a Bad Idea).

With FCoE you’ll be operating on an ethernet network typically using Converged Network Adapters (CNAs).  Depending on the type of fabric you are building, the array side may still be direct FC, or it may be FCoE as well.  Cisco UCS is a good example of the split out, as generally it goes from host to Fabric Interconnect as FCoE, and then from Fabric Interconnect to array or FC switch as direct Fibre Channel.

It could also be accessed via iSCSI, which encapsulates SCSI commands within IP over a standard network.  And then there are some other odd mediums like infiniband, or direct attach via SAS (here we are kind of straying away from the SAN and are really just directly attaching disks, but I digress).

What kind of SAN you use depends largely on the scale and type of your infrastructure.  Generally if you already have FC infrastructure, you’ll stay FC.  If you don’t have anything yet, you may go iSCSI.  Larger and performance environments typically trend toward FC, while small shops trend towards iSCSI.  That isn’t to say that one is necessarily better than the other – they have their own positives and negatives.  For example, FC has its own learning curve with fabric management like zoning, while iSCSI connections are just point to point over existing networks that someone probably already knows.  The one thing I will caution against here is if you are going for iSCSI, watch out for 1Gb configurations – there is not a lot of bandwidth and the network can get choked VERY quickly.  I personally prefer FC because I know it well and trust its stability, but again there are positives and negatives.

Back to the subject at hand – in all cases with SAN the recurring theme here is SCSI commands.  In other words, even though the “disk” might be a virtual LUN on an array 10 feet (or 10 miles) away, the computer is treating it like a local disk and sending SCSI disk commands to it.

Some array platforms are SAN only, like the EMC VMAX 10K, 20K, 40K series.  EMC XtremIO is another example of a SAN only platform.  And then there are non-EMC platforms like 3PAR, Hitachi, and IBM XIV.  Other platforms are unified, meaning they do both SAN and NAS.  EMC VNX is a good example of a unified array.  NetApp is another competitor in this space.  Just be aware that if you have a SAN only array, you can’t do NAS…and if you have a NAS only array (yes they exist, see below), you can’t do SAN.  Although some “NAS” arrays also support iSCSI…I’d say most of the time this should be avoided unless absolutely necessary.

NAS

NAS on the other hand is virtually always over an IP network.  This is going to use standard ethernet adapters (1Gb or 10Gb) and standard ethernet switches and IP routers.

As far as protocols there is CIFS, which is generally used for Windows, and NFS which is generally used on the Linux/Unix/vSphere side.  CIFS has a lot of tie-ins with Active Directory, so if you are a windows shop with an AD infrastructure, it is pretty easy to leverage your existing groups for permissions.  NFS doesn’t have these same ties with AD, but does support NIS for some authentication services.

The common theme on this side of the house is “file” which can be interpreted as “file system.”  With CIFS, generally you are going to connect to a “share” on the array, like \\MYARRAY1\MYAWESOMESHARE.  This may be just through a file browser for a one time connection, or this may be mounted as a drive letter via the Map Network Drive feature.  Note that even though it is mounted as a drive letter, it is still not the same as an actual local disk or SAN attached LUN!

For NFS, an “export” will be configured on the array and then mounted on your computer.  This actually gets mounted within your file system.  So you may have your home directory in /users/myself, and you create a directory “backups” and mount an export to it doing something like mount -t nfs 172.0.0.10:/exports/backups /users/myself/backups.  Then you access any files just as you would any other ones on your computer.  Again note that even though the NFS export is mounted within your file system, it is still not the same as an actual local disk or SAN attached LUN!

Which type of NAS protocol you use is generally determined by the majority of your infrastructure – whether it is Windows or *nix.  Or you may run both at once!  Running and managing both NFS and CIFS is really more of a hurdle with understanding the protocols (and sometimes licensing both of them on your storage array), whereas the choice to run both FC and iSCSI has hardware caveats.

For NAS platforms, we again look to the unified storage like EMC VNX.  There are also NAS gateways that can be attached to a VMAX for NAS services.  EMC also has a NAS only platform called Isilon.

One thing to note is that if your array doesn’t support NAS (say you have a VMAX or XtremIO) the gateway solution is definitely viable and enables some awesome features, but it is also pretty easy to spin up a Windows/Linux VM, or use a Windows/Linux physical server (but seriously, please virtualize!) that uses array block storage, but then serves up NAS itself.  So you could create a Windows file server on the VMAX and then all your NAS clients would connect to the Windows machine.

The reverse is not really true…if your array doesn’t support SAN, it is difficult to wedge SAN into the environment.  You can always do NFS with vSphere, but if you need block storage you should really purchase some infrastructure for it.  iSCSI is a relatively simple thing to insert into an existing environment, just again beware 1Gb bandwidth.

Protection

One final note I wanted to mention is about protection.  There are methods for replicating file and block data, but many times these are different mechanisms, or at least they function in different ways.  For instance, EMC RecoverPoint is a block replication solution.  EMC VNX Replicator is a file replication solution.  RP won’t protect your file data (unless you franken-config it to replicate your file LUNs), and Replicator won’t protect your block data.  NAS supports NDMP while SAN generally does not.  Some solutions, like NetApp snapshots, do function on both file and block volumes, but they are still very different in how they are taken and restored…block snapshots should be initiated from the host the LUN is mounted to (in order to avoid disastrous implications regarding host buffers and file system consistency) while file snapshots can be taken from any old place you please.

I say all this just to say, be certain you understand how your SAN and NAS data is going to be protected before you lay down the $$$ for a new frame!  It would be a real bummer to find out you can’t protect your file data with RecoverPoint after the fact.  Hopefully your pre-sales folks have you covered here but again be SURE!

And……..

We’ve drawn a lot of clear distinctions between SAN and NAS, which kind of fall back into the “bullet point” message that I talked about in my first post.  All that is well and good, but here is where the confusion starts to set in: in both NAS cases (CIFS and NFS), on your computer the remote array may appear to be a disk.  It may look like a local hard drive, or even appear very similar to a SAN LUN. This leads some people to think that they are the same, or at least are doing the same things.  I mean, after all, they even have the same letters in the acronym!

However, your computer never issues SCSI commands to a NAS.  Instead it issues commands to the remote file server for things like create, delete, read, write, etc.  Then the remote file server issues SCSI (block) commands to its disks in order to make those requests happen.

In fact, a major point of understanding here is, “who has the file system?”  This will help you understand who can do what with the data.  In the next post we are going to dive into this question head first in a linux lab environment.

SAN vs NAS Part 1: Intro

Welcome to the New Year!

I wanted to write a blog post on a very confusing storage topic (at least for myself) but I have also been searching for another large scale topic similar to the set I wrote on RAID last year.  After thinking about it I feel like my confusing question is really just a subset of a misunderstanding about block storage.  So without further ado, I’m going to write up a pretty detailed break down of SAN (Storage Area Networks), or block storage, vs NAS (Network Attached Storage), or file storage.  This is another topic, like RAID, that is fundamental and basic but not always fully understood.

Certainly there are other write ups on this topic out there, and in ways this can be summed up in just a few bullet points.  But I think a larger discussion will really help solidify understanding.

The specific confusing question I’ll ask and hopefully answer is, with thin provisioned block storage, if I delete a ton of data out of a LUN, why do I not see any space returned on the storage array?  Say I’ve got a thin 1TB LUN on my VMAX, and it is currently using (allocated) 500GB of space.  I go to the server where this LUN is attached and delete 300GB of data.  Querying the VMAX, I still see 500GB of space used.

This concept is hard to understand and I’ve not only asked this question myself, I’ve fielded it from several people in a variety of roles.  Central to understanding this concept is understanding the difference between file and block storage.

To start out, let’s briefly define the nature of things about file and block storage.

SAN – Block Storage

The easiest way to think of SAN is a disk drive directly attached to a computer.  Block storage access is no different from plugging in a USB drive, or installing another hard drive into the server, as far as how the server accesses it.  The medium for accessing it over your SAN varies with protocols and hardware, but at the end of the day you’ve got a disk drive (block device) to perform I/O with.

NAS – File Storage

The idea with NAS is that you are accessing files stored on a file server somewhere.  So I have a computer system in the corner that has a network service running on it, and my computer on my desk connects to that system.  Generally this connection is going to be CIFS (for Windows) or NFS (for *nix/vSphere).  The file protocol here varies but we are (most of the time) going to be running over IP.  And yes, sometimes Linux folks access CIFS shares and sometimes Windows folks do NFS, but these are exceptions to the rule.

In part 2, I’ll be covering more of the differences and similarities between these guys.

RAID: Part 6 – WrapUp

Finally the end – what a long, wordy trip it has been.  If you waded through all 5 posts, awesome!

As a final post, I wanted to attempt to bring all of the high points together and draw some contrasts between the RAID types I’ve discussed.  My goal with this post is less about the technical minutia and more about providing some strong direction to equip readers to make informed decisions.

Does Any of This Matter?

I always spend some time asking myself this question as I dive further and further down the rabbit hole on topics like this.  It is certainly possible that you can interact with storage and not understand details about RAID.  However I am a firm believer that you should understand it.  RAID is the foundation on which everything is built.  It is used in almost every storage platform out there.  It dictates behavior.  Making a smart choice here can save you money or waste it.  It can improve storage performance or cripple it.

I also like the idea that understanding the building blocks can later empower you to understand even more concepts.  For instance, if you’ve read through this you understand about mirroring, striping, and parity.  Pop quiz: what would a RAID5/0 look like?

raid50

Pretty neat that even without me describing it in detail, you can understand a lot about how this RAID type would function.  You’d know the failure capabilities and the write penalties of the individual RAID5 members.  And you’d know that the configuration couldn’t survive a failure of either RAID5 set because of the top level striping configuration.  And let’s say that I told you the strip size of the RAID5 group was 64KB, and that the strip size of the RAID0 config was 256MB.  Believe it or not, this is a pretty accurate description of a 10 disk VNX2 storage pool from a single tier RAID5 perspective.

Again to me this is part of the value – when fancy new things come out, the fundamental building blocks are often the same.  If you understand the functionality of the building block, then you can extrapolate functionality of many things.  And if I give you a new storage widget to look at, you’ll instantly understand certain things about it based on the underlying RAID configuration.  It puts you in a much better position than just memorizing that RAID5 is “parity.”

Okay, I’m off my soapbox!

Workload – Read

  • RAID1/0 – Great
  • RAID5 – Great
  • RAID6 – Great

I’ve probably hammered this home by now, but when we are looking at largely read workloads (or just the read portion of any workload) the RAID type is mostly irrelevant from a performance perspective in non-degraded mode.  But as with any blanket statement, there are caveats.  Here are some things to keep in mind.

  • Your read performance will depend almost entirely on the underlying disk (ignoring sequential reads and prefetching).  I’m not talking about the obvious flash vs NLSAS; I’m talking about RAID group sizing.  As a general statement I can say that RAID1/0 performs identically to RAID5 for pure read workloads, but an 8 disk RAID1/0 is going to outperform a 4+1 RAID5.
  • Ask the question and do tests to confirm: does your storage platform round robin reads between mirror pairs in RAID1/0?  If not (and not all controllers do), your RAID1/0 read performance is going to be constrained to half of the spindles.  From the previous bullet point, our 8 disk RAID1/0 would be outperformed by a 4+1 disk RAID5 in reads because only 4 of the 8 spindles are actually servicing read requests.

Workload – Write

  • RAID1/0 – Great (write penalty of 2)
  • RAID5 – Okay (write penalty of 4)
  • RAID6 – Bad (write penalty of 6)

Writes are where the RAID types start to diverge pretty dramatically due to the vastly different write penalties between them.  Yet once again sometimes people draw the wrong conclusion from the general idea that RAID1/0 is more efficient at writes than RAID6.

  • The underlying disk structure is still dramatically important.  A lot of people seem to focus on “workload isolation,” meaning e.g. with a database that I would put the data on RAID5 and the transaction logs on RAID1/0.  This is a great idea from a design perspective starting with a blank slate.  However, what if my RAID5 disk pool I’m working with is 200 disks and I only have 4 disks for RAID1/0?  In this case I’m pretty much a lock to have better success dropping logs into the RAID5 pool because there are WAY more spindles to support the I/O.  There are a lot of variables here about the workload, but the point I’m trying to make is you should take a look at all the parts as a whole when making these decisions.
  • If your write workload is large block sequential, take a look at RAID5 or RAID6 over RAID1/0 – you will typically see much more efficient I/O in these cases.  However, make sure you do proper analysis and don’t end up with heavy small block random writes on RAID6.

Going back and re-reading some of my previous posts, I feel like I may have given the impression that I don’t like RAID1/0.  Or that I don’t see value in RAID1/0.  That is certainly not the case and I wanted to draw an example to show when you need to use RAID1/0 without question.  That example is when we see a “lot” of small block random writes and don’t need excessive amounts of capacity.  What is a “lot”?  Good question.  Typically the breaking point is around 30-40% write ratio.

Given that a SAS drive should only be allowed to support around 180 IOPs, let’s crunch some numbers for an imaginary 10,000 front end IOPs workload. How many spindles do we need to support the workload at specific read/write ratios?  (I will do another blog post on the specifics of these calculations)

Read/Write Ratio RAID1/0 disk count RAID5 disk count RAID6 disk count
90%/10% 62 73 78
75%/25% 70 98 123
60%/40% 78 125 167

So, at lighter write percentages, the difference in the RAID type doesn’t matter as much.  But as we already learned RAID1/0 is the most efficient at back end writes, and this gets incredibly apparent at the 60/40 split.  In fact, I need over twice the amount of spindles if I choose RAID6 instead of RAID1/0 to support the workload.  Twice the amount of hardware up front, and then twice the amount of power suckers and heat producers sitting your data center for years.

Capacity Factor

  • RAID1/0 – Bad (50% penalty)
  • RAID5 – Great (generally ~20% penalty or less)
  • RAID6 – Great (generally ~25% penalty or less)

Capacity is a pretty straightforward thing so I’m not going to belabor the point – you need some amount of capacity and you can very quickly calculate how many disks you need of the different RAID types.

  • You can get more or less capacity out of RAID5 or 6 by adjusting RAID group size, though remember the protection caveats.
  • Remember that in some cases (for instance, storage pools on an EMC VNX) a choice of RAID type today locks you in on that pool forever.  By this I mean to say, if someone else talks you into RAID1/0 today and it isn’t needed, not only is it needlessly expensive today, but as you add storage capacity to that pool it is needlessly expensive for years.

Protection Factor

  • RAID1/0 – Lottery! (meaning, there is a lot of random chance here)
  • RAID5 – Good
  • RAID6 – Great

As we’ve discussed, the types vary in protection factor as well.

  • Because of RAID1/0’s lottery factor on losing the 2nd disk, the only thing we can state for certain is that RAID1/0 and RAID6 are better than RAID5 from a protection standpoint.  By that I mean, it is entirely possible that the 2nd simultaneous disk failure will invalidate a RAID1/0 set if it is the exact right disk, but there is a chance that it won’t.  For RAID5, a 2nd simultaneous failure will invalidate the set every time.
  • Remember is that RAID1/0 is much better behaved in a degraded and rebuild scenario than RAID5 or 6.  If you are planning on squeezing every ounce of performance out of your storage while it is healthy and can’t stand any performance hit, RAID1/0 is probably a better choice.  Although I will say that I don’t recommend running a production environment like this!
  • You can squeeze extra capacity out of RAID5 and 6 by increasing the RAID group size, but keep it within sane limits.  Don’t forget the extra trouble you can have from a fault domain and degraded/rebuild standpoint as the RAID group size gets larger.
  • Finally, remember that RAID is not a substitute for backups.  RAID will do the best it can to protect you from physical failures, but it has limits and does nothing to protect you from logical corruption.

Summary

I think I’ve established that there are a lot of factors to consider when choosing a RAID type.  At the end of the day, you want to satisfy requirements while saving money.  In that vein, here are some summary thoughts.

If you have a very transactional database, or are looking into VDI, RAID1/0 is probably going to be very appealing from a cost perspective because these workloads tend to be IOPs constrained with a heavy write percentage.  On the other hand, less transactional databases, application, and file storage tend to be capacity constrained with a low write percentage.  In these cases RAID5 or 6 are going to look better.

In general the following RAID types are a good fit in the following disk tiers, for the following reasons:

  • EFD (a.k.a. Flash or SSD) – RAID5.  Response time here is not really an issue, instead you want to squeeze as much capacity as possible out of them for use, ’cause these puppies are pricey!  RAID5 does that for us.
  • SAS (a.k.a. FC) – RAID5 or RAID1/0.  The choice here hinges on write percentage.  RAID6 on these guys is typically a waste of space and added write penalty.  They rebuild fast enough that RAID5 is acceptable.  Note – as these disks get larger and larger this may shift towards RAID1/0 or RAID6 due to rebuild times or even UBEs, but these are actually enterprise grade and have exponentially less UBE rate.
  • NLSAS (a.k.a. SATA) – RAID6.  Please use RAID6 for these disks.  As previously stated, they need the added protection of the extra parity, and you should be able to justify the cost.

Again, this is just in general, and I can’t overstate the need for solid analysis.

Hopefully this has been accurate and useful. I really enjoyed writing this up and hope to continue producing useful (and accurate!) material in the future.

RAID: Part 5 – RAID5 and RAID6

Now that the parity post is out of the way, we can move into RAID5 and RAID6 configurations.  The good news for anyone who actually plodded through the parity post is that we’ve essentially already covered RAID5!  RAID5 is striping with single parity protection, generated on each row of data, exactly like my example.  Because of that I’ll be writing this post assuming you’ve read the parity post (or at least understand the concepts).

RAID5

Actually, from the parity post not only have we covered RAID5…we also covered most of our criteria for RAID type analysis.  Sneaky!

Before continuing on, let me make a quick point about RAID5 (note: this also applies to RAID6) group size.  In our example we did 4+1 RAID5. X+1  is standard notation for RAID5, meaning X data disks and 1 parity disk (…kind of – I’ll clarify later regarding distributed parity) but there is no reason it has to be 4+1.  There is a lower limit on single parity schemes, and that is three disks (since if you had two disks you would just do mirroring) which would be 2+1.  There is no upper bound on RAID5 group size, though I will discuss this nuance in the protection factor section.  I could theoretically have a 200+1 RAID5 set.  On an EMC VNX system, the upper bound of a RAID5 group is a system limitation of 16 disks, meaning we can go as high as 15+1.  The more standard sizes for storage pools are 4+1 and the newer 8+1.

That said, let’s talk about usable capacity. RAID5 differs from RAID1/0 in that the usable capacity penalty is directly dependent on how many disks are in the group.  I’ve explained that in RAID5, for every stripe, exactly one strip must be dedicated to parity.  Scale out to the disk level, and this translates into one whole disk’s worth of parity in the group.   In the 4+1 case our capacity penalty is 20% (1 out of 5 disks are used for parity).  Here are the capacity penalties for the schemes I just listed:

  • 2+1 – 33% (this is the worst case scenario, and still better than the 50% of RAID 1/0)
  • 4+1 – 20%
  • 8+1 – 11%
  • 15+1 – 6.5%

So as we add more data disks into a RAID5 group our usable capacity penalty goes down, but is always better than RAID1/0.

Protection factor?  After the parity post we know and understand why RAID5 can survive a single drive failure.  Let’s talk about degraded and rebuild.

  • Degraded mode – Degraded on RAID5 isn’t too pretty.  In this case we have lost a single disk but are still running because of our parity bits.  In this case for a read request coming in to the failed disk, the system must rebuild that data in memory.  We know that process – every remaining disk must be read in order to generate that data.  For a write request coming into the failed disk, the system must rebuild the existing data in memory, read and recalculate parity, and write the new parity value to disk. The one exception to the write condition is if in a given stripe we have lost the parity strip instead of a data strip.  In this case we get a performance increase because the data is just written to whatever data strip it is destined for with no regard to parity recalculation.  However this teensy performance increase is HEAVILY outweighed by the I/O crushing penalty going on all around it.
  • Rebuild mode – Rebuild is also ugly.  The replacement disk must be rebuilt, which means that every bit of data on every remaining drive must be read in order to calculate what the replacement disk looks like.  And all the while, for incoming reads it is still operating in degraded mode.  Depending on controller design, writes can typically be sent to the new disk – but we still have to update parity.

Protection factor aside, the performance hit from degraded mode is why hot spares are tremendously important to RAID5. You want to spend as little time as possible in degraded mode.

Circling back to usable capacity, why do I want smaller RAID groups?  If I have 50 disks, why would I want to do ten 4+1’s instead of one 49+1.  Why waste 10 times the space to parity?  The answer is two-fold.

First related to the single drive failure issue, the 49+1 presents a much larger fault domain.  In English, fault domain means a set of things that are tied to each other for functionality.  Think of it like links in a chain: if one link fails, the entire chain fails (well, a chain in analogy like this one fails) .  With 49+1, I can lose at most one drive out of 50 at any time and keep running.  With ten 4+1’s, I can lose up to 10 drives as long as they come out of different RAID groups.  It is certainly possible that I lose two disks in one 4+1 group and that group is dead, but the likelihood of it happening with a given set of 5 disks is lower than a set of 50 disks.  The trade-off here is that as we add more disks to our RAID group size, we gain usable capacity but increase our risk of a two drive failure causing data loss.

Second, related to the Degraded and Rebuild issues, the more drives I have, the more pieces of data I must read in order to construct data during a failure.  If I have 4+1 and lose a disk, for every read that comes into the system I have to read four disks to generate that data.  But with a 49+1 if I lose a disk, now I have to read forty-nine disks in order to generate that data!  As I add more disks to a RAID5 set, Degraded and Rebuild operations become more taxing on the storage array.

On to write penalty!  In the parity post I explained that any write to existing data causes the original data and parity to be read, some calculations (which happen so fast they aren’t relevant) and then the new data and new parity must be written to disk.  So the write penalty in this case is 4:1.  Four I/O operations for each write coming into the system.  Interestingly enough, this doesn’t scale with RAID group size.  Whether a 2+1 or  200+1, the write penalty is always 4:1 for single parity schemes.

Full Stripe Writes

RAID1/0 has a 2:1 write penalty, and RAID5 has a 4:1 write penalty.  Does this mean that writes to RAID1/0 are always more efficient than RAID5?  Not necessarily.  There is a special case for writes to parity called Full Stripe Writes (FSWs).  A FSW is a special case that typically happens with large block sequential writes (like backup operations).  In this case we are writing such a large amount of data that we actually overwrite one entire stripe.  E.g. in our 4+1 scenario, if the strip size was 64KB and we wrote 256KB of data starting at the first disk, we would end our write at the end of the stripe.  In this case, we have no need to do a parity update because every bit of data that we are protecting with the parity is getting overwritten.  Because of this, we can actually just calculate parity in memory (since we already have the entire stripe’s data in memory) and write the entire stripe at once.

The payback is enormous here, because we only have one extra write for every four writes coming into the system.  In the 4+1 that we described, this translates into a write penalty of 5:4.  This is actually a big improvement even over RAID1/0!

FSWs are not something to hope for when choosing a RAID type.  They are very dependent on the application behavior, file system alignment, and I/O pattern.  Modern storage arrays enable this behavior more often because they hold data in protected cache before flushing to disk, but choosing RAID5 for something that is heavily write oriented and simply hoping that you will get the 5:4 write penalty would be very foolish.  However, if you do your homework you can usually figure out if it is happening or not.  As a simple example, if I was dumping large backups onto a storage array, I would almost always choose RAID5 or RAID6 because this generally will leverage FSWs.

RAID6

RAID6 is striping with dual parity protection.  Essentially most of what we know about RAID5 applies, except that in any given stripe instead of one parity value there are two.  What this allows us to do is to recover in the event that we lose two drives.  RAID6 can survive two drive failures.

In order to do this, a catch with this second value is that the second parity bit must actually be different from the first.  If the second parity value was just a copy of the first, that doesn’t buy us anything for data recovery.  Another catch is that the 2nd parity value can’t use the first parity value for the calculation…otherwise the 2nd parity value is dependent on the first and in a recovery scenario we run into a bit of a storage array and the egg problem.  Not what we want.

In the parity post I declared my undying love for XOR, and to prove to the rest of you doubters that it is just as amazing as I made it out to be – the 2nd parity value also uses XOR!  It is just too efficient to pass up.  But obviously we must XOR some different data values together.  RAID6’s second parity actually comes from diagonal stripes.

Offhand you might be imagining something like this:

wrongr6

As the helpful text indicates, not so much.  Why not, though?  We satisfied both of our criteria – the 2nd parity bit is different than the first, and it doesn’t include it either.

From a protection standpoint, this probably works but we pay a couple of performance penalties.  First and foremost, we lose the ability to do FSWs.  In order to do a full stripe write with this scheme, I have to essentially overwrite every single disk at one time.  Not gonna happen.  Second, in recovery scenarios my protection information is tied to more strips than RAID5.  I have a set of horizontal strips for one parity value and then another set of diagonal strips for the 2nd parity strip.

Instead, remember that we are working with an ordered set of 1’s and 0’s in every strip, so really the 2nd parity bit is calculated like:

rightr6

It is a strange, strange thing, but essentially the parity is calculated (or should be calculated) within the same stripe using different bits in each strip.

For a more comprehensive and probably more clear look into the hows of RAID6 (including recovery methodology), EMC’s old whitepaper on it is still a great resource.  I really encourage you to check it out if you need some more detail or explanation, or just want to read a different perspective on it.  https://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf  Their diagrams are much more informative than mine, although they have very few kittens in them from what I’ve seen so far.

On to our other criteria – the degraded and rebuild modes are pretty much the same as RAID5 except that we may have to read one additional parity disk during the operation.  In other words, degraded and rebuild modes are not pleasant with RAID6.  Make sure you have hotspares to get you out of both as fast as possible.

Usable capacity – the penalty is calculated similarly to RAID5, just with X+2 notation. So e.g. a 6+2 RAID6 would have a 2/8 (two out of eight disks used for parity) penalty, or 25%.  Just like RAID5, this value depends on the size of the group itself, with a technical minimum of four drives.  I say technical because RAID6 schemes are usually implemented to protect a large number of disks – instead of two data and two parity disks, why not just do a 2+2 RAID1/0?  Ahh, variety.

Finally, write penalty.  Because every time I write data I have to update two parity values, there is a 6:1 write penalty with RAID6.  The update operation is once again the same as RAID5 except the second parity value must be read, new parity calculated, and new parity written.

RAID6 can utilize FSWs as discussed above, but if it doesn’t, write operations are taxed HEAVILY with the 6:1 write penalty.  RAID6 has its place, but if you are trying to support small block random writes, it is probably advisable to steer clear.  Again there is no such thing as read penalty, so from a read perspective it performs identically to all other RAID types given the same number of disks in the group.

Distributed vs Dedicated Parity

Briefly I wanted to mention something about parity and the RAID notation like 4+1.  We “think” of this as “4 data disks, one parity disk” which makes sense from a capacity perspective.  In practice, this is called dedicated parity…and it’s not such a good idea.

Every write that comes in the system generates 4 back end I/Os.  Two of those I/Os are slated for the strip that the data is on, and the other two I/Os hit the parity strip.  Were we to stack all the parity strips up on one disk (as we would with a dedicated parity disk), what do you think that would look like under any serious write load?

You could roast marshmallows on the parity disk

You could roast marshmallows on the parity disk

The parity disk has a lot of potential to become a bottleneck.  Instead, RAID5 and 6 implementations use what is called distributed parity in order to provide better I/O balancing.

distributedparity

In this manner, the parity load for the RAID group is distributed evenly across the disks.  Now, does this guarantee even balance?  Nope.  If I hit the top stripe hard, the top parity strip on Disk1 is still going to cook.  But under normal write load with small enough strip size, this provides a much needed load balance.

Not all protection schemes use distributed parity – NetApp’s RAID-DP is a good example of this.  But in cases where parity is not distributed, there must be some other mechanism to alleviate the parity load…otherwise the parity disk is going to be a massive bottleneck.

Uncorrectable Bit Errors

Finally, I wanted to mention Uncorrectable Bit Errors and their impact on RAID5 vs RAID6.  If you check out the whitepaper from EMC above, you’ll see a reference to uncorrectable errors.  You can also google this topic – here is a good paper on it.

An uncorrectable error is one that happens on a disk and renders the data for that particular sector unrecoverable.  The error rate is measured in errors per bit read.  Many consumer grade drives are 1 error per 10^15 bits (~113TB) read, and enterprise grade drives are 1/10^16 (~1.1PB). Generally the larger capacity drives (NL-SAS) are actually consumer grade from this standpoint.

During normal operations with RAID protection a UBE is OK because we have recovery information built into the RAID scheme.  But in a RAID5 rebuild scenario, a UBE is instant death for the RAID group.  Remember we have to be able to reconstruct that failed disk in its entirety, and in order to do that we have to read every bit of data off of every other disk in the group.

So consider that 3TB capacity drives are going to exhibit an UBE every ~113TB of data read, giving a run through the entire disk an approximately 2.5% chance of winning the lottery. Then consider that your RAID5 group is probably going to have at least four or five of these guys in it.

I’ve seen RAID5 used for capacity drives before. And there are mechanisms built into storage arrays to try to sweep and detect errors before a drive fails.  And to date (knock on wood) I haven’t seen a RAID group die a horrible death during rebuild.  But it is always my emphatic recommendation to protect capacity drives with RAID6.  You will find this best practice repeated ad nauseum throughout the storage world.  It is nearly impossible to justify the additional risk of RAID5 against the cost of a few extra capacity disks, even if it pushes you into an extra disk shelf.  Fighting a battle today for a few more dollars on the purchase is going to be a lot less painful than explaining why a 50TB storage pool is invalid and everything in it must be rolled from backup. (and you’ve got backups right?  and they work?)

The Summary Before the Summary

This was a tremendous amount of information and is probably not digestible in one sitting.  Maybe not even two.  My hope is really that by reading this you will learn just a bit about the operations behind the curtain that will help you make an informed decision on when to use RAID5 and RAID6.  If this saves just one person from saying “we need to use RAID1/0 because it is the fast one,” I will be happy.

My next post will be a wrap up of RAID and some comparisons between the types to bring a close to this sometimes bizarre topic of RAID.

RAID: Part 4 – Parity, Schmarity

We’re in the home stretch now.  We’ve covered mirroring and striping, and three RAID types already.

Now we are moving into RAID5 and RAID6.  They leverage striping and a concept called parity.

I’m still new to this blogging thing and once again I bit off more than I could chew.  I wanted to include RAID5 and 6 within this post but it again got really lengthy.  Rather than put an abridged version of them in here, I will cover them in the next post.

For a long time I didn’t really understand what parity was, I just knew it was “out there” and let us recover from disk failures.  But the more I looked into it and how it worked, the more it really amazed me.  It might not grab you the way it grabs me…and that’s OK, maybe you just want to move on to RAID5/6 directly.  But if you are really interested in how it works, forge on.

What is Parity?

Parity (pronounced ‘pear-ih-tee’) is a fancy (pronounced ‘faincy’) kind of recovery information.  In the mirroring post we examined data recovery from a copy perspective.  This is a pretty straight forward concept – if I make two copies of data and I lose one copy, I can recover from the remaining copy.

The downside of this lies in the copy itself.  It requires double the amount of space (which we now refer to as a 50% usable capacity penalty) to protect data with an identical copy.

Parity is a more efficient form of recovery information.  Let’s walk through a simple example, but one that I hope will really illustrate the mechanism, benefits, and problems of parity.  Say that I’m writing numbers to disk, the numbers 18, 24, 9, and last but certainly not least, 42.

noparity

As previously discussed, a mirroring strategy would require 4 additional disks in order to mirror the existing data – very inefficient from a capacity perspective.

Instead, I’m going to perform a parity calculation – or a calculation that results in some single value I can use to recover from.  In this case I’m going to use simple addition to create it.

18 + 24 + 9 + 42 = 93

So my parity value is 93 and I can use this for recovery (I’ll explain how in just a moment).

Next question – where can I store this value?  Well we probably shouldn’t use any of the existing disks, because they contain the information we are protecting.  This is a pretty common strategy for recovery. If I’m protecting funny kitten videos on my hard drive, I don’t want to back them up to the same disk because a disk failure takes out the original hilarious videos and the adorable backups. Instead I want to store them on a different physical medium to protect from physical failure.

kittehbackup

Similarly, in my parity scheme if a disk were to fail that contained data and the parity value, I would be out of luck.

To get around this, I’ll add a fifth disk and write this value:

parity

Now the real question: how do I use it for recovery?  Pretty simply in this case.  Any disk that is lost can be rebuilt by utilizing the parity information, along with the remaining data values.  If Disk3 dies, I can recover the data on it by subtracting the remaining data values from the parity value:

reconstruct

Success – data recovery after a disk failure…and it was accomplished without adding a complete data copy!  In this case 1/5 of the disk space is lost to parity, translating to a 20% usable capacity penalty.  That is a serious improvement from mirroring.

What happens if there are two disks that fail? As with most things, it depends.  Just like RAID1, if a second disk fails after the system has fully recovered from the first failure, everything is fine.

But if there is a simultaneous failure of two disks?  This presents a recovery problem because there are two unknowns.  If Disk1 and Disk2 are lost simultaneously, my equation looks like:

93-42-9 = 42 = ?+?

Or in English, what two values add up to 42?  While it is true that 18+24=42, so does 20+22, neither of which are my data.  There are a lot of values that meet this criteria…in this case more of them aren’t my data than are.  And guessing with data recovery is, in technical terms, a terribad idea.  So we know that this parity scheme can survive a single disk failure.

Another important question – what happens if we overwrite data?  For instance, if Disk2’s value of 24 gets overwritten with a value of 15, how do we adjust?  It would be a real bummer if the system had to read all of the data in the stripe to calculate parity again for just one affected strip.

There is some re-reading of data, but it isn’t near that bad.  We remove the old data value (24) from the parity value (93), and then add the new one (15) in.  Then we can replace both the data and the parity on disk.

The process looks like:

  1. Get the old parity value
  2. Get the old data value
  3. Subtract the old data value from the old parity value, creating an intermediate parity value
  4. Add the new data value to the intermediate parity value, creating the new parity value
  5. Update the parity value with the new parity value
  6. Update the data value with the new data value

Because we are working with disks, we can replace the “Get” phrasing with read and “Update” with write.  Looking back, at this list, we see that there are two gets (reads) and two updates (writes).

The Magical XOR Operation

I know, I know – my example was amazing.  “Parity calculations should use that mechanism!,” I can hear you saying.  “Give him the copyright and millions in royalties,” you are no doubt proclaiming to everyone around you.

Unfortunately my scheme has a serious problem, and that is that my parity calculation is accumulative.  The larger the numbers get that I am protecting, the larger my parity value gets. Remember at the end of the day we aren’t working with numbers. We are working with data bits (1’s and 0’s) on disk, and we are working with a fixed strip size.  Were we to write 4 x 64KB of data with accumulative parity, I would need 256KB of parity to protect it.  Not ideal – this is essentially the same as mirroring’s usable capacity penalty!

Instead, in the real world parity is supported entirely by the bitwise “exclusive or,” or XOR, operation.  Visually the operator itself looks kind of like a crosshair, or a plus inside a circle.  XOR is a very unique operator in that it essentially allows you to add and subtract from a value (similar to my scheme) without increasing the total amount of information (the bit count).

Another cool thing about the bitwise XOR is it functions both as addition and subtraction.  To “add” two values, you XOR them together.  Then to remove one value from it, you XOR that value again. So instead of A + B – B = A, we have:

xoraddsub

XOR’s principle is very simple – if two values are different, the output is TRUE (1); otherwise the output is FALSE (0).  In other words:

  1. Take any pair of 1’s and/or 0’s and compare
  2. If they are the same, output 0
  3. Else output 1

That’s all, folks.  The 4 possible input combinations and their outputs are:

  • 0 XOR 0 = 0
  • 0 XOR 1 = 1
  • 1 XOR 0 = 1
  • 1 XOR 1 = 0

Not too crazy looking is it?  Let’s put it to the test and prove that this works for recovery.  Similar to before, I have 4 data disks but this time with just ones and zeros on them.  The parity calculation goes like so:

xorcalc

Every data bit gets XOR’d together and the result is the parity bit – in this case it is 1.

Now say Disk2 with 0 on it experiences a failure and the system needs to recover. Recovery would simply be the result of XORing all the remaining data bits and the parity bit.

Recovery (from left to right): 1 XOR 1 = 0, XOR 1 = 1, XOR 1 = 0

Success!  We have recovered the 0 bit.  Unfortunately XOR’s magic only extends so far, and in the event of a simultaneous two disk failure we are still up the creek without a paddle.  One parity value can only protect you against the loss of one disk because it can only recover one unknown value.

How about updating the parity bit in the event that we overwrite some data? Again, this works as outlined above:

  1. Read the old parity value
  2. Read the old data value
  3. XOR the old data value with the old parity value, creating an intermediate parity value
  4. XOR the new data value with the intermediate parity value, creating the new parity value
  5. Write the parity value with the new parity value
  6. Write the data value with the new data value

Same as before, there are two reads and two writes.

This is obviously a very simple example and was in no way meant to be a mathematical proof of parity recovery or XOR, but it works on any scale you choose.  Test it out!

Quick note – in the real world, strip size is not a single bit like in this example. With a 64KB strip size, that comes out to 524288 bits in a single strip.  524288 ones and zeros.  XOR functions quite simply as you add bits on, since it just compares each bit in place.  For example, 1100 XOR 1010 is 0110.  The first digit of the result is the first digit of each input XOR’d together.  The second digit is the second digits of each input XOR’d together.  And so on.  There are more detailed XOR manifestos out there as well as XOR calculators online…feel free to consult if you are interested and my explanation left you wanting.

Bitwise XOR is the mechanism for generating a parity bit without increasing the total bit count.  Using this, RAID controllers can generate parity that is identical in size to any amount of data strips being protected.  No matter the strip size, and no matter the stripe width, this mechanism will always result in an identically sized parity.

So what?

So what, indeed.  That was a lot to take in, I know…parity is certainly more complicated than mirroring.  Is it absolutely necessary to understand how parity works at this level?  No, not really.  But thus far I’ve never had a problem arise because I understood how something worked too deeply. I have encountered plenty of issues because I haven’t understood how something worked, or made assumptions about what something was doing behind the scenes.

When we get into some aspects of RAID5 and RAID6, understanding what parity is supposed to do will help clarify what those RAID types are useful for.  And if you don’t agree, feel free to wipe this from your memory banks and replace it with something more useful.

RAID: Part 3 – RAID 1/0

So, if you have been following along dear reader, we are now up to speed on several things.  We have discussed mirroring (and RAID1, which leverages it) and striping (and RAID0, which leverages that).  We have also discussed RAID types using some familiar and standard terminology which will allow us to compare and contrast the versions moving forward.

Now, on to the big dog of RAID – RAID 1/0.  This is called “RAID one zero” and “RAID ten,” and sometimes “RAID one plus zero” (and indicated as RAID 1+0).  I have never heard it called “RAID one slash zero” but perhaps somebody somewhere does that also.  All of these things are referring to the same thing, and RAID ten is the most common term for it.

Why do we need RAID1/0?

In this section I wanted to ask a sometimes overlooked question – what are the problems with RAID0 and RAID1 that cause people to need something else?

If you know about RAID0 (or even better if you read Part 2) you should have an excellent idea of the failings of it.  Just to reiterate, the problem of RAID0 is that it only leverages striping, and striping only provides a performance enhancement.  It provides nothing in the way of protection, hence any disk that fails in a RAID0 set will invalidate the entire set. RAID0 is the ticking time bomb of the storage world.

RAID1’s problems aren’t quite as obvious as the “one disk failure = worst day ever” of RAID0, but once again let’s go back to Part 1 and look at the benefits I listed of RAID:

  1. Protection – RAID (except RAID0) provides protection against physical failures.  Does RAID1 provide that?  Absolutely – RAID1 can survive a single disk failure.  Check box checked.
  2. Capacity – RAID also provides a benefit of capacity aggregation.  Does RAID1 provide that?  Not at all.  RAID1 provides no aggregate capacity or aggregate free space benefit because there are always exactly two disks in a RAID1 pair, and the usable capacity penalty is 50%.  Whether I have a RAID1 set using a 600GB drive or a 3TB drive, I get no aggregate capacity benefit with RAID1, beyond the idea of just splitting a disk up into logical partitions…which can be done on a single disk without RAID in the first place.
  3. Performance – RAID provides a performance benefit since it is able to leverage additional physical spindles.  Does RAID1 provide that?  The answer is yes…sort of.  It does provide two spindles instead of one, which fits the established definition.  However there are some caveats.  There isn’t a performance boost on writes because of the write penalty of 2:1 (both of the spindles are being used for every single write).  There is a performance boost on reads because it can effectively round-robin read requests back and forth on the disks.  But, and a BIG BUT, there are only two spindles.  There are only ever going to be two spindles.  Unlike a RAID0 set which can have as many disks as I want to risk my data over, a RAID1 set is performance bound to exactly two spindles.

Essentially the problem with the mirrored pair is just that – there are only ever going to be two physical disks.

By now it may have become obvious, but RAID0 and RAID1 are almost polar opposites.  RAID1’s benefit lies mostly around protection, and RAID0’s benefit is performance and capacity.  RAID1 is the stoic peanut butter, and RAID0 is the delicious jelly.  If only there was a way to leverage them both….

What is RAID1/0?

RAID1/0 is everything you wanted out of RAID0 and RAID1. It is the peanut butter and jelly sandwich.  (Note: please do not attempt to combine your storage array with peanut butter or jelly.  Especially chunky peanut butter.  And even more especiallyer chunky jelly)

Essentially RAID1/0 looks like a combination of RAID1 and RAID0, hence the label.  More accurately, it is a combination of mirroring and striping in that order.  RAID1/0 replaces the individual disks of a RAID0 stripe set with RAID1 mirror pairs.  It is also important to understand what RAID1/0 is and what it is not.  It is true that it leverages the good things out of both RAID types, but it also still maintains the bad things of both RAID types. This will become apparent as we dive into it.

raid10

This is a busy image, but bear with me as I break it down.

  • This is an eight disk RAID1/0 configuration, and on this configuration (similar to the Part 2 examples) we are writing A,B,C,D to it. For simplicity’s sake we ignore write order and just go alphabetically
  • The orange and green help indicate what is happening at their particular parts of the diagram
  • The physical disks themselves (the black boxes) are in mirrored pairs that should hopefully be familiar by now (indicated by the green boxes and plus signs).  This is the same RAID1 config that I’ve covered previously.
  • The weirdness picks up at the orange part. The orange box indicates that we are striping across every mirrored pair.  This is also identical to the RAID0 configuration, except that the the physical disks of the RAID0 config have been replaced with these RAID1 pairs.

This is what is meant by RAID1/0.  First comes RAID1 – we build mirrored pairs.  Then comes RAID0 – we stripe data across the members, which happen to be those mirrored pairs.  It may help to think about RAID1/0 as RAID0 with an added level of protection at the member level (since we know RAID0 provides no protection otherwise).

As the host writes A,B,C,D, the diagram indicates where the data will land, but let’s cover the order of operations.

  1. The host writes A to the RAID1/0 set
  2. A is intercepted by the RAID controller.  The particular strip it is targeted for is identified.
  3. The strip is recognized to be on a mirrored pair, and due to the mirror configuration the write is split.
  4. A lands on both disks that make up the first member of the RAID0 set.
  5. Once the write is confirmed on both disks, the write is acknowledged back to the host as completed
  6. The host writes B to the RAID1/0 set
  7. B is intercepted by the RAID controller.  The particular strip it is targeted for is identified.  Due to the mirror configuration the write is split.
  8. B lands on both disks that make up the second member of the RAID0 set.
  9. Once the write is confirmed on both disks, the write is acknowledged back to the host as completed
  10. The host writes C to the RAID1/0 set
  11. etc.

Hopefully this gives an accurate, comprehensible version of the how’s of RAID1/0.  Now, let’s look at RAID1/0 using the same terminology we’ve been using.

From a usable capacity perspective, RAID1/0 maintains the same penalty as RAID1.  Because every member is a RAID1 pair, and every RAID1 pair has a 50% capacity penalty, it stands to reason that RAID1/0 also has a 50% capacity penalty as a whole.  No matter how many members are in a RAID1/0 group, the usable capacity penalty is always 50%.

The write penalty is a similar tune.  Because every member is a RAID1 pair, and every RAID1 pair has a 2:1 write penalty, RAID1/0 also has a write penalty of 2:1.  Again no matter how many members are in the set, the write penalty is always 2:1.

RAID1/0 reminds me of the Facts of Life. You know, you take the good, you take the bad?  RAID1/0 is a leap up from RAID0 and RAID1, but it doesn’t mean that we’ve gotten rid of their problems.  It is better to think that we’ve worked around their problems.  The same usable capacity penalty exists, but now I have the ability to aggregate capacity by putting more and more members into a RAID1/0 configuration.  The same write penalty exists, but again I can now add more spindles to the RAID1/0 configuration for a performance boost.

The protection factor is weird, but still a combination of the two.  How many disks failures can a RAID1/0 set survive?  The answer is, it depends.  There is still striping on the outer layer, and by now we have beaten the dead horse enough to know that RAID0 can’t lose any physical disks.  It is a little clearer, especially for this transition, to think of this concept as RAID0 can’t survive any member failures, and in traditional RAID0 members are physical disks.  In this capacity, RAID1/0 is the same: RAID1/0 can’t survive any member failures.  The difference is that now a member is made up of two physical disks that are protecting each other.  So can a RAID1/0 set lose a disk and continue running?  Absolutely – RAID1/0 can always survive one physical disk failure.

…But, can it survive two?  This is where it gets questionable.  If the second disk failure is the other half of the mirrored pair, the data is toast.  Just as toast as if RAID0 had lost one physical disk since the effect is the same.  But what if it doesn’t lose that specific disk?  What if it loses a disk that is part of another RAID1 pair?  No problem, everything keeps running.  In fact, in our example, we can lose 4 disks like this and keep running:

raid10_4fails

You can lose as many as half of the disks in the RAID1/0 set and continue running, just as long as they are the right disks.  Again, if we lose two disks like this, ’tis a bad day:

raid10_2fails

So there are a few rules about the protection of RAID1/0

  • RAID1/0 can always survive a single disk failure
  • RAID1/0 can survive multiple disk failures, so long as the disk failures aren’t within the same mirrored pair
  • With RAID1/0 data loss can occur over as little as two disk failures (if they are part of the same mirror pair) and is guaranteed to occur at (n/2)+1 failures where n is the total disk count in the RAID1/0 set. 

Degraded and rebuild concepts are identical to RAID1 because the striping portion provides no protection and no rebuild ability.

  • Any mirror pair in degraded mode will see a write performance increase (splitting writes no longer necessary), and potentially a read performance decrease.  Other mirror pairs continue to operate as normal
  • Any mirror pair in rebuild mode will see a heavy performance penalty.  Other mirror pairs continue to operate as normal with no performance penalty.

Why not RAID0/1?

This is one of my favorite interview questions, and if you are interviewing with me (or at places I’ve been) this might give you a free pass on at least one technical question.  I picked it up from a colleague of mine and have used it ever since.

Why not RAID0/1?  Or is there even a concept of RAID0/1?  Would it be the same as RAID1/0?

It does exist, and it is extremely similar on the surface.  The only difference is the order of operations: RAID1/0 is mirrored, then striped, and RAID0/1 is striped, then mirrored.  This seemingly minor difference in theory actually manifests as a very large difference in practice.

raid01

Most things about RAID0/1 are identical to RAID1/0 (like performance and usable capacity), with one notable exception – what happens during disk failure?

I covered the failure process of RAID1/0 above so I won’t rehash that. For RAID0/1, remember that any failure of a RAID0 member invalidates the entire set.  So, what happens whenever the top left disk in RAID0/1 fails?  Yep, the entire top RAID0 set fails, and now it is effectively running as RAID0 using only the bottom set.

This has two implications.  The most severe being that RAID0/1 can survive a single disk failure, but never two disk failures.  The other is that if a disk failed and a hot spare was available (or the bad disk was swapped out with a good disk), the rebuild affects the entire RAID set rather than just a portion of it.

It would be possible to design a RAID controller to get around this.  It could recognize that there is still a valid member available to continue running from in the second stripe set.  But then essentially what it is doing is trying to make RAID0/1 be like RAID1/0.  Why not just use RAID1/0 instead?  That is why RAID1/0 is a common implementation and RAID0/1 is not.

Wrap Up

In Part 4 I’m going to cover parity and hopefully RAID5 and 6, and then I’ll provide some notes to bring the entire discussion together.  However, I wanted to include some thoughts about RAID1/0 in case someone stumbled on this and had some specific questions or issues related to performance, simply because I’ve seen this a lot.

RAID1/0 performs more efficiently than other RAID types from a write perspective only.  A lot of people seem to think that RAID1/0 is “the fastest one,” and hence should always be used for performance applications.  This is demonstrably untrue.  As I’ve stated previously, there is no such thing as a read penalty for any RAID type.  If your application is entirely or mostly read oriented, using RAID1/0 instead of RAID5 or 6 does nothing but cost you money in the form of usable capacity.  And yes, there are workloads with enormous performance requirements that are 100% read.

RAID1/0 has a massive usable capacity penalty.  If you are protecting data with RAID1/0, you need to purchase twice as much storage as it needs.  If you are replicating that data like-for-like, you need to purchase four times the amount of storage that it needs.  Additionally, sometimes your jumping off point locks you into a RAID type as well, so a decision to use RAID1/0 today may impact the future costs of storage as well.  I can’t emphasize this point enough – RAID1/0 is extremely expensive and not always needed.

I like to think of people who always demand RAID1/0 like the people who might bring a Ferrari when asked to “bring your best vehicle.”  But it turns out, I needed to tow a trailer full of concrete blocks up a mountain.  Different vehicles are the best at different things…just like RAID types.  We need to fully understand the requirements before we bring the sports car.

If you are having performance problems, or more likely someone is telling you they are having performance problems, jumping from RAID5 to RAID1/0 may not do a thing for you.  It is important to do a detailed analysis of the ENTIRE storage environment and figure out what the best fit solution is.  You don’t want to be that guy who advocated a couple hundred thousand dollars of a storage purchase when it turned out there was a host misconfiguration.