For my first foray into the tech blogging world, I wanted to have a discussion on the simple yet incredibly complex subject of RAID. Part 1 will not be technical, and instead hopefully provide some good footing on which to build.
- Part 2 – Mirroring and Striping
- Part 3 – RAID 1/0
- Part 4 – Parity, Schmarity
- Part 5 – RAID5 and RAID6
- Part 6 – WrapUp
For the purposes of this discussion I’m only going to focus on RAID 0, 1, 1/0 (called “RAID one zero” or more commonly “RAID ten”), 5, and 6. These are generally the most common RAID types in use today, and the ones available for use on an EMC VNX platform. Newcomers may feel daunted by the many types of RAID…I know I was. I spent some time memorizing a one line definition of what they mean. While this may be handy for a job interview, a far more valuable use of time would be to memorize how they work! You can always print out a summary and hang it next to your desk.
I’ve found RAID to be one of the more interesting topics in the storage world because it seems to be one of the more misunderstood, or at least not fully understood, concepts – yet it is probably one of the most widely used. Almost every storage array uses RAID in some form or another. Often I deal with questions like:
- Why don’t we just use RAID 1/0 since it is the fastest?
- Why don’t I just want to throw all my disks into one big storage pool?
- RAID6 for NLSAS is a good suggestion, but RAID5 isn’t too much different right?
- RAID6 gives two disk failure protection, why would anyone use RAID5 instead?
- Isn’t RAID6 too slow for anything other than backups?
Most of these questions really just stem from not understanding the purpose of RAID and how the types work.
In this post we’ll tackle the most basic of questions – what does RAID do, and why would I want to use RAID?
What does RAID do?
RAID is an acronym for Redundant Array of Independent (used to be Inexpensive, but not so much anymore) Disks. The easiest way to think of RAID is a group of disks that are combined together into one virtual disk.
If I had five 200GB disks, and “RAIDed” them together, it would be like I had one 1000GB disk. I could then allocate capacity from that 1000GB disk.
Why would I want to use RAID?
RAID serves at least three purposes – protection, capacity, and performance.
Protection from Physical Failures
With the exception of RAID 0 (I’ll discuss the types later), the other RAID versions listed will protect you against at least one disk failure in the group. In other words, if a hard drive suffers a physical failure, not only can you continue running (possibly with a performance impact), but you won’t lose any data. A RAID group that has suffered a failure but is continuing to run is generally known as degraded. What this means is a little different for each type so we’ll cover those details later. When the failed disk is replaced with a functional disk, some type of rebuild operation will commence, and when complete the RAID group will return to normal status without issue.
Most enterprise storage arrays, and many enterprise servers, allow you to implement what is commonly known as a hot spare. A hot spare is a disk that is running in a system, but not currently in use. The idea behind a hot spare is to reduce restore time. If a disk fails and you have to:
- Wait for a human to recognize the failure
- Open a service request for a replacement
- Wait for the replacement to be shipped
- Have someone physically replace the disk
That is potentially a long period of time that I am running in degraded mode. Hence the hot spare concept. With a hot spare in the system, when the disk fails, a spare is instantly available and rebuild starts. Once the rebuild is finished, the RAID group returns to normal. The failed disk is no longer a part of any active RAID group, and itself can be seen as a spare, unused disk in the system (though obviously not a hot spare because it is failed!). Eventually it will be replaced, but because it isn’t involved in data service there is less of a critical business need to replace it.
An important and sometimes hazy concept, especially with desktops, is that RAID only protects you against physical failures. It does not protect you against logical corruption. As a simple example, if I protect your computer’s hard drives with RAID1 and one of those drives dies, you are protected. If instead you accidentally delete a critical file, RAID will do nothing for you. In this situation, you need to be able to recover the file through the file system if possible, or restore from a backup. There are a lot of types of logical corruption, and rest assured that RAID will not protect you from any of them.
There are two capacity related benefits to RAID. Note that there is generally also a capacity penalty that comes along with RAID, but we will discuss that when we get into the types.
Aggregated Usable Capacity
Continuing the example above with the five 200GB disks, if you were to come ask me for storage, without RAID the most I could give you would be a 200GB disk. I might be able to give you multiple 200GB disks, and you might be able to combine those through a volume manager, but as a storage admin I could only present you one 200GB disk.
What if you need a terabyte of space? I’d have to give you all five separate disks, and then you’d have to do some volume management on your end to put them together.
With RAID, I can assemble those together on the back end as a virtual device, and present it as one contiguous address space to a host. As an example, 2TB datastores are fairly common in ESX, and I would venture to say a lot of those datastores run on disk drives much smaller than 2TB. Maybe it is a 10 or 20 disk 600GB SAS pool, and we have allocated two TB out of that for the ESX datastore.
Aggregated Free Space
Think about the hard drive in your computer. It is likely that you’ve got some amount of free capacity on it. Let’s say you have a 500GB hard drive with 200GB of free space.
Now let’s think about five computers with the same configuration. 500GB hard drives, 200GB free on each. This means that we are not using 1000GB of space overall, but because it is dedicated to each individual computer, we can’t do anything with it.
If instead we took those 500GB hard drives and grouped them, we could then have a sum total of 2500GB to work with and hand out. Now perhaps it doesn’t make sense to give all users 300GB of capacity, since that is what they are using and they would be out of space…but perhaps we could give them 400GB instead.
Now we’ve allocated (also commonly known as “carving”) five 400GB virtual disks (also commonly known as LUNs) out of our 2500GB pool, leaving us 500GB of free space to work with. Essentially by pooling the resources, we’ve gained the ability to add one additional hard drive without adding another physical disk.
Performance of disk based storage is largely based on how many physical spindles are backing it (this changes with EFD and large cache models, but that is for another discussion). A hard drive is a mechanical device, and is generally the slowest thing in the data path. Ergo, the more I can spread your data request (and all data requests) out over a bunch of hard drives, the more performance I’m going to be able to leverage.
If you need 200GB of storage and I give you one 200GB physical disk, that is one physical spindle backing your storage. You are going to be severely limited on how much performance you can squeeze out of that hard drive.
If instead I allocate your 200GB of space out of a RAID group or pool, now I can give you a little bit of space on multiple disks. Now your virtual disk storage is backed by many physical spindles, and in turn you will get a lot more performance out of it.
It should be said that this blog is focused on enterprise storage arrays, but some of the benefits listed above apply to any RAID controller, even one in a server or workstation. The aggregated free space, and in most scenarios the performance benefit, only apply to a shared storage arrays.
Hopefully this was a good high level introduction to the why’s of RAID. In the next post I will cover the how’s of RAID 1 and 0.