VNX Storage Pool Construction

Welcome back to me!  I’ve gone dark for a while as I welcomed my first child into the world.  It has been a very busy and confusing few months. 🙂

In this blog post I’d like to cover some rules around good storage pool construction and hopefully shed some light on what the storage pool “recommended numbers” really mean.

Let’s walk through a scenario.  You’ve got a VNX array, and you’ve got a hankerin’ for some RAID1/0.  You’ve got 18 unbound disks burning a hole in your pocket and nothing can stop you now.  You fire up Unisphere and away you go!

You start out with this:

18ub

And end up with this:

r10sp

Looks awesome!  Unfortunately you’ve actually made a pretty bad mistake.  Under the covers, things are really out of balance now.  The private RAID groups behind the storage pool now look like this:

ubrgs

One RG has 1/4th of the IOPs capability of the other two.

There is kind of an interesting density mechanic here at play in that it is likely that the 4+4 will generate more I/O overall because there is more usable capacity…this metric would be something like I/O per GB.  But the fact remains that the 1+1 behind the scenes has a good chance of being a bottleneck at some point.

Even worse, the VNX doesn’t understand that the RGs have a performance imbalance. So while there is less space for it to move slices into via FAST, it is still going to try to balance out the I/O within the tier.  Edit: Today I learned (always learning!) that Flare32 and later VNXes actually will recognize the performance imbalance of the private RGs and allocate less I/O to them.  However there are still less total IOPs within that PRG to serve the slices on it.  And because the private RAID groups aren’t made terribly obvious unless you dig in to the array, you might not ever notice the misconfiguration either until you start asking questions like, “why is my array performing like a Yugo instead of a Corvette?”

What could I have done?

I see this mistake made pretty frequently on arrays where perhaps an admin who wasn’t intimately familiar with VNX mechanics created or expanded some pools..  It isn’t always this severe…sometimes it is a 3 or 4 disk RAID5 RG in a 4+1 pool.  And really it is kind of counter-intuitive, since you probably (and rightly) feel like you really should be able to use any amount of disks you want…after all you paid for them!  And sadly it is not reversible without destroying and recreating the entire pool.  But the important takeaway is that it never needs to happen.  You have two ways to avoid this.

One way is to always add disks in “recommended” chunks (more on this below), which also translates into purchasing them in recommended chunks.  If you had 18 disks available on your array, you should have planned on using them for a 4+4 and two 4+1’s.  Or two 8+1’s.  You should be aware of the number of disks you are adding and whether they correspond with the recommended amounts.

Another way is to build the entire pool in smaller chunks.  This is a very manual, very annoying process.  It is also likely to bite your company in the rear after you win the lottery and leave, and another admin comes in after you.  But for some reason you just REALLY want to use all 18 disks in RAID 1/0.  In order to do this properly, you want to create the pool with 6 disks in RAID1/0.  Then add another 6 disks in RAID1/0.  Then add the final 6 disks in RAID1/0.  This will create 3 x 3+3 balanced RGs in the pool.  Now, unless the next guy coming in to fill your lottery-winning shoes is a sharpshooter, he’s probably not going to know this and may just start adding in 8 disks groups to the pool, which again unbalances the back end.

What about those “recommended” numbers?

This is another interesting misconception behind the storage pool numbers.  The recommended numbers per RAID type are:

  • R1/0: 4+4
  • R5: 4+1 or 8+1
  • R6: 6+2 or 14+2

Some people feel like there is some magic behind these choices.  There isn’t.  These are simply some good balances between performance, capacity, and the fault domain of the RAID sets.  If you’ve read my epic on RAID you know that a 4+1 has less performance (because of less spindles) than a 15+1, but also has less of a chance of a dual disk failure which would cause data loss.

So they are good choices but they aren’t perfect choices.  You may experience a dual disk failure with 4+1 and corrupt the entire storage pool.  Having a R5 5+1 isn’t going to guarantee data destruction.  Having a R5 3+1 isn’t going to guarantee abysmal performance.

Specific disk counts may also enable more Full Stripe Writes, but truthfully if something is writing a LOT of data to the array, it is probably doing it sequentially anyway so the disk count becomes less of a factor.

Based on this it isn’t really accurate to say go with the recommended disk counts for pools because they are amazing, but it does guarantee back-end balance and provides a good balance of performance and capacity.

VNX, Dedupe, and You

Block deduplication was introduced in Flare 33 (VNX2).  Yes, you can save a lot of space.  Yes, dedupe is cool.  But before you go checkin’ that check box, you should make sure you understand a few things about it.

As always, nothing can replace reading the instructions before diving in:

http://www.emc.com/collateral/white-papers/h12209-vnx-deduplication-compression-wp.pdf

Lots of great information in that paper, but I wanted to hit the high points briefly before I go over the catches.  Some of these are relatively standard for dedupe schemes, some aren’t:

  • 8KB granularity
  • Pointer based
  • Hash comparison, followed by a bit-level check to avoid hash collisions
  • Post-process operation on a storage pool level
  • Each pass starts 12 hours after the last one completed for a particular pool
  • Only 3 processes allowed to run at the same time; any new ones are queued
  • If a process runs for 4 hours straight, it is paused and put at the end of the queue.  If nothing else is in the queue, it resumes.
  • Before a pass starts, if the amount of new/changed data in a pool is less than 64GB the process is skipped and the 12 hour timer is reset
  • Enabling and disabling dedupe are online operations
  • FAST Cache and FAST VP are dedupe aware << Very cool!
  • Deduped and non-deduped LUNs can coexist in the same pool
  • Space will be returned to the pool when one entire 256MB slice has been freed up
  • Dedupe can be paused, though this does not disable it
  • When dedupe is running if you see “0GB remaining” for a while, this is the actual removal of duplicate blocks
  • Deduped LUNs within a pool are considered a single unit from FAST VP’s perspective.  You can only set a FAST tiering policy for ALL deduped LUNs in a pool, not for individual deduped LUNs in a pool.
  • There is an option to set dedupe rate – this adjusts the amount of resources dedicated to the process (i.e. how fast it will run), not the amount of data it will dedupe
  • There are two Dedupe statistics – Deduplicated LUN Shared Capacity is the total amount of space used by dedupe, and Deduplication and Snapshot Savings is the total amount of space saved by dedupe

Performance Implications

Nothing is free, and this check box is no different.  Browse through the aforementioned PDF and you’ll see things like:

Block Deduplication is a data service that requires additional overhead to the normal code path.

Leaving Block Deduplication disabled on response time sensitive applications may also be desirable

Best suited for workloads of < 30% writes….with a large write workload, the overhead could be substantial

Sequential and large block random (IOs 32 KB and larger) workloads should also be avoided

But the best line of all is this:

it is suggested to test Block Deduplication before enabling it in production

Seriously, please test it before enabling it on your mission critical application. There are space saving benefits, but that comes with a performance hit.  Nobody can tell you without analysis whether that performance hit will be noticeable or detrimental.  Some workloads may even get a performance boost out of dedupe if they are very read oriented and highly duplicated – it is possible to fit “more” data into cache…but don’t enable it and hope it will happen. Testing and validation is important!

Along with testing for performance, test for stability.  If you are using deduplication with ESX or Windows 2012, specific features (the XCOPY directive for VAAI, ODX for 2012) can cause deduped LUNs to go offline with certain Flare revisions.  Upgrade to .052 if you plan on using it with these specific OSes.  And again, validate, do your homework, and test test test!

The Dedupe Diet – Thin LUNs

Another thing to remember about deduplication is that all LUNs become thin.

When you enable dedupe, in the background a LUN migration happens to a thin LUN in the invisible dedupe container.  If your LUN is already thin, you won’t notice a difference here.  However if the LUN is thick, it will become thin whenever the migration completes.   This totally makes sense – how could you dedupe a fully allocated LUN?

When you enable dedupe the status for the LUN will be “enabling.”  This means it is doing the LUN migration – you can’t see it in the normal migration status area.

Thin LUNs have slightly lower performance characteristics than thick LUNs. Verify that your workload is happy on a thin LUN before enabling dedupe.

Also keep in mind that this LUN migration requires 110% of the consumed space in order to migrate…so if you are hoping to dedupe your way out of a nearly full pool, you may be out of luck.

One SP to Rule Them All

Lastly but perhaps most importantly – the dedupe container is owned by one SP.  This means that whenever you enable dedupe on the first LUN in a pool, that LUN’s owner becomes the Lord of Deduplication for that pool.  Henceforth, any LUNs that have dedupe enabled will be migrated into the dedupe container and will become owned by that SP.

This has potentially enormous performance implications with respect to array balance.  You need to be very aware of who the dedupe owner is for a particular pool.  In no particular order:

  • If you are enabling dedupe in multiple pools, the first LUN in each pool should be owned by differing SPs.  E.g. if you are deduping 4 different pools, choose an SPA LUN for the first one in two pools, and an SPB LUN for the first one in the remaining two pools.  If you choose an SPA LUN for the first LUN in all four pools, every deduped LUN in all four pools will be on SPA
  • If you are purchasing an array and planning on using dedupe in a very large single pool, depending on the amount of data you’ll be deduping you may want to divide it into two pools and alternate the dedupe container owner.  Remember that you can keep non-deduplicated LUNs in the pools and they can be owned by any SP you feel like
  • Similar to a normal LUN migration across SPs, after you enable dedupe on a LUN that is not owned by the dedupe container owner, you need to fix the default owner and trespass after the migration completes.  For example – the dedupe container in Pool_X is owned by SPA.  I enable dedupe on a LUN in Pool_X owned by SPB.  When the dedupe finishes enabling, I need to go to LUN properties and change the default owner to SPA.  Then I need to trespass that LUN to SPA.
  • After you disable dedupe on a LUN, it returns to the state it was pre-dedupe.  If you needed to “fix” the default owner on enabling it, you will need to “fix” the default owner on disabling.

What If You Whoopsed?

What if you checked that box without doing your homework?  What if you are seeing a performance degradation from dedupe?  Or maybe you accidentally have everything on your array now owned by one SP?

The good news is that dedupe is entirely reversible (big kudos to EMC for this one).  You can uncheck the box for any given LUN and it will migrate back to its undeduplicated state.  If it was thick before, it becomes thick again.  If it was owned by a different SP before, it is owned by that SP again.

If you disable dedupe on all LUNs in a given pool, the dedupe container is destroyed and can be recreated by re-enabling dedupe on something.  So if you unbalanced an array on SPA, you can remove all deduplication in a given pool, and then enable it again starting with an SPB LUN.

Major catch here – you must have the capacity for this operation.  A LUN requires 110% of the consumed capacity to migrate, so you need free space in order to undo this.

Deduplication is a great feature and can save you a lot of money on capacity, but make sure you understand it before implementing!