Thoughts on Thin Provisioning

I finally decided to put all my thoughts down on the topic of thin provisioning.  I wrestled with this post for a while because some of what I say is going to go kinda-sorta against a large push in the industry towards thin provisioning.  This is not a new push; it has been happening for years now.  This post may even be a year or two too late…

I am not anti-thin – I am just not 100% pro-thin.  I think there are serious questions that need to be addressed and answered before jumping on board with thin provisioning.  And most of these are relatively non-technical; the real issue is operational.

Give me a chance before you throw the rocks.

What is Thin Provisioning?

First let’s talk about what thin provisioning is, for those readers who may not know.  I feel like this is a pretty well known and straightforward concept so I’m not going to spend a ton of time on it.  Thin provisioning at its core is the idea of provisioning storage space “on demand.”

Before thin provisioning a storage administrator would have some pool of storage resources which gave some amount of capacity.  This could be simply a RAID set or even an actual pooling mechanism like Storage Pools on VNX.  A request for capacity would come in and they would “thick provision” capacity out of the pool.  The result would mean that the requested capacity would be reserved from the pooled capacity and be unavailable for use…except obviously for whatever purpose it was provisioned for.  So for example if I had 1000GB and you requested a 100GB LUN, my remaining pool space would be 900GB.  I could use the 900GB for whatever I wanted but couldn’t encroach into your 100GB space – that was yours and yours alone.  This is a thick provisioned LUN.

Of course back then it wasn’t “thick provisioning,” it was just “provisioning” until thin came along! With thin provisioning, after the request is completed and you’ve got your LUN, the pool is still at 1000GB (or somewhere very close to it due to metadata allocations which are beyond the scope of this post).  I have given you a 100GB LUN out of my 1000GB pool and still I have 1000GB available.  Remember that as soon as you get this 100GB LUN, you will usually put a file system on it and then it will appear empty.  This emptyness is the reason that the 100GB LUN doesn’t take up any space…there isn’t really any data on it until you put it there.

Essentially the thin LUN is going to take up no space until you start putting stuff into it.  If you put 10GB of data into the LUN, then it will take up 10GB on the back side.  My pool will now show 990GB free.  You should have a couple of indicators on the array like allocated or subscribed or committed and consumed or used.  Allocated/subscribed/committed is typically how much you as the storage administrator have created in the pool.  Consumed or used is how much the servers themselves have eaten up.

What follows are, in no particular order, some things to keep in mind when thin provisioning.

Communication between sysadmin and storage admin

This seems like a no-brainer but a discussion needs to happen between the storage admins providing the storage and the sysadmins who are consuming it.  If a sysadmin is given some space, they typically see this as space they can use for whatever they want.  If they need a dumping ground for a big ISO, they can use the SAN attached LUN with 1TB of free space on it.  Essentially they will likely feel that space you’ve allocated is theirs to do whatever they want with.  This especially makes sense if they’ve been using local storage for years.  If they can see disk space on their server, they can use it as they please.  It is “their” storage!

You need to have this conversation so that sysadmins understand activities and actions that are “thin hostile.”  A thin hostile action is one that effectively nullifies the benefit of thin provisioning by eating up space from day 1.  An example of a thin hostile action would be hard formatting the space a 500GB database will use up front, before it is actually in use.  Another example of a thin hostile action would be to do a block level zero formatting of space, like Eager Zero Thick on ESX.  And obviously using excess free space on a LUN for a file dumping ground is extremely thin hostile!

Another area of concern here is deduplication.  If you are using post-process deduplication, and you have thin provisioned storage, your sysadmins need to be aware of this when it comes to actions that would overwrite a significant amount of data.  You may dedupe their data space by 90%, but if they come in and overwrite everything it can balloon quickly.

The more your colleagues know about how their actions can affect the underlying storage, the less time you will spend fire fighting.  Good for them, good for you.  You are partners, not opponents!

Oversubscription & Monitoring

With thin provisioning, because no actual reservation happens on disk, you can provision as much storage as you want out of as small a pool as you want.  When you exceed the physical media, you are “oversubscribing” (or overcommitting, or overprovisioning, or…).  For instance, with your 1000GB you could provision 2000GB of storage.  In this case you would be 100% oversubscribed.  You don’t have issues as long as the total used or consumed portion is less than 1000GB.

There are a lot of really appealing reasons for doing this.  Most of the time people ask for more storage than they really need…and if it goes through several “layers” of decision makers, that might amplify greatly.  Most of the time people don’t need all of the storage they asked for right off the bat.  Sometimes people ask for storage and either never use it or wait a long time to use it.  The important thing to never forget is that from the sysadmin’s perspective, that is space you guaranteed them!  Every last byte.

Oversubscription is a powerful tool, but you must be careful about it.  Essentially this is a risk-reward proposition: the more people you promise storage to, the more you can leverage your storage array, but the more you risk that they will actually use it.  If you’ve given out 200% of your available storage, that may be a scary situation when a couple of your users decide to make good on the promise of space you made to them.  I’ve seen environments with as much as 400% oversubscription.  That’s a very dangerous gamble.

Thin provisioning itself doesn’t provide much benefit unless you choose to oversubscribe.  You should make a decision on how much you feel comfortable oversubscribing.  Maybe you don’t feel comfortable at all (if so, are you better off thick?).  Maybe 125% is good for you.  Maybe 150%.  Nobody can make this decision for you because it hinges on too many internal factors.  The important thing here is to establish boundaries up front.  What is that magic number?  What happens if you approach it?

Monitoring goes hand in hand with this.  If you monitor your environment by waiting for users to email that systems are down, oversubscribing is probably not for you.  You need to have a firm understanding of how much you’ve handed out and how much is being used.  Again, establish thresholds, establish an action plan for exceeding them, and monitor them.

Establishing and sticking with thresholds like this really helps speed up and simplify decision making, and makes it very easy to measure success.  You can always re-evaluate the thresholds if you feel like they are too low or too high.

Also make sure your sysadmins are aware of whether you are oversubscribed or not, and what that means to them.  If they are planning on a massive expansion of data, maybe they can check with you first.  Maybe they requested storage for a project and waited 6 months for it to get off the ground – again they can check with you to make sure all is well before they start in on it.  These situations are not about dictating terms, but more about education.  Many other resources in your environment are likely oversubscribed.  Your network is probably oversubscribed.  If a sysadmin in the data center decided to suddenly multicast an image to a ton of servers on a main network line, you’d probably have some serious problems.  You probably didn’t design your network to handle that kind of network traffic (and if you did you probably wasted a lot of money).  Your sysadmins likely understand the potential DDoS effect this would generate, and will avoid it.  Nobody likes pain.

“Runway” to Purchase New Storage

Remember with thin provisioning you are generally overallocating and then monitoring (you are monitoring, aren’t you?) usage.  At some point you may need to buy more storage.

If you wait till you are out of storage, that’s no good right?  You have a 100% consumed pool, with a bunch of attached hosts that are thinking they have a lot more storage to run through.  If you have oversubscribed a pool of storage and it hits 100%, it is going to be a terrible, horrible, no good, very bad day for you and everyone around you.  At a minimum new writes to anything in that pool will be denied, effectively turning your storage read-only.  At a maximum, the entire pool (and everything in it) may go offline, or you may experience a variety of fun data corruptions.

So, you don’t want that.  Instead you need to figure out when you will order new storage.  This will depend on things like:

  • How fast is your storage use growing?
  • How many new projects are you implementing?
  • How long does it take you to purchase new storage?

The last point is sometimes not considered before it is too late.  When you need more storage you have to first figure out exactly what you need, then you need to spec it, then you need a quote, the quote needs approval, then purchasing, then shipping, then it needs to be racked/stacked, then implemented.  How long does this process last for your organization?  Again nobody can answer this but you.  If your organization has a fast turn around time, maybe you can afford to wait till 80% full or more.  But if you are very sluggish, you might need to start that process at 60% or less.

Another thing to consider is if you are a sluggish organization, you may save money by thick provisioning.  Consider that you may need 15TB of storage in 2 years.  Instead you buy 10TB of storage right off the bat with a 50% threshold.  As soon as you hit 5TB of storage used you buy another 10TB to put you at 20.  Then when you hit 10 you buy another 10TB to put you at 30.  Finally at 15TB you purchase again and hit 40TB.  If you had bought 20 to begin with and gone thick, you would have never needed to buy anything else.  This situation is probably uncommon but I wanted to mention it as a thought exercise.  Think about how the purchasing process will impact the benefit you are trying to leverage from thin provisioning.

Performance Implications

Simply – ask your vendor whether thin storage has any performance difference over thick.  The answer with most storage arrays (where you have an actual choice between thick and thin) is yes.  Most of the time this is a negligible difference, and sometimes the difference is only in the initial allocation – that is to say, the first write to a particular LBA/block/extent/whatever.  But again, ask.  And test to make sure your apps are happy on thin LUNs.

Feature Implications

Thin provisioning may have feature implications on your storage system.

Sometimes thin provisioning enables features.  On a VMAX, thin provisioning enables pooling of a large number of disks.  On a VNX thin provisioning is required for deduplication and VNX Snapshots.

And sometimes thin provisioning either disables or is not recommended with certain features.  On a VNX thin LUNs are not recommended for use as File OE LUNs, though you can still do thin file systems on top of thick LUNs.

Ask what impact thin vs thick will have on array features – even ones you may not be planning to use at this very second.

Thin on Thin

Finally, in virtualized environments, in general you will want to avoid “thin on thin.”  This is a thin datastore created on a thin LUN.  The reason is that you tend to lose a static point of reference for how much capacity you are overprovisioning.  And if your virtualization team doesn’t communicate too well with the storage team, they could be unknowingly crafting a time bomb in your environment.

Your storage team might have decided they are comfortable with a 200% oversubscription level, and your virt team may have made this same decision.  This will potentially overallocate your storage by 400%!  Each team is sticking to their game plan, but without knowing and monitoring the other folks they will never see the train coming.

You can get away with thin on thin if you have excellent monitoring, or if your storage and virt admins are one and the same (which is common these days).  But my recommendation still continues to be thick VMs on thin datastores.  You can create as many thin datastores as you want, up to system limits, and then create lazy zeroed thick VMs on top of them.

Edit: this recommendation assumes that you are either required or compelled to use thin storage.  Thin VMs on thick storage are just as effective, but sometimes you won’t have a choice in this matter.  The real point is keeping one side or the other thick gives you a better point of reference for the amount of overprovisioning.


Hopefully this provided some value in the form of thought processes around thin provisioning.  Again, I am not anti-thin; I think it has great potential in some environments.  However, I do think it needs to be carefully considered and thought through when it sometimes seems to be sold as a “just thin provision, it will save you money” concept.  It really needs to be fleshed out differently for every organization, and if you take the time to do this you will not only better leverage your investment, but you can avoid some potentially serious pain in the future.

VNX File + Linux CLI

If you can learn Linux/UNIX command line and leverage it in your job, I firmly believe it will make you a better, faster, more efficient storage/network/sysadmin/engineer.  egrep, sed, awk, and bash are extremely powerful tools.  The real trick is knowing how to “stack” up the tools to make them do what you want…and not bring down the house in the process.  Note: I bear no responsibility for you bringing your house down!

Today I was able to leverage this via the VNX Control Station CLI.  I had a bunch of standard file system replications to set up and Unisphere was dreadfully slow.  If you find yourself in this situation, give the following a whirl.  I’m going to document my thought process as well, because I think this is equally as important as knowing how to specifically do these things.

First what is the “create file replication” command?  A quick browse through the man pages, online, or the Replicator manual gives us something like this:

nas_replicate –create REPLICATIONNAME –source –fs FILESYSTEMNAME –destination –pool id=DESTINATIONPOOLID –vdm DESTINATIONVDMNAME –interconnect id=INTERCONNECTID

Looking at the variable data in CAPITAL LETTERS, the only thing I really care about changing is the replication name and file system name.  In fact I usually use the file system name for the replication name…I feel like this does what I need it to unless you are looking at a complex Replicator set up.  So if I identify the destination pool ID (nas_pool -list), the destination vdm name (nas_server -list -vdm), and the interconnect ID (nas_cel -interconnect -list) then all I’m left with is needing the file system name.

So the command would look like (in my case, with some made up values):

nas_replicate –create REPLICATIONNAME –source –fs FILESYSTEMNAME –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001

Pretty cool – at this point I can just replace the name itself if I wanted and still get through it much faster than through Unisphere.  But let’s go a little further.

I want to automate the process for a bunch of different things in a list.  And in order to do that, I’ll need a for loop.  A for loop in bash goes something like this:

for i in {0..5}; do echo i is $i; done

This reads in English, “for every number in 0 through 5, assign the value to the variable $i, and run the command ‘echo i is $i'”  If you run that line on a Linux box, you’ll see:

i is 0
i is 1
i is 2
i is 3
i is 4
i is 5

Now we’ve got our loop so we can process through a list.  What does that list need to be?  In our case that list needs to be a list of file system names.  How do we get those?

We can definitely use the nas_fs command but how is a bit tricky.  nas_fs -l will give us all the file system names, but it will truncate them if they get too long.  If you are lucky enough to have short file system names, you might be able to get them out of here.  If not, the full name would come from nas_fs -info -all.  Unfortunately that command also gives us a bunch of info we don’t care about like worm status and tiering policy.

Tools to the rescue!  What we want to do is find all lines that have “name” in them and the tool for that is grep.  nas_fs -info -all | grep name will get all of those lines we want.  Success!  We’ve got all the file system names.

name      = root_fs_1
name      = root_fs_common
name      = root_fs_ufslog
name      = root_panic_reserve
name      = root_fs_d3
name      = root_fs_d4
name      = root_fs_d5
name      = root_fs_d6
name      = root_fs_2
name      = root_fs_3
name      = root_fs_vdm_cifs-vdm
name      = root_rep_ckpt_68_445427_1
name      = root_rep_ckpt_68_445427_2
name      = cifs
name      = root_rep_ckpt_77_445449_1
name      = root_rep_ckpt_77_445449_2
name      = TEST
name      = TestNFS

Alas they are not as we want them, though.  First of all we have a lot of “root” file systems we don’t like at all.  Those are easy to get rid of.  We want all lines that don’t have root in them, and once again grep to the rescue with the -v or inverse flag.

nas_fs -info -all | grep name | grep -v root

name      = cifs
name      = TEST
name      = TestNFS

Closer and closer.  Now the problem is the “name   =” part.  Now what we want is only the 3rd column of text.  In order to obtain this, we use a different tool – awk.  Awk has its own language and is super powerful, but we want a simple “show me the 3rd column” and that is going to just be tacked right on the end of the previous command.

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’


Cool, now we’ve got our file system names.  We can actually run our loop on this output, but I find it easier to send it to a file and work with it.  Just run the command and point the output to a file like so:

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’ > /home/nasadmin/fsout.txt

This way you can directly edit the fsout.txt file if you want to make changes.  Learning how these tools work is very important because your environment is going to be different and the output that gets produced may not be exactly what you want it to be.  If you know how grep, awk, and sed work, you can almost always coerce output however you want.

Now let’s combine this output with ye olde for loop to finish out strong.  Note the ` below are backticks, not single quotes:

for fsname in `cat /home/nasadmin/fsout.txt`; do echo nas_replicate –create $fsname –source –fs $fsname –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001; done

My output in this case is a series of commands printed to the screen because I left in the “echo” command:

nas_replicate –create cifs –source –fs cifs –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001
nas_replicate –create TEST –source –fs TEST –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001
nas_replicate –create TestNFS –source –fs TestNFS –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001

Exactly what I wanted.  Now if I want to actually run it rather than just printing them to the screen, I can simply remove the “echo” from the previous for loop.  This is a good way to validate your statement before you unleash it on the world.

If you are going to attempt this, look into the background flag as well which can shunt these all to the NAS task scheduler.  I actually like running them without the flag in this case so I can glance at putty and see progress.

If you haven’t played in the Linux CLI space before, some of this might be greek.  Understandable!  Google it and learn.  There are a million tutorials on all of these concepts out there.  And if you are a serious Linux sysadmin you probably have identified a million flaws in the way I did things. 🙂  Such is life.

Sometimes there is a fine line with doing things like this, where you may spend more time on the slick solution than you would have just hammering it out.  In this made up case I just had 3…earlier I had over 30.  But solutions like this are nice because they are reusable, and they scale.  It doesn’t really matter whether I’m doing 1 replication or 10 or 40.  I can use this (or some variation of it) every time.

The real point behind this post wasn’t to show you how to use these tools to do replications via CLI, though if it helps you do that then great.  It was really to demonstrate how you can use these tools in the real world to get real work done.  Fast and consistent.

VNX2 Hot Spare Policy bug in Flare 33 .051

The best practice for VNX2 hot spares is one spare for every 30 drives in your array.  However, if you have a VNX2 on Flare 33 .051 release, you’ll notice that the “Recommended” default policy is 1 per 60.

This is a bug.  There has been no change in the recommendations from EMC.  If you want the policy to return to the recommended 1 per 30, you have to manually set it.

I noticed today when trying to do this via Unisphere that you can only set a 1 per 30 policy if you actually have 30 or more disks of a given type.  If you have 6 EFD disks, your options through Unisphere are 1 hotspare per 2, 3, 4, 5, 6, or 60 disks.  In order to set a 1 per 30 policy in this situation you must use navicli or naviseccli.

Get a list of the hotspare policy IDs:

navicli –h SPA_IP_ADDRESS hotsparepolicy –list

Set a policy ID to 1 per 30:

navicli –h SPA_IP_ADDRESS hotsparepolicy –set POLICY_ID_NUMBER –keep1unusedper 30 -o

Note that you only need to do this on SPA or SPB for each policy, not both.

I also wanted to quickly mention there isn’t a great danger in leaving this 1 per 30 because the hot spare policy is really only a reporting mechanism.  E.g. if you leave the policy at 1 per 60, and you have 60 drives, and you have two hot spares with 58 used data disks, AND you have two drives fail….both spares will kick in.  The hot spare policy does not control hot sparing behavior; it just reports compliance.  (Actually it will also prevent you from creating a storage pool that would violate the hot spare policy, but only if you don’t manually select disks…)

But I still like having the hot spare policy reflect the recommended best practice, and that is still one hotspare for every 30 disks.

Information taken from:

VNX, Dedupe, and You

Block deduplication was introduced in Flare 33 (VNX2).  Yes, you can save a lot of space.  Yes, dedupe is cool.  But before you go checkin’ that check box, you should make sure you understand a few things about it.

As always, nothing can replace reading the instructions before diving in:

Lots of great information in that paper, but I wanted to hit the high points briefly before I go over the catches.  Some of these are relatively standard for dedupe schemes, some aren’t:

  • 8KB granularity
  • Pointer based
  • Hash comparison, followed by a bit-level check to avoid hash collisions
  • Post-process operation on a storage pool level
  • Each pass starts 12 hours after the last one completed for a particular pool
  • Only 3 processes allowed to run at the same time; any new ones are queued
  • If a process runs for 4 hours straight, it is paused and put at the end of the queue.  If nothing else is in the queue, it resumes.
  • Before a pass starts, if the amount of new/changed data in a pool is less than 64GB the process is skipped and the 12 hour timer is reset
  • Enabling and disabling dedupe are online operations
  • FAST Cache and FAST VP are dedupe aware << Very cool!
  • Deduped and non-deduped LUNs can coexist in the same pool
  • Space will be returned to the pool when one entire 256MB slice has been freed up
  • Dedupe can be paused, though this does not disable it
  • When dedupe is running if you see “0GB remaining” for a while, this is the actual removal of duplicate blocks
  • Deduped LUNs within a pool are considered a single unit from FAST VP’s perspective.  You can only set a FAST tiering policy for ALL deduped LUNs in a pool, not for individual deduped LUNs in a pool.
  • There is an option to set dedupe rate – this adjusts the amount of resources dedicated to the process (i.e. how fast it will run), not the amount of data it will dedupe
  • There are two Dedupe statistics – Deduplicated LUN Shared Capacity is the total amount of space used by dedupe, and Deduplication and Snapshot Savings is the total amount of space saved by dedupe

Performance Implications

Nothing is free, and this check box is no different.  Browse through the aforementioned PDF and you’ll see things like:

Block Deduplication is a data service that requires additional overhead to the normal code path.

Leaving Block Deduplication disabled on response time sensitive applications may also be desirable

Best suited for workloads of < 30% writes….with a large write workload, the overhead could be substantial

Sequential and large block random (IOs 32 KB and larger) workloads should also be avoided

But the best line of all is this:

it is suggested to test Block Deduplication before enabling it in production

Seriously, please test it before enabling it on your mission critical application. There are space saving benefits, but that comes with a performance hit.  Nobody can tell you without analysis whether that performance hit will be noticeable or detrimental.  Some workloads may even get a performance boost out of dedupe if they are very read oriented and highly duplicated – it is possible to fit “more” data into cache…but don’t enable it and hope it will happen. Testing and validation is important!

Along with testing for performance, test for stability.  If you are using deduplication with ESX or Windows 2012, specific features (the XCOPY directive for VAAI, ODX for 2012) can cause deduped LUNs to go offline with certain Flare revisions.  Upgrade to .052 if you plan on using it with these specific OSes.  And again, validate, do your homework, and test test test!

The Dedupe Diet – Thin LUNs

Another thing to remember about deduplication is that all LUNs become thin.

When you enable dedupe, in the background a LUN migration happens to a thin LUN in the invisible dedupe container.  If your LUN is already thin, you won’t notice a difference here.  However if the LUN is thick, it will become thin whenever the migration completes.   This totally makes sense – how could you dedupe a fully allocated LUN?

When you enable dedupe the status for the LUN will be “enabling.”  This means it is doing the LUN migration – you can’t see it in the normal migration status area.

Thin LUNs have slightly lower performance characteristics than thick LUNs. Verify that your workload is happy on a thin LUN before enabling dedupe.

Also keep in mind that this LUN migration requires 110% of the consumed space in order to migrate…so if you are hoping to dedupe your way out of a nearly full pool, you may be out of luck.

One SP to Rule Them All

Lastly but perhaps most importantly – the dedupe container is owned by one SP.  This means that whenever you enable dedupe on the first LUN in a pool, that LUN’s owner becomes the Lord of Deduplication for that pool.  Henceforth, any LUNs that have dedupe enabled will be migrated into the dedupe container and will become owned by that SP.

This has potentially enormous performance implications with respect to array balance.  You need to be very aware of who the dedupe owner is for a particular pool.  In no particular order:

  • If you are enabling dedupe in multiple pools, the first LUN in each pool should be owned by differing SPs.  E.g. if you are deduping 4 different pools, choose an SPA LUN for the first one in two pools, and an SPB LUN for the first one in the remaining two pools.  If you choose an SPA LUN for the first LUN in all four pools, every deduped LUN in all four pools will be on SPA
  • If you are purchasing an array and planning on using dedupe in a very large single pool, depending on the amount of data you’ll be deduping you may want to divide it into two pools and alternate the dedupe container owner.  Remember that you can keep non-deduplicated LUNs in the pools and they can be owned by any SP you feel like
  • Similar to a normal LUN migration across SPs, after you enable dedupe on a LUN that is not owned by the dedupe container owner, you need to fix the default owner and trespass after the migration completes.  For example – the dedupe container in Pool_X is owned by SPA.  I enable dedupe on a LUN in Pool_X owned by SPB.  When the dedupe finishes enabling, I need to go to LUN properties and change the default owner to SPA.  Then I need to trespass that LUN to SPA.
  • After you disable dedupe on a LUN, it returns to the state it was pre-dedupe.  If you needed to “fix” the default owner on enabling it, you will need to “fix” the default owner on disabling.

What If You Whoopsed?

What if you checked that box without doing your homework?  What if you are seeing a performance degradation from dedupe?  Or maybe you accidentally have everything on your array now owned by one SP?

The good news is that dedupe is entirely reversible (big kudos to EMC for this one).  You can uncheck the box for any given LUN and it will migrate back to its undeduplicated state.  If it was thick before, it becomes thick again.  If it was owned by a different SP before, it is owned by that SP again.

If you disable dedupe on all LUNs in a given pool, the dedupe container is destroyed and can be recreated by re-enabling dedupe on something.  So if you unbalanced an array on SPA, you can remove all deduplication in a given pool, and then enable it again starting with an SPB LUN.

Major catch here – you must have the capacity for this operation.  A LUN requires 110% of the consumed capacity to migrate, so you need free space in order to undo this.

Deduplication is a great feature and can save you a lot of money on capacity, but make sure you understand it before implementing!

Just what the heck is Geometry Limited on VMAX FTS?

VMAX is a truly amazing piece of hardware with truly amazing features and unfortunately some truly mind-boggling concepts behind it.  I think really this comes with the territory – most often the tools with the biggest capabilities and flexibilities require a lot of knowledge to configure and understand.

One concept that I struggled with on the VMAX was “geometry limited” via Federated Tiered Storage so I thought I would provide some info for anyone else who is having trouble with it. This also gives me the opportunity to talk briefly about some other VMAX topics.

Federated Tiered Storage (FTS) is the ability to have the VMAX leverage third-party storage arrays behind it.  So as a simple example, I could connect a VNX to the back of a VMAX, and then present usable storage to the VMAX, and then present that storage to a host.  The VMAX in this case acts more like a virtual front end than a storage array itself.


FTS is a complex subject that could itself be the subject of multiple posts.  Instead I want to provide a high level overview of concepts, and then go over the ‘geometry limited’ part.  There are two ways that FTS manages the other arrays: external provisioning and encapsulation.

External Provisioning

External provisioning is the easiest to understand.  In this method you present some amount of storage from the external array as a LUN (or multiple LUNs) to the VMAX, and the VMAX interprets this as being special disks.  It then uses these disks in much the same manner as it would use a direct attached disk.  I say the disks are special because it relies on the assumption that the external array is going to RAID protect the data.  Therefore, there is no need to once again RAID protect them on a VMAX like it would normally do – this would waste space and likely degrade performance.  Because of this, the VMAX manipulates them in unprotected form.  And because they are just seen as attached disks, it also formats them which destroys any data that is on them.


From here you can do most of the stuff you would do with a normal set of disks.  You can create a thin pool out of them and even use them as a tier in FAST VP.  This is a good way to really maximize the benefits of FTS and leverage your older storage arrays.  Or simply just consolidate your arrays into one point of management.  Cool feature, but again any data on the LUNs will evaporate.


Encapsulation is the focus of this post and is a little more complicated.  Encapsulation allows you to preserve data on a LUN presented for the purposes of FTS.  For example if you had an existing LUN on a VNX that a host was using (say oracle database data) and you wanted to present that LUN through the VMAX, you probably wouldn’t want to use External Provisioning because the VMAX would wipe out the data. Instead you do what is known as encapsulation.


When you encapsulate the external LUN, the VMAX in either a thick way or a thin way preserves all the data on the LUN.  So you could then connect your oracle database server to the VMAX, attach the encapsulated LUN to it, and viola! all your data is available.  It preserves the data by creating a “dummy LUN” of sorts on the VMAX and passing through from the external array to a host.

Encapsulation is neat but there are some restrictions around it.  For instance, an encapsulated LUN can’t be the target of a VLUN migration (though you can VLUN migrate it somewhere else) and an encapsulated LUN can’t participate in FAST (whether geometry limited or not).

Some encapsulated LUNs are geometry limited and some aren’t.

Device Sizing

In order to understand what geometry limited means, you must first understand how the VMAX sees device sizes.  VMAX sizing is always done in cylinders (15 tracks) which are 960KB.  This means that what I would consider common LUN sizes (100GB, 500GB, 1TB, 2TB) don’t actually “fit” on a VMAX.  Instead, if you ask it to create a 100GB LUN, it rounds up to the nearest cylinder(ish). You can kind of think of this as a “Do No Harm” rule.  If you request a device size that falls exactly onto a cylinder boundary, you get that exact device size.  If you request one that falls outside of a cylinder boundary, the VMAX rounds up in order to make sure that you get all the space you originally requested.

We won’t get a 100GB LUN because 100GB doesn’t fall onto a cylinder boundary:

100GB * 1024 * 1024 = 104857600KB / 960KB = 109226.6 Cylinders

So we might end up with a device that is actually 109227 cylinders, which would be (109227 * 960KB / 1024 /1024) 100.000305GB.

During “normal” operations this difference is not particularly meaningful (unless you are trying to match device sizes for things like replication, in which case it becomes tremendously important), but it is important to understand.

It is also important to understand that there is a maximum device size on a VMAX, and that is 262668 cylinders, or 240.479GB.  In order for a device to be larger than this, you must create what is known as a meta device, or several devices bundled together into a larger virtual device.  For instance, if I needed an 800GB LUN, I could “meta” four 200GB regular devices together and have an 800GB device.

Geometry Limited

So finally, what does geometry limited mean?  Geometry limited is what happens when you encapsulate a LUN into a device that does not match up exactly with a VMAX device from a size perspective.  In other words, the “dummy LUN” on the VMAX is larger than the actual LUN on the remote array. Again remember the “Do No Harm” philosophy here.  You are asking the VMAX to preserve data from an external array, and there is a good chance that external device will not align with cylinder boundaries.  The VMAX in this case can’t round down, because it would effectively be chopping off the last parts of what you asked it to preserve – not good!  Instead, because it needs to preserve the entire external LUN, and it is required that device sizes align to cylinder boundaries, the VMAX device size is larger than the actual LUN on the external array – this is exactly what causes a device to be geometry limited.  If it happens that the external LUN matches up precisely with the cylinder boundary, it is not geometry limited.

With geometry limited devices, the VMAX is going to “fake” the sizing to the host.  So no matter how large the VMAX device is, the host is only going to see exactly what is on the original array.

To demonstrate, there are two specific instances where this will happen.

Scenario the First

The first occurrence of geometry limited would be when the external device size does not align with a cylinder boundary.  This happens exactly like my previous example with the 100GB LUN.  Any device size where the VMAX would have to round-up to match a cylinder boundary would be geometry limited. For instance, if I were to encapsulate a 15GB LUN, this device would not be geometry limited.  This is because 15GB fits exactly into 960KB cylinders (16384 cylinders = 16384 * 960KB / 1024 / 1024 = 15GB).

But if I were to encapsulate a 50GB LUN, the VMAX needs to preserve the entire 50GB even though it doesn’t align with cylinder values.  Similar to my 100.000305GB LUN above, the VMAX “dummy LUN” must be slightly larger than the 50GB LUN on the external array.

50GB * 1024 * 1024 = 52428800KB / 960KB = 54613.3 Cylinders

So the “dummy LUN” in this case needs to be 54614 cylinders in order to preserve all of the 50GB of data, and 54614 cylinders is larger than the original 50GB device.  Hence this encapsulated LUN would be geometry limited.

Scenario the Second

The second occurrence of geometry limited happens due to meta configuration.  Let’s encapsulate a 300GB LUN.

300GB * 1024 * 1024 = 314572800KB / 960KB = 327680 cylinders

OK good news!  The device falls exactly onto a cylinder boundary, so no geometry limited feature here right?  Well, maybe.

Remember that the max device size on a VMAX is around 240GB, so in this case we need a meta configuration to create the VMAX device.  Whether this device is geometry limited or not revolves around exactly how that meta is created.

  • Sometimes the meta settings force all the members to be a specific size
  • Sometimes the meta settings force all the members to be the same size

These conditions can cause a geometry limited condition.  In this case, imagine that our meta member size was 125GB and we forced all members to be the same size.  In this case we would end up with 3 members and a 375GB meta – 75GB larger than the original device.  Again this is a geometry limited device.

Another weird situation can arise when the original device might fall on a cylinder boundary but the meta member count causes it to deviate.  For instance if we tried to do a 7 member meta for the 300GB device.  Even with the proper meta settings, this is going to be a geometry limited device because 300GB / 7 will not align onto 960KB.

What does it mean and what can I do about it?

Geometry limited devices have several restrictions (do your own research to validate).

  • Can only be source devices for local replication
  • Can only be R1 devices for SRDF
  • Can’t be expanded

Getting rid of geometry limited is possible but still strange.  For instance, you can VLUN migrate to an internal pool.  This will fix the geometry limited part!  What’s sort of bizarre is that the size mismatch still exists.  Further, the VMAX will continue to “fake” the sizing to the host even if the VMAX device is expanded.  In order to fix this you need to reset the LUN geometry, which requires that you unmount/unmap the device from all FAs…so this is disruptive.

Wrap Up

There are a lot of potential use cases for FTS and it is some really sweet technology.  However, if you are going to use encapsulation, you should understand these limitations and make sure that you aren’t painting yourself into a corner.