Shedding Light on Storage Encryption

I’ve been noticing some fundamental misunderstandings around storage encryption – I see this most when dealing with XtremIO although plenty of platforms support it (VNX2 and VMAX).  I hope this blog post will help someone who is missing the bigger picture and maybe make a better decision based on tradeoffs.  This is not going to be a heavily technical post, but is intended to shed some light on the topic from a strategic angle.

Hopefully you already know, but encryption at a high level is a way to make data unreadable gibberish except by an entity that is authorized to read it.  The types of storage encryption I’m going to talk about are Data At Rest Encryption (often abbreviated DARE or D@RE), in-flight encryption, and host-based encryption.  I’m talking in this post mainly about SAN (block) storage, but these concepts also apply to NAS (file) storage.  In fact, in-flight encryption is probably way more useful on a NAS array given the inherent security of FC fabrics.  But then, iSCSI, and it gets cloudier.

Before I start, security is a tool and can be used wisely or poorly with equivalent results.  Encryption is security.  All security, and all encryption, is not great.  Consider the idea of cryptographic erasure, by which data is “deleted” merely because it is encrypted and nobody has the key.  Ransomware thrives on this.  You are looking at a server with all your files on it, but without the key they may as well be deleted.  Choosing a security feature for no good business reason other than “security is great” is probably a mistake that is going to cause you headaches.


Here is a diagram with 3 zones of encryption.  Notice that host-based encryption overlaps the other two – that is not a mistake as we will see shortly.

Data At Rest Encryption

D@RE of late is typically referring to a storage arrays ability to encrypt data at the point of entry (write) and decrypt on exit (read).  Sometimes this is done with ASICs on an array or I/O module, but it is often done with Self Encrypting Drives (SEDs).  However the abstract concept of D@RE is simply that data is encrypted “at rest,” or while it is sitting on disk, on the storage array.

This might seem like a dumb question, but it is a CRUCIAL one that I’ve seen either not asked or answered incorrectly time and time again: what is the purpose of D@RE?  The point of D@RE is to prevent physical hardware theft from compromising data security.  So, if I nefariously steal a drive out of your array, or a shelf of drives out of your array, and come up with some way to attach them to another system and read them, I will get nothing but gibberish.

Now, keep in mind that this problem is typically far more of an issue on a small server system than it is a storage array.  A small server might just have a handful of drives associated with it, while a storage array might have hundreds, or thousands.  And those drives are going to be in some form of RAID protection which leverages striping.  So even without D@RE the odds of a single disk holding meaningful data is small, though admittedly it is still there.

More to the point, D@RE does not prevent anyone from accessing data on the array itself.  I’ve heard allusions to this idea that “don’t worry about hackers, we’ve got D@RE” which couldn’t be more wrong, unless you think hackers are walking out of your data center with physical hardware.  If the hackers are intercepting wire transmissions, or they have broken into servers with SAN access, they have access to your data.  And if your array is doing the encryption and someone manages to steal the entire array (controllers and all) they will also have access to your data.

D@RE at the array level is also one of the easiest to deal with from a management perspective because usually you just let the array handle everything including the encryption keys.  This is mostly just a turn it on and let it run solution.  You don’t notice it and generally don’t see any fall out like performance degradation from it.

In-Flight Encryption

In-flight encryption is referring to data being encrypted over the wire.  So your host issues a write to a SAN LUN, and that traverses your SAN network and lands on your storage array.  If data is encrypted “in-flight,” then it is encrypted throughout (at least) the switching.

Usually this is accomplished with FC fabric switches that are capable of encryption.  So the switch that sees a transmission on an F port will encrypt it, and then transmit it encrypted along all E ports (ISLs) and then decrypt it when it leaves another F port.  So the data is encrypted in-flight, but not at rest on the array.  Generally we are still talking about ASICs here so performance is not impacted.

Again let’s ask, what is the purpose of in-flight encryption?  In-flight encryption is intended to prevent someone who is sniffing network traffic (meaning they are somehow intercepting the data transmissions, or a copy of the data transmissions, over the network) from being able to decipher data.

For local FC networks this is (in my opinion) not often needed.  FC networks tend to be very secure overall and not really vulnerable to sniffing.  However, for IP based or WAN based communication, or even stretched fabrics, it might be sensible to look into something like this.

Also keep in mind that because data is decrypted before being written to the array, it does not provide the physical security that D@RE does, nor does it prevent anyone from accessing data in general.  You also sometimes have the option of not decrypting when writing to the array.  So essentially the data is encrypted when leaving the host, and written encrypted on the array itself.  It is only decrypted when the host issues a read for it and it exits the F port that host is attached to. This results in you having D@RE as well with those same benefits.  A real kicker here becomes key management, because in-flight encryption can be removed at any time without issue.  You can remove or disable in-flight encryption and not see any change in data because at the ends it is unencrypted.  However, if the data is written encrypted on the array, then you MUST have those keys to read that data.  If you had some kind of disaster that compromised your switches and keys, you would have a big array full of cryptographically erased data.

Host Based Encryption

Finally, host-based encryption is any software or feature that encrypts LUNs or files on the server itself.  So data that is going to be written to files (whether SAN based or local files) is encrypted in memory before the write actually takes place.

Host-based encryption ends up giving you both in-flight encryption and D@RE as well.  So when we ask the question, what is the purpose of host-based encryption?, we get the benefits we saw from in-flight and D@RE, as well as another one.  That is the idea that even with the same hardware setup, no other host can read your data.  So if I were to forklift your array, fabric switches, and get an identical server (hardware, OS, software) and hook it up, I wouldn’t be able to read your data.  Depending on the setup, if a hacker compromises the server itself in your data center, they may not be able to read the data either.

So why even bother with the other kinds of encryption?  Well for one, generally host-based encryption does incur a performance hit because it isn’t using ASICs.  Some systems might be able to handle this but many won’t be able to.  Unlike D@RE or in-flight, there will be a measurable degradation when using this method.  Another reason is that key management again becomes huge here.  Poor key management and a server having a hardware failure can lead to that data being unreadable by anyone.  And generally your backups will be useless in this situation as well because you have backups of encrypted data that you can’t read without the original keys.

And frankly, usually D@RE is good enough.  If you have a security issue where host-based encryption is going to be a benefit, usually someone already has the keys to the kingdom in your environment.

Closing Thoughts

Hopefully that cleared up the types of encryption and where they operate.

Another question I see is “can I use one or more at the same time?”  The answer is yes, with caveats.  There is nothing that prevents you from using even all 3 at the same time, even though it wouldn’t really make any sense.  Generally you want to avoid overlapping because you are encrypting data that is already encrypted which is a waste of resources.  So a sensible pairing might be D@RE on the array and in-flight on your switching.

A final HUGELY important note – and what really prompted me to write this post – is to make sure you fully understand the effect of encryption on all of your systems.  I have seen this come up in a discussion about XtremIO using D@RE paired with host-based encryption.  The question was “will it work?” but the question should have been “should we do this?”  Will it work?  Sure, there is nothing problematic about host-based encryption and XtremIO D@RE interacting, other than the XtremIO system encrypting already encrypted data.  What is problematic, though, is the fact that encrypted data does not compress, and most encrypted data won’t dedupe either…or at least not anywhere close to the level of unencrypted data.  And XtremIO generally relies on its fantastic inline compression and dedupe features to fit a lot of data on a small footprint. XtremIO’s D@RE happens behind the compression and deduplication, so there is no issue.  However host-based encryption will happen ahead of the dedupe/compression and will absolutely destroy your savings. So if you wanted to use the system like this, I would ask, how was it sized?  Was it sized with assumptions about good compression and dedupe ratios?  Or was it sized assuming no space savings?  And, does the extra money you will be spending for the host-based encryption product and the extra money you will be spending on the additional required storage justify the business problem you were trying to solve?  Or was there even a business problem at all?  A better fit would probably be something like a tiered VNX2 and FAST cache which could easily handle a lot of raw capacity and use the flash where it helps the most.

Again, security is a tool, so choose the tools you need, use them judiciously, and make sure you fully understand their impact (end-to-end) in your environment.

EMC RecoverPoint Journal Sizing

A commenter on my post about RecoverPoint Journal Usage asks:

How can I tell if my journal is large enough for a consistency group? That is to say, where in the GUI will it tell me I need to expand my journal or add another journal lun?

This is an easy question to answer but for me this is another opportunity to re-iterate journal behavior.  Scroll to the end if you are in a hurry.

Back to Snapshots…

Back to our example in the previous article about standard snapshots – on platforms where snapshots are used you often have to allocate space for this purpose…like with SnapView on EMC VNX and Clariion, you have to allocate space via a Reserve LUN pool.  On NetApp systems this is called the snapshot reserve.

Because of snapshot behavior (whether Copy On First Write or Redirect On Write), at any given time I’m using some variable amount of space in this area that is related to my change rate on the primary copy.  If most of my data space on the primary copy is the same as when I began snapping, I may be using very little space.  If instead I have overwritten most of the primary copy, then I may be using a lot of space.  And again, as I delete snapshots over time this space will free up.  So a potential set of actions might be:

  1. Create snapshot reserve of 10GB and create snapshot1 of primary – 0% reserve used
  2. Overwrite 2.5GB of data on primary – 25% reserve used
  3. Create snapshot2 of primary and overwrite a different 2.5GB of data on primary – 50% reserve used
  4. Delete snapshot1 – 25% reserve used
  5. Overwrite 50GB of data – snapshot space full (probably bad things happen here…)

There is meaning to how much space I have allocated to snapshot reserve.  I can have way too much (meaning my snapshots only use a very small portion of the reserve) and waste a lot of storage.  Or I can have too little (meaning my snapshots keep overrunning the maximum) and probably cause a lot of problems with the integrity of my snaps.  Or it can be just right, Goldilocks.

RP Journal

Once again the RP journal does not function like this.  Over time we expect RP journal utilization to be at 100%, every time.  If you don’t know why, please read my previous post on it!

The size of the journal only defines your protection window in RP.  The more space you allocate, the longer back you are able to recover from.  However, there is no such thing as “too little” or “too much” journal space as a rule of thumb – these are business defined goals that are unique to every organization.

I may have allocated 5GB of journal space to an app, and that lets me recover 2 weeks back because it has a really low write rate.  If my SLA requires me to recover 3 weeks back, that is a problem.

I may have allocated 1TB of journal space to an app, and that lets me recover back 30 minutes because it has an INSANE write rate.  If my SLA only requires me to recover back 15 minutes, then I’m within spec.

RP has no idea about what is good journal sizing or bad journal sizing, because this is simply a recovery time line.  You must decide whether it is good or bad, and then allocate additional journals as necessary.  Unlike other technology like snapshots, there is no concept of “not enough journal space” beyond your own personal SLAs. In this manner, by default RecoverPoint won’t let you know that you need more journal space for a given CG because it simply can’t know that.

Note: if you are regularly using the Test A Copy functionality for long periods of time (even though you really shouldn’t…), then you may run into sizing issues beyond just protection windows, as portions of the journal space are also used for that.  This is beyond the scope of this post, but just be aware that even if you are in spec from a protection window standpoint, you may need more journal space to support the test copy.

Required Protection Window

So RecoverPoint has no way of knowing whether you’ve allocated enough journal space to a given CG.  Folks on the pre-sales side have some nifty tools that can help with journal sizing by looking at data change rate, but this is really for the entire environment and hopefully before you bought it.

Luckily, RecoverPoint has a nice internal feature to alert you whether a given Consistency Group is within spec or not, and that is “Required Protection Window.”  This is a journal option within each copy and can be configured when a CG is created, or modified later.  Here is a pic of a CG without it.  Note that you can still see your current protection window here and make adjustments if you need.rpj1

Here is where the setting is located.


And here is what it looks like with the setting enabled.


So if I need to recover back 1 hour on this particular app, I set it to 1 hour and I’m good.  If I need to recover back 24 hours, I set it that way and it looks like I need to allocate some additional journal space to support that.

Now this does not control behavior of RecoverPoint (unlike, say, the Maximum Journal Lag setting) – whether you are within or under your required protection window, RP still functions the same.  It simply alerts you that you are under your personally defined window for that CG.  And if you are under for too long, or maybe under it at all if it is a mission critical application, you may want to add additional journal space to extend your protection window so that you are within spec.  Again I repeat, this is only an alerting function and will not, by itself, do anything to “fix” protection window problems!


So bottom line: RP doesn’t – or more accurately can’t – know whether you have enough journal space allocated to a given CG because that only affects how long you can roll back for.  However, using the Required Protection Window feature, you can tell RP to alert you if you go out of spec and then you can act accordingly.

SAN vs NAS Part 4: The Layer Cake

Last post we covered the differences between NFS and iSCSI (NAS and SAN) and determined that we saw a different set of commands when interacting with a file.  The NFS write generated an OPEN command, while the iSCSI write did not.  In this post we’ll cover the layering of NAS (file or file systems) on top of SAN (SCSI or block systems) and how that interaction works.

Please note!  In modern computing systems there are MANY other layers than I’m going to talk about here.  This isn’t to say that they don’t exist or aren’t important, but just that we are focusing on a subset on them for clarity.  Hopefully.

First, take a look at the NFS commands listed here:

nfscommandsNotice that a lot of these commands reference files, and things that you would do with files like read and write, but also create, remove, rename, etc.

Compare this with the SCSI reference:

Notice that in the SCSI case, we still have read and write, but there is no mention of files (other than “filemarks”).  There is no way to delete a file with SCSI – because again we are working with a block device which is a layer below the file system.  There is no way to delete a file because there is no file.  Only addresses where data is stored.

As a potentially clumsy analogy (like I often wield!) think about your office desk.  If it’s anything like mine, there is a lot of junk in the drawers.  File storage is like the stuff in a drawer.  The space in a drawer can have a lot of stuff in it, or it can have a little bit of stuff in it.  If I add more stuff to the drawer, it gets more full.  If I take stuff out of the drawer, it gets less full.  There is meaning to how much stuff is in an individual drawer as a relation to how much more stuff I can put in the drawer.

Block storage, on the other hand, is like the desk itself.  There are locations to store things – the drawers.  However, whether I have stuff in a drawer or I don’t have stuff in a drawer, the drawer still exists.  Emptying out my desk entirely doesn’t cause my desk to vanish.  Or at least, I suspect it wouldn’t…I have never had an empty desk in my life.  There is no relationship to the contents of the drawers and the space the desk occupies.  The desk is a fixed entity.  An empty drawer is still a drawer.

To further solidify this file vs block comparison, take a look at this handsome piece of artwork depicting the layers:

fsvisio_1Here is a representation of two files on my computer, a word doc and a kitty vid, and their relationship to the block data on disk.  Note that some disk areas have nothing pointing to them – these are empty but still zero filled (well…maybe, depending on how you formatted the disk).  In other words, these areas still exist!  They still have contents, even if that content is nothing.

When I query a file, like an open or read, it traverses the file system down to the disk level.  Now I’m going to delete the word doc.  In most cases, this is what is going to happen:

fsvisio_2My document is gone as far as I can “see.”  if I try to query the file system (like look in the directory it was stored in) it is gone.  However on the disk, it still exists.  (Fun fact: this is how “undelete” utilities work – by restoring data that is still on disk but no longer has pointers from the file system.)  It isn’t really relevant that it is still on the disk, because from the system’s perspective (and the file system’s perspective) it doesn’t exist any more.  If I want to re-use that space, the system will see it as free and store something else there, like another hilarious kitten video.

Sometimes this will happen instead, either as you delete something (rarely) or later as a garbage collection process:

fsvisio_3The document data has been erased and replaced with zeros.  (Fun fact: this is how “file shredder” programs work – by writing zeros (or a pattern) once (or multiple times) to the space that isn’t being actively used by files.)  Now the data is truly gone, but from the disk perspective it still isn’t really relevant because something still occupies that space.  From the disk’s perspective, something always occupies that space, whether it is kitty video data, document data, or zeros.  The file system (the map) is what makes that data relevant to the system.

This is a really high level example, but notice the difference in the file system level and the disk level.  When I delete that file, whether the actual disk blocks are scrubbed or left intact, the block device remains the same except for the configuration of the 1’s and 0’s.  All available addresses are still in place.  Are we getting closer to understanding our initial question?

Let’s move this example out a bit and take a look at an EMC VNX system from a NAS perspective.  This is a great example because there are both SAN/block (fibre channel) and NAS/file (cifs/nfs) at the same time.  The connections look like this:


From my desktop, I connect via NFS to an interface on the NAS (the datamover) in order to access my files.  And the datamover has a fibre channel connection to the block storage controllers which is where the data is actually stored.  The datamover consumes block storage LUNs, formats them with appropriate file systems, and then uses that space to serve out NAS.  This ends up being quite similar to the layered file/disk example above when we were looking at a locally hosted file system and disk.

What does it look like when I read and write?  Simply like this:

DM2My desktop issues a read or write via NFS, which hits the NAS, and the NAS then issues a read or write via SCSI over Fibre Channel to the storage processor.

Reads and writes are supported by SCSI, but what happens when I try to do something to a file like open or delete?

DM3The same command conversion happens, but it is just straight reads and writes at the SCSI level. It doesn’t matter whether the NAS is SAN attached like this one, or it just has standard locally attached disks.  This is always what’s going to happen because the block protocol and subsystems don’t work with files – only with data in addresses.

By understanding this layering – what file systems (NAS) do vs what disks (SAN) do – you can better understand important things about their utility.  For instance, file systems have various methods to guarantee consistency, in spite of leveraging buffers in volatile memory.  If you own the file system, you know who is accessing data and how.  You have visibility into the control structure.  If the array has no visibility there, then it can’t truly guarantee consistency.  This is why e.g. block array snapshots and file array snapshots are often handled differently.  With NAS snapshots, the array controls the buffers and can easily guarantee consistent snapshots.  But for a block snapshot, the array can only take a picture of the disk right now regardless of what is happening in the file system.  It may end up with an inconsistent image on disk, unless you initiate the snapshot from the attached server and properly quiesce/clean the file system.

Back to the idea of control, because NAS systems manage the file side of things, they also have a direct understanding of who is trying to access what.  Not only does this give it the ability to provide some access control (unlike SAN which just responds happily to any address requests it gets), it also explains why NAS is often ideal for multi-access situations.  If I have users trying to access the same share (or better yet, the same file), NAS storage is typically the answer because it knows who has what open.  It can manage things on that level.  For the SAN, not so much.  In fact if you want two hosts to access the same storage, you need to have some type of clustering (whether direct software or file system) that provides locks and checks.  Otherwise you are pretty much guaranteed some kind of data corruption as things are reading and writing over top of one another.  Remember SAN and SCSI just lets you read and write to addresses, it doesn’t provide the ability to open and own a file.

In part 5 I’ll provide a summary review and then some final thoughts as well.

SAN vs NAS Part 2: Hardware, Protocols, and Platforms, Oh My!

In this post we are going to explore some of the various options for SAN and NAS.


There are a couple of methods and protocols for accessing SAN storage.  One is Fibre Channel (note: this is not misspelled, the protocol is Fibre, the cables are fiber) where SCSI commands are encapsulated within Fibre Channel frames.  This may be direct Fibre Channel (“FC”) over a Fibre Channel fabric, or Fibre Channel over Ethernet (“FCoE”) which further encapsulates Fibre Channel frames inside ethernet.

With direct Fibre Channel you’ll need some FC Host Bus Adapters (HBAs), and probably some FC switches like Cisco MDS or Brocade (unless you plan on direct attaching a host to an array which most of the time is a Bad Idea).

With FCoE you’ll be operating on an ethernet network typically using Converged Network Adapters (CNAs).  Depending on the type of fabric you are building, the array side may still be direct FC, or it may be FCoE as well.  Cisco UCS is a good example of the split out, as generally it goes from host to Fabric Interconnect as FCoE, and then from Fabric Interconnect to array or FC switch as direct Fibre Channel.

It could also be accessed via iSCSI, which encapsulates SCSI commands within IP over a standard network.  And then there are some other odd mediums like infiniband, or direct attach via SAS (here we are kind of straying away from the SAN and are really just directly attaching disks, but I digress).

What kind of SAN you use depends largely on the scale and type of your infrastructure.  Generally if you already have FC infrastructure, you’ll stay FC.  If you don’t have anything yet, you may go iSCSI.  Larger and performance environments typically trend toward FC, while small shops trend towards iSCSI.  That isn’t to say that one is necessarily better than the other – they have their own positives and negatives.  For example, FC has its own learning curve with fabric management like zoning, while iSCSI connections are just point to point over existing networks that someone probably already knows.  The one thing I will caution against here is if you are going for iSCSI, watch out for 1Gb configurations – there is not a lot of bandwidth and the network can get choked VERY quickly.  I personally prefer FC because I know it well and trust its stability, but again there are positives and negatives.

Back to the subject at hand – in all cases with SAN the recurring theme here is SCSI commands.  In other words, even though the “disk” might be a virtual LUN on an array 10 feet (or 10 miles) away, the computer is treating it like a local disk and sending SCSI disk commands to it.

Some array platforms are SAN only, like the EMC VMAX 10K, 20K, 40K series.  EMC XtremIO is another example of a SAN only platform.  And then there are non-EMC platforms like 3PAR, Hitachi, and IBM XIV.  Other platforms are unified, meaning they do both SAN and NAS.  EMC VNX is a good example of a unified array.  NetApp is another competitor in this space.  Just be aware that if you have a SAN only array, you can’t do NAS…and if you have a NAS only array (yes they exist, see below), you can’t do SAN.  Although some “NAS” arrays also support iSCSI…I’d say most of the time this should be avoided unless absolutely necessary.


NAS on the other hand is virtually always over an IP network.  This is going to use standard ethernet adapters (1Gb or 10Gb) and standard ethernet switches and IP routers.

As far as protocols there is CIFS, which is generally used for Windows, and NFS which is generally used on the Linux/Unix/vSphere side.  CIFS has a lot of tie-ins with Active Directory, so if you are a windows shop with an AD infrastructure, it is pretty easy to leverage your existing groups for permissions.  NFS doesn’t have these same ties with AD, but does support NIS for some authentication services.

The common theme on this side of the house is “file” which can be interpreted as “file system.”  With CIFS, generally you are going to connect to a “share” on the array, like \\MYARRAY1\MYAWESOMESHARE.  This may be just through a file browser for a one time connection, or this may be mounted as a drive letter via the Map Network Drive feature.  Note that even though it is mounted as a drive letter, it is still not the same as an actual local disk or SAN attached LUN!

For NFS, an “export” will be configured on the array and then mounted on your computer.  This actually gets mounted within your file system.  So you may have your home directory in /users/myself, and you create a directory “backups” and mount an export to it doing something like mount -t nfs /users/myself/backups.  Then you access any files just as you would any other ones on your computer.  Again note that even though the NFS export is mounted within your file system, it is still not the same as an actual local disk or SAN attached LUN!

Which type of NAS protocol you use is generally determined by the majority of your infrastructure – whether it is Windows or *nix.  Or you may run both at once!  Running and managing both NFS and CIFS is really more of a hurdle with understanding the protocols (and sometimes licensing both of them on your storage array), whereas the choice to run both FC and iSCSI has hardware caveats.

For NAS platforms, we again look to the unified storage like EMC VNX.  There are also NAS gateways that can be attached to a VMAX for NAS services.  EMC also has a NAS only platform called Isilon.

One thing to note is that if your array doesn’t support NAS (say you have a VMAX or XtremIO) the gateway solution is definitely viable and enables some awesome features, but it is also pretty easy to spin up a Windows/Linux VM, or use a Windows/Linux physical server (but seriously, please virtualize!) that uses array block storage, but then serves up NAS itself.  So you could create a Windows file server on the VMAX and then all your NAS clients would connect to the Windows machine.

The reverse is not really true…if your array doesn’t support SAN, it is difficult to wedge SAN into the environment.  You can always do NFS with vSphere, but if you need block storage you should really purchase some infrastructure for it.  iSCSI is a relatively simple thing to insert into an existing environment, just again beware 1Gb bandwidth.


One final note I wanted to mention is about protection.  There are methods for replicating file and block data, but many times these are different mechanisms, or at least they function in different ways.  For instance, EMC RecoverPoint is a block replication solution.  EMC VNX Replicator is a file replication solution.  RP won’t protect your file data (unless you franken-config it to replicate your file LUNs), and Replicator won’t protect your block data.  NAS supports NDMP while SAN generally does not.  Some solutions, like NetApp snapshots, do function on both file and block volumes, but they are still very different in how they are taken and restored…block snapshots should be initiated from the host the LUN is mounted to (in order to avoid disastrous implications regarding host buffers and file system consistency) while file snapshots can be taken from any old place you please.

I say all this just to say, be certain you understand how your SAN and NAS data is going to be protected before you lay down the $$$ for a new frame!  It would be a real bummer to find out you can’t protect your file data with RecoverPoint after the fact.  Hopefully your pre-sales folks have you covered here but again be SURE!


We’ve drawn a lot of clear distinctions between SAN and NAS, which kind of fall back into the “bullet point” message that I talked about in my first post.  All that is well and good, but here is where the confusion starts to set in: in both NAS cases (CIFS and NFS), on your computer the remote array may appear to be a disk.  It may look like a local hard drive, or even appear very similar to a SAN LUN. This leads some people to think that they are the same, or at least are doing the same things.  I mean, after all, they even have the same letters in the acronym!

However, your computer never issues SCSI commands to a NAS.  Instead it issues commands to the remote file server for things like create, delete, read, write, etc.  Then the remote file server issues SCSI (block) commands to its disks in order to make those requests happen.

In fact, a major point of understanding here is, “who has the file system?”  This will help you understand who can do what with the data.  In the next post we are going to dive into this question head first in a linux lab environment.


In a previous post in the not so distant past I showed how to leverage some Linux CLI tools to make the VNX control station work a little harder and let you work a little easier.  Mainly this was around scripting multiple file replications at the same time.  I noted in the post that I like to leave the SSH window up and just let them run so I can see them complete.  And I also noted that if you didn’t want to do this, the -background flag should queue the tasks in the NAS scheduler and let you go about your merry business.

I have now used the -background flag and wanted to mention two important things about it.

  1. Less important than the next point, but still worth mentioning, using the -background flag still takes a while to run the commands.  I was expecting them to complete one after another in short order…not so much.  Not near as bad as actually waiting for the normal replication tasks to finish, but still not optimal.
  2. Most importantly, after queueing up 33 file replication tasks in the NAS scheduler, I came back to find out that only three of them had succeeded.  All the rest had failed.

The commands were valid and I’m not sure exactly what caused them to fail.  Maybe there is a good reason for it.  I have a gut feeling that the NAS scheduler has a timeout for tasks and these nas_replicate commands exceed it (because some of them take a really, really long time to finish).  Unfortunately I didn’t have the time to investigate so I went back to the CLI without the -background flag.  This worked just fine, but it takes a very long time to schedule these puppies because the tasks take so long to run.  In a time crunch, or “it’s 5:15 and I want to launch this, logout, and hit the road” situation, it might not be ideal.

So, what if you want a reliable method of kicking off a bunch of replication tasks (or a bunch of tasks on any *nix box…again I’m mainly trying to demonstrate the value of Linux CLI tools) via SSH and then letting them run through?  Once again let’s do some thinking and testing.  Note: I am on a lab box and am not responsible for anything done to your system.  All of these commands can be run on any Linux box to test/play with.

I need something that will take “a while” to run and produce some output.  In order to do this I create a simple test script with the following line:

for i in {1..10}; do echo $i; sleep 5; done

In bash, this will loop 10 times, and each time it loops it will print the value of $i (which will go 1, 2, 3…all the way up to 10) and then sleep, or wait, for 5 seconds.  The whole script will take about 50 seconds to run, which is perfect for me.  Long enough for testing but not so long that I’ll be falling asleep at the keyboard waiting for the test script to complete.

Now, because my plan is to run this guy and then log out, I want to track the output to something other than my screen.  I’ll redirect the output to a file using the > operator:

for i in {1..10}; do echo $i > outfile; sleep 5; done

Now if I run bash the screen basically just sits there for 50 seconds, and then it returns the prompt to me.  If I check the contents of outfile.txt then I see my output, which should be 1 2 3 4….etc. right?

$ bash
$ cat outfile

Or not!  I only have 10 in the outfile.  I only have 10 in the outfile because my script is doing exactly what I told it to do!  You’ll find this is a common “problem” with your scripts. 🙂  My echo $i > outfile overwrites outfile every time that line runs.  Instead I want to create a new outfile every time the script runs, and then append (>> operator) the numeric output while my script is running.  No problem:

$ cat
echo ” ” > outfile
for i in {1..10}; do echo $i >> outfile; sleep 5; done

Testing is really, really important because while you can always rely on the computer to do exactly what you tell it to do, you cannot always rely on yourself to tell it what you are expecting it to do.  Now when I run my script I get this in outfile:

$ cat outfile


OK this is what I had envisioned.  Another problem – when I run this script it locks my terminal so I am actually unable to exit the SSH session (which is what I’m hoping to accomplish in the end!).  In order to make this work I’m going to need to run the script as a background task using the & operator.

$ bash &
[1] 14007
$ ps 14007
14007 pts/21   S      0:00 bash

When I do this, it immediately returns my prompt and gives me the process ID (or PID) that I can use with ps to answer the question “is it still running?”  Of course I can also just cat outfile and see where that is too.

Now I’ve got everything I need.  I bash & and then exit and wait 50 seconds.  Then reconnect via SSH.  What do I see in outfile?  Why, I see my entire expected output, 1 through 10!  Awesome…except honestly I was expecting to not see it here.  If you read my About Me page, I state I really enjoy the learning experience (and readily admit I don’t know everything!).  This is a good example of learning something while trying to teach others.

You see, if you have done something like this before you may have seen where running the background task with & and then exiting doesn’t work.  As soon as you exit, the process is killed and your background task stops.  There is a workaround for this (called nohup which I’m about to go into) but this left me scratching my head as to why this was actually working without the workaround?  I thought this was default behavior.  To the Googlenator!

In this very helpful post, user nemo articulates why I’m not seeing what I normally see:

In case you’re using bash, you can use the command shopt | grep hupon to find out whether your shell sends SIGHUP to its child processes or not. If it is off, processes won’t be terminated, as it seems to be the case for you.

Heading back to the CS, I run this command to find out it is indeed off:

$ shopt | grep hupon
huponexit       off

Sweet.  This means that on a CS you may not even need the workaround.  However, you may not be working on a VNX Control Station at this same version, and perhaps this value is different among them.  Heck you may not be working on a VNX Control Station at all.  So if it were on, what would change?  Well lets turn it on and see.  Again this is a lab environment!

$ shopt -s huponexit
$ shopt | grep huponexit
huponexit       on

Now once again I run the script as a background task, then exit, wait 50 seconds and then reconnect.

$ cat outfile


OK this is what I was expecting to see.  Even though I have run the command with an ampersand, as soon as I dropped my SSH connection it was killed.  In order to work around this, we need to use the nohup command along with it.

$ nohup bash &

Once again I exit, wait, and reconnect.  Now when I cat outfile I see all the numbers because my script continued running in the background despite the huponexit setting.

Finally, briefly, I wanted to mention that it is possible to run background tasks via ssh in one line when connecting from another host, but you will notice that they don’t actually return your shell to you.  E.g.:

$ ssh user@host “nohup bash &”

You won’t get a prompt returned here until the remote task finishes because SSH won’t drop the connection when I/O streams are open.  Instead try redirecting the I/O streams per the suggestion here:

$ ssh -n user@host “nohup bash & > /dev/null &2>1 &”

This will immediately return your prompt.


So, what have we learned?

  1. It is possible to run scripts on a linux box that will continue to run after SSH is dropped using the & operator to background them
  2. If the huponexit flag is set, you will need the nohup command to keep the script running after exit
  3. If you are running a one-liner via SSH you will need to redirect your input streams in order to effectively return your prompt after the command kicks off
  4. On a VNX Control Station the -background flag apparently is not so great at actually completing your requested commands

Now all I need to do is use this knowledge with the script generation from the first post (obviously you would need to update with your pool IDs, VDM name, etc.) with a minor modification shown here in underlined italics:

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’ > /home/nasadmin/fsout.txt
for fsname in `cat /home/nasadmin/fsout.txt`; do echo nas_replicate –create $fsname –source –fs $fsname –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001 >>; done

Then if you cat you should see all of your file systems and the replication commands you generated for them.  And then finally you should be able to bash & then log out and it should process through.


This post was again not really about “how to run file system replication tasks on a VNX Control Station via a script that you can leave running while you get your coffee,” though if this nails that for you I’m super happy.  This post was about demonstrating the crazy abilities that are available to you when you leverage the Linux CLI.  It is, no exaggeration, very hard to overstate how powerful this is for system, network, and storage administrators, especially considering a lot of hardware and appliance CLIs are at least Linux-like.

I would also suggest tinkering with keyed SSH as well.  I may cover this in a future post, but briefly this will allow you to establish a trust of sorts between some users on systems that allows you to remotely connect with encryption, but without requiring you to enter a password.  Several things support SSH (and keyed SSH) but don’t have the full bash CLI suite behind them – off the top of my head I know this is true for NetApp filers and Cisco switches.  Keyed SSH will allow you to run commands or scripts from a trusted Linux host, and use one-liner SSH calls to execute commands on remote hardware without having to enter passwords (or keep passwords in plaintext scripts like may happen with expect scripts).  If you can learn and leverage scripting, this is the gateway into Poor Man’s Automation.

Thoughts on Thin Provisioning

I finally decided to put all my thoughts down on the topic of thin provisioning.  I wrestled with this post for a while because some of what I say is going to go kinda-sorta against a large push in the industry towards thin provisioning.  This is not a new push; it has been happening for years now.  This post may even be a year or two too late…

I am not anti-thin – I am just not 100% pro-thin.  I think there are serious questions that need to be addressed and answered before jumping on board with thin provisioning.  And most of these are relatively non-technical; the real issue is operational.

Give me a chance before you throw the rocks.

What is Thin Provisioning?

First let’s talk about what thin provisioning is, for those readers who may not know.  I feel like this is a pretty well known and straightforward concept so I’m not going to spend a ton of time on it.  Thin provisioning at its core is the idea of provisioning storage space “on demand.”

Before thin provisioning a storage administrator would have some pool of storage resources which gave some amount of capacity.  This could be simply a RAID set or even an actual pooling mechanism like Storage Pools on VNX.  A request for capacity would come in and they would “thick provision” capacity out of the pool.  The result would mean that the requested capacity would be reserved from the pooled capacity and be unavailable for use…except obviously for whatever purpose it was provisioned for.  So for example if I had 1000GB and you requested a 100GB LUN, my remaining pool space would be 900GB.  I could use the 900GB for whatever I wanted but couldn’t encroach into your 100GB space – that was yours and yours alone.  This is a thick provisioned LUN.

Of course back then it wasn’t “thick provisioning,” it was just “provisioning” until thin came along! With thin provisioning, after the request is completed and you’ve got your LUN, the pool is still at 1000GB (or somewhere very close to it due to metadata allocations which are beyond the scope of this post).  I have given you a 100GB LUN out of my 1000GB pool and still I have 1000GB available.  Remember that as soon as you get this 100GB LUN, you will usually put a file system on it and then it will appear empty.  This emptyness is the reason that the 100GB LUN doesn’t take up any space…there isn’t really any data on it until you put it there.

Essentially the thin LUN is going to take up no space until you start putting stuff into it.  If you put 10GB of data into the LUN, then it will take up 10GB on the back side.  My pool will now show 990GB free.  You should have a couple of indicators on the array like allocated or subscribed or committed and consumed or used.  Allocated/subscribed/committed is typically how much you as the storage administrator have created in the pool.  Consumed or used is how much the servers themselves have eaten up.

What follows are, in no particular order, some things to keep in mind when thin provisioning.

Communication between sysadmin and storage admin

This seems like a no-brainer but a discussion needs to happen between the storage admins providing the storage and the sysadmins who are consuming it.  If a sysadmin is given some space, they typically see this as space they can use for whatever they want.  If they need a dumping ground for a big ISO, they can use the SAN attached LUN with 1TB of free space on it.  Essentially they will likely feel that space you’ve allocated is theirs to do whatever they want with.  This especially makes sense if they’ve been using local storage for years.  If they can see disk space on their server, they can use it as they please.  It is “their” storage!

You need to have this conversation so that sysadmins understand activities and actions that are “thin hostile.”  A thin hostile action is one that effectively nullifies the benefit of thin provisioning by eating up space from day 1.  An example of a thin hostile action would be hard formatting the space a 500GB database will use up front, before it is actually in use.  Another example of a thin hostile action would be to do a block level zero formatting of space, like Eager Zero Thick on ESX.  And obviously using excess free space on a LUN for a file dumping ground is extremely thin hostile!

Another area of concern here is deduplication.  If you are using post-process deduplication, and you have thin provisioned storage, your sysadmins need to be aware of this when it comes to actions that would overwrite a significant amount of data.  You may dedupe their data space by 90%, but if they come in and overwrite everything it can balloon quickly.

The more your colleagues know about how their actions can affect the underlying storage, the less time you will spend fire fighting.  Good for them, good for you.  You are partners, not opponents!

Oversubscription & Monitoring

With thin provisioning, because no actual reservation happens on disk, you can provision as much storage as you want out of as small a pool as you want.  When you exceed the physical media, you are “oversubscribing” (or overcommitting, or overprovisioning, or…).  For instance, with your 1000GB you could provision 2000GB of storage.  In this case you would be 100% oversubscribed.  You don’t have issues as long as the total used or consumed portion is less than 1000GB.

There are a lot of really appealing reasons for doing this.  Most of the time people ask for more storage than they really need…and if it goes through several “layers” of decision makers, that might amplify greatly.  Most of the time people don’t need all of the storage they asked for right off the bat.  Sometimes people ask for storage and either never use it or wait a long time to use it.  The important thing to never forget is that from the sysadmin’s perspective, that is space you guaranteed them!  Every last byte.

Oversubscription is a powerful tool, but you must be careful about it.  Essentially this is a risk-reward proposition: the more people you promise storage to, the more you can leverage your storage array, but the more you risk that they will actually use it.  If you’ve given out 200% of your available storage, that may be a scary situation when a couple of your users decide to make good on the promise of space you made to them.  I’ve seen environments with as much as 400% oversubscription.  That’s a very dangerous gamble.

Thin provisioning itself doesn’t provide much benefit unless you choose to oversubscribe.  You should make a decision on how much you feel comfortable oversubscribing.  Maybe you don’t feel comfortable at all (if so, are you better off thick?).  Maybe 125% is good for you.  Maybe 150%.  Nobody can make this decision for you because it hinges on too many internal factors.  The important thing here is to establish boundaries up front.  What is that magic number?  What happens if you approach it?

Monitoring goes hand in hand with this.  If you monitor your environment by waiting for users to email that systems are down, oversubscribing is probably not for you.  You need to have a firm understanding of how much you’ve handed out and how much is being used.  Again, establish thresholds, establish an action plan for exceeding them, and monitor them.

Establishing and sticking with thresholds like this really helps speed up and simplify decision making, and makes it very easy to measure success.  You can always re-evaluate the thresholds if you feel like they are too low or too high.

Also make sure your sysadmins are aware of whether you are oversubscribed or not, and what that means to them.  If they are planning on a massive expansion of data, maybe they can check with you first.  Maybe they requested storage for a project and waited 6 months for it to get off the ground – again they can check with you to make sure all is well before they start in on it.  These situations are not about dictating terms, but more about education.  Many other resources in your environment are likely oversubscribed.  Your network is probably oversubscribed.  If a sysadmin in the data center decided to suddenly multicast an image to a ton of servers on a main network line, you’d probably have some serious problems.  You probably didn’t design your network to handle that kind of network traffic (and if you did you probably wasted a lot of money).  Your sysadmins likely understand the potential DDoS effect this would generate, and will avoid it.  Nobody likes pain.

“Runway” to Purchase New Storage

Remember with thin provisioning you are generally overallocating and then monitoring (you are monitoring, aren’t you?) usage.  At some point you may need to buy more storage.

If you wait till you are out of storage, that’s no good right?  You have a 100% consumed pool, with a bunch of attached hosts that are thinking they have a lot more storage to run through.  If you have oversubscribed a pool of storage and it hits 100%, it is going to be a terrible, horrible, no good, very bad day for you and everyone around you.  At a minimum new writes to anything in that pool will be denied, effectively turning your storage read-only.  At a maximum, the entire pool (and everything in it) may go offline, or you may experience a variety of fun data corruptions.

So, you don’t want that.  Instead you need to figure out when you will order new storage.  This will depend on things like:

  • How fast is your storage use growing?
  • How many new projects are you implementing?
  • How long does it take you to purchase new storage?

The last point is sometimes not considered before it is too late.  When you need more storage you have to first figure out exactly what you need, then you need to spec it, then you need a quote, the quote needs approval, then purchasing, then shipping, then it needs to be racked/stacked, then implemented.  How long does this process last for your organization?  Again nobody can answer this but you.  If your organization has a fast turn around time, maybe you can afford to wait till 80% full or more.  But if you are very sluggish, you might need to start that process at 60% or less.

Another thing to consider is if you are a sluggish organization, you may save money by thick provisioning.  Consider that you may need 15TB of storage in 2 years.  Instead you buy 10TB of storage right off the bat with a 50% threshold.  As soon as you hit 5TB of storage used you buy another 10TB to put you at 20.  Then when you hit 10 you buy another 10TB to put you at 30.  Finally at 15TB you purchase again and hit 40TB.  If you had bought 20 to begin with and gone thick, you would have never needed to buy anything else.  This situation is probably uncommon but I wanted to mention it as a thought exercise.  Think about how the purchasing process will impact the benefit you are trying to leverage from thin provisioning.

Performance Implications

Simply – ask your vendor whether thin storage has any performance difference over thick.  The answer with most storage arrays (where you have an actual choice between thick and thin) is yes.  Most of the time this is a negligible difference, and sometimes the difference is only in the initial allocation – that is to say, the first write to a particular LBA/block/extent/whatever.  But again, ask.  And test to make sure your apps are happy on thin LUNs.

Feature Implications

Thin provisioning may have feature implications on your storage system.

Sometimes thin provisioning enables features.  On a VMAX, thin provisioning enables pooling of a large number of disks.  On a VNX thin provisioning is required for deduplication and VNX Snapshots.

And sometimes thin provisioning either disables or is not recommended with certain features.  On a VNX thin LUNs are not recommended for use as File OE LUNs, though you can still do thin file systems on top of thick LUNs.

Ask what impact thin vs thick will have on array features – even ones you may not be planning to use at this very second.

Thin on Thin

Finally, in virtualized environments, in general you will want to avoid “thin on thin.”  This is a thin datastore created on a thin LUN.  The reason is that you tend to lose a static point of reference for how much capacity you are overprovisioning.  And if your virtualization team doesn’t communicate too well with the storage team, they could be unknowingly crafting a time bomb in your environment.

Your storage team might have decided they are comfortable with a 200% oversubscription level, and your virt team may have made this same decision.  This will potentially overallocate your storage by 400%!  Each team is sticking to their game plan, but without knowing and monitoring the other folks they will never see the train coming.

You can get away with thin on thin if you have excellent monitoring, or if your storage and virt admins are one and the same (which is common these days).  But my recommendation still continues to be thick VMs on thin datastores.  You can create as many thin datastores as you want, up to system limits, and then create lazy zeroed thick VMs on top of them.

Edit: this recommendation assumes that you are either required or compelled to use thin storage.  Thin VMs on thick storage are just as effective, but sometimes you won’t have a choice in this matter.  The real point is keeping one side or the other thick gives you a better point of reference for the amount of overprovisioning.


Hopefully this provided some value in the form of thought processes around thin provisioning.  Again, I am not anti-thin; I think it has great potential in some environments.  However, I do think it needs to be carefully considered and thought through when it sometimes seems to be sold as a “just thin provision, it will save you money” concept.  It really needs to be fleshed out differently for every organization, and if you take the time to do this you will not only better leverage your investment, but you can avoid some potentially serious pain in the future.

VNX File + Linux CLI

If you can learn Linux/UNIX command line and leverage it in your job, I firmly believe it will make you a better, faster, more efficient storage/network/sysadmin/engineer.  egrep, sed, awk, and bash are extremely powerful tools.  The real trick is knowing how to “stack” up the tools to make them do what you want…and not bring down the house in the process.  Note: I bear no responsibility for you bringing your house down!

Today I was able to leverage this via the VNX Control Station CLI.  I had a bunch of standard file system replications to set up and Unisphere was dreadfully slow.  If you find yourself in this situation, give the following a whirl.  I’m going to document my thought process as well, because I think this is equally as important as knowing how to specifically do these things.

First what is the “create file replication” command?  A quick browse through the man pages, online, or the Replicator manual gives us something like this:

nas_replicate –create REPLICATIONNAME –source –fs FILESYSTEMNAME –destination –pool id=DESTINATIONPOOLID –vdm DESTINATIONVDMNAME –interconnect id=INTERCONNECTID

Looking at the variable data in CAPITAL LETTERS, the only thing I really care about changing is the replication name and file system name.  In fact I usually use the file system name for the replication name…I feel like this does what I need it to unless you are looking at a complex Replicator set up.  So if I identify the destination pool ID (nas_pool -list), the destination vdm name (nas_server -list -vdm), and the interconnect ID (nas_cel -interconnect -list) then all I’m left with is needing the file system name.

So the command would look like (in my case, with some made up values):

nas_replicate –create REPLICATIONNAME –source –fs FILESYSTEMNAME –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001

Pretty cool – at this point I can just replace the name itself if I wanted and still get through it much faster than through Unisphere.  But let’s go a little further.

I want to automate the process for a bunch of different things in a list.  And in order to do that, I’ll need a for loop.  A for loop in bash goes something like this:

for i in {0..5}; do echo i is $i; done

This reads in English, “for every number in 0 through 5, assign the value to the variable $i, and run the command ‘echo i is $i'”  If you run that line on a Linux box, you’ll see:

i is 0
i is 1
i is 2
i is 3
i is 4
i is 5

Now we’ve got our loop so we can process through a list.  What does that list need to be?  In our case that list needs to be a list of file system names.  How do we get those?

We can definitely use the nas_fs command but how is a bit tricky.  nas_fs -l will give us all the file system names, but it will truncate them if they get too long.  If you are lucky enough to have short file system names, you might be able to get them out of here.  If not, the full name would come from nas_fs -info -all.  Unfortunately that command also gives us a bunch of info we don’t care about like worm status and tiering policy.

Tools to the rescue!  What we want to do is find all lines that have “name” in them and the tool for that is grep.  nas_fs -info -all | grep name will get all of those lines we want.  Success!  We’ve got all the file system names.

name      = root_fs_1
name      = root_fs_common
name      = root_fs_ufslog
name      = root_panic_reserve
name      = root_fs_d3
name      = root_fs_d4
name      = root_fs_d5
name      = root_fs_d6
name      = root_fs_2
name      = root_fs_3
name      = root_fs_vdm_cifs-vdm
name      = root_rep_ckpt_68_445427_1
name      = root_rep_ckpt_68_445427_2
name      = cifs
name      = root_rep_ckpt_77_445449_1
name      = root_rep_ckpt_77_445449_2
name      = TEST
name      = TestNFS

Alas they are not as we want them, though.  First of all we have a lot of “root” file systems we don’t like at all.  Those are easy to get rid of.  We want all lines that don’t have root in them, and once again grep to the rescue with the -v or inverse flag.

nas_fs -info -all | grep name | grep -v root

name      = cifs
name      = TEST
name      = TestNFS

Closer and closer.  Now the problem is the “name   =” part.  Now what we want is only the 3rd column of text.  In order to obtain this, we use a different tool – awk.  Awk has its own language and is super powerful, but we want a simple “show me the 3rd column” and that is going to just be tacked right on the end of the previous command.

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’


Cool, now we’ve got our file system names.  We can actually run our loop on this output, but I find it easier to send it to a file and work with it.  Just run the command and point the output to a file like so:

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’ > /home/nasadmin/fsout.txt

This way you can directly edit the fsout.txt file if you want to make changes.  Learning how these tools work is very important because your environment is going to be different and the output that gets produced may not be exactly what you want it to be.  If you know how grep, awk, and sed work, you can almost always coerce output however you want.

Now let’s combine this output with ye olde for loop to finish out strong.  Note the ` below are backticks, not single quotes:

for fsname in `cat /home/nasadmin/fsout.txt`; do echo nas_replicate –create $fsname –source –fs $fsname –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001; done

My output in this case is a series of commands printed to the screen because I left in the “echo” command:

nas_replicate –create cifs –source –fs cifs –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001
nas_replicate –create TEST –source –fs TEST –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001
nas_replicate –create TestNFS –source –fs TestNFS –destination –pool id=40 –vdm MYDESTVDM01 –interconnect id=20001

Exactly what I wanted.  Now if I want to actually run it rather than just printing them to the screen, I can simply remove the “echo” from the previous for loop.  This is a good way to validate your statement before you unleash it on the world.

If you are going to attempt this, look into the background flag as well which can shunt these all to the NAS task scheduler.  I actually like running them without the flag in this case so I can glance at putty and see progress.

If you haven’t played in the Linux CLI space before, some of this might be greek.  Understandable!  Google it and learn.  There are a million tutorials on all of these concepts out there.  And if you are a serious Linux sysadmin you probably have identified a million flaws in the way I did things. 🙂  Such is life.

Sometimes there is a fine line with doing things like this, where you may spend more time on the slick solution than you would have just hammering it out.  In this made up case I just had 3…earlier I had over 30.  But solutions like this are nice because they are reusable, and they scale.  It doesn’t really matter whether I’m doing 1 replication or 10 or 40.  I can use this (or some variation of it) every time.

The real point behind this post wasn’t to show you how to use these tools to do replications via CLI, though if it helps you do that then great.  It was really to demonstrate how you can use these tools in the real world to get real work done.  Fast and consistent.

VNX2 Hot Spare Policy bug in Flare 33 .051

The best practice for VNX2 hot spares is one spare for every 30 drives in your array.  However, if you have a VNX2 on Flare 33 .051 release, you’ll notice that the “Recommended” default policy is 1 per 60.

This is a bug.  There has been no change in the recommendations from EMC.  If you want the policy to return to the recommended 1 per 30, you have to manually set it.

I noticed today when trying to do this via Unisphere that you can only set a 1 per 30 policy if you actually have 30 or more disks of a given type.  If you have 6 EFD disks, your options through Unisphere are 1 hotspare per 2, 3, 4, 5, 6, or 60 disks.  In order to set a 1 per 30 policy in this situation you must use navicli or naviseccli.

Get a list of the hotspare policy IDs:

navicli –h SPA_IP_ADDRESS hotsparepolicy –list

Set a policy ID to 1 per 30:

navicli –h SPA_IP_ADDRESS hotsparepolicy –set POLICY_ID_NUMBER –keep1unusedper 30 -o

Note that you only need to do this on SPA or SPB for each policy, not both.

I also wanted to quickly mention there isn’t a great danger in leaving this 1 per 30 because the hot spare policy is really only a reporting mechanism.  E.g. if you leave the policy at 1 per 60, and you have 60 drives, and you have two hot spares with 58 used data disks, AND you have two drives fail….both spares will kick in.  The hot spare policy does not control hot sparing behavior; it just reports compliance.  (Actually it will also prevent you from creating a storage pool that would violate the hot spare policy, but only if you don’t manually select disks…)

But I still like having the hot spare policy reflect the recommended best practice, and that is still one hotspare for every 30 disks.

Information taken from:

VNX, Dedupe, and You

Block deduplication was introduced in Flare 33 (VNX2).  Yes, you can save a lot of space.  Yes, dedupe is cool.  But before you go checkin’ that check box, you should make sure you understand a few things about it.

As always, nothing can replace reading the instructions before diving in:

Click to access h12209-vnx-deduplication-compression-wp.pdf

Lots of great information in that paper, but I wanted to hit the high points briefly before I go over the catches.  Some of these are relatively standard for dedupe schemes, some aren’t:

  • 8KB granularity
  • Pointer based
  • Hash comparison, followed by a bit-level check to avoid hash collisions
  • Post-process operation on a storage pool level
  • Each pass starts 12 hours after the last one completed for a particular pool
  • Only 3 processes allowed to run at the same time; any new ones are queued
  • If a process runs for 4 hours straight, it is paused and put at the end of the queue.  If nothing else is in the queue, it resumes.
  • Before a pass starts, if the amount of new/changed data in a pool is less than 64GB the process is skipped and the 12 hour timer is reset
  • Enabling and disabling dedupe are online operations
  • FAST Cache and FAST VP are dedupe aware << Very cool!
  • Deduped and non-deduped LUNs can coexist in the same pool
  • Space will be returned to the pool when one entire 256MB slice has been freed up
  • Dedupe can be paused, though this does not disable it
  • When dedupe is running if you see “0GB remaining” for a while, this is the actual removal of duplicate blocks
  • Deduped LUNs within a pool are considered a single unit from FAST VP’s perspective.  You can only set a FAST tiering policy for ALL deduped LUNs in a pool, not for individual deduped LUNs in a pool.
  • There is an option to set dedupe rate – this adjusts the amount of resources dedicated to the process (i.e. how fast it will run), not the amount of data it will dedupe
  • There are two Dedupe statistics – Deduplicated LUN Shared Capacity is the total amount of space used by dedupe, and Deduplication and Snapshot Savings is the total amount of space saved by dedupe

Performance Implications

Nothing is free, and this check box is no different.  Browse through the aforementioned PDF and you’ll see things like:

Block Deduplication is a data service that requires additional overhead to the normal code path.

Leaving Block Deduplication disabled on response time sensitive applications may also be desirable

Best suited for workloads of < 30% writes….with a large write workload, the overhead could be substantial

Sequential and large block random (IOs 32 KB and larger) workloads should also be avoided

But the best line of all is this:

it is suggested to test Block Deduplication before enabling it in production

Seriously, please test it before enabling it on your mission critical application. There are space saving benefits, but that comes with a performance hit.  Nobody can tell you without analysis whether that performance hit will be noticeable or detrimental.  Some workloads may even get a performance boost out of dedupe if they are very read oriented and highly duplicated – it is possible to fit “more” data into cache…but don’t enable it and hope it will happen. Testing and validation is important!

Along with testing for performance, test for stability.  If you are using deduplication with ESX or Windows 2012, specific features (the XCOPY directive for VAAI, ODX for 2012) can cause deduped LUNs to go offline with certain Flare revisions.  Upgrade to .052 if you plan on using it with these specific OSes.  And again, validate, do your homework, and test test test!

The Dedupe Diet – Thin LUNs

Another thing to remember about deduplication is that all LUNs become thin.

When you enable dedupe, in the background a LUN migration happens to a thin LUN in the invisible dedupe container.  If your LUN is already thin, you won’t notice a difference here.  However if the LUN is thick, it will become thin whenever the migration completes.   This totally makes sense – how could you dedupe a fully allocated LUN?

When you enable dedupe the status for the LUN will be “enabling.”  This means it is doing the LUN migration – you can’t see it in the normal migration status area.

Thin LUNs have slightly lower performance characteristics than thick LUNs. Verify that your workload is happy on a thin LUN before enabling dedupe.

Also keep in mind that this LUN migration requires 110% of the consumed space in order to migrate…so if you are hoping to dedupe your way out of a nearly full pool, you may be out of luck.

One SP to Rule Them All

Lastly but perhaps most importantly – the dedupe container is owned by one SP.  This means that whenever you enable dedupe on the first LUN in a pool, that LUN’s owner becomes the Lord of Deduplication for that pool.  Henceforth, any LUNs that have dedupe enabled will be migrated into the dedupe container and will become owned by that SP.

This has potentially enormous performance implications with respect to array balance.  You need to be very aware of who the dedupe owner is for a particular pool.  In no particular order:

  • If you are enabling dedupe in multiple pools, the first LUN in each pool should be owned by differing SPs.  E.g. if you are deduping 4 different pools, choose an SPA LUN for the first one in two pools, and an SPB LUN for the first one in the remaining two pools.  If you choose an SPA LUN for the first LUN in all four pools, every deduped LUN in all four pools will be on SPA
  • If you are purchasing an array and planning on using dedupe in a very large single pool, depending on the amount of data you’ll be deduping you may want to divide it into two pools and alternate the dedupe container owner.  Remember that you can keep non-deduplicated LUNs in the pools and they can be owned by any SP you feel like
  • Similar to a normal LUN migration across SPs, after you enable dedupe on a LUN that is not owned by the dedupe container owner, you need to fix the default owner and trespass after the migration completes.  For example – the dedupe container in Pool_X is owned by SPA.  I enable dedupe on a LUN in Pool_X owned by SPB.  When the dedupe finishes enabling, I need to go to LUN properties and change the default owner to SPA.  Then I need to trespass that LUN to SPA.
  • After you disable dedupe on a LUN, it returns to the state it was pre-dedupe.  If you needed to “fix” the default owner on enabling it, you will need to “fix” the default owner on disabling.

What If You Whoopsed?

What if you checked that box without doing your homework?  What if you are seeing a performance degradation from dedupe?  Or maybe you accidentally have everything on your array now owned by one SP?

The good news is that dedupe is entirely reversible (big kudos to EMC for this one).  You can uncheck the box for any given LUN and it will migrate back to its undeduplicated state.  If it was thick before, it becomes thick again.  If it was owned by a different SP before, it is owned by that SP again.

If you disable dedupe on all LUNs in a given pool, the dedupe container is destroyed and can be recreated by re-enabling dedupe on something.  So if you unbalanced an array on SPA, you can remove all deduplication in a given pool, and then enable it again starting with an SPB LUN.

Major catch here – you must have the capacity for this operation.  A LUN requires 110% of the consumed capacity to migrate, so you need free space in order to undo this.

Deduplication is a great feature and can save you a lot of money on capacity, but make sure you understand it before implementing!