Shedding Light on Storage Encryption

I’ve been noticing some fundamental misunderstandings around storage encryption – I see this most when dealing with XtremIO although plenty of platforms support it (VNX2 and VMAX).  I hope this blog post will help someone who is missing the bigger picture and maybe make a better decision based on tradeoffs.  This is not going to be a heavily technical post, but is intended to shed some light on the topic from a strategic angle.

Hopefully you already know, but encryption at a high level is a way to make data unreadable gibberish except by an entity that is authorized to read it.  The types of storage encryption I’m going to talk about are Data At Rest Encryption (often abbreviated DARE or D@RE), in-flight encryption, and host-based encryption.  I’m talking in this post mainly about SAN (block) storage, but these concepts also apply to NAS (file) storage.  In fact, in-flight encryption is probably way more useful on a NAS array given the inherent security of FC fabrics.  But then, iSCSI, and it gets cloudier.

Before I start, security is a tool and can be used wisely or poorly with equivalent results.  Encryption is security.  All security, and all encryption, is not great.  Consider the idea of cryptographic erasure, by which data is “deleted” merely because it is encrypted and nobody has the key.  Ransomware thrives on this.  You are looking at a server with all your files on it, but without the key they may as well be deleted.  Choosing a security feature for no good business reason other than “security is great” is probably a mistake that is going to cause you headaches.

encryptionblogpic

Here is a diagram with 3 zones of encryption.  Notice that host-based encryption overlaps the other two – that is not a mistake as we will see shortly.

Data At Rest Encryption

D@RE of late is typically referring to a storage arrays ability to encrypt data at the point of entry (write) and decrypt on exit (read).  Sometimes this is done with ASICs on an array or I/O module, but it is often done with Self Encrypting Drives (SEDs).  However the abstract concept of D@RE is simply that data is encrypted “at rest,” or while it is sitting on disk, on the storage array.

This might seem like a dumb question, but it is a CRUCIAL one that I’ve seen either not asked or answered incorrectly time and time again: what is the purpose of D@RE?  The point of D@RE is to prevent physical hardware theft from compromising data security.  So, if I nefariously steal a drive out of your array, or a shelf of drives out of your array, and come up with some way to attach them to another system and read them, I will get nothing but gibberish.

Now, keep in mind that this problem is typically far more of an issue on a small server system than it is a storage array.  A small server might just have a handful of drives associated with it, while a storage array might have hundreds, or thousands.  And those drives are going to be in some form of RAID protection which leverages striping.  So even without D@RE the odds of a single disk holding meaningful data is small, though admittedly it is still there.

More to the point, D@RE does not prevent anyone from accessing data on the array itself.  I’ve heard allusions to this idea that “don’t worry about hackers, we’ve got D@RE” which couldn’t be more wrong, unless you think hackers are walking out of your data center with physical hardware.  If the hackers are intercepting wire transmissions, or they have broken into servers with SAN access, they have access to your data.  And if your array is doing the encryption and someone manages to steal the entire array (controllers and all) they will also have access to your data.

D@RE at the array level is also one of the easiest to deal with from a management perspective because usually you just let the array handle everything including the encryption keys.  This is mostly just a turn it on and let it run solution.  You don’t notice it and generally don’t see any fall out like performance degradation from it.

In-Flight Encryption

In-flight encryption is referring to data being encrypted over the wire.  So your host issues a write to a SAN LUN, and that traverses your SAN network and lands on your storage array.  If data is encrypted “in-flight,” then it is encrypted throughout (at least) the switching.

Usually this is accomplished with FC fabric switches that are capable of encryption.  So the switch that sees a transmission on an F port will encrypt it, and then transmit it encrypted along all E ports (ISLs) and then decrypt it when it leaves another F port.  So the data is encrypted in-flight, but not at rest on the array.  Generally we are still talking about ASICs here so performance is not impacted.

Again let’s ask, what is the purpose of in-flight encryption?  In-flight encryption is intended to prevent someone who is sniffing network traffic (meaning they are somehow intercepting the data transmissions, or a copy of the data transmissions, over the network) from being able to decipher data.

For local FC networks this is (in my opinion) not often needed.  FC networks tend to be very secure overall and not really vulnerable to sniffing.  However, for IP based or WAN based communication, or even stretched fabrics, it might be sensible to look into something like this.

Also keep in mind that because data is decrypted before being written to the array, it does not provide the physical security that D@RE does, nor does it prevent anyone from accessing data in general.  You also sometimes have the option of not decrypting when writing to the array.  So essentially the data is encrypted when leaving the host, and written encrypted on the array itself.  It is only decrypted when the host issues a read for it and it exits the F port that host is attached to. This results in you having D@RE as well with those same benefits.  A real kicker here becomes key management, because in-flight encryption can be removed at any time without issue.  You can remove or disable in-flight encryption and not see any change in data because at the ends it is unencrypted.  However, if the data is written encrypted on the array, then you MUST have those keys to read that data.  If you had some kind of disaster that compromised your switches and keys, you would have a big array full of cryptographically erased data.

Host Based Encryption

Finally, host-based encryption is any software or feature that encrypts LUNs or files on the server itself.  So data that is going to be written to files (whether SAN based or local files) is encrypted in memory before the write actually takes place.

Host-based encryption ends up giving you both in-flight encryption and D@RE as well.  So when we ask the question, what is the purpose of host-based encryption?, we get the benefits we saw from in-flight and D@RE, as well as another one.  That is the idea that even with the same hardware setup, no other host can read your data.  So if I were to forklift your array, fabric switches, and get an identical server (hardware, OS, software) and hook it up, I wouldn’t be able to read your data.  Depending on the setup, if a hacker compromises the server itself in your data center, they may not be able to read the data either.

So why even bother with the other kinds of encryption?  Well for one, generally host-based encryption does incur a performance hit because it isn’t using ASICs.  Some systems might be able to handle this but many won’t be able to.  Unlike D@RE or in-flight, there will be a measurable degradation when using this method.  Another reason is that key management again becomes huge here.  Poor key management and a server having a hardware failure can lead to that data being unreadable by anyone.  And generally your backups will be useless in this situation as well because you have backups of encrypted data that you can’t read without the original keys.

And frankly, usually D@RE is good enough.  If you have a security issue where host-based encryption is going to be a benefit, usually someone already has the keys to the kingdom in your environment.

Closing Thoughts

Hopefully that cleared up the types of encryption and where they operate.

Another question I see is “can I use one or more at the same time?”  The answer is yes, with caveats.  There is nothing that prevents you from using even all 3 at the same time, even though it wouldn’t really make any sense.  Generally you want to avoid overlapping because you are encrypting data that is already encrypted which is a waste of resources.  So a sensible pairing might be D@RE on the array and in-flight on your switching.

A final HUGELY important note – and what really prompted me to write this post – is to make sure you fully understand the effect of encryption on all of your systems.  I have seen this come up in a discussion about XtremIO using D@RE paired with host-based encryption.  The question was “will it work?” but the question should have been “should we do this?”  Will it work?  Sure, there is nothing problematic about host-based encryption and XtremIO D@RE interacting, other than the XtremIO system encrypting already encrypted data.  What is problematic, though, is the fact that encrypted data does not compress, and most encrypted data won’t dedupe either…or at least not anywhere close to the level of unencrypted data.  And XtremIO generally relies on its fantastic inline compression and dedupe features to fit a lot of data on a small footprint. XtremIO’s D@RE happens behind the compression and deduplication, so there is no issue.  However host-based encryption will happen ahead of the dedupe/compression and will absolutely destroy your savings. So if you wanted to use the system like this, I would ask, how was it sized?  Was it sized with assumptions about good compression and dedupe ratios?  Or was it sized assuming no space savings?  And, does the extra money you will be spending for the host-based encryption product and the extra money you will be spending on the additional required storage justify the business problem you were trying to solve?  Or was there even a business problem at all?  A better fit would probably be something like a tiered VNX2 and FAST cache which could easily handle a lot of raw capacity and use the flash where it helps the most.

Again, security is a tool, so choose the tools you need, use them judiciously, and make sure you fully understand their impact (end-to-end) in your environment.

VMware vSphere – Guest clock, replication, and squirrels

****PPFFFFFFFFFFPFPPPFFFPFFFFPPFFFP****

That’s me blowing the cobwebs off of my blog. 🙂

Time marches on and things are always changing.  I too have recently made several changes both in my personal and professional life.  I am now a delivery engineer for CDI Southeast, though we are going through another transition period ourselves.  I am also doing a bit of retooling, attempting to become a little less storage focused and branching out into other areas.  Right now I’m trying to focus more on vSphere but also working with some DevOps stuff as well.  I also released a course on Pluralsight last October which I am very proud of.  If you are interested in hearing me talk to you about VNX for hours, I encourage you to check it out.

So, great things hopefully coming down the pipe for me in the future and I hope to continue to share with the community at large that has given so much to me.

In the meantime, here is a quick nugget that I turned up a couple of weeks ago.  I was doing a pretty straightforward implementation of vSphere replication (and I hope to do a series on that soon), but ran into an oddity which I initially wrote off as a squirrel.

A squirrel is a situation where I walk into your building to do an implementation, and shortly thereafter you (or your co-worker, or your boss) comes up to me and says, “you know ever since you got here, the squirrels have been going crazy outside.  What did you do?”

The squirrel can be anything really, but most of the time is comically unrelated to anything I’m actually working on. And sometimes it is just people messing around with me.  “Hey you know Exchange is down?  JUST KIDDING!”

Side note – I’ve been on site with a jokey client where something really did go down and it took us a minute or two to figure out that there was really a problem. 

I’m used to squirrels now.  Honestly they rarely have merit but I always take them seriously “just in case.”

In this case, in the midst of replicating VMs in the environment, an admin asked me if anything I was doing would mess with the clock on a guest server.  I racked my brain for a moment and replied that I wouldn’t think so.  The admin didn’t think so either so he went off to investigate more.  I did some more thinking, and then went back and got some more information.  What exactly was happening to the clock?

In this case the server had a purposefully mis-set clock.  As in, it wasn’t supposed to read current time but it kept getting set to the current time.  VMware tools dawned on me, because there is a clock sync built into tools that has to be disabled.  We double checked but it was disabled. It made sense because we didn’t do anything with tools (no new or updated tools install).

So later that night I was playing around in my lab.  I recreated the setup as best I could. Installed a guest with tools, disabled time sync, and set the clock back a year and some months.  Then I started replication.  And instantly, the clock was set forward.

So it turns out that even if you tell the guest “don’t sync the clock to the host,” it will STILL sync the clock to the host in certain situations.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1189

While I understand the rationale (certain operations have the potential to skew the clock, ergo syncing the clock up after those will help prevent ongoing skew) I really feel like if time sync is disabled, it shouldn’t sync the clock.  Ever.  Or there should be another check box that says “No really, never sync the clock.”  Nevertheless, I don’t work for VMware so I can’t tell them how to run their product.

In this case the fix is pretty simple though it does require downtime.  Shutdown the guest and add some lines to the .vmx:

time.synchronize.continue = "0"
time.synchronize.restore = "0"
time.synchronize.resume.disk = "0"
time.synchronize.shrink = "0"
time.synchronize.tools.startup = "0"
time.synchronize.tools.enable = "0"
time.synchronize.resume.host = "0"

Now it will really, really never mess with the clock on the guest.

This might be common knowledge to VMware admins but I had no idea and I suppose I’ve never dealt with a purposefully skewed clock before.

EMC Recoverpoint and XtremIO Part 4 – Recovery and Summary

In this final post we are going to cover a simple recovery, as well as do a quick summary.  I’ll throw in a few bonus details for free.

Recovery

Our CG has been running now for over 48 hours with our configuration – 48 hours Required Protection Window, 48 max snaps, one snap per hour.  Notice below that I have exactly (or just under, depending on how you measure) a 48 hour protection window.  I have one snap per hour for 48 hours and that is what is retained.  This is because of how I constructed my settings!

xsumm1

If I reduce my Required Protection Window to 24 hours, notice that IMMEDIATELY the snaps past 24 hours are nuked:

xsumm2

The distribution of snaps in this case wouldn’t be different because of how the CG is constructed (one snap per hour, 48 max snaps, 24 hour protection window = 1 snap per hour for 24 hours), but again notice that the Required Protection Window is much more than just an alerting setting in RP+XtremIO.

Alright, back to our recovery example.  Someone dumb like myself ignored all the “Important” naming and decided to delete that VM.

xsumm3

Even worse, they decided to just delete the entire datastore afterwards.

xsumm4

But lucky for us we have RP protection enabled.  I’m going to head to RP and use the Test a Copy and Recover Production button.

xsumm5

I’ll choose my replica volume:

xsumm6

Then I decide I don’t want to use the latest image because I’m worried that the deletion actually exists in that snapshot.  I choose one hour prior to the latest snap.  Quick note: see that virtual access is not even available now?  That’s because with snap based promotion there is no need for it.  Snaps are instantly promoted to the actual replica LUN, so physical access is always available and always immediate no matter how old the image.

xsumm7

After I hit next, it spins up the Test a Copy screen.  Now normally I might want to map this LUN to a host and actually check it to make sure that this is a valid copy.  In this case because, say, I’ve tracked the bad user’s steps through vCenter logging, I know exactly when I need to recover.  An important note though, as you’ll see in a second all snapshots taken AFTER your recovery image will be deleted!  But again, because I’m a real maverick I just tell it go to ahead and do the production recovery.

xsumm8

It gives me a warning that prod is going to be overwritten, and that data transfer will be paused.  It doesn’t warn you about the snapshot deletion but this has historically been RP behavior.

xsumm9

On the host side I do a rescan, and there’s my datastore.  It is unmounted at the moment so I’ll choose to mount it.

xsumm10

Next, because I deleted that VM I need to browse the datastore and import the VMX file back into vCenter.

xsumm11 xsumm12

And just like that I’ve recovered my VM.  Easy as pie!

xsumm13

Now, notice that I recovered using the 2:25 snap, and below this is now my snapshot list.  The 3:25 and the 2:25 snap that I used are both deleted.  This is actually kind of interesting because an awesome feature of XtremIO is that all snaps (even snaps of snaps) are independent entities; intermediate snaps can be deleted with no consequence.  So in this case I don’t necessarily think this deletion of all subsequent snaps is a requirement, however it certainly makes logical sense that they should be deleted to avoid confusion.  I don’t want a snapshot of bad data hanging around in my environment.

xsumm14

Summary

In summary, it looks like this snap recovery is fantastic as long as you take the time to understand the behavior.  Like most things, planning is essential to ensure you get a good balance of your required protection and capacity savings.  I hope for some more detailed breakdowns from EMC on the behavior of the snapshot pruning policies, and the full impact that settings like Required Protection Window have in the environment.

Also, don’t underestimate the 8,192 max snaps+vols for a single XMS system, especially if you are managing multiple clusters per XMS!  If I had to guess I would guess that this value will be bumped up in a future release considering these new factors, but in the meantime make sure you don’t overrun your environment.  Remember, you can still use a single XMS per cluster in order to sort of artificially inflate your snap ceiling.

Bonus Deets!

A couple of things of note.

First, in my last post I stated that I had notice a bug with settings not “sticking.”  After talking with a customer, he indicated this doesn’t have to do with the settings (the values) but with the process itself.  Something about the order is important here.  And now I believe this to be true because if I recreate a CG with those same busted settings, it works every time!  I can’t get it to break. 🙂  I still believe this to be a bug so just double check your CG settings after creating.

Second, keep in mind that today XtremIO dashboard settings display your provisioned capacity based on volumes and snapshots on the system, with no regard for who created those snaps.  So you can imagine with a snap based recovery tool, things get out of hand quickly. I’m talking about 1.4PB (no typo – PETAbytes) “provisioned” on a 20TB brick!

DC2_20T

While this is definitely a testament to the power (or insanity?) of thin provisioning, I’m trying to put in a feature request to get this fixed in the future because it really messes with the dashboard relevance.  But for the moment just note that for anything you protect with RP:

  • On the Production side, you will see a 2x factor of provisioning.  So if you protected 30TB of LUNs, your provisioned space (from those LUNs) will be 60TB.
  • On the Replica side, you will see a hilarious factor of provisioning, depending on how many snaps you are keeping.

I hope this series has been useful – I’m really excited about this new technology pairing!

EMC Recoverpoint and XtremIO Part 3 – Come CG With Me

In this post we are going to configure a local consistency group within XtremIO, armed with our knowledge of the CG settings.  I want to configure one snap per hour for 48 hours, 48 max snaps.

Because I’m working with local protection, I have to have the full featured licensing (/EX) instead of the basic (/SE) that only covers remote protection.  Note: these licenses are different than normal /SE and /EX RP licenses!  If you have an existing VNX with standard /SE, then XtremIO with /SE won’t do anything for you!

I have also already configured the system itself, so I’ve presented the 3GB repository volume, configured RP, and added this XtremIO cluster into RP.

All that’s left now is to present storage and protect!  I’ve got a 100GB production LUN I want to protect.  I have actually already presented this LUN to ESX, created a datastore, and created a very important 80GB Eager Zero Thick VM on it.

cgcreate0

First thing’s first, I need to create a replica for my production LUN – this must be the exact same size as the production LUN, although again this is always my recommendation with RP anyway.  I also need to create some journal volumes as well.  Because this isn’t a distributed CG, I’ll be using the minimum 10GB sizing.  Lucky for us creating volumes on XtremIO is easy peasy.  Just a reminder – you must use 512 byte blocks instead of 4K, but you are likely using that already anyway due to lack of 4K support.

cgcreate1

Next I need to map the volume.  If you haven’t seen the new volume screen in XtremIO 4.0, it is a little different.  Honestly I kind of like the old one which was a bit more visual but I’m sure I’ll come to love this one too.  I select all 4 volumes and hit the Create/Modify Mapping button.  Side note: notice that even though this is an Eager Zero’d VM, there is only 7.1MB used on the volume highlighted below.  How?  At first I thought this was the inline deduplication, but XtremIO does a lot of cool things, and one neat thing it does is discard all ful-zero block writes coming into the box!  So EZTs don’t actually inflate your LUNs. 

cgcreate2

Next I choose the Recoverpoint Initiator group (the one that has ALL my RP initiators in it) and map the volume.  LUN IDs have never really been that important when dealing with RP, although in remote protection it can be nice to try to keep the local and remote LUN IDs matching up.  Trying to make both host LUN IDs and RP LUN ID match up is a really painful process, especially in larger environments, for (IMO) no real benefit.  But if you want to take up that, I won’t stop you Sysyphus!

Notice I also get a warning because it recognizes that the Production LUN is already mapped to an existing ESX host.  That’s OK though, because I know with RP this is just fine.

cgcreate3

Alright now into Recoverpoint.  Just like always I go into Protection and choose Protect Volumes.

cgcreate4

These screens are going to look pretty familiar to you if you’ve used RP before.  On this one, for me typically CG Name = LUN name or something like it, Production name is ProdCopy or something similar, and then choose your RPA cluster.  Just like always, it is EXTREMELY important to choose the right source and destinations, especially with remote replication.  RP will happily replicate a bunch of nothing into your production LUN if you get it backwards!  I choose my prod LUN and then I hit modify policies.

cgcreate5

In modify policy, like normal I choose the Host OS (BTW I’ll happily buy a beer for anyone who can really tell me what this setting does…I always set it but have no idea what bearing it really has!) and now I set the maximum number of snaps.  This setting controls how many total snapshots the CG will maintain for the given copy.  If you haven’t worked with RP before this can be a little confusing because this setting is for the “production copy” and then we’ll set the same setting for the “replica copy.”  This allows you to have different settings in a failover situation, but most of the time I keep these identical to avoid confusion.  Anywho, we want 48 max snaps so that’s what I enter.

cgcreate6

I hit Next and now deal with the production journal.  As usual I select that journal I created and then I hit modify policy.

cgcreate7

More familiar settings here, and because I want a 48 hour protection window, that’s what I set.  Again based on my experience this is an important setting if you only want to protect over a specific period of time…otherwise it will spread your snaps out over 30 days.  Notice that snapshot consolidation is greyed out – you can’t even set it anymore.  That’s because the new snapshot pruning policy has effectively taken its place!

cgcreate8

After hitting next, now I choose the replica copy.  Pretty standard fare here, but a couple of interesting items in the center – this is where you configure the snap settings.  Notice again that there is no synchronous replication; instead you choose periodic or continuous snaps.  In our case I choose periodic and a rate of one per 60 minutes.  Again I’ll stress, especially in a remote situation it is really important to choose the right RPA cluster!  Naming your LUNs with “replica” in the name helps here, since you can see all volume names in Recoverpoint.

cgcreate9

In modify policies again we set that host OS and a max snap count of 48 (same thing we set on the production side).  Note: don’t skip over the last part of this post where I show you that sometimes this setting doesn’t apply!

cgcreate11

In case you haven’t seen the interface to choose a matching replica, it looks like this.  You just choose the partner in the list at the bottom for every production LUN in the top pane.  No different from normal RP.

cgcreate10

Next, we choose the replica journal and modify policies.

cgcreate12

Once again setting the required protection window of 48 hours like we did on the production side.

cgcreate13

Next we get a summary screen.  Because this is local it is kind of boring, but with remote replication I use this opportunity to again verify that I chose the production site and the remote site correctly.

cgcreate14

After we finish up, the CG is displayed like normal, except it goes into “Snap Idle” when it isn’t doing anything active.

cgcreate15

One thing I noticed the other day (and why I specifically chose these settings for this example) is that for some reason the replica copy policy settings aren’t getting set correctly sometimes.  See here, right after I finished up this example the replica copy policy OS and max snaps aren’t what I specified.  The production is fine.  I’ll assume this is a bug until told otherwise, but just a reminder to go back through and verify these settings when you finish up.  If they are wrong you can just fix them and apply.

cgcreate16

Back in XtremIO, notice that the replica is now (more or less) the same size as the production volume as far as used space.  Based on my testing this is because the data existed on the prod copy before I configured the CG.  If I configure the CG on a blank LUN and then go in and do stuff, nothing happens on the replica LUN by default because it isn’t rolling like it use to.  Go snaps!

cgcreate17

I’ll let this run for a couple of days and then finish up with a production recovery and a summary.

EMC Recoverpoint and XtremIO Part 2 – Make It Snappy

In this post we are going to explore the new snap based protection with Recoverpoint and XtremIO.  It is worth noting that some of this is based on my observations and testing, and encourage you to do the same in your environment.  Take snaps, try to recover, etc.  I used a relatively small sample size and testing criteria.

Also worth noting – you aren’t supposed to manually interact with the snapshots that RP is taking, and in fact you can’t even see them when logged in as admin.  However if you log in as tech or rp_user, you can see them and optionally interact with them.  But again remember, if you manually interact with these pieces you may cause issues in your environment!  Leave this type of stuff to the testers or in test environments.

Snap Based Replication Behavior

So, snap based replication – what is is and how is it different?

Well standard Recoverpoint is pretty well documented but the idea is each write is:

  1. Split at the source array
  2. Sent to the remote array
  3. Finally applied to the journal volume there.

At a later time, this write will be applied to the replica LUN.  So the journals contain a timeline of writes, and the replica LUN is somewhere alone that timeline at any given moment.  No real clue where, but when you go to access an image the system (with direct access) the system will “roll” the replica using the write timeline to where ever you wanted.

Snap based replication is literally nothing like this.  Instead it operates like this.  Again I’m writing this based on my reading of the tech notes as well as what I “see” between RP and XtremIO.  I write this from the perspective of a single source/replica combo but obviously you can have multiples just like always.

  1. Source LUN and Replica LUN (along with a single source and replica journal – remember no need to have large journals or even multiple journals) form a consistency group.
  2. On the source LUN, a snap is created that is labeled “SMP” – likely a reference to snapshot mount point, even though these don’t really exist on XtremIO.  All snaps are just disks.
  3. On the DR side, the DR LUN also has a snap created that is SMP.
  4. On the DR side, two sub-snaps of the SMP are created called Volume##### (some incremental volume number).  Presumably the first is the state of the LUN as it started with and the next is where the changes are headed.  At this point if you look inside RP at the DR journal, you will see two snaps.  Regardless…
  5. All changes (current contents of SMP) are sent across to the DR side.  So at this point we’ve got Source LUN and Source SMP snap.  We’ve also got DR LUN, DR SMP snap, and 2 x sub snaps.snap1
  6. At some point (depending on how you’ve configured things) the system will:
    1. Take another prod side snap and DR snap, both Volume##### snaps. On the prod side, this snap is temporary because the differences between it and the prod SMP LUN represent the changes that need to be sent across.  snap2
    2. These changes are sent across and injected into the DR snapshot, which is your newest snapshot for recovery.  snap3
    3. Once this is complete, the temporary snap on the source is merged into the SMP snap, which now represents the state of the source LUN from last replication.snap4

Now the source SMP and the latest snap are identical.

snap5

This process repeats indefinitely and represents your ongoing protection    .

So clearly a departure from what we are used to.  Because all changes are stored in snapshots, no journal space is necessary for storing writes.  And there is also no need to keep rolling the replica either, because the recovery points on RP are in-memory snapshots on XtremIO (pointer based) which can be promoted or merged at any time near instantaneously. I self-confirmed no replica rolling by:

  1. Configuring a CG on a blank LUN and letting replication start rolling through snaps.
  2. Mounting prod LUN on vSphere and create VMFS datastore, noting some activity in the snaps.
  3. Waited a few more replication cycles
  4. Paused CG
  5. Unmounted/unmapped prod LUN
  6. Manually mapped replica LUN
  7. Mounted/attached replica LUN in vSphere, but it does not contain a VMFS file system.  This is just a raw LUN, indicating that there is no more replica rolling in the background.
  8. Unmounted/unmapped replica LUN
  9. Enabled image access on newest snapshot
  10. Mapped/mounted/attached replica LUN in vSphere.  Now the VMFS file system is there.
  11. Detached replica and disabled image access.
  12. Reattached replica LUN, VMFS file system is still there.  So it didn’t try to restore the “nothing” that was in the LUN to begin with since there is no good reason to do that.

One thing I didn’t test is whether the snaps get merged into the replica LUN as they roll off the image list.  I don’t think this is the case – I think they are actually merged into the DR side SMP LUN, though I haven’t confirmed.

But either way, again, very cool how this new functionality leverages XtremIO snaps for efficient replication.

Image Access

Another nice change is that image access no longer uses the journal, because essentially all changes are snap based and stored in the XtremIO pool.  So no worries about long term image access and filling up the log.

I did image access on a raw LUN and presented to vSphere. Created a new datastore and deployed an EZT VMDK.  In the RP GUI, there was no extra activity on the journal side.

Interestingly, the “undo writes” button still works.  In this case I unmounted that LUN from vSphere and clicked undo writes.  When I attempted to remount/readd, there was no datastore on it.

Consistency Group Snapshot Behavior

When you configure a consistency group, you will configure a few parameters related to your protection.  The first is Maximum Number of Snapshots.  This is the total amount of snapshots that consistency group will retain, and goes up to 500.  Don’t forget that there is a per-XMS limitation of 8,192 total volumes + snapshots!  If you configure 500 snaps per group then you’ll probably run out quickly and won’t even be able to create new LUNs on XtremIO.

The other parameter you’ll configure is the type of protection you want.  There is no synchronous mode with RP+XtremIO.  Instead you choose Continuous which essentially creates a new snap as soon as the previous one is done transferring, or Periodic which will take snaps every X minutes.

With Continuous there isn’t really anything else to configure.  You can configure an RPO in minutes, but this is allegedly just an alerting mechanism.

With Periodic, you do tell it how often to take the snaps.  You can configure down to a per minute snapshot if you want.

Alright, so now the weirdness – the snapshot pruning policy.  The snapshot pruning policy is designed to give you a nice “spread” of snapshots.  This is listed in the whitepaper as follows (these percentages are not currently adjustable):

Age of snapshots // Percentage of total

  • 0–2 hours // 40%
  • 2–24 hours // 30%
  • 1–2 days // 20%
  • 2–4 days // 5%
  • 5–30 days // 5%

This is kind of helpful, except they don’t really tell you how or when this policy is applied.  In my testing, here is what I believe to be true.

  1. Unlike previous versions, the “Required Protection Window” setting actively alters what snapshots are removed.  In classic RP, required protection window was simply an alerting mechanism.  Now it appears that if you configure a required protection window of Z hours with X snapshots, most of the time the system will work to stagger those out so you will have X snapshots distributed throughout your Z hours.
    1. For instance, if you told the system I want periodic snaps every minute, 10 maximum snapshot count, and a required protection window of 5 hours, it will start out by taking one snap a minute for 10 minutes.  5_hr_window_1 After that, it will begin deleting snaps in the middle but preserving the first ones it took. Here I still have 2 of the first snaps it took, but a lot of intermediary ones have been purged.5_hr_window_2 It will continue this process until you get to the 5 hour mark, when it starts purging the oldest snap.  So you will end up with a 5 hour rolling protection window at the end of the day.  Same thing if you said 12 hour, or 1 day, or 1 week, etc.
    2. If you reduce your Required Protection Window, the system will immediately purge snapshots.  So for instance if I have my 5 hour window as in my previous example, with 5 hours worth of snaps, and I reduce my Required Protection Window to 3 hours, any snaps past 3 hours are immediately deleted.
  2. By default (again, I believe this to be true), a consistency group will have an unwritten Required Protection Window of 1 month.  I noticed while tinkering around that if a CG doesn’t have a Protection Window set, it looks like it will try to go for 30 days worth of snaps.  And sometimes (in the midst of testing copies and other things) it actually set a 30 day window on the CG without my interaction.
  3. If the protection window is 1 or 2 hours, no snapshot pruning is done.  This kind of matches up with the stated pruning policy which starts to delineate after 2 hours.  But e.g. if I configure a CG with 10 snap max, 1 per minute, and a 1 or 2 hour required protection window, then my actual recovery will only ever be 10 minutes long and I will never meet my specified requirements.  After 10 snaps exist, the newest snap always replaced the old one.  BUT!  If I set my Required Protection Window to 3+ hrs, then it starts doing the odd pruning/purging so that my total protection window is met.
  4. The pruning behavior seems to be the same whether you have Periodic snaps or Continuous snaps in place.

Again I found this to be a little odd and hope there is some clearer documentation in the future, but in the meantime this is my experience.

EMC RecoverPoint and XtremIO Part 1 – Initial Findings and Requirements

Back in the saddle again after a long post drought!  I’ve been busy lately working on some training activities with pluralsight, as well as dealing with a company merger.  I’m no longer with Varrow, as Varrow was acquired by Sirius Computer Solutions.  And enjoying time with my son, who is about to turn 1 year old – hard to believe!

Over the past couple of weeks, I’ve been involved in some XtremIO and Recoverpoint deployments.  RP+XtremIO just released not too long ago and it has been a bit of a learning curve – not with the product itself, but with the new methodology.  I wanted to lay out some details in case anyone is looking at this solution.

There is a good whitepaper on support.emc.com called Recoverpoint Deploying with XtremIO Tech Notes.  It does a good job of laying out the functionality, but for me at least still missed some important details – or maybe just didn’t phrase them so I could understand.

First, great news, from a functional standpoint this solution is roughly the same as all other RP implementations.  The same familiar interface is there, you create CGs and can do things like test a copy, recover production, and failover.  So if you are familiar with Recoverpoint protection operationally there is not a lot of difference.

Under the covers, things are hugely different.  I’m going to talk about the snap based replication a little later, and probably in part 2 as well.

First, the actual deployment is roughly the same.  Don’t forget your code requirements:

  • RecoverPoint 4.1.2 or later
  • XtremIO 4.0 or later

RP is deployed with Deployment Manager as usual, and XtremIO is configured as usual. 3GB repository volume (as usual!).

RP to XtremIO zoning is simple – everything to everything.  A single zone with all RP ports and all XtremIO ports from a single cluster in each fabric.

With the new 4.0 code, a single XtremIO Management Server (XMS) can manage multiple clusters.  Even though it would probably work, I would use a single zone per fabric for each cluster regardless of whether it is in the same XMS or not. More on the multi-cluster XMS with Recoverpoint later…

When you go to add XtremIO arrays into RP, you’ll use the XMS IP, and then a new rp_user account.  I’m not sure what the default password here is, so I just reset the password using the CLI.  If you have pre-zoned, you just select the XtremIO array from the list, give the XMS IP and rp_user creds.  If you haven’t pre-zoned, you also have to enter the XtremIO serial as well.

add_array

Here is the “I didn’t zone already” screen.  If you did pre-zone, you’ll see your serial in the list at the top and don’t need to enter it below.  Port 443 is required to be open between RP, XMS, and SCs.  Port 11111 is required between RP and SCs.  Usually this is in the same data center so not a huge deal.

Once the arrays have been added in and your RP cluster is configured like you want it, the rest is again same as usual.

  1. Create initiator group on XtremIO for Recoverpoint with all RP initiators.
  2. Create journal volumes, production volumes, and replica volumes.
  3. Present them to RecoverPoint
  4. Configure consistency groups.  Here there are important things to understand about the snap-based protection schemes, that I’ll go over later.

One important change due to the snap based recovery – no recovery data is stored in journals, only metadata related to snapshots!  Because of this, journals need to be as small as possible – 10GB for normal CGs, 40GB for distributed.  They won’t use all this space but we don’t care (assuming your jvols are on XtremIO) because XtremIO is thin anyway.  Similarly, each CG only needs one journal, as your protection window is not defined by your total journal capacity.

RP with XtremIO licensing is pretty simple.  You can either buy a basic (“/SE”) or full (“/EX”) license for your brick size.  Either way you can protect as much capacity as you can create, which is nice considering XtremIO is thin and does inline dedupe/compression.  Essentially basic just gives you remote protection, XtremIO to XtremIO only, only 1 remote copy.  Full adds in the ability to do local, as well as go from anything to anything (e.g. XtremIO to VNX, or VMAX to XtremIO), and a 3rd copy (so production, local, and remote, or production and two remote).  Obviously you need EX or CL licensing for the other arrays if you are doing multiple types.  Just a point of clarification here, the “SE” and “EX” for XtremIO are different than normal.  So if you have a VNX with /SE licensing, you can’t use it with /SE (or even /EX) XtremIO licensing. 

If you are using iSCSI with XtremIO, you can still do RP in direct attach mode, similar to what we do on VNX iSCSI.  Essentially you will direct attach  up to two bricks directly to your exactly 2 node RP cluster.  I would imagine (though not confirmed) that you could have more than two bricks, but only would attach two of them to RPAs.  vRPA is not currently supported – this remains a Clariion/VNX/VNXe only product.

I’m going to cover some details about the snap based protection in the next post, but in the meantime know that because it IS all snap based and there is no data in the journal to “roll” to, that image access is always direct and it is always near instantaneous.  It doesn’t matter if you are trying to access an image from 1 minute ago that has 4KB worth of changes, or an image from a week ago with 400GB worth of changes.  This part is very cool, as there is no need to worry about rolling.  There is also no need to worry about the undo log for image access – with traditional recoverpoint you were “gently encouraged” 🙂 to not image access for a long time, because as the writes piled up, eventually replication would halt.  And there was a specific capacity for the undo log.

Instead now the only capacity based limit you are concerned about is the physical capacity on the XtremIO brick itself.

Allegedly Site Recovery Manager is supported but I didn’t do any testing with that.

RP only supports volumes from XtremIO that use 512 as the logical block size, not the 4K block size.  Although there is such little support for 4K block size now I’m still strongly discouraging anyone to use that, unless they have tons of sign off and have done tons of testing.  But if you are using 4K block size, then you won’t be able to use RP protection.  Just to clarify, this is the setting I’m talking about – this is unrelated to FS block sizing a-la NTFS or anything of that nature.

4kblock

A few other random caveats:

  • If one volume at a copy is on an XtremIO array, then all volumes at that copy must be on that XtremIO array.  So for a given single copy (all the volumes in a copy), you can’t split them between array types or even clusters due to snapshotting.
  • There must be a match between production and the replica size, although I always recommend this anyway.
  • Resize for volumes is unfortunately back at the old way.  Remove both prod/replica volumes from CG, resize, then re-add.  Hopefully a dynamic resize will be available at some point.

In the next post I’m going to talk about some things I know and some things I’ve observed during testing with the snapshotting behavior, but I wanted to call out a specific limitation right now and will probably hammer on it later – there is an 8,192 limit of total volumes + snapshots per XMS irrespective of Recoverpoint.  This sounds like a ton, but each production volume you protect will have (at times) two snapshots associated with it.  Each replica volume will have max_snaps + 1 snapshots associated with it.  Because this is a per XMS limitation and not a per cluster limitation, depending on exactly how many volumes you have and how many snapshots you want to keep, you may still want a single XMS per cluster in a multi-cluster configuration.

More to come!

EMC RecoverPoint Journal Sizing

A commenter on my post about RecoverPoint Journal Usage asks:

How can I tell if my journal is large enough for a consistency group? That is to say, where in the GUI will it tell me I need to expand my journal or add another journal lun?

This is an easy question to answer but for me this is another opportunity to re-iterate journal behavior.  Scroll to the end if you are in a hurry.

Back to Snapshots…

Back to our example in the previous article about standard snapshots – on platforms where snapshots are used you often have to allocate space for this purpose…like with SnapView on EMC VNX and Clariion, you have to allocate space via a Reserve LUN pool.  On NetApp systems this is called the snapshot reserve.

Because of snapshot behavior (whether Copy On First Write or Redirect On Write), at any given time I’m using some variable amount of space in this area that is related to my change rate on the primary copy.  If most of my data space on the primary copy is the same as when I began snapping, I may be using very little space.  If instead I have overwritten most of the primary copy, then I may be using a lot of space.  And again, as I delete snapshots over time this space will free up.  So a potential set of actions might be:

  1. Create snapshot reserve of 10GB and create snapshot1 of primary – 0% reserve used
  2. Overwrite 2.5GB of data on primary – 25% reserve used
  3. Create snapshot2 of primary and overwrite a different 2.5GB of data on primary – 50% reserve used
  4. Delete snapshot1 – 25% reserve used
  5. Overwrite 50GB of data – snapshot space full (probably bad things happen here…)

There is meaning to how much space I have allocated to snapshot reserve.  I can have way too much (meaning my snapshots only use a very small portion of the reserve) and waste a lot of storage.  Or I can have too little (meaning my snapshots keep overrunning the maximum) and probably cause a lot of problems with the integrity of my snaps.  Or it can be just right, Goldilocks.

RP Journal

Once again the RP journal does not function like this.  Over time we expect RP journal utilization to be at 100%, every time.  If you don’t know why, please read my previous post on it!

The size of the journal only defines your protection window in RP.  The more space you allocate, the longer back you are able to recover from.  However, there is no such thing as “too little” or “too much” journal space as a rule of thumb – these are business defined goals that are unique to every organization.

I may have allocated 5GB of journal space to an app, and that lets me recover 2 weeks back because it has a really low write rate.  If my SLA requires me to recover 3 weeks back, that is a problem.

I may have allocated 1TB of journal space to an app, and that lets me recover back 30 minutes because it has an INSANE write rate.  If my SLA only requires me to recover back 15 minutes, then I’m within spec.

RP has no idea about what is good journal sizing or bad journal sizing, because this is simply a recovery time line.  You must decide whether it is good or bad, and then allocate additional journals as necessary.  Unlike other technology like snapshots, there is no concept of “not enough journal space” beyond your own personal SLAs. In this manner, by default RecoverPoint won’t let you know that you need more journal space for a given CG because it simply can’t know that.

Note: if you are regularly using the Test A Copy functionality for long periods of time (even though you really shouldn’t…), then you may run into sizing issues beyond just protection windows, as portions of the journal space are also used for that.  This is beyond the scope of this post, but just be aware that even if you are in spec from a protection window standpoint, you may need more journal space to support the test copy.

Required Protection Window

So RecoverPoint has no way of knowing whether you’ve allocated enough journal space to a given CG.  Folks on the pre-sales side have some nifty tools that can help with journal sizing by looking at data change rate, but this is really for the entire environment and hopefully before you bought it.

Luckily, RecoverPoint has a nice internal feature to alert you whether a given Consistency Group is within spec or not, and that is “Required Protection Window.”  This is a journal option within each copy and can be configured when a CG is created, or modified later.  Here is a pic of a CG without it.  Note that you can still see your current protection window here and make adjustments if you need.rpj1

Here is where the setting is located.

rpj2

And here is what it looks like with the setting enabled.

rpj3

So if I need to recover back 1 hour on this particular app, I set it to 1 hour and I’m good.  If I need to recover back 24 hours, I set it that way and it looks like I need to allocate some additional journal space to support that.

Now this does not control behavior of RecoverPoint (unlike, say, the Maximum Journal Lag setting) – whether you are within or under your required protection window, RP still functions the same.  It simply alerts you that you are under your personally defined window for that CG.  And if you are under for too long, or maybe under it at all if it is a mission critical application, you may want to add additional journal space to extend your protection window so that you are within spec.  Again I repeat, this is only an alerting function and will not, by itself, do anything to “fix” protection window problems!

Summary

So bottom line: RP doesn’t – or more accurately can’t – know whether you have enough journal space allocated to a given CG because that only affects how long you can roll back for.  However, using the Required Protection Window feature, you can tell RP to alert you if you go out of spec and then you can act accordingly.

SAN vs NAS Part 5: Summary

We’ve covered a lot of information over this series, some of it more easily consumable than others.  Hopefully it has been a good walkthrough of the main differences between SAN and NAS storage, and presented in a little different way than you may have seen in the past.

I wanted to summarize the high points before focusing on a few key issues:

  • SAN storage is fundamentally block I/O, which is SCSI.  With SAN storage, your local machine “sees” something that it thinks is a locally attached disk.  In this case your local machine manages the file system, and transmissions to the array are simple SCSI requests.
  • NAS storage is file I/O, which is either NFS or CIFS.  With NAS storage, your local machine “sees” a service to connect to on the network that provides file storage.  The array manages the file system, and transmissions to the array are protocol specific file based operations.
  • SAN and NAS have different strengths, weaknesses, and use cases
  • SAN and NAS are very different from a hardware and protocol perspective
  • SAN and NAS are sometimes only offered on specific array platforms

Our Question

So back to our question that started this mess: with thin provisioned block storage, if I delete a ton of data out of a LUN, why do I not see any space returned on the storage array?  We know now that this is because there is no such thing as a delete in the SAN/block/SCSI world.  Thin provisioning works by allocating storage you need on demand, generally because you tried to write to it.  However once that storage has been allocated (once the disk has been created), the array only sees reads and writes, not creates and deletes.  It has no way of knowing that you sent over a bunch of writes that were intended to be a delete.  The deletes are related to the file system, which is being managed by your server, not the array.  The LUN itself is below the file system layer, and is that same disk address space filled with data we’ve been discussing.  Deletes don’t exist on SAN storage, apart from administratively deleting an entire object – LUN, RAID set, Pool, etc.

With NAS storage on the other hand, the array does manage the file system.  You tell it when to delete something by sending it a delete command via NFS or CIFS, so it certainly knows that you want to delete it.  In this manner file systems allocations on NAS devices usually fluctuate in capacity.  They may be using 50GB out of 100GB today, but only 35GB out of 100GB tomorrow.

Note: there are ways to reclaim space either on the array side with thin reclamation (if it is supported), or on the host side with the SCSI UNMAP commands (if it is supported).  Both of these methods will allow you to reclaim some/all of the deleted space on a block array, but they have to be run as a separate operation from the delete itself.  It is not a true “delete” operation but may result in less storage allocated.

Which Is Better?

Yep, get out your battle gear and let’s duke it out!  Which is better?  SAN vs NAS!  Block vs File!  Pistols at high noon!

Unfortunately as engineers a lot of times we focus on this “something must be the best” idea.

Hopefully if you’ve read this whole thing you realize how silly this question is, for the most part.  SAN and NAS storage occupy different areas and cover different functions.  Most things that need NAS functionality (many access points and permissions control) don’t care about SAN functionality (block level operations and utilities), and vice versa.  This question is kind of like asking which is better, a toaster or a door stop?  Well, do you need to toast some delicious bread or do you need to stop a delicious door?

In some cases there is overlap.  For example, vSphere datastores can be accessed over block protocols or NAS (NFS).  In this case what is best is most often going to be – what is the best fit in the environment?

  • What kind of hardware do you have (or what kind of budget do you have)?
  • What kind of admins do you have and what are their skillsets?
  • What kind of functionality do you need?
  • What else in the environment needs storage (i.e. does something else need SAN storage or NFS storage)?
  • Do you have a need for RDMs (LUNs mapped directly from the array in order to expose some of the SCSI functionality)?

From a performance perspective 10Gb NFS and 10Gb iSCSI are going to do about the same for you, and honestly you probably won’t hit the limits of those anyway.  These other questions are far more critical.

Which leads me to…

What Do I Need?

A pretty frequently asked question in the consulting world – what do I need, NAS or SAN?  This is a great question to ask and to think about but again it goes back to what do you need to do?

Do you have a lot of user files that you need remote access to?  Windows profiles or home directories?  Then you probably need NAS.

Do you have a lot of database servers, especially ones that utilize clustering?  Then you probably need SAN.

Truthfully, most organizations need some of both – the real question is in what amounts.  This will vary for every organization but hopefully armed with some of the information in this blog series you are closer to making that choice for your situation.

SAN vs NAS Part 4: The Layer Cake

Last post we covered the differences between NFS and iSCSI (NAS and SAN) and determined that we saw a different set of commands when interacting with a file.  The NFS write generated an OPEN command, while the iSCSI write did not.  In this post we’ll cover the layering of NAS (file or file systems) on top of SAN (SCSI or block systems) and how that interaction works.

Please note!  In modern computing systems there are MANY other layers than I’m going to talk about here.  This isn’t to say that they don’t exist or aren’t important, but just that we are focusing on a subset on them for clarity.  Hopefully.

First, take a look at the NFS commands listed here: https://tools.ietf.org/html/rfc1813

nfscommandsNotice that a lot of these commands reference files, and things that you would do with files like read and write, but also create, remove, rename, etc.

Compare this with the SCSI reference: http://www.t10.org/lists/op-alph.htm

Notice that in the SCSI case, we still have read and write, but there is no mention of files (other than “filemarks”).  There is no way to delete a file with SCSI – because again we are working with a block device which is a layer below the file system.  There is no way to delete a file because there is no file.  Only addresses where data is stored.

As a potentially clumsy analogy (like I often wield!) think about your office desk.  If it’s anything like mine, there is a lot of junk in the drawers.  File storage is like the stuff in a drawer.  The space in a drawer can have a lot of stuff in it, or it can have a little bit of stuff in it.  If I add more stuff to the drawer, it gets more full.  If I take stuff out of the drawer, it gets less full.  There is meaning to how much stuff is in an individual drawer as a relation to how much more stuff I can put in the drawer.

Block storage, on the other hand, is like the desk itself.  There are locations to store things – the drawers.  However, whether I have stuff in a drawer or I don’t have stuff in a drawer, the drawer still exists.  Emptying out my desk entirely doesn’t cause my desk to vanish.  Or at least, I suspect it wouldn’t…I have never had an empty desk in my life.  There is no relationship to the contents of the drawers and the space the desk occupies.  The desk is a fixed entity.  An empty drawer is still a drawer.

To further solidify this file vs block comparison, take a look at this handsome piece of artwork depicting the layers:

fsvisio_1Here is a representation of two files on my computer, a word doc and a kitty vid, and their relationship to the block data on disk.  Note that some disk areas have nothing pointing to them – these are empty but still zero filled (well…maybe, depending on how you formatted the disk).  In other words, these areas still exist!  They still have contents, even if that content is nothing.

When I query a file, like an open or read, it traverses the file system down to the disk level.  Now I’m going to delete the word doc.  In most cases, this is what is going to happen:

fsvisio_2My document is gone as far as I can “see.”  if I try to query the file system (like look in the directory it was stored in) it is gone.  However on the disk, it still exists.  (Fun fact: this is how “undelete” utilities work – by restoring data that is still on disk but no longer has pointers from the file system.)  It isn’t really relevant that it is still on the disk, because from the system’s perspective (and the file system’s perspective) it doesn’t exist any more.  If I want to re-use that space, the system will see it as free and store something else there, like another hilarious kitten video.

Sometimes this will happen instead, either as you delete something (rarely) or later as a garbage collection process:

fsvisio_3The document data has been erased and replaced with zeros.  (Fun fact: this is how “file shredder” programs work – by writing zeros (or a pattern) once (or multiple times) to the space that isn’t being actively used by files.)  Now the data is truly gone, but from the disk perspective it still isn’t really relevant because something still occupies that space.  From the disk’s perspective, something always occupies that space, whether it is kitty video data, document data, or zeros.  The file system (the map) is what makes that data relevant to the system.

This is a really high level example, but notice the difference in the file system level and the disk level.  When I delete that file, whether the actual disk blocks are scrubbed or left intact, the block device remains the same except for the configuration of the 1’s and 0’s.  All available addresses are still in place.  Are we getting closer to understanding our initial question?

Let’s move this example out a bit and take a look at an EMC VNX system from a NAS perspective.  This is a great example because there are both SAN/block (fibre channel) and NAS/file (cifs/nfs) at the same time.  The connections look like this:

dm1

From my desktop, I connect via NFS to an interface on the NAS (the datamover) in order to access my files.  And the datamover has a fibre channel connection to the block storage controllers which is where the data is actually stored.  The datamover consumes block storage LUNs, formats them with appropriate file systems, and then uses that space to serve out NAS.  This ends up being quite similar to the layered file/disk example above when we were looking at a locally hosted file system and disk.

What does it look like when I read and write?  Simply like this:

DM2My desktop issues a read or write via NFS, which hits the NAS, and the NAS then issues a read or write via SCSI over Fibre Channel to the storage processor.

Reads and writes are supported by SCSI, but what happens when I try to do something to a file like open or delete?

DM3The same command conversion happens, but it is just straight reads and writes at the SCSI level. It doesn’t matter whether the NAS is SAN attached like this one, or it just has standard locally attached disks.  This is always what’s going to happen because the block protocol and subsystems don’t work with files – only with data in addresses.

By understanding this layering – what file systems (NAS) do vs what disks (SAN) do – you can better understand important things about their utility.  For instance, file systems have various methods to guarantee consistency, in spite of leveraging buffers in volatile memory.  If you own the file system, you know who is accessing data and how.  You have visibility into the control structure.  If the array has no visibility there, then it can’t truly guarantee consistency.  This is why e.g. block array snapshots and file array snapshots are often handled differently.  With NAS snapshots, the array controls the buffers and can easily guarantee consistent snapshots.  But for a block snapshot, the array can only take a picture of the disk right now regardless of what is happening in the file system.  It may end up with an inconsistent image on disk, unless you initiate the snapshot from the attached server and properly quiesce/clean the file system.

Back to the idea of control, because NAS systems manage the file side of things, they also have a direct understanding of who is trying to access what.  Not only does this give it the ability to provide some access control (unlike SAN which just responds happily to any address requests it gets), it also explains why NAS is often ideal for multi-access situations.  If I have users trying to access the same share (or better yet, the same file), NAS storage is typically the answer because it knows who has what open.  It can manage things on that level.  For the SAN, not so much.  In fact if you want two hosts to access the same storage, you need to have some type of clustering (whether direct software or file system) that provides locks and checks.  Otherwise you are pretty much guaranteed some kind of data corruption as things are reading and writing over top of one another.  Remember SAN and SCSI just lets you read and write to addresses, it doesn’t provide the ability to open and own a file.

In part 5 I’ll provide a summary review and then some final thoughts as well.

SAN vs NAS Part 3: File Systems

In the last blog post, we asked a question: “who has the file system?”  This will be important in our understanding of the distinction between SAN and NAS storage.

First, what is a file system?  Simply (see edit below!), a file system is a way of logically sorting and addressing raw data.  If you were to look at the raw contents of a disk, it would look like a jumbled mess.  This is because there is no real structure to it.  The file system is what provides the map.  It lets you know that block 005A and block 98FF are both the first parts of your text file that reads “hello world.”  But on disk it is just a bunch of 1’s and 0’s in seemingly random order.

Edit: Maybe I should have chosen a better phrase like “At an extremely basic level” instead of “Simply.” 🙂 As @Obdurodon pointed out in the comments below, file systems are a lot more than a map, especially these days.  They help manage consistency and help enable cool features like snapshots and deduplication.  But for the purposes of this post this map functionality is what we are focusing on as this is the relationship between the file system and the disk itself.

File systems allow you to do things beyond just reads and writes.  Because they form files out of data, they let you do things like open, close, create, and delete.  They allow you the ability to keep track of where your data is located automatically.

(note: there are a variety of file systems depending on the platform you are working with, including FAT, NTFS, HFS, UXFS, EXT3, EXT4, and many more.  They have a lot of factors that distinguish them from one another, and sometimes have different real world applications.  For the purposes of this blog series we don’t really care about these details.)

Because SAN storage can be thought of as a locally attached disk, the same applies here.  The SAN storage itself is a jumbled mess, and the file system (data map) is managed by the host operating system.  Similar to your local C: drive in your windows laptop, your OS puts down a file system and manages the location of the block data.  Your system knows and manages the file system so it interacts with the storage array at a block level with SCSI commands, below the file system itself.

With NAS storage on the other hand, even though it may appear the same as a local disk, the file system is actually not managed by your computer – or more accurately the machine the export/share is mounted on.  The file system is managed by the storage array that is serving out the data.  There is a network service running that allows you to connect to and interact with it.  But because that remote array manages the file system, your local system doesn’t.  You send commands to it, but not SCSI commands.

With SAN storage, your server itself manages the file system and with NAS storage the remote array manages the file system.  Big deal, right?  This actually has a MAJOR impact on functionality.

I set up a small virtual lab using VirtualBox with a CentOS server running an NFS export and an iSCSI target (my remote server), and a Ubuntu desktop to use as the local system.  After jumping through a few hoops, I got everything connected up.  All commands below are run and all screenshots are taken from the Ubuntu desktop.

I’ll also take a moment to mention how awesome Linux is for these type of things.  It took some effort to get things configured, but it was absolutely free to set up a NFS/iSCSI server and a desktop to connect to it.  I’ve said it before but will say it again – learn your way around Linux and use it for testing!

So remember, who has the file system?  Note that with the iSCSI LUN, I got a raw block device (a.k.a. a disk) presented from the server to my desktop.  I had to create a partition and then format it with EXT4 before I could mount it.  With the NFS export, I just mounted it immediately – no muss no fuss.  That’s because the file system is actually on the server, not on my desktop.

Now, if I were to unmount the iSCSI LUN and then mount it up again (or on a different linux desktop) I wouldn’t need to lay down a file system but that is only because it has already been done once.  With SAN storage I have to put down a file system on the computer it is attached to the first time it is used, always.  With NAS storage, there is no such need because the file system is already in place on the remote server or array.

Let’s dive in and look at the similarities and differences depending on where the file system is.

Strace

First let’s take a look at strace.  strace is a utility that exposes some of the ‘behind the scenes’ activity when you execute commands on the box.  Let’s run this command against a data write via a simple redirect:

strace -vv -Tt -f -o traceout.txt echo “hello world” > testfile

Essentially we are running strace with a slew of flags against the command [ echo “hello world” > testfile ].  Here is a screenshot of the relevant portion of both outputs when I ran the command with testfile located on the NFS export vs the local disk.

strace

Okay there is a lot of cryptic info on those pics, but notice that in both cases the write looks identical.  The “things” that are happening in each screenshot look the same.  This is a good example of how local and remote I/O “appears” the same, even at a pretty deep level.  You don’t need to specify that you are reading or writing to a NAS export, the system knows what the final destination is and makes the necessary arrangements.

Dstat

Let’s try another method – dstat.  Dstat is a good utility for seeing the types of I/O running through your system.  And since this is a lab system, I know it is more or less dead unless I’m actively doing something on it.

I’m going to run a large stream of writes (again, simple redirection) in various locations (one location at a time!) while I have dstat running in order to see the differences.  The command I’m using is:

for i in {1..100000}; do echo $i > myout; done

With myout located in different spots depending on what I’m testing.

For starters, I ran it against the local disk:

localdisk_dstat

Note the two columns in the center indicating “dsk” traffic (I/O to a block device) and “net” traffic (I/O across the network interfaces).  You can think of the “dsk” traffic as SCSI traffic.  Not surprisingly, we have no meaningful network traffic, but sustained block traffic.  This makes sense since we are writing to the local disk.

Next, I targeted it at the NFS export.

nfs_dstat

A little different this time, as even though I’m writing to a file that appears in the filesystem of my local machine (~/mynfs/myout) there is no block I/O.  Instead we’ve got a slew of network traffic.  Again this makes sense because as I explained even though the file “appears” to be mine, it is actually the remote server’s.

Finally, here are writes targeted at the iSCSI LUN.

iscsi_dstat

Quite interesting, yes?  We have BOTH block and network traffic.  Again this makes sense.  The LUN itself is attached as a block device, which generates block I/O.  However, iSCSI traffic travels over IP, which hits my network interfaces.  The numbers are a little skewed since the block I/O on the left is actually included in the network I/O on the right.

So we are able to see that something is different depending on where my I/O is targeted, but let’s dig even deeper.  It’s time to…

WIRESHARK!

For this example, I’m going to run a redirect with cat:

cat > testfile

hello world

ctrl+c

This is simply going to write “hello world” into testfile.

After firing up wireshark and making all the necessary arrangements to capture traffic on the interface that I’m using as an iSCSI initiator, I’m ready to roll.  This will allow me to capture network traffic between my desktop and server.

Here are the results:

iscsi_write

There is a lot of stuff on this pic as expected, but notice the write command itself.  It is targeted at a specific LBA, just as if it were a local disk that I’m writing to.  And we get a response from the server that the write was successful.

Here is another iSCSI screenshot.

iscsi_write2

I’ve highlighted the write and you can see my “hello world” in the payload.  Notice all the commands I highlighted with “SCSI” in them.  It is clear that this is a block level interaction with SCSI commands, sent over IP.  Note also that in both screenshots, there is no file interaction.

Now let’s take a look at the NFS export on my test server.  Again I’m firing up wireshark and we’ll do the same capture operation on the interface I’m using for NFS.  I’m using the same command as before.

nfscap_write

Here is the NFS write command with my data.  There are standard networking headers and my hello world is buried in the payload.  Not much difference from iSCSI, right?

The difference is a few packets before:

nfscap_open

We’ve got an OPEN command!  I attempt to open the file “testfile” and the server responds to my request like a good little server.  This is VERY different from iSCSI!  With iSCSI we never had to open anything, we simply sent a write request for a specific Logical Block Address.  With iSCSI, the file itself is opened by the OS because the OS manages the file system.  With NFS, I have to send an OPEN to the NAS in order to discover the file handle, because my desktop has no idea what is going on with the file system.

This is, I would argue, THE most important distinction between SAN and NAS and hopefully I’ve demonstrated it well enough to be understandable.  SAN traffic is SCSI block commands, while NAS traffic is protocol-specific file operations.  There is also some overlap here (like read and write), but these are still different entities with different targets.  We’ll take a look at the protocols and continue discussing the layering effect of file systems in Part 4.