VMware vSphere – Guest clock, replication, and squirrels

****PPFFFFFFFFFFPFPPPFFFPFFFFPPFFFP****

That’s me blowing the cobwebs off of my blog. ūüôā

Time marches on and things are always changing.¬† I too have recently made several changes both in my personal and professional life.¬† I am now a delivery engineer for CDI Southeast, though we are going through another transition period ourselves.¬† I am also doing a bit of retooling, attempting to become a little less storage focused and branching out into other areas.¬† Right now I’m trying to focus more on vSphere but also working with some DevOps stuff as well.¬† I also released a course on Pluralsight last October which I am very proud of.¬† If you are interested in hearing me talk to you about VNX for hours, I encourage you to check it out.

So, great things hopefully coming down the pipe for me in the future and I hope to continue to share with the community at large that has given so much to me.

In the meantime, here is a quick nugget that I turned up a couple of weeks ago.  I was doing a pretty straightforward implementation of vSphere replication (and I hope to do a series on that soon), but ran into an oddity which I initially wrote off as a squirrel.

A squirrel is a situation where I walk into your building to do an implementation, and shortly thereafter you (or your co-worker, or your boss) comes up to me and says, “you know ever since you got here, the squirrels have been going crazy outside.¬† What did you do?”

The squirrel can be anything really, but most of the time is comically unrelated to anything I’m actually working on. And sometimes it is just people messing around with me.¬† “Hey you know Exchange is down?¬† JUST KIDDING!”

Side note – I’ve been on site with a jokey client where something really did go down and it took us a minute or two to figure out that there was really a problem.¬†

I’m used to squirrels now.¬† Honestly they rarely have merit but I always take them seriously “just in case.”

In this case, in the midst of replicating VMs in the environment, an admin asked me if anything I was doing would mess with the clock on a guest server.¬† I racked my brain for a moment and replied that I wouldn’t think so.¬† The admin didn’t think so either so he went off to investigate more.¬† I did some more thinking, and then went back and got some more information.¬† What exactly was happening to the clock?

In this case the server had a purposefully mis-set clock.¬† As in, it wasn’t supposed to read current time but it kept getting set to the current time.¬† VMware tools dawned on me, because there is a clock sync built into tools that has to be disabled.¬† We double checked but it was disabled. It made sense because we didn’t do anything with tools (no new or updated tools install).

So later that night I was playing around in my lab.  I recreated the setup as best I could. Installed a guest with tools, disabled time sync, and set the clock back a year and some months.  Then I started replication.  And instantly, the clock was set forward.

So it turns out that even if you tell the guest “don’t sync the clock to the host,” it will STILL sync the clock to the host in certain situations.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1189

While I understand the rationale (certain operations have the potential to skew the clock, ergo syncing the clock up after those will help prevent ongoing skew) I really feel like if time sync is disabled, it shouldn’t sync the clock.¬† Ever.¬† Or there should be another check box that says “No really, never sync the clock.”¬† Nevertheless, I don’t work for VMware so I can’t tell them how to run their product.

In this case the fix is pretty simple though it does require downtime.  Shutdown the guest and add some lines to the .vmx:

time.synchronize.continue = "0"
time.synchronize.restore = "0"
time.synchronize.resume.disk = "0"
time.synchronize.shrink = "0"
time.synchronize.tools.startup = "0"
time.synchronize.tools.enable = "0"
time.synchronize.resume.host = "0"

Now it will really, really never mess with the clock on the guest.

This might be common knowledge to VMware admins but I had no idea and I suppose I’ve never dealt with a purposefully skewed clock before.

EMC Recoverpoint and XtremIO Part 3 ‚Äď Come CG With Me

In this post we are going to configure a local consistency group within XtremIO, armed with our knowledge of the CG settings.  I want to configure one snap per hour for 48 hours, 48 max snaps.

Because I’m working with local protection, I have to have the full featured licensing (/EX) instead of the basic (/SE) that only covers remote protection.¬† Note: these licenses are different than normal /SE and /EX RP licenses!¬† If you have an existing VNX with standard /SE, then XtremIO with /SE won’t do anything for you!

I have also already configured the system itself, so I’ve presented the 3GB repository volume, configured RP, and added this XtremIO cluster into RP.

All that’s left now is to present storage and protect!¬† I’ve got a 100GB production LUN I want to protect.¬† I have actually already presented this LUN to ESX, created a datastore, and created a very important 80GB Eager Zero Thick VM on it.

cgcreate0

First thing’s first, I need to create a replica for my production LUN – this must be the exact same size as the production LUN, although again this is always my recommendation with RP anyway.¬† I also need to create some journal volumes as well.¬† Because this isn’t a distributed CG, I’ll be using the minimum 10GB sizing.¬† Lucky for us creating volumes on XtremIO is easy peasy.¬† Just a reminder – you must use 512 byte blocks instead of 4K, but you are likely using that already anyway due to lack of 4K support.

cgcreate1

Next I need to map the volume.¬† If you haven’t seen the new volume screen in XtremIO 4.0, it is a little different.¬† Honestly I kind of like the old one which was a bit more visual but I’m sure I’ll come to love this one too.¬† I select all 4 volumes and hit the Create/Modify Mapping button.¬† Side note: notice that even though this is an Eager Zero’d VM, there is only 7.1MB used on the volume highlighted below.¬† How?¬† At first I thought this was the inline deduplication, but XtremIO does a lot of cool things, and one neat thing it does is discard all ful-zero block writes coming into the box!¬† So EZTs don’t actually inflate your LUNs.¬†

cgcreate2

Next I choose the Recoverpoint Initiator group (the one that has ALL my RP initiators in it) and map the volume.¬† LUN IDs have never really been that important when dealing with RP, although in remote protection it can be nice to try to keep the local and remote LUN IDs matching up.¬† Trying to make both host LUN IDs and RP LUN ID match up is a really painful process, especially in larger environments, for (IMO) no real benefit.¬† But if you want to take up that, I won’t stop you Sysyphus!

Notice I also get a warning because it recognizes that the Production LUN is already mapped to an existing ESX host.¬† That’s OK though, because I know with RP this is just fine.

cgcreate3

Alright now into Recoverpoint.  Just like always I go into Protection and choose Protect Volumes.

cgcreate4

These screens are going to look pretty familiar to you if you’ve used RP before.¬† On this one, for me typically CG Name = LUN name or something like it, Production name is ProdCopy or something similar, and then choose your RPA cluster.¬† Just like always, it is EXTREMELY important to choose the right source and destinations, especially with remote replication.¬† RP will happily replicate a bunch of nothing into your production LUN if you get it backwards!¬† I choose my prod LUN and then I hit modify policies.

cgcreate5

In modify policy, like normal I choose the Host OS (BTW I’ll happily buy a beer for anyone who can really tell me what this setting does…I always set it but have no idea what bearing it really has!) and now I set the maximum number of snaps.¬† This setting controls how many total snapshots the CG will maintain for the given copy.¬† If you haven’t worked with RP before this can be a little confusing because this setting is for the “production copy” and then we’ll set the same setting for the “replica copy.”¬† This allows you to have different settings in a failover situation, but most of the time I keep these identical to avoid confusion.¬† Anywho, we want 48 max snaps so that’s what I enter.

cgcreate6

I hit Next and now deal with the production journal.  As usual I select that journal I created and then I hit modify policy.

cgcreate7

More familiar settings here, and because I want a 48 hour protection window, that’s what I set.¬† Again based on my experience this is an important setting if you only want to protect over a specific period of time…otherwise it will spread your snaps out over 30 days.¬† Notice that snapshot consolidation is greyed out – you can’t even set it anymore.¬† That’s because the new snapshot pruning policy has effectively taken its place!

cgcreate8

After hitting next, now I choose the replica copy.¬† Pretty standard fare here, but a couple of interesting items in the center – this is where you configure the snap settings.¬† Notice again that there is no synchronous replication; instead you choose periodic or continuous snaps.¬† In our case I choose periodic and a rate of one per 60 minutes.¬† Again I’ll stress, especially in a remote situation it is really important to choose the right RPA cluster!¬† Naming your LUNs with “replica” in the name helps here, since you can see all volume names in Recoverpoint.

cgcreate9

In modify policies again we set that host OS and a max snap count of 48 (same thing we set on the production side).¬† Note: don’t skip over the last part of this post where I show you that sometimes this setting doesn’t apply!

cgcreate11

In case you haven’t seen the interface to choose a matching replica, it looks like this.¬† You just choose the partner in the list at the bottom for every production LUN in the top pane.¬† No different from normal RP.

cgcreate10

Next, we choose the replica journal and modify policies.

cgcreate12

Once again setting the required protection window of 48 hours like we did on the production side.

cgcreate13

Next we get a summary screen.  Because this is local it is kind of boring, but with remote replication I use this opportunity to again verify that I chose the production site and the remote site correctly.

cgcreate14

After we finish up, the CG is displayed like normal, except it goes into “Snap Idle” when it isn’t doing anything active.

cgcreate15

One thing I noticed the other day (and why I specifically chose these settings for this example) is that for some reason the replica copy policy settings aren’t getting set correctly sometimes.¬† See here, right after I finished up this example the replica copy policy OS and max snaps aren’t what I specified.¬† The production is fine.¬† I’ll assume this is a bug until told otherwise, but just a reminder to go back through and verify these settings when you finish up.¬† If they are wrong you can just fix them and apply.

cgcreate16

Back in XtremIO, notice that the replica is now (more or less) the same size as the production volume as far as used space.¬† Based on my testing this is because the data existed on the prod copy before I configured the CG.¬† If I configure the CG on a blank LUN and then go in and do stuff, nothing happens on the replica LUN by default because it isn’t rolling like it use to.¬† Go snaps!

cgcreate17

I’ll let this run for a couple of days and then finish up with a production recovery and a summary.

VNX File + Linux CLI

If you can learn Linux/UNIX command line and leverage it in your job, I firmly believe it will make you a better, faster, more efficient storage/network/sysadmin/engineer.¬† egrep, sed, awk, and bash are extremely powerful tools.¬† The real trick is knowing how to “stack” up the tools to make them do what you want…and not bring down the house in the process.¬† Note: I bear no responsibility for you bringing your house down!

Today I was able to leverage this via the VNX Control Station CLI.¬† I had a bunch of standard file system replications to set up and Unisphere was dreadfully slow.¬† If you find yourself in this situation, give the following a whirl.¬† I’m going to document my thought process as well, because I think this is equally as important as knowing how to specifically do these things.

First what is the “create file replication” command?¬† A quick browse through the man pages, online, or the Replicator manual gives us something like this:

nas_replicate ‚Äďcreate REPLICATIONNAME ‚Äďsource ‚Äďfs FILESYSTEMNAME ‚Äďdestination ‚Äďpool id=DESTINATIONPOOLID ‚Äďvdm DESTINATIONVDMNAME ‚Äďinterconnect id=INTERCONNECTID

Looking at the variable data in CAPITAL LETTERS, the only thing I really care about changing is the replication name and file system name.¬† In fact I usually use the file system name for the replication name…I feel like this does what I need it to unless you are looking at a complex Replicator set up.¬† So if I identify the destination pool ID (nas_pool -list), the destination vdm name (nas_server -list -vdm), and the interconnect ID (nas_cel -interconnect -list) then all I’m left with is needing the file system name.

So the command would look like (in my case, with some made up values):

nas_replicate ‚Äďcreate REPLICATIONNAME ‚Äďsource ‚Äďfs FILESYSTEMNAME ‚Äďdestination ‚Äďpool id=40 ‚Äďvdm MYDESTVDM01 ‚Äďinterconnect id=20001

Pretty cool – at this point I can just replace the name itself if I wanted and still get through it much faster than through Unisphere.¬† But let’s go a little further.

I want to automate the process for a bunch of different things in a list.¬† And in order to do that, I’ll need a for loop.¬† A for loop in bash goes something like this:

for i in {0..5}; do echo i is $i; done

This reads in English, “for every number in 0 through 5, assign the value to the variable $i, and run the command ‘echo i is $i'”¬† If you run that line on a Linux box, you’ll see:

i is 0
i is 1
i is 2
i is 3
i is 4
i is 5

Now we’ve got our loop so we can process through a list.¬† What does that list need to be?¬† In our case that list needs to be a list of file system names.¬† How do we get those?

We can definitely use the nas_fs command but how is a bit tricky.¬† nas_fs -l will give us all the file system names, but it will truncate them if they get too long.¬† If you are lucky enough to have short file system names, you might be able to get them out of here.¬† If not, the full name would come from nas_fs -info -all.¬† Unfortunately that command also gives us a bunch of info we don’t care about like worm status and tiering policy.

Tools to the rescue!¬† What we want to do is find all lines that have “name” in them and the tool for that is grep.¬† nas_fs -info -all | grep name will get all of those lines we want.¬† Success!¬† We’ve got all the file system names.

name      = root_fs_1
name      = root_fs_common
name      = root_fs_ufslog
name      = root_panic_reserve
name      = root_fs_d3
name      = root_fs_d4
name      = root_fs_d5
name      = root_fs_d6
name      = root_fs_2
name      = root_fs_3
name      = root_fs_vdm_cifs-vdm
name      = root_rep_ckpt_68_445427_1
name      = root_rep_ckpt_68_445427_2
name      = cifs
name      = root_rep_ckpt_77_445449_1
name      = root_rep_ckpt_77_445449_2
name      = TEST
name      = TestNFS

Alas they are not as we want them, though.¬† First of all we have a lot of “root” file systems we don’t like at all.¬† Those are easy to get rid of.¬† We want all lines that don’t have root in them, and once again grep to the rescue with the -v or inverse flag.

nas_fs -info -all | grep name | grep -v root

name      = cifs
name      = TEST
name      = TestNFS

Closer and closer.¬† Now the problem is the “name¬†¬† =” part.¬† Now what we want is only the 3rd column of text.¬† In order to obtain this, we use a different tool – awk.¬† Awk has its own language and is super powerful, but we want a simple “show me the 3rd column” and that is going to just be tacked right on the end of the previous command.

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’

cifs
TEST
TestNFS

Cool, now we’ve got our file system names.¬† We can actually run our loop on this output, but I find it easier to send it to a file and work with it.¬† Just run the command and point the output to a file like so:

nas_fs -info -all | grep name | grep -v root | awk ‘{print $3;}’ > /home/nasadmin/fsout.txt

This way you can directly edit the fsout.txt file if you want to make changes.  Learning how these tools work is very important because your environment is going to be different and the output that gets produced may not be exactly what you want it to be.  If you know how grep, awk, and sed work, you can almost always coerce output however you want.

Now let’s combine this output with ye olde for loop to finish out strong.¬† Note the ` below are backticks, not single quotes:

for fsname in `cat /home/nasadmin/fsout.txt`; do echo nas_replicate ‚Äďcreate $fsname ‚Äďsource ‚Äďfs $fsname ‚Äďdestination ‚Äďpool id=40 ‚Äďvdm MYDESTVDM01 ‚Äďinterconnect id=20001; done

My output in this case is a series of commands printed to the screen because I left in the “echo” command:

nas_replicate ‚Äďcreate cifs ‚Äďsource ‚Äďfs cifs ‚Äďdestination ‚Äďpool id=40 ‚Äďvdm MYDESTVDM01 ‚Äďinterconnect id=20001
nas_replicate ‚Äďcreate TEST ‚Äďsource ‚Äďfs TEST ‚Äďdestination ‚Äďpool id=40 ‚Äďvdm MYDESTVDM01 ‚Äďinterconnect id=20001
nas_replicate ‚Äďcreate TestNFS ‚Äďsource ‚Äďfs TestNFS ‚Äďdestination ‚Äďpool id=40 ‚Äďvdm MYDESTVDM01 ‚Äďinterconnect id=20001

Exactly what I wanted.¬† Now if I want to actually run it rather than just printing them to the screen, I can simply remove the “echo” from the previous for loop.¬† This is a good way to validate your statement before you unleash it on the world.

If you are going to attempt this, look into the background flag as well which can shunt these all to the NAS task scheduler.  I actually like running them without the flag in this case so I can glance at putty and see progress.

If you haven’t played in the Linux CLI space before, some of this might be greek.¬† Understandable!¬† Google it and learn.¬† There are a million tutorials on all of these concepts out there.¬† And if you are a serious Linux sysadmin you probably have identified a million flaws in the way I did things. ūüôā¬† Such is life.

Sometimes there is a fine line with doing things like this, where you may spend more time on the slick solution than you would have just hammering it out.¬† In this made up case I just had 3…earlier I had over 30.¬† But solutions like this are nice because they are reusable, and they scale.¬† It doesn’t really matter whether I’m doing 1 replication or 10 or 40.¬† I can use this (or some variation of it) every time.

The real point behind this post wasn’t to show you how to use these tools to do replications via CLI, though if it helps you do that then great.¬† It was really to demonstrate how you can use these tools in the real world to get real work done.¬† Fast and consistent.