In this post we are going to explore the new snap based protection with Recoverpoint and XtremIO. It is worth noting that some of this is based on my observations and testing, and encourage you to do the same in your environment. Take snaps, try to recover, etc. I used a relatively small sample size and testing criteria.
Also worth noting – you aren’t supposed to manually interact with the snapshots that RP is taking, and in fact you can’t even see them when logged in as admin. However if you log in as tech or rp_user, you can see them and optionally interact with them. But again remember, if you manually interact with these pieces you may cause issues in your environment! Leave this type of stuff to the testers or in test environments.
Snap Based Replication Behavior
So, snap based replication – what is is and how is it different?
Well standard Recoverpoint is pretty well documented but the idea is each write is:
- Split at the source array
- Sent to the remote array
- Finally applied to the journal volume there.
At a later time, this write will be applied to the replica LUN. So the journals contain a timeline of writes, and the replica LUN is somewhere alone that timeline at any given moment. No real clue where, but when you go to access an image the system (with direct access) the system will “roll” the replica using the write timeline to where ever you wanted.
Snap based replication is literally nothing like this. Instead it operates like this. Again I’m writing this based on my reading of the tech notes as well as what I “see” between RP and XtremIO. I write this from the perspective of a single source/replica combo but obviously you can have multiples just like always.
- Source LUN and Replica LUN (along with a single source and replica journal – remember no need to have large journals or even multiple journals) form a consistency group.
- On the source LUN, a snap is created that is labeled “SMP” – likely a reference to snapshot mount point, even though these don’t really exist on XtremIO. All snaps are just disks.
- On the DR side, the DR LUN also has a snap created that is SMP.
- On the DR side, two sub-snaps of the SMP are created called Volume##### (some incremental volume number). Presumably the first is the state of the LUN as it started with and the next is where the changes are headed. At this point if you look inside RP at the DR journal, you will see two snaps. Regardless…
- All changes (current contents of SMP) are sent across to the DR side. So at this point we’ve got Source LUN and Source SMP snap. We’ve also got DR LUN, DR SMP snap, and 2 x sub snaps.
- At some point (depending on how you’ve configured things) the system will:
- Take another prod side snap and DR snap, both Volume##### snaps. On the prod side, this snap is temporary because the differences between it and the prod SMP LUN represent the changes that need to be sent across.
- These changes are sent across and injected into the DR snapshot, which is your newest snapshot for recovery.
- Once this is complete, the temporary snap on the source is merged into the SMP snap, which now represents the state of the source LUN from last replication.
Now the source SMP and the latest snap are identical.
This process repeats indefinitely and represents your ongoing protection .
So clearly a departure from what we are used to. Because all changes are stored in snapshots, no journal space is necessary for storing writes. And there is also no need to keep rolling the replica either, because the recovery points on RP are in-memory snapshots on XtremIO (pointer based) which can be promoted or merged at any time near instantaneously. I self-confirmed no replica rolling by:
- Configuring a CG on a blank LUN and letting replication start rolling through snaps.
- Mounting prod LUN on vSphere and create VMFS datastore, noting some activity in the snaps.
- Waited a few more replication cycles
- Paused CG
- Unmounted/unmapped prod LUN
- Manually mapped replica LUN
- Mounted/attached replica LUN in vSphere, but it does not contain a VMFS file system. This is just a raw LUN, indicating that there is no more replica rolling in the background.
- Unmounted/unmapped replica LUN
- Enabled image access on newest snapshot
- Mapped/mounted/attached replica LUN in vSphere. Now the VMFS file system is there.
- Detached replica and disabled image access.
- Reattached replica LUN, VMFS file system is still there. So it didn’t try to restore the “nothing” that was in the LUN to begin with since there is no good reason to do that.
One thing I didn’t test is whether the snaps get merged into the replica LUN as they roll off the image list. I don’t think this is the case – I think they are actually merged into the DR side SMP LUN, though I haven’t confirmed.
But either way, again, very cool how this new functionality leverages XtremIO snaps for efficient replication.
Another nice change is that image access no longer uses the journal, because essentially all changes are snap based and stored in the XtremIO pool. So no worries about long term image access and filling up the log.
I did image access on a raw LUN and presented to vSphere. Created a new datastore and deployed an EZT VMDK. In the RP GUI, there was no extra activity on the journal side.
Interestingly, the “undo writes” button still works. In this case I unmounted that LUN from vSphere and clicked undo writes. When I attempted to remount/readd, there was no datastore on it.
Consistency Group Snapshot Behavior
When you configure a consistency group, you will configure a few parameters related to your protection. The first is Maximum Number of Snapshots. This is the total amount of snapshots that consistency group will retain, and goes up to 500. Don’t forget that there is a per-XMS limitation of 8,192 total volumes + snapshots! If you configure 500 snaps per group then you’ll probably run out quickly and won’t even be able to create new LUNs on XtremIO.
The other parameter you’ll configure is the type of protection you want. There is no synchronous mode with RP+XtremIO. Instead you choose Continuous which essentially creates a new snap as soon as the previous one is done transferring, or Periodic which will take snaps every X minutes.
With Continuous there isn’t really anything else to configure. You can configure an RPO in minutes, but this is allegedly just an alerting mechanism.
With Periodic, you do tell it how often to take the snaps. You can configure down to a per minute snapshot if you want.
Alright, so now the weirdness – the snapshot pruning policy. The snapshot pruning policy is designed to give you a nice “spread” of snapshots. This is listed in the whitepaper as follows (these percentages are not currently adjustable):
Age of snapshots // Percentage of total
- 0–2 hours // 40%
- 2–24 hours // 30%
- 1–2 days // 20%
- 2–4 days // 5%
- 5–30 days // 5%
This is kind of helpful, except they don’t really tell you how or when this policy is applied. In my testing, here is what I believe to be true.
- Unlike previous versions, the “Required Protection Window” setting actively alters what snapshots are removed. In classic RP, required protection window was simply an alerting mechanism. Now it appears that if you configure a required protection window of Z hours with X snapshots, most of the time the system will work to stagger those out so you will have X snapshots distributed throughout your Z hours.
- For instance, if you told the system I want periodic snaps every minute, 10 maximum snapshot count, and a required protection window of 5 hours, it will start out by taking one snap a minute for 10 minutes. After that, it will begin deleting snaps in the middle but preserving the first ones it took. Here I still have 2 of the first snaps it took, but a lot of intermediary ones have been purged. It will continue this process until you get to the 5 hour mark, when it starts purging the oldest snap. So you will end up with a 5 hour rolling protection window at the end of the day. Same thing if you said 12 hour, or 1 day, or 1 week, etc.
- If you reduce your Required Protection Window, the system will immediately purge snapshots. So for instance if I have my 5 hour window as in my previous example, with 5 hours worth of snaps, and I reduce my Required Protection Window to 3 hours, any snaps past 3 hours are immediately deleted.
- By default (again, I believe this to be true), a consistency group will have an unwritten Required Protection Window of 1 month. I noticed while tinkering around that if a CG doesn’t have a Protection Window set, it looks like it will try to go for 30 days worth of snaps. And sometimes (in the midst of testing copies and other things) it actually set a 30 day window on the CG without my interaction.
- If the protection window is 1 or 2 hours, no snapshot pruning is done. This kind of matches up with the stated pruning policy which starts to delineate after 2 hours. But e.g. if I configure a CG with 10 snap max, 1 per minute, and a 1 or 2 hour required protection window, then my actual recovery will only ever be 10 minutes long and I will never meet my specified requirements. After 10 snaps exist, the newest snap always replaced the old one. BUT! If I set my Required Protection Window to 3+ hrs, then it starts doing the odd pruning/purging so that my total protection window is met.
- The pruning behavior seems to be the same whether you have Periodic snaps or Continuous snaps in place.
Again I found this to be a little odd and hope there is some clearer documentation in the future, but in the meantime this is my experience.