How Much Journal Space Will EMC RecoverPoint use?

I see this question asked relatively frequently, and it is super easy to answer.  However I wanted to provide some context so that folks can understand a little better about how RecoverPoint works, and why the journal works the way it does.

The Answer

First – how much journal space will RecoverPoint use?  All of it.  Every time.  If you allocate 10GB of journal space to a Consistency Group, RP will use all of it.  And if you allocate 100GB to that same CG (or 500GB), it will again use all of it.  Depending on the write rate, it may take a very long time to fill up, but eventually it will use it all.

(Now, the journal itself is divided into different areas and for actually storing snapshots it is only able to use part of the total capacity, the rest of it being reserved.  But we are just talking about the snapshot area)

The Reason

The reason this happens is due to how RecoverPoint functions as compared to other technologies where you might allocate capacity for recovery, like snapshot storage space.

Let’s take a moment to discuss snapshot technology, as with VNX snapshots.  In this case you don’t allocate capacity for anything – it just uses free pool space – but the space utilization mechanism is very similar to all snapshot methods.  A snapshot is taken at some point in time, and all blocks are “frozen” at that time in the snapshot.  As changes are made to the real data, one way or another the original data makes its way over to the snapshot space area.  So right after the snapshot is taken, virtually no space is utilized.  And as things change over time, the snapshot space utilization increases.

Then at some point (hopefully) you’d delete the snapshot and the space would be returned and you’d be using less snapshot space.  Let’s say with daily snapshot scheduling (one snap per day for a week), eventually you’d move into a kind of steady state where you have the total utilization for the week be stable, with some minor peaks and valleys as snapshots get deleted and retaken.  So your utilization might be a little higher on Tuesday than it is on Saturday, but overall most of your Tuesdays will look the same.

RecoverPoint is really nothing like this.  Instead, abstractly, I like to think as the journal space as a bucket.  You put the bucket under your production LUN and any writes get split into the bucket.  Over time the bucket gets full of writes.  This happens for EVERY consistency group, EVERY time, and is why RP will ALWAYS use all journal space.  Of course the journal is oriented by time and this is where the bucket analogy begins to break down.  So let’s dig a little deeper.

Think of the RP journal as a line – like waiting to purchase tickets.  Or more accurately, a time line.  Whether you have one journal volume or multiple journal volumes, they still form this same line as a whole.  It starts out empty and the first write comes in and heads immediately to the front of the line, because there is nothing in it. Like this:firstwrite

That first, only write is now our oldest write in the queue (because again it is the only write!).

Subsequent writes queue up behind it.  Like this:

morewrites

Eventually the line capacity (journal capacity) is full and we can’t let anyone else in line, like this:

fullwrites

Now we are at kind of the steady-state from the journal perspective.  The writes at the front of the line (the oldest point in time) start falling off to make room for newer writes as they come into the queue.  You can imagine these blocks are just continually shifting to the right as new writes come in, and old writes fall off and are lost.

This timeline defines your protection window.  You can recover from any point in time all the way back to the oldest write, and how many total writes in the queue depend on how large the journal space is.  In this manner it is (hopefully) easy to see that RecoverPoint will always use as much journal space as you give it, and the more journal space you give it, the longer in time you can roll back to.

Since I’ve already got the graphics going, and as a bonus, let’s talk about the replica LUN.  The other thing that RP is doing is constantly updating the replica LUN with journal entries.  It figures out where the next write is going, reads data from that location which is inserted in the journal, and then writes the new data into that location.  As writes pile up, the “journal lag” increases.  Essentially the replica LUN is going to be, at any given point in time, somewhere along this line, like this:

REPLICA

 

You can see several things depicted in this graphic.  We have our entire timeline of the journal, which is our protection window, with the oldest write at one end and the newest write at the other.  We also have our replica LUN which at this very moment is at the state indicated by the black arrow.

The writes in front of this black arrow are writes that have yet to be distributed to the Replica LUN.  These are the journal lag.  If a ton of new writes happen, more blue stacks up, more green falls off the end, and the Replica LUN state shifts to the right.  Journal lag increases, because we have more data that has not yet been distributed into the Replica LUN, like this.

replica_lag

The green blocks behind this represent the Undo Stream.  This is data that is read FROM the replica LUN and written INTO the journal for an undo operation.  So if RP was going to process that next blue block, it would first find the location in the Replica LUN the block was destined for.  Then it would read and insert the current data into the journal, which would be a new green block at the front of the green blocks.  Finally it would write the blue block into the replica LUN and the Replica LUN state would advance one block.  And if write I/O ceases for long enough (or there is just enough performance for the Replica operations to catch up), then the Replica LUN state moves up, the undo stream gets larger, and the journal lag gets smaller.

The Summary

In summary:

  • RecoverPoint will always use ALL of the journal space you give it, regardless of the activity of what it is protecting
  • RecoverPoint journal space can be seen as a time line, with the oldest writes on one end and the newest writes on another end.  This time line is the protection window
  • The Replica LUN, at any given point in time, is somewhere along the time line.  Any space between the Replica LUN and the newest write represents the journal lag.

3 thoughts on “How Much Journal Space Will EMC RecoverPoint use?

  1. Pingback: Newsletter: January 25, 2014 | Notes from MWhite

  2. Great article! Question though… how can I tell if my journal is large enough for a consistency group? That is to say, where in the GUI will it tell me I need to expand my journal or add another journal lun? Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s