Thursday, October 10, 2013

Getting Your Backup and Recovery Process On Foils - Part 1

Photo taken by Alex Almeida during Oracle Team USA Practice Session. San Francisco Bay, CA Aug 2013
A couple of weeks ago we witnessed Oracle Team USA make a heroic comeback in the 34th America's Cup in San Francisco Bay. Clearly the biggest comeback in the Regatta's history, and arguably one of the biggest comebacks in all of sports! I was certainly pulling for the team that would keep the cup here in America, but I always look forward to these competitions to see the world's elite sailors work together as a team to flawlessly and efficiently navigate the fastest path around the course. It is incredibly amazing to me how each sailor has put years of practice with their respective groups so that each tack and jibe is executed perfectly and each team member moves purposefully in concert with the other crew. It also goes without saying the new AC72s are a big reason why the racing action is much more exciting than in years past. The boats themselves are a marvel in technology alone.

I found The America's Cup to be an appropriate backdrop and analogy to discussing backup and recovery of an SAP with Oracle environment. Not because of the clear ties to the database vendor sponsoring Team USA, but rather concentrating on how a sailing team in any regatta operates. There are lots of moving parts and the winning solution has the most efficient design (fastest boat) and crew that integrate together like clockwork toward a main goal.  In fact, you can apply a lot of what I am going to talk about here to any mission critical application use case.

When backing up any mission critical transactional data the ideal recipe calls for the following:
  • Create a full retention copy on cost-effective storage as quickly as possible 
  • Minimize impact by the backup process to the application and more importantly its end-users as possible (NO IMPACT is really what we are going for) 
  • Provide for an "as easy as Apple Time Machine" recovery mechanism, where my Recovery Point Objective (RPO) is also as granular and flexible as possible.
In talking to more and more DBAs and IT Administrators, one of the biggest pain points is tied particularly to the second bullet point I mentioned. As most of them are experiencing (and maybe you are as well), most transactional systems now essentially operate at peak load around the clock. So any task that leverages processing cycles against your application environment is scrutinized.

To continue the sailing analogy, the boat and the crew are the two main components that together combine for a winning solution. Each one operates to benefit the other but never get in each other's way. The crew on the boat clearly puts a lot of effort to make sure sails are trimmed correctly and tacks and jibes are executed flawlessly, but that is all done without slowing down the boat. This is the most obvious when crew members don't get in position fast enough or don't make the properly timed adjustments leaving precious speed on the water. Clearly you also want to protect your business data in a fashion that doesn't cause the business to slow down. So how do we go about performing essential backups without upsetting the datacenter waters?

One of the approaches we are hearing a lot about in the data protection industry is "snap and replicate." Clearly this provides for data versioning, which offers a flexible RPO and quick (Recovery Time Objective) RTO, and replication of those copies on another array which is most likely at another physical location providing some degree of offsite protection.

As you start to dig deeper into this approach, one has to ask, "What happens if I lose my primary volume?" The answer is inevitably that the metadata recipe for re-creating the version of data I need (the snap) is useless to me. That metadata still references data blocks residing on the original storage logical unit number (LUN) and are essential for rebuilding any part of the dataset at any given RPO.

Let's forget about actual primary LUN failure for a second, and just concentrate on application/software data corruption. After you realize that corruption has taken place, it is often too late to recover from snaps because that corruption has propagated through too many of your snaps already.

But as much as folks may think I am "snap bashing," they aren't all bad! They can actually be very helpful! The key is how they are used. When leveraged as a complement to backup and not the lone backup process, you really start to see your Application Backup and Recovery processes step out of the limelight. Just like the crew on an AC72 in the America's Cup. To further validate my point, can you name for me the "grinders" on the Oracle Team USA boat?

The right Backup and Recovery architecture for mission critical transactional data separates the winning businesses from the losing ones and can make your IT team look like champions, essential to business success. On the other hand, if not implemented leveraging the right technology, you quickly have a team looking like serious ballast that not even the strongest winds can move!

Stay tuned for a future post where I expose the bits and bytes on a way to go about implementing this principle.