“Everybody has a plan until they get punched in the mouth” – Mike Tyson
No truer words have been spoken as Mike Tyson’s famous quote. The quote holds true in personal life and in information technology. When your world gets tossed upside-down things tend to have a snowball effect. I fondly remember upgrading a storage array and seeing my databases go off-line on my SQL Server right before a long weekend. I had an upgrade plan, had a backup plan but both were manual tasks. One dumb thing led to another and I was phoning my wife to wish her a great trip as I stayed at the office fixing my mistakes. So essentially, I punched myself in the face which made it all the better.
Now I’m sure other people can relate as various technology surveys state that 25% or more of down time is caused by human error. The other 75% is a mixed bag of environmental conditions, unforeseen glitches and hardware failures. Luckily Nutanix provides a lot of automation around upgrades, security, patching and self-healing. That does not make the tenants an ivory tower, but we do try to shield users from mundane tasks that can cause downtime. So when the inevitable comes knocking at your door are you ready to respond? If your boss is behind you say yes and you can read the rest when they leave. 😊
Like any good athlete training for hard competition is a lot about repetitions and making your reactions appear to be reflexes. In the land of disaster recovery (DR) Nutanix is furthering ease-of-use by adding DR orchestration known as Leap. Leap works both on-prem and with the Xi cloud service known as Xi Leap. The main components that I want to touch on are protection policies and recovery plans. Together these two newly added services in Prism Central form Nutanix run-books for DR. By moving DR functionality to Prism Central it’s allowing one policy to have the opportunity to protect all the Nutanix clusters under its management. No longer do you have to set up remote sites, snapshot schedules on individual clusters. For large environments this will save not only tons of time but reduce errors in recovery.
When thinking about protection policies a common term that is associated is recovery point objective (RPO). RPO is essentially how much data can you afford to lose if that is the last recovery point you have available. At general availability of Leap for both on-prem and Xi Leap the minimum RPO is one hour. Note: Older protection domains after near-sync if needed. I would also strongly attest that the RPO is a useless number without providing a recovery time objective (RTO). RTO is the time that it takes to recover from a failure using your last recovery point available. If it takes you 24 hours or more to recover you have not only the time lost from the last recovery point but also the extended outage.
According to Gartner from 2017, the average cost of IT downtime is $5,600 per minute. Because there are so many differences in how businesses operate, the Gartner analyst, Andrew Lerner, states that downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end.
So with so without knowing both the RPO and RTO of an environment you will not have the full financial outcome facing a failure.
On the above screenshot you can see Nutanix protection policies that are applied by categories. This allows for policies to be applied easily across the environment. Should be also noted that protection policies will automatically self protect going the opposite way after a fail-over. This allows for the business to return to the preferred destination as quick as possible. As long as there is network connection going back to the original source, replication will start flowing again.
Nutanix Recovery Plans are going to make sure you’re ok after getting hit in the chops from Joe and Sally from maintenance tested the sprinkler system in the Datacenter. Recovery Plans control the boot order, allows you set delays and controls the network mappings. With the proper networking mapping, Xi Leap will automatically recreate your on-prem networking in Xi so you don’t have to Re-IP anything. That alone is big headache out of the way.
Recovery plans offer 4 different options to ensure your RTO is consistent and predictable. Failover operations in Leap are of the following types:
ValidateIf you perform the validation in Xi Cloud Services, Leap validates failover from the on-premises availability zone to Xi Cloud Services. Recovery plan validation only reports warnings and errors. Failover is not performed. Think of this as you quick gut check.
Test FailoverYou perform a test failover when you want to test a recovery plan. When you perform a test failover, the VMs are started in the virtual network designated for testing purposes at the recovery location. However, the VMs at the primary location are not affected. Test failovers rely on the presence of VM snapshots at the recovery location. Test early and test often. Allowing you to make sure all dependences are in order.
Planned Failover/MigrationYou perform planned failover when a disaster that disrupts services is predicted at the primary location. When you perform a planned failover, the recovery plan first creates a snapshot of each VM, replicates the snapshots at the recovery location, and then starts the VMs at the recovery location. Therefore, for a planned failover to succeed, the VMs must be available at the primary location. If the failover process encounters errors, you can resolve the error condition. After a planned failover, the VMs no longer run in the source availability zone. This also keeps the MAC address if you have older software licensing schemes in play.
After failover, replication begins in the reverse direction.
Unplanned FailoverYou perform unplanned failover when a disaster has occurred at the primary location. In an unplanned failover, you can expect some data loss to occur. The maximum data loss possible is equal to the RPO configured in the protection policy or the data that was generated after the last manual backup for a given VM. In an unplanned failover, by default, VMs are recovered from the most recent snapshot. However, you can recover from an earlier snapshot by selecting a date and time. Any errors are logged but the execution of the failover continues.
After failover, replication begins in the reverse direction.
The quality assurance team at Nutanix runs daily tests to make sure a large number of VMs are recovered in a timely fashion (RTO). These tests include:
- Deploy 200 VMs on Source.
- Perform unplanned failover in Xi. Calculate time taken. The Timer is started when tasks are started and stops when it is 100% complete.
- Delete the 200 VMs s on Source.
- Protect the recovered 200 VMs on Remote.
- Repeat 2-5 in reverse direction.
With knowing that Xi Leap provides you have a solid RPO and RTO and the ability to test any time you are well equipped to be hit in the mouth!