I started a new project in which we need to be prepared for a Disaster Recovery between 2 Nutanix clusters in different regions. I did the Data Protection from Prism Central just fine in the beginning, already have 8 VM with NearSync enabled with RPO of 15min. However, this new VM has been giving me trouble lately, giving me a RPO up to 3 hours because of the following issues:
Warning : Snapshot Replication to Remote Site is Lagging.
Warning : Snapshot queued for replications to remote site
Warning : Entity is being transitioned to a lower frequency snapshot schedule.
To be fair this new VM is kinda big, with 10 cores, 50GB RAM, and 1.22 TB storage used. It’s for a high transactional MSSQL Server database.
I am basically stuck deciding if this is a networking problem, or a workload capacity from Nutanix problem.
We are supposedly using 200MBps LAN-to-LAN networking, with a 20ms latency. Should this be enough?
On the other hand, I am looking in Nutanix’s Analysis section that the cluster bandwidth I/O sometimes have some big peaks, and the Read IOPS goes up to 97% sometimes. Could this be the reason?
Any idea of the change rate of your server? Maybe it is not suitable for the technology of Nearsync.
The first warning in your message is suggesting bandwidth issues…
Are you able to monitor the 200Mbps connection from network perspective? No saturation on the link?
How can I determine the change rate of this VM? Can this be done in Nutanix’s Analysis section?
According to the network team, the 200MBps connection is being used at 100%, but Nutanix “can’t receive it completely”, that is, cannot get the 200MBps bandwidth. I don’t really think that’s the case but I am not an expert in Nutanix infrastructure.
If line is fully utilized, then you have your problem… The delta snapshots are not delivered in time… Nutanix will revert back to async.
I think you should increase the bandwidth for the Nutanix replication.
Yeah we are planning to increase it to 500MBps. I will post results once they confirm me this change. Thanks for the suggestion.
You can determine the change rate through your backup solution by calculate the average of last 4 incremental backup size.
Unfortunately the Reclaiming Space column for each snapshot in Prism Central is always blank, which is strange because in Prism Element I could view them in the Data Protection section. IIRC the average of the incremental backups were between 1GB - 2GB.
Thank you for the information.
Just wanted to mention my solution to this issue, (well, at least partially) was fixing/adjusting the policies in our switches. Apparently it was not because of the bandwidth, 200Mbps was enough in our case.
However, I am still getting the Warning : Snapshot queued for replications to remote site yet, but at least NearSync with 15min RPO is working now.
Thanks everyone for your advice and information provided.