I started a new project in which we need to be prepared for a Disaster Recovery between 2 Nutanix clusters in different regions. I did the Data Protection from Prism Central just fine in the beginning, already have 8 VM with NearSync enabled with RPO of 15min. However, this new VM has been giving me trouble lately, giving me a RPO up to 3 hours because of the following issues:
Warning : Snapshot Replication to Remote Site is Lagging.
Warning : Snapshot queued for replications to remote site
Warning : Entity is being transitioned to a lower frequency snapshot schedule.
To be fair this new VM is kinda big, with 10 cores, 50GB RAM, and 1.22 TB storage used. It’s for a high transactional MSSQL Server database.
I am basically stuck deciding if this is a networking problem, or a workload capacity from Nutanix problem.
We are supposedly using 200MBps LAN-to-LAN networking, with a 20ms latency. Should this be enough?
On the other hand, I am looking in Nutanix’s Analysis section that the cluster bandwidth I/O sometimes have some big peaks, and the Read IOPS goes up to 97% sometimes. Could this be the reason?