I’m pretty new to this Nutanix world. Have been dealing with standard server+storage for more than a decade. We have 2 clusters here, with 3 nodes in each site with metro availability. There are 3 protection domains active in each site (active) that are replicated to the other site (passive) and vice-versa.
Node1 - Node3 - Node5
PDs site1 (active): PROD_001, DEV_001, INFRA_001
PDs site1 (passive): PROD_002, DEV_002, INFRA_002
Node2 - Node4 - Node6
PDs site2 (active): PROD_002, DEV_002, INFRA_002
PDs site2 (passive): PROD_001, DEV_001, INFRA_001
In vCenter cluster configuration we obviously have affinity rules for VMs/Hosts in site1 and VMs/Hosts in site2, preventing the VMs running in “odd” nodes from being stored in “even” nodes.
Sometimes we have to migrate VMs from one site to another. So we do a complete vmotion (compute and storage). After the migration, we start to constantly receive alerts with this message:
Snapshot status for vstore INFRA_001: Failed. Vstore INFRA_001 has VMs being protected by other vstore(s): VM = SXXXX96 vstores = (PROD_002). Please unprotect VMs from vstore(s) before snapshotting this vstore.
It also happens when we storage vmotion a VM datafiles from one datastore to another in the same site. I did a search in the internet and nutanix documentation and found nothing about how to deal with these errors. It says “unprotect VMs to vstore before snapshotting this vstore” but how do I do it? Is it done on ncli? Prism? vCenter? Is there something that we are not doing right here? What is the best practice?
Any help will be appreciated.
Best answer by Sergei Ivanov
I have checked the history of your support cases and I have found a performance related case that was regarding the bug in VMware - when there are more than 5 NFS datastores connected via the same IP, the storage performance degrades over time. This issue is addressed in ESXi versions 6.5U3, 6.7U3 and newer. We have also applied a workaround from the AOS side and simply upgrading AOS to 5.10.4 and newer applies the fix, but the hosts need a reboot after that. That is what i can see happened in your situation - fix was already applied, but the reboot was pending. As i can see from the case, the issue was resolved after the hosts reboots were completed.
Here is the information about that VMware bug: https://kb.vmware.com/s/article/67129
We also have a KB about this issue with more details: https://portal.nutanix.com/kb/6961