Skip to main content

Hello!

I’m pretty new to this Nutanix world. Have been dealing with standard server+storage for more than a decade. We have 2 clusters here, with 3 nodes in each site with metro availability. There are 3 protection domains active in each site (active) that are replicated to the other site (passive) and vice-versa. 
 

Site1:

Node1 - Node3 - Node5

PDs site1 (active): PROD_001, DEV_001, INFRA_001

PDs site1 (passive): PROD_002, DEV_002, INFRA_002

 

Site2:

Node2 - Node4 - Node6

PDs site2 (active): PROD_002, DEV_002, INFRA_002

PDs site2 (passive): PROD_001, DEV_001, INFRA_001

 

In vCenter cluster configuration we obviously have affinity rules for VMs/Hosts in site1 and VMs/Hosts in site2, preventing the VMs running in “odd” nodes from being stored in “even” nodes.
Sometimes we have to migrate VMs from one site to another. So we do a complete vmotion (compute and storage). After the migration, we start to constantly receive alerts with this message:

Snapshot status for vstore INFRA_001: Failed. Vstore INFRA_001 has VMs being protected by other vstore(s): VM = SXXXX96 vstores = (PROD_002). Please unprotect VMs from vstore(s) before snapshotting this vstore.

It also happens when we storage vmotion a VM datafiles from one datastore to another in the  same site. I did a search in the internet and nutanix documentation and found nothing about how to deal with these errors. It says “unprotect VMs to vstore before snapshotting this vstore” but how do I do it? Is it done on ncli? Prism? vCenter? Is there something that we are not doing right here? What is the best practice?

 

Any help will be appreciated.

Thanks

Henrique

Hi Henrique,

 

If I understand correctly, you perform a failover of VMs between sites and that is when you see the error?

Also is there anything missing in this sentence? “It also happens when we storage vmotion a VM datafiles from one datastore to another in the  same site.” What happens to the datafiles?

 

As per the planned failover with Metro Availability, the procedure is outlined in the guide Failing Over a Protection Domain Manually (Planned Failover) – are these the steps that you follow?


Hi Alona,

It is not a failover between sites, just a rebalancing. We often create too much virtual machines in site1 and the cluster got unbalanced from the storage/computing resources point of view. As DRS only balances compute resources (and we don’t like the way storage DRS works) we need then to manually migrate the whole virtual machine (compute and storage) from site1 to site2. Both sites are active, replicating between themselves. 

Everytime we migrate a virtual machine between sites we got these errors. As I said, we also have a DEV datastore where we firstly create VMs for dev and test purposes. Sometimes these DEV VMs became Production VMs and need to be moved to Production datastores, so we do the same migration process and the errors start to show up too.

Thanks

Henrique


Henrique, are you using any third party i.e. non-Nutanix backup solutions or tools by any chance?


Yes, I’m using Veeam Backup & Replication, but only for backups. Veeam uses vmware snapshots to backup the VMs. It is doing it right, no problem at all, create the snap, saves information, delete snap and goes on (I can see it in logs). I believe that these snapshot errors I see inside Prism are related to some type of snapshot used by nutanix and it’s engine services to replicated data between nodes/sites. I don’t believe that nutanix uses vmware snapshots to replicated. Am I right?

 

Thanks.


This looks suspiciously like one of the logged improvements with our Engineering team. To be sure, are you able to confirm whether the alert points towards the proxy VM used in backups or not?

When you say VMware snapshots it is still important to keep in mind that this is a hyperconverged environment and the storage is handled and presented by Nutanix exclusively.

You are right, MA does not rely on snapshots by third parties.


We don’t use proxy VMs for backups.

The alerts are for ordinary VMs in our environment. 


I would suggest raising this with Nutanix Support in this case.


I did it many times. No one was ever capable of telling us a command or a procedure to “unprotect” a VM. It’s always the same behavior, connect remotely, run a lot of ncc checks in CLI, collect logs, delete warnings and life goes on. 

To be honest, I’m really disappointed with Nutanix solution. It’s a black box, lots of theory, lots of “technology” with complicated terms but no one has really deep knowledge over it. We have another open ticket for a problem related to performance and still 2 months without response. All our SQL databases servers needed to be migrated to Server+Storage solutions  (HPE+3PAR) due to extremely low performance in Nutanix. Really bad.

 

Thank you


Hi Henrique,

I can’t seem to locate any support cases, forgive me. If you send me a direct message with the latest support case number we’d be able to review the case and hopefully provide with you with the solution.


Hi

Believe this is due to ISO file beeing connected to VM ( even if CD/DVD is disconnected )
Edit VM settings and change  CD/DVD drive to client device.
don’t know if it is needed but I also disconnect the drive form the VM.


Hi Henrique,

I have checked the history of your support cases and I have found a performance related case that was regarding the bug in VMware - when there are more than 5 NFS datastores connected via the same IP, the storage performance degrades over time. This issue is addressed in ESXi versions 6.5U3, 6.7U3 and newer. We have also applied a workaround from the AOS side and simply upgrading AOS to 5.10.4 and newer applies the fix, but the hosts need a reboot after that. That is what i can see happened in your situation - fix was already applied, but the reboot was pending. As i can see from the case, the issue was resolved after the hosts reboots were completed.

Here is the information about that VMware bug: https://kb.vmware.com/s/article/67129

We also have a KB about this issue with more details: https://portal.nutanix.com/kb/6961