Solved

VSS snapshot failed for the VM(s)

  • 22 January 2020
  • 4 replies
  • 13137 views

Good Day, 

I am very new to Nutanix and recently purchased a cluster. It has only been running for about 30 days now managing my network and I am in the process of make some configurations to it. I need some assistance with an error message that I am receiving. I am using AHV as hyper-visor on my 3 node cluster. On this cluster I am running 5 Windows based Server VM's (not using Hyper-V or VMware). I followed the instructions from the Administration Guide by enabling VSS Shadow Copies on the Servers, then installing guest tools on all the servers and creating a protect domain Async DR. My configurations are working and snapshots are being created for my Domain Controllers and Application Servers. However, when a snapshot is trying to be created of my File Server, I keep getting the following error. 

"Warning : VSS snapshot failed for the VM(s) FS-01 protected by the FileServer in the snapshot (169035, 1563300389081879, 960) because Quiescing guest VM(s) failed or timed out.

Impact          : Crash consistent snapshot is taken instead of application consistent snapshot.
Cause           : Guest is not able to quiesce VM due to internal error.
Resolution      : Look at the logs in the guest VM. If the VM failed to quiesce, reduce the load on the VM and try again.

Node Id         : HM186S003223
Block Id        : 19SM6J220007
Block Type      : NX-1065-G6
Cluster Id      : 169035
Cluster Uuid    : 00058dd0-3c5e-5317-0000-00000002944b
Cluster Name    : NTNX
Cluster Version : el7.3-release-euphrates-5.10.8.1-stable-9ac2cb13b645b9df04eb85b0e091f1060ee27439
Cluster Ips     : 192.168.1.10 192.168.1.11 192.168.1.12
Timestamp       : Wed Jan 22 10:03:48 AST 2020"

I would greatly appreciate your assistance to have this issue resolved.

Best Regards,
Kevon Heraman

icon

Best answer by JeremyJ 20 February 2020, 17:08

View original

This topic has been closed for comments

4 replies

Userlevel 4
Badge +9

Hi,

did you install any NGT tools? See https://next.nutanix.com/backup-and-recovery-29/vss-snapshot-failed-12608

No.

Userlevel 4
Badge +5

Hello @kheraman 

We now raise an alert if Application Consistent Snapshots are attempted when Nutanix Guest Tools (NGT) is not installed on the VM or if the NGT communication link is down. The following document might help; 

https://portal.nutanix.com/#/page/kbs/details?targetId=kA00e000000CqILCA0

Can you install NGT on the VMs and see if you are again seeing the alert.

Userlevel 3
Badge +4

hello @kheraman

Are you still having an issue?

For successful application consistent VM snapshots, the following need to happen:

  1. Backup snapshot is triggered with app-consistent option enabled.
  2. The CVM reaches out to the NGT service running on the VM via TCP/IP to signal that VSS snapshot is needed.
    1. This requires TCP/IP communication be possible both ways, but should not need DNS since the NGT service will inform Prism of the IP, and the NGT installation has cluster IP info. (NAT could be a problem)
    2. The NGT service must, of course, be installed and running, and able to reach the CVM on port 2074.
      (more detail on what ports are needed for full NGT function here
    3. Communication uses a pre-shared key which is part of NGT installation, and an identifier which is unique to the VM. To have this work, NGT installation must be unique per VM using the “mount iso” option from Prism.
      If cloning VMs, you can pre-install NGT and then mount the ISO again on the clone before powering on. The NGT service will fetch updated identifier info during service start if the NGT ISO is found in the VM’s CDROM drive.
  3. The NGT service requests VSS Quiesce operation from the Windows OS.
    1. In a quiesce operation, all new changes to disk are held in hot-backup in memory on the VM until the snapshot is finished.
    2. All pending changes to disk must finish before the snapshot can happen.
    3. This requires sufficient memory on the VM to hold all new changes long enough to complete the snapshot, otherwise application consistent snapshot will fail. If you’re seeing intermittent failures, this is where to focus.
    4. This process can be impacted by high workload, slower disk performance, hypervisor memory or CPU contention, VM memory or CPU contention, or any Windows VSS specific issue which prevents quiesce completion.
      1. Options for resolution include re-balancing workloads, adding resources at the host or VM, or adjusting scheduled jobs so that snapshots can run at lower-IO times. 
  4. Once VSS signals back to NGT that quiesce is good, NGT service signals to Prism and then snapshot is taken. Prism then signals back to NGT, which relays to VSS, at which point new pending disk operations are allowed to flow to disk.

Where this error message indicates “an internal error” I would actually be looking at VSS and the Nutanix Guest Tools service on the user VM itself. A different error should be seen if Prism cannot reach the NGT service on the VM. There is yet another error when NGT has not been enabled. 

The KB article “Taking app-consistent (VSS) snapshots using NGT fails on Windows VMs” covers one scenario where the culprit is anti-virus software on the VM. The KB also gives some good general steps for exploring the issue with Event Viewer and the vssadmin command. These are often essential in identifying and resolving the issue. The important thing to look for in ‘vssadmin list writers’ and ‘vssadmin list providers’ is the last error state. If the last attempt was successful we’ll see an indication of no error. If you just tried the backup and still see no error here, our problem is happening before VSS gets triggered.

The article “Nutanix Guest Tools Troubleshooting Guide” provides further guidance on validating the Nutanix Guest Tools installation.