Cluster utilization reach 95% space usage make Windows Server crash | Nutanix Community
Skip to main content

The storage usage reaches 95% and some windows VM (2016 2019)down.

AHV hypervisor 20170830.395

After release the disk space many VMs should be repaired.

Questions:

  1. AHV will shut down VM when the capacity reaches 95% ? 
  2. If user reboot VM when it is 95% , will it make  the VM data disks crashed ?

Hi @dzeng 

AHV does not turn VMs off in this scenario. What happens is that new IO is not written only read so the guest OS crashes. Will the disk become corrupt? That is unlikely as stargate (storage service) goes into read-only mode. So the data is preserved. What is in the memory cannot be written to the disk however and that in turn can lead to the application crash.

Let me know if that helps with understanding.


Hi @Alona 

Thank you for your reply. We met the issue the week and many VMs crashed and we spend 48 hours to repair.  As I know if the ESXI cluster storage usage reaches 95% it will shut down the VMs to protect the OS.  Any chance that AHV can shutdown the VMs as ESXI ? Or any better idea for this kind of situation.

Thank you in advance


Hi @dzeng 

AHV does not suspend VMs in this situation. Reaching 95% storage utilisation is not something that should happen if the environment is well planned, monitored and looked after.

I would assess the environment with a few questions in mind:

  • what is the workload? VDI, mission-critical DBs and applications, file-servers? Each of those has a growth pattern. DBs can grow in size rapidly, for example. While VDIs can spike during certain times of the day and days of the week.
  • What is the “normal” level of storage utilisation? This is not a statistical average as it will not give you a meaningful number. When it’s not at 95%, where does the number sit most of the time?
    Is it below 50% and there is a sudden rapid growth of data (then find that source, find the reason for the growth, eliminate or manage the root cause, isolate the source from the rest of the workloads potentially and ensure there is sufficient room for it expand if the growth is unavoidable)?
    Is it normally at 80%? Then it is not surprising that during the growth of data the storage utilisation reaches critical numbers and you are presented with a laborious and challenging task of emergency cleanup and recovery.
  • Is the storage sized correctly for the workload? The normal storage utilisation may be at 50% but because total available storage is not big enough the normal and expected growth of the workload brings the system to its critical state. Would you benefit from storage expansion?
  • Look at the tasks at the background. Are there too many snapshots? Would it be worth changing the retention and/or schedule of the snapshots to something more conservative?

Above are only a few points. You know your systems better. Prevention is the best policy, really.


Reply