Solved

CVM Alert - threshold of 98 Percent Disk IO exceeded

  • 24 April 2020
  • 6 replies
  • 855 views

Badge +2

One of my CVMs is throwing this alert:

CVM threshold of 98 Percent Disk IO exceeded

I can’t find anything published on this alert anywhere…  would like some assistance understanding this and how to mitigate it..  

Ran a complete NCC check and it came back clean, just 1 warning.  

 

 

icon

Best answer by theGman 4 May 2020, 16:17

After an evaluation of performance counters I’m chalking this up to a false alert created by our monitoring system.    It happens.. and its always a pain in the a** when it does because it creates wasted work cycles.   

Thanks to those who provided some assistance in looking into this.

 

View original

6 replies

Userlevel 4
Badge +4

Hey, hey! Are you using a third party monitoring system which reports that alert? Is that where you see this? Or is it reported in the Prism?

Badge +2

Hi Alona - yes we are using a 3rd party monitoring system with a specific integration with Nutanix, the company is Zenoss and the integration is done with the Nutanix Zenpack.   There isn’t an alert in prism showing this.   I have no idea to what extent Zenoss worked with Nutanix to produce this “Zenpack” or how they are arriving at conclusion that the CVM exceeded 98% disk IO… I could understand it better if it was based on throughput since disk IO itself is highly variable.  

Userlevel 4
Badge +4

That’s what I thought but needed you to confirm first:)

This is not a Nutanix alert hence you do not see it coming up as part of NCC health check report.

The thresholds are configurable from Zenoss console. Below is a potential location (in no way I am an expert in Zenoss but it might give you a kick start in terms of where to look first).

Let me know if that helps, please.

Badge +2

Hi Alona - what would I do here, remove the alert or change the threshold from 98% to 100%?  I think the point here is that the alert was triggered indicating a 98% disk IO for the CVM, in a cluster that is for the most part not highly consolidated, although that doesn’t mean much because it only takes a couple of heavy hitters to be disruptive in a shared infrastructure.   

I think what is more pertinent here is guidance from Nutanix that either A) high IO from the CVS is normal and not to worry about, or B) this is an anomaly that requires investigation into operations within the cluster.  

This Nutanix Zenpack, per Zenoss documentation, was “supposedly” developed in conjunction with Nutanix Engineering.   So on one hand we have an alert being triggered by this integration and on the other its not triggered in Nutanix prism.   Perhaps this disconnect is nothing more than a false alert?   Very possible.. If I had a dollar for every false alert from a monitoring system over the years I’d retire.. :) 

 

Thanks

 

 

Userlevel 2
Badge +4

@theGman do you see a correlating spike or increase in latency at the same time? If not, I would either increase the threshold or disable the alert. If there is a latency increase the alert seems worthwhile.

I would be looking at these stats in Prism, on the Hardware dashboard, host performance tab in the lower panel after selecting the host with the indicated CVM. 

As far as mitigating the situation (if warranted), I’d think it’s really a question of workload distribution and balancing between the physical nodes. CVM IO is generated primarily by local VM’s. There are other IO generating processes such as DR replication, storage clean-up, post-process compression and the like, but these would either be balanced or would be localized to the node where the VM resides.

If you wanted to spread the disk workload of a single large VM across multiple CVMs, this can be done using volume groups rather than standard VM disks.

I hope these answers help with the situation in your environment. Feel free to prod for clarification if needed.

Badge +2

After an evaluation of performance counters I’m chalking this up to a false alert created by our monitoring system.    It happens.. and its always a pain in the a** when it does because it creates wasted work cycles.   

Thanks to those who provided some assistance in looking into this.

 

Reply