Interpreting performance metrics

Hi everyone,

I’m using the API to pull VM performance stats, however I’m having trouble interpreting what I’m seeing.

For instance, I’m pulling “hypervisor.cpu_ready_time_ppm” for one of my VMs and getting the following output:

{
 "statsSpecificResponses": R
 {
 "successful": true,
 "message": null,
 "startTimeInUsecs": 1576458000000000,
 "intervalInSecs": 30,
 "metric": "hypervisor.cpu_ready_time_ppm",
 "values": >
 108,
 97,
 144,
 107,
 89,
 92,
 78,
 74,
 47,
 49,
 90,
.... Output truncated for brevity

I get that the metric is a percentage but obviously you can’t have 144% of time, so how should I interpret these values?

What I really need is a guide and/or reference that explains all of these metrics and how to interpret them.

I found the following link but it doesn’t really tell me what I want to know:

https://portal.nutanix.com/#/page/docs/details?targetId=Prism-Central-Guide-Prism-v51:mul-alerts-user-created-metrics-r.html

Page 1 / 1

@Detho Could you let me know about below:

1- Is this VM running on AHV or ESXi hyperviosr?

2- Does VM you are reporting on, shows cpu usage of close to 100% internally in the OS?

Still reviewing a few things before I come to a conclusion. I appreciate you response.

The platform is AHV.
No, but CPU usage does not necessarily correlate with CPU ready. That’s to say that a VM could have relatively low CPU usage but still have high CPU ready times.

I think it’s important to note that I already know how to troubleshoot performance, what I don’t know is the decimal place of the percentages given in the output. For example, in the output above what percentage does 144 come out to? 1.44% ? 14.4%? .144%?

@Detho So based on the link from Nutanix you mentioned, the description of the “cpu ready time” is:

Percentage of the time a virtual machine waits to use the physical CPU out of the total CPU time allotted to the VM.

and it is reported in percentage format.

Based on the above it should be possible to have percentages higher than 100,

If your vm is scheduled to run for x number of units of time, and you had to wait more than that to run that amount of time (say 1.4x) , then the percentage will be more than 100%. (140% here)

I think the confusion may come from the fact that you assume the time the Vm is “running” (performing instructions in cpu and not sitting idle) is the same as what your collected interval was (in this case 30 seconds).

Hope this helps explaining it. Let me kow your thoughts and questions about it (if any).

One more thing, you may already know that this measure alone is not the best way to dig into the performance of a VM and you should consider other factors as well (Just thought mentioning it).

@Detho sorry I saw your response after I sent my last message, let me know your thought about it.

This is very good information, thank you.

Is there a metric to return the “total CPU time allotted to the VM” for a period of time?

What I’m trying to accomplish is to create a report showing the amount of time (in seconds) the VM(s) had to wait for CPU time at the interval.

In other words, how can I translate 144% to seconds of time that the VM waited for CPU time?

@Detho I am not sure if this is easily translatable to seconds of physical cpu time, because the amount of time a cpu is scheduled to run is a variable amount for each interval, depending on the number of instructions to run on the physical cpu on behalf of the VM.

Using this measure we have a ratio (for example x/y=1.4) for that interval, but we cannot find the value of “y” (our interval) without searching for other information relevant to cpu operations as Y itself varies each time Vm is scheduled to run.

You basically want a counter on each physical cpu to count the number of the “time unit slots” a cpu runs for instructions of a specific VM per each scheduled time. I am not saying it is impossible, but best I know at least these metrics wont provide that to us. I am not sure if this already exists or somebody who really needs it has to write it.

Let me kow your thoughts.

Alright, fair enough.

So given the following data, how would you interpret the CPU Ready for this VM? In general terms would you say that it’s high? Normal?

The period is 24 hours and the average of all of the returned values (2759) is 111.

Minimum: 29

Maximum: 850

The amount of values returned that are > 100: 1312 (47%)

I can provide the entire output to you, if you’d like.

@Detho here is my thoughts:

I would first want to know what this VM runs inside (OS and main apps)?

Secondly, what does the cpu usage inside the OS looks like for this 24 hours period of time? Is it most of the time close to or above 80% or not. Is there any application that is nto behaving well inside this VM?

Thirdly, how many other VMs are running on the same AHV node and if the rest of them show the same behavior or not?

Cpu ready by itself as you know may bow indicate any performance issues for that Vm.

The combination of the OS inside showing high cpu usage and the cpu ready time being high usually points to the fact that the VM is having perfromance problems cause it needs cpu cycles but it wont get it all the time

If the cpu usage inside the VM is low but we get these results then it could be that the AHV node is over loaded as either many VMs are running there or at lease one or some of the Vm machines are consuming lots of cpu time. This an be the situation were you have exchange or sql server like VMs along other VMs on one node.Let me know if any of the above matches your experience and if not what do you see with regard to the VM performance? Are users complaining about accessing this Vm and its data? Is it reachable network wise and the ping latency is not high or otherwise?

In a nutshell let me know what do you hope to find out about this VM or other VMs in the cluster or if there is any specific issue that you are trying to pin point using these metrics?

The VM in question runs Windows 2016 and is a webserver servicing about 10 concurrent connections at any time. The CPU usage on the VM averages about 40% utilization and peaks to 72%.

The problem we’re having is that the web application on this server will be intermittently and unacceptably slow. However I don’t want to focus on the web application in this conversation; there’s another group diagnosing from that angle and I’m focusing on the virtualization hardware that the webserver and its associated SQL server run on (ie the Nutanix platform.)

Most of my experience is with VMWare and on that platform you can run ESXTOP top and with one tool you can get realtime data that will give you a sense of the overall health of the host, including CPU Ready %. Getting a similar experience with Nutanix is proving to be more difficult.

In VMWare, using ESXTOP, if I saw that the VMs were routinely getting over 5% CPU Ready then I’d know that the host wasn’t healthy and requires more diagnostics - that’s what I’m looking for in Nutanix.

Like I said I’m trying to build a report using the API that will pull the CPU ready but I will also be pulling a variety of other metrics like RAM consumption, CPU usage, etc of each VM to get a good cross section of overall performance of the VMs.

@Detho Using the combination of cpu ready and cpu usage per Vm should give you a relatively good idea if the Vm is potentially hitting a performance bottleneck. If so you can always use prism “Analysis” page to get more information about the VM and if the issue is storage you should be able to use CVM port 2009 to check for any extreme disk activities at any specific time. The later is something that you probably need to open a support ticket for as reading these pages may bot be very straight forward.

The cpu ready alone won’t tell you much regarding a specific VM as the Vm may not really be busy and due to hypervisor being heavily loaded it just have hard time getting a scheduled slot to run its required instructions but internally it may bot really need that much scheduling time as not much is happening on it.

Adding the ram consumption to both of the above could make the things more clear about any bottle neck. In AHV we do not have memory over allocation, if we don’t have memory to allocate we don’t power on the VM.

As far as the hardware health goes, you can always use IPMI page to check for BIOS and firmware version along side the “event log” and component level health. More importantly the Hypervisor actives and information all appear int he prism “hardware” page. There you can check and see the hyperviosr level activities.

Unfortunately I don’t know any matching command to esxtop on AHV, but as I mentioned the prism VM page should show good info about the VM memory and cpu and storage activities.

Many of these reports that you may need to check for cluster and nodes and VM status may already be available int he prism page in case you have not check it yet.

Let me kow if I missed anything or if you have further concerns about what I mentioned and sorry if I have repeated things that you already know about.

Regards,

-Said

Ok, we’ve gotten off track from what I’m trying to understand, so let me rephrase my question;

Suppose you managed a Nutanix AHV deployment with 100 VMs and you ran a script that pulled “hypervisor.cpu_ready_time_ppm“ for a 1 hour period during peak usage (say 3PM), with a 30 second interval, for all 100 VMs, and saved the results to a CSV file.

Now you’re looking at the CSV file in Excel; using the data that you have in front of you, how would you determine that the CPU Ready % is high for your VMs and warrants more investigation? What kind of values would you need to see to make that determination?

@Detho

Given:

Suppose you managed a Nutanix AHV deployment with 100 VMs and you ran a script that pulled “hypervisor.cpu_ready_time_ppm“ for a 1 hour period during peak usage (say 3PM), with a 30 second interval, for all 100 VMs, and saved the results to a CSV file.

the value for “hypervisor.cpu_ready_time_ppm” alone per each Vm won’t tell much to me.

If hypervisor.cpu_ready_time_ppm” is high (more than 5% all the time) for ALL VM in the hypervisor, I would conclude that either

the hypervisor is overloaded (too many VMs running on it)

one or more VM are really using more cpu time cycle than others and causing others having trouble getting their share of scheduled time. You will need to identify these Vms and see if

Add to the above the cpu usage value per VM, now I can tell you which VM could be experiencing performance issues. But I wont stop hear, I make sure the VM is really experiencing issues by trying accessing it and seeing how things work internally.

Last but not the least, I would not rely only on that report you mentioned to determine the health of the hyperviosr and the Vms running on it.

I would basically start form the other side; which VM is experiencing performance issues and if the hypervisor being overloaded has anything to do with it or not. From here, I would check prism graphs and information for the VM and hypervisor in question for cpu usage, ram assigned, storage used and only then move forward to see if cpu ready time and cpu usage can provide further insight to the issue. I also would check the number of vcpus assigned for the VM with problem, occasionally reducing this number can actually help providing more schedule running time for it and making situation improve.

Hope this clarifies it better how I look into this issue if trying to troubleshoot performance issues on Nutanix cluster.

Hi,

I’ve just gone through the conversation for the second time.

Did the author actually get an answer to his question?

Maybe I misunderstood, but it seems the original question boils down to:
What does “ppm” stand for?

For example: hypervisor_cpu_usage_ppm = 5778

Based on Nutanix documentation (link provided by author) PPM represents percentage. That would give us 5778%

However PPM usually stands for Parts Per Million, therefore 5778 PPM would be 0.5778%

Based on observation (comapring API values to PE) I belive the latter is correct.

PPM = Parts Per Million; PPM/10000 = %

Back to the original question...

hypervisor.cpu_ready_time_ppm = 144 = 0,0144% = very low number for CPU ready time %.

Am I right?

@PeWu Good observation. If the numbers are part per million (and it makes sense, but probably need to be investigated to be 100% sure) then a number like 144 is basically %0.0144 (as you mentioned), which means the instruction did not wait in the queue to be executed much at all. The focus of response was the fact that the percentage can be above 100%; which they can cause this is a ratio expressed in ppm (as you pointed).

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded