Using Prism Pro to Understand the Effects of Spectre and Meltdown Patches

  • 23 February 2018
  • 1 reply
Using Prism Pro to Understand the Effects of Spectre and Meltdown Patches
Userlevel 7
Badge +35
This post was authored by Brian Suhr, Sr. Technical Marketing Engineer

Unless you just got back from a vacation to Mars, you've probably not been able to avoid the noise and confusion around the security vulnerabilities recently discovered in most modern CPU architectures. These vulnerabilities-Spectre and Meltdown-have three variants that will require software patches and microcode updates to close. To learn more, is a good starting point for your research.

Why are these patches different?

Because the Spectre and Meltdown vulnerabilities are both related to CPU architectures, closing them involves adjusting the way operating systems and applications have used the speculative execution function of CPUs to date. Guest operating system vendors are widely advising their customers that applying these patches is likely to have a negative impact on performance, and several vendors have pulled back their initial patches due to instability.

It's too early to understand completely what the final performance implications of Spectre and Meltdown may be, and I definitely want to caution everyone not to rush into applying patches without thoroughly vetting them. Because patching involves the hypervisor, guest OS, and microcode layers, applying updates in phases-at least during the vetting process-to understand the effect of each is ideal. This phased approach also builds in the time needed to get a sense of the stability of each patch both individually and in combination before rolling it out to your greater environment.

How Prism Pro can help

For the remainder of this blog post, we'll look at how Prism Pro simplifies the process of understanding any deltas in CPU consumption between the state of your environment today and its state after patching.

If you know what time to look [and] what kind of change to expect, you can see an event's effect in a normal performance chart. In the example below, the top chart shows the CPU usage for a selected VM over a three-hour period. The arrow marks a point where the CPU usage for this VM increases by a significant amount. The fact that usage increased is not hard to understand, but the chart doesn't indicate whether this variance is normal or a result of a change within the environment.

With the release of AOS 5.5, Nutanix added X-Fit technology to Prism Pro, so the management pane can now learn how your VMs and hosts normally behave. When the behavioral learning system detects that a metric it is monitoring has strayed outside of normal boundaries, it notes an anomaly and creates a warning.

The following example looks at the same VM scenario shown above, but using the new, intelligent charts. In the chart below, the solid blue line represents the actual CPU usage, while the shaded blue band represents the learned baseline range for this VM's normal CPU usage. The zoomed-in portion of the chart (in the red circle) again illustrates the spike in CPU consumption. However, in this new context, we can see that the value has increased above our learned baseline. The anomaly is thus clearly not normal for this VM, so it has triggered an alert.

The learning behavior algorithms built into Nutanix X-Fit continuously watch our example VM and note that the increase in CPU continues. Over time, Prism learns the changed CPU usage as the new normal behavior for this VM and begins to adjust the baseline accordingly, as shown in the example below. Because of the amount of the increase, the revised baseline initially spans a much larger range; this range automatically adjusts as Prism continues to monitor the VM.

So far, our examples have focused on the behaviors of a single VM. While looking at the charts for a single VM can be very helpful, to address the Spectre and Meltdown vulnerabilities, most organizations are going to need to patch hundred to thousands of VMs. Given the scope of deploying these fixes and monitoring their impact, the focus will eventually need to be at a higher level. In this scenario, it is less time consuming to look at the environment from the host level, while still allowing you to trace the results of any changes.

To test our ability to see the effect of a patch at the host level, I used a group of VMs to simulate the CPU increase that you would expect to observe in your environment. The example below shows a host-level CPU usage chart with a learned baseline and a usage spike toward the end. This spike in activity created an anomaly at the host level just as it did in our VM example above.

Each of these anomaly events creates an alert within Prism. If you have configured email notifications, they also send an email alert. This warning system is helpful by itself, but Prism also gives you an easy way to understand the anomalies that exist and with which entities they're associated.

The Impacted Cluster widget on the Prism dashboard provides high-level details on the health of the cluster. As you can see in the screenshot below, one of the data points is the number of anomalies in the last 24 hours.

Click the anomaly count in the widget to see a list of the entities with anomalies within the given time period and which metric had the anomaly, as shown in the example below. Clicking each anomaly takes you to the charts we saw earlier in the VM and host examples.

Planning for capacity changes

Organizations need to understand and plan for how the potentially nontrivial increase in CPU usage expected from guest OS patching could affect different clusters. The capacity planning features in Prism Pro allow you to create scenarios to model the effects of adding this kind of demand to a cluster.

For this example, we will create a new planning scenario for our test cluster. The following chart shows the current runway details for the environment's current state, with the existing host resources listed below.

Next, we add a workload to the scenario to account for any expected change in resource usage. The capacity planning function allows us to model many different types of business applications for the scenario; for this example I am using the Change in Demand option to simulate a 30% increase in demand.

After applying the additional workload demand to our scenario, the capacity runway chart updates with the new calculations. It now shows that our available CPU runway has dropped below our goal of six months. We have a couple of different options for getting enough resources to boost our CPU runway back into our goal range.

The first option is to expand the cluster by adding resources. Because we're basing this example on CPU usage, we would need to add another node to the cluster. Adding nodes to the cluster is easy; simply click Recommend below the chart to have Prism automatically and precisely calculate how many nodes you need, which model they should be, and their ideal configuration to meet your runway goals based on existing and newly modeled workload demands. You also always have the option to input the count and configuration of the additional nodes manually.

In the runway chart screenshot below, the Resources table shows the existing nodes along with the node that is newly recommend meeting the new demands.

Reclaiming wasted resources

If adding nodes is more than you need to meet the demand that you modeled or is otherwise just not an ideal solution for your particular circumstances, you can look for wasted resources to reclaim. Prism Pro provides easy-to-understand data to help you identify resources that you could optimize or reclaim. The next screenshot shows the Optimize Resources widget that you can view at either a global or per-cluster level.

For this example, I've focused on the overprovisioned and inactive classifications for VMs because both of these categories tend to contain resources that organizations can reclaim fairly easily to meet increased demands.
  • The over-provisioned classification uses the learned VM behavior functions described earlier to identify VMs that are regularly using limited amounts of their assigned resources.
  • The inactive classification identifies VMs that have been powered off for at least 30 days. This category also uses learned VM behavior to identify VMs that are powered on but regularly generate minimal amounts of CPU and disk activity.

Clicking on any of the classifications brings up in Prism's Explore view a list of the VMs identified in that category, as shown in the screenshot below for overprovisioned VMs. From this view, it's easy to identify exactly which VMs are overprovisioned and by how much, so you can plan out the discussion you may need to have with the application team to reclaim these resources. When you're ready to apply the resource assignment changes to your environment, simply select the VM and adjust the resources to the desired levels.

Taking inventory of your environment

Now that we've looked at monitoring and capacity planning, the last thing we need to discuss is how to take inventory of your environment. Understanding what pieces exist in your environment and what code they run is important for preparing and tracking your update phases.

Using Prism's Life Cycle Manager (LCM), it's easy to report which versions of BIOS each node within a cluster is running. This detail can be beneficial both for understanding which nodes need to be updated and for confirming which updates are complete.

In the following screenshot, I'm using the Explore function in Prism Central to list of all of the hosts in the environment with the hypervisor and version each is running. Because Prism was built for the cloud era, it supports multi-hypervisor management natively and allows for a single view of all of these details.

What about Prism reporting?

With the release of AOS 5.5, Prism Pro offers the ability to generate reports that can be run either on an ad-hoc basis or on a schedule. Prism Pro comes with a few default reports already included, and administrators can create new reports easily, either by cloning an existing report or by using the visual report designer to select report components from the available list of widgets.

The first report example below is the security patch watch list. This report presents the capacity runway chart for cluster CPU to show how any changes may affect the amount of capacity remaining. Next is a line chart that shows the CPU consumption for each host over the specified period of time, highlighting any increases within the reported timeframe. The report also includes an individual CPU usage chart for every VM in the environment to show any impact on a per-VM basis.

Here is a sample of the security patch watch list report.

I've created a second sample report to provide a list of all of the nodes within the environment with the hypervisor and version each node is running. We saw these details in the Prism inventory example earlier; having them in a report gives you the option to have them automatically delivered via email on a schedule and to share them with a larger team in PDF format. You can easily combine multiple report types into a single larger report if desired.

Here is a sample of the environment summary report.

I really appreciate this opportunity to discuss how Prism Pro could be useful in identifying the effects of the Spectre and Meltdown vulnerabilities. If you would like to discuss further or get additional details, please leave a comment or reach out to your account teams.

Disclaimer: This blog may contain links to external websites that are not part of Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such site.

©️ 2018 Nutanix, Inc. All rights reserved. Nutanix, Prism, and the Nutanix logo are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

1 reply

Badge +5
In my Cluster Efficiency Summary, I'm assuming if a VM is overprovisioned and it indicates the "CPU Gain" is be 1 vCPU and the "Memory Gain" is 2 GiB, that is how much I should reduce the CPU and Memory by? It appears obvious, but I just wanted to double-check. Thanks!