Nutanix Connect Blog

Welcome to the Nutanix NEXT community. To get started please read our short welcome post. Thanks!

Showing results for 
Search instead for 
Did you mean: 
Community Manager

The Cloud OS Awakens : A Deeper Dive


This post was authored by Maryam Sanglaji, Product Marketing Principal Nutanix

Following our previous blog “The Cloud OS Awakens : A New Hope We covered why Machine Learning (ML) and Artificial intelligence (AI) are powering Nutanix Enterprise Cloud operating system. Let’s dive deeper as Yoda would say “Stay for some soup you must.”

Anomaly Detection to maintain cluster health

Every scalable architecture should be designed for failures with the goal of no application downtime and zero risk to the business and revenue. This is necessary but not sufficient to provide a highly available system with 5 or more 9s of availability. In addition to handling the presence of failed nodes, a truly scale out application must also handle partial faults, as outlined in Limplock and Fail-Stutter Fault Tolerance papers.

Partial faults can result in a node that is not dead, but is so unhealthy that it causes the entire cluster to slow down. Machine learning algorithms can significantly improve how these behaviors are learnt so that appropriate actions are taken in a timely fashion to maintain cluster health. Nutanix clusters have already mastered this and are able to maintain high availability. For this particular problem, Nutanix leverages a clustering algorithm called DBSCAN along with Nutanix distributed set of degraded node monitors.Figure 1: This figure illustrates how the performance drops due to a partial fault and recovers once a partial fail detection is enabledFigure 1: This figure illustrates how the performance drops due to a partial fault and recovers once a partial fail detection is enabled

In this process, every node calculates a score for all its peers based on their performance. DBSCAN algorithm then runs on this data and detects the outliers, having scores indicative of degradation. Once the degraded nodes are flagged, an alert is generated and the leadership and critical services will not be hosted on that node. When the node is operationally ready, it can be added back to the cluster. As a result, this feature ensures cluster health and high availability.

Optimization & Proactive Placement

In order to guarantee performance to all VMs, the cluster has to intelligently utilize adequate resources at all times. However, with the constant changing of the environments, number of VMs, and type of workloads, it is very difficult to maintain a consistent performance within a cluster. Therefore, there are many VM placement issues in the datacenters. As a result of a bad placement, the applications running on the cluster will experience unpredictable performance.

Also, in some cases resource contention happens. To help achieve better density of VMs, a lot of time resources are overcommitted which can cause contention between the nodes during peak traffic times. It is evident that lower and unpredictable performance directly affects business. Here is where Nutanix hypervisor AHV uses Acropolis Dynamic Scheduler (ADS). ADS leverages Constraint Satisfaction Problems Solver (CSP solvers) to improve VM placement & scheduling.  CSP solvers are used in artificial intelligence (AI).Figure 2: ADS EngineFigure 2: ADS EngineThese solvers optimally place VMs while keeping a list of constraint such as HA reservations, anti-affinity or affinity rules, storage performance, etc. This feature is always on, responds to hotspots proactively, and enables fully resource utilization with no compromise. Most importantly it is fault-tolerant, and robust to failures and maintains high availability during VM Placement. 

VM Behavior Learning

In a cluster with multiple VMs, many different applications are running. They are all consuming resources and can display different resource consumption characteristics. Some VMs are highly active during the day but idle at other times. In order to efficiently utilize the available resources, you will need to understand these behavioral patterns. Manually tracking and learning these behaviors in a large environment that is constantly changing, is a very cumbersome job. Who really wants to spend their time doing that?

There are bigger business problems that needs to be addressed. In a big deployment, VMs get created and may at times be forgotten. I am sure you can think of a scenario that the VM was created for a user and was not utilized. Here is where Nutanix’s X-FIT engine comes to rescue. Within Prism Central, X-FIT engine uses time series analysis algorithms to identify patterns.Figure 3: Prism CentralFigure 3: Prism CentralThese algorithms can model trend as well as seasonal component in a time series. For example, X-FIT engine can model weekly or daily seasonal patterns. This learnt behavior helps the system detect anomalies and thus smart alerts will be generated. Also, once the behavior is learnt, the behavior band, margin, and alert zone will be created. These categories and zones will be displayed in Prism Central. This way all misbehaving VMs (e.g. constraint VMs, bully VMs, zombie VMs, and over provisioned VMs) can be categorized and are visible so that the appropriate actions can be taken.

Smart Planning & What if Analysis

Guesswork & spreadsheets! How many different management consoles need to be monitored before making a critical expensive decision for your datacenter? How can you avoid over provisioning and the costs of it? X-FIT engine not only helps with the VM behavioral analysis, but also provides accurate forecasting. Many of our customers are loving the one-click upgrades and one-click operational insights within Prism Central. And because of X-FIT, they are enjoying the one-click planning option. X-FIT engine is comprised of a set of algorithms such as ARIMA, Theta, Neural, etc. It runs a tournament to choose the algorithms that best describes the data. Once the tournament winners are chosen their forecasts are combined.

These forecasts assist the customers to estimate their true resource needs empowering them to optimally size hardware resources for specific workloads. Customers using the one-click planning can easily see when they will run out of capacity. This is where the power of planning flourishes. Using what if analysis in Prism Central, you can specify the workloads that needs to be added and then the system will generate a resource recommendation. It is important to notice that leveraging technologies such as X-FIT helps us eliminate inefficiencies in the datacenter and largely save us on costs. Say goodbye to stressful, costly, and hectic IT refresh cycles and say hello to Nutanix.Figure 4:  X-FIT FlowchartFigure 4: X-FIT FlowchartNow that we have covered these four categories more in depth, it is evident that leveraging technologies such as ML and AI will help eliminate inefficiencies and enable true one-click operation experience in the datacenters. These technologies provide truly cloud-like experience for enterprise datacenters with minimal IT resources.“Size matters not. Look at me. Judge me by my size, do you? Hmm? Hmm. And well you should not.” Yoda.


Disclaimer: This blog may contain links to external websites that are not part of Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such site.

© 2017 Nutanix, Inc.  All rights reserved. Nutanix, the Enterprise Cloud Platform, and the Nutanix logo are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

About the Author
Top Kudoed Authors