ADS CPU migration decisions in AHV: Factors considered for assessment

  • 9 January 2023
  • 1 reply

What metrics are considered when Acropolis Dynamic Scheduling makes a migration decision on AHV related to CPU usage? Does ADS consider only CPU usage on the Host? Or does ADS take into consideration other VM or Host metrics such as CPU Ready, Co Stop, and/or Steal Time? 


Best answer by Moustafa Hindawi 9 January 2023, 19:47

View original

This topic has been closed for comments

1 reply

Userlevel 5
Badge +6

Hello @Shaun.Sparks 

ADS monitors the following resources:

  • VM CPU Utilization: Total CPU usage of each guest VM.
  • Storage CPU Utilization: Storage controller (Stargate) CPU usage per VM or iSCSI target

ADS does not monitor memory and networking usage.

How Acropolis Dynamic Scheduling Works

Lazan is the ADS service in an AHV cluster. AOS selects a Lazan manager and Lazan solver among the hosts in the cluster to effectively manage ADS operations.

ADS performs the following tasks to resolve compute and storage I/O contentions or hotspots:

  • The Lazan manager gathers statistics from the components it monitors.
  • The Lazan solver (runner) checks the statistics for potential anomalies and determines how to resolve them, if possible.
  • The Lazan manager invokes the tasks (for example, VM migrations) to resolve the situation.
  • During migration, a VM consumes resources on both the source and destination hosts as the High Availability (HA) reservation algorithm must protect the VM on both hosts. If a migration fails due to lack of free resources, turn off some VMs so that migration is possible.
  • If a problem is detected and ADS cannot solve the issue (for example, because of limited CPU or storage resources), the migration plan might fail. In these cases, an alert is generated. Monitor these alerts from the Alerts dashboard of the Prism Element web console and take necessary remedial actions.
  • If the host, firmware, or AOS upgrade is in progress and if any resource contention occurs during the upgrade period, ADS does not perform any resource contention rebalancing.

When Is a Hotspot Detected?

Lazan runs every 15 minutes and analyzes the resource usage for at least that period of time. If the resource utilization of an AHV host remains >85% for the span of 15 minutes, Lazan triggers migration tasks to remove the hotspot.

Note: For a storage hotspot, ADS looks at the last 40 minutes of data and uses a smoothing algorithm to use the most recent data. For a CPU hotspot, ADS looks at the last 10 minutes of data only, that is, the average CPU usage over the last 10 minutes.

Following are the possible reasons if there is an obvious hotspot, but the VMs did not migrate:

  • Lazan cannot resolve a hotspot. For example:
    • If there is a huge VM (16 vCPUs) at 100% usage, and accounts for 75% of the AHV host usage (which is also at 100% usage).
    • The other hosts are loaded at ~ 40% usage.

    In these situations, the other hosts cannot accommodate the large VM without causing contention there as well. Lazan does not prioritize one host or VM over others for contention, so it leaves the VM where it is hosted.

  • Number of all-flash nodes in the cluster is less than the replication factor.

    If the cluster has an RF2 configuration, the cluster must have a minimum of two all-flash nodes for successful migration of VMs on all the all-flash nodes.