LCM | What is this Sorcery?

  • 4 September 2020
  • 0 replies
  • 626 views

Userlevel 4
Badge +2

Can you imagine performing firmware upgrades manually?

Checking compatibility matrix for each component, procuring ISOs and staging them, take days to perform firmware upgrade and one can only hope that the host/cluster is still functioning after everything is said and done.

LCM simplifies Nutanix IT infrastructure life cycle operations by consolidating software and firmware component upgrades into a unified control plane.

LCM does all the hard work of managing all the upgrade dependencies for the software and firmware components. You don't have to worry about a thing. LCM normalizes firmware upgrades across hardware platforms and provides a single unified process regardless of vendor. Based on the bundle of components -- software and firmware-- that are selected for upgrade, LCM creates a plan that ensures the updates are deployed with the minimal number of service or host restarts.

Get Started:

  1. Use Prism Central or Prism Element to open LCM

  2. Perform the LCM inventory (default is set to automatic) 

  3. Select one or more software or firmware packages to deploy

  4. Click update to start the upgrade process

Typical workflow:

Most firmware updates are run from a CentOS-based staging area called Phoenix. When a node is selected by the LCM Leader for upgrade, the Foundation service is started on a remote CVM and the reboot_to_phoenix API is called to send that node into the staging area. A similar Foundation API, reboot_to_host, is later used to boot the upgraded node back into its hypervisor.

The actual workflows for each firmware entity vary somewhat. Here is what users can expect during several upgrade types.

Pro-tip: 

What's the difference between a Warm reset and a Cold one? A warm reset of host is also known as a graceful reboot, or one in which an operating system is reset but the power for the appliance remains On. Conversely, in a cold reset an AC Power Cycle (Off+On) of the hardware is performed. The latter operation is necessary for applying most firmware updates.

 

BIOS

What happens during the update:

  • Put the node in maintenance mode, automatically migrating all guest VMs to another node.

  • Restart the node into the Phoenix Live CD.

  • Perform stage one of the update.

  • Warm-reset Phoenix.

  • Perform stage two of the update.

  • Cold-reset Phoenix.

  • Perform stage three of the update.

  • Restart out of Phoenix and bring the node out of maintenance mode.

  • Note: Because of requirements of Intel microcode, updating the BIOS requires several restarts, so BIOS updates take longer than updates for other components.

BMC

What happens during the update:

  • Put the node in maintenance mode, automatically migrating all guest VMs to another node.

  • Restart the node into the Phoenix Live CD.

  • Perform the update.

  • Restart out of Phoenix and bring the node out of maintenance mode.

 

Data Drives and HBA Controllers

What happens during the update:

  • Check to make sure that all data drives are healthy, so that taking down one drive does not cause any data loss.

  • Place the target Host into Hypervisor Maintenance Mode, migrating all guest VMs to another node.

  • Place the CVM into Maintenance Mode so that storage traffic is served by a remote CVM.

  • Stop remaining services on the CVM.

  • Restart the node into the Phoenix Live CD with the disk check option.

  • Perform the firmware update.

  • Restart the node.

  • Restart the CVM.

  • Bring the node out of maintenance mode.

  • Return storage traffic to the CVM.

  • Recheck all data drives.

 

Satadom/M.2

If Inventory detects any 3IE3 model satadoms in a cluster running S560301N firmware it will only make updates for this entity available until the device is upgraded to version S670330N. See KB-7194 for details.

 

 

What happens during the update:

  • Put the node in maintenance mode, automatically migrating all guest VMs to another node.

  • Restart the node into the Phoenix Live CD.

  • Perform the update.

  • Restart out of Phoenix and bring the node out of maintenance mode.

 

Performing upgrades using LCM falls in line with our 1-Click update workflows that performs actions without causing any downtime to the end users whatsoever.

How do you do it? Is there a magic wand ?

Well it is actually a token that we call a shutdown token

A shutdown token is essentially an entry in the ZK that states what node can shut itself down, and the reason why the node was granted this token, plus a timestamp.

It has three fields: request_reason, request_time, and requester_ip. Example: {"request_reason": "host_upgrade", "request_time": 1489673150.9863801, "requester_ip": "10.125.65.14"}

  • When a node wants to shut itself down, it sends a request to the genesis master for a shutdown token. The master then finds the current owner of the shutdown token and tries to get it to release the token.

  • The node that holds the token will release it only when it's done with the operation it was granted the token for, and once all its services are UP and the HA routes on the hypervisor are removed.

  • Genesis master is responsible for managing the requests and for ensuring that only one node gets the token. Each node sends a request every 30s until it gets the shutdown token

This way if there is an issue with any 1-Click upgrade, only one node or CVM will be affected and with Nutanix clusters able to survive at the least one node crash, this process is safe to run anytime. 

Note : Please consider the importance of Data resiliency and Cluster health as any issues detected before upgrade needs to be resolved before starting upgrades.

 

Check out the Guide below and our product page

Life Cycle Manager Guide v2.3

Product page


This topic has been closed for comments