All things considered — upgrade sequence and preaparation guidelines

  • 22 October 2019
  • 2 replies
  • 8761 views

Userlevel 6
Badge +5
  • Nutanix Employee
  • 433 replies

Let’s face it, upgrades are daunting, and confusing, and frequent, and unavoidable, and, well, painful. Aiming to help with the preparation process and alleviate at least some of the worries Nutanix put together Acropolis upgrade (Upgrading AOS, Prism Central, Hypervisors, and Related Software Through The Web Console).

Recommended Upgrade Order:

  1. Prism Central (PC): Upgrade and run NCC on Prism Central.
  2. PC: Upgrade Prism Central.
  3. PC: Run NCC.
  4. Prism Element clusters (PE): Upgrade and run NCC.
  5. PE: Upgrade Foundation.
  6. PE: Run and upgrade Life Cycle Manager (LCM):
    • Perform an LCM inventory (also updates LCM framework). Do not upgrade any other software component except LCM in this step.
  7. PE: Upgrade AOS.
  8. PE: Run and upgrade Life Cycle Manager (LCM):
    • Perform an LCM inventory (also updates LCM framework).
    • Upgrade SATA DOM firmware (for hardware using SATA DOMs) as recommended by LCM.
    • Upgrade all other firmware as recommended by LCM (BIOS / BMC / other).
  9. PE: Upgrade AHV for AHV clusters.
  10. PE: Upgrade cluster hypervisor hosts other than AHV.
  11. PE: Run NCC.

 

Figure. Nutanix Upgrade Order 1 of 2: Prism Central and Prism Element Clusters

 

Figure. Nutanix Upgrade Order 2 of 2: Hypervisor Hosts

 

Nutanix recommends that you perform installation and upgrades during your scheduled maintenance window or outside your normal business hours. You can upgrade during these times, but your users might experience some latency during the upgrade process. This latency might be noticeable for clusters that use only 1GbE network uplinks due to the limited bandwidth available for this configuration.

Guest VMs might live migrate between hosts depending on the upgrade being performed with little to no impact to their services. Users can access to their VMs and be able to work as normal during the upgrade.

You might need to power down or adjust settings before upgrading for VMs that cannot live migrate, such as those with vGPUs or Affinity Rules. Typically this recommendation applies for any upgrade that requires a host reboot (like hypervisor upgrades). If you do not do this, an upgrade operation might stall when evacuating guest VMs.

 

:exclamation: Be sure to read Before You Begin, Upgrade Checklist as well as Reference and Resources (lists all compatibility matrices’ URLs) sections of the guide as a starting point.

May the Force be with you!:fingers_crossed:

 


This topic has been closed for comments

2 replies

Userlevel 6
Badge +5

Hi @Rob Thomas 

That is an excellent question, thank you. To the best of my knowledge, the answer in the format you seek does not exist. I will attempt to explain why it is so the best I can.

There is more than one way to set up the environment. Some environments have no interest in high availability (not only within the cluster but on the environment/site scale), some have considered failure scenarios minorly or moderately and accounted for them, some have a very low tolerance for failure where a 6 seconds outage is already outside of the acceptable zone. There are sites that have mirrored setups in two vendors in parallel so that if/when there is a bug in the software of either of the vendors there is an option to isolate the issue and switch over to the other vendor while searching or waiting for the solution.

Hence there are LTS and STS releases. First are more stable, more conservative, have a longer life cycle, less frequent updates. The latter are more frequent, more explorative.

So in overall, there are two main factors, I think, to consider: one being the tolerance to maintenance windows and the other – let’s call it hunger for new features as I can’t find a better word for it right now. Those two things can vary from site to site within the same company drastically so that is not necessarily even the company policy.

In regards to using 1-click availability, that is not a very reliable marker to use. Whenever an issue is found with a component or a feature the 1-click option is disabled for that item.

I would say, to stay on the safer side (assuming that’s the goal), and that is not only with Nutanix but with any vendor in my experience, give it at least a month before rolling it out. To be extra safe roll it out in the test/dev site if you have one and let it run for a few weeks before you make the call.

When does Nutanix recommend upgrading software? When is is available for download from the support downloads page or when it is in the 1-click? Like officially recommend. Like if it was a test question. :wink: