Most traditional IT houses are built on a foundation of some long favored conventional wisdom. To wit, “the fans and early adopters take Windows 10 right out of the box.” “We’re bleeding edge; we’ll take SP1.” “We are an enterprise; we will deploy SP2 in production.” Or, “We’ll wait for RHEL to support VMware 7.0.” If you’ve worked in IT for more than a few years, you know the drill--there are some things your leadership just does not allow in your shop.
The list is long and includes, but is not limited to, “hardware replacement after business hours only.” “Patch only when security requires it!” Or, “upgrades on the weekends only when there is less potential to impact the business.”
In the end, the life of the crew doing the heavy IT lifting is largely one where even with the above rules, you never know what to expect. Other than to know that your leadership values one thing: stability of the production systems, 24x7.
In IT, we get stability by making sure we don’t change anything. The bulk of production impacting events (known as “P1s”) are due to something changing in the environment. Hence, don’t change anything and all will be well with the world, right?
Not so much, because old versions of code have a set of known issues, as well as a set of issues as yet undefined. “As yet undefined” because the longer code is out there, the more opportunity there is to find new issues, or finally hit the conditions that trigger known bugs that are already fixed in newer versions of code. Each of those issues has a chance to destabilize your production environment and impact your bottom line.
Yes, new code has new bugs. Pssst! Come closer… your life will become easier, and you will breathe easier, the sooner you realize that so far, no one makes “Bug Free” code. Not a secret.
In the days of the traditional siloed IT functions of compute, storage, networking, each function had a team. Each team had their processes. Each team had their schedules. Each team had their bias for how they saw the world (fangirls / early adopters, bleeding edge, or, “we’ll upgrade when you pry these CDROMS from my cold dead hands”). All of these perspectives are valid by the way; it’s just that each has its own set of consequences.
Then virtualization showed up. Virtualization was truly a game changer. Virtualization allowed IT shops to realize more of the potential of their available system resources.
With 3-tier convergence came the idea of smaller datacenter footprints and less power consumption. On a visit I made to one CIO, he pointed out a “pizza box” server in his office, noting that it was the 1,000th box that his team had decommissioned in their converged infrastructure effort (circa 2011).
This CIO embraced change in his datacenter due to the tangible effects he could perceive (less floor space and lower power consumption), which, all-in-all meant potentially huge savings. Yet, he kept his organizational silos. His expected savings were lost to his network team who did not take the compute team’s request for upgrade as an important need for them - their stuff was working fine. The storage team had a definite opinion on what gear was best, hence the bias the storage team held drove decision making for new infrastructure. An old cycle gave way to yet another cycle that pushed the savings convergence offered into a latent cloud of mañana.
Enter hyperconverged infrastructure (HCI). By now, IT teams have finally converged, though networking has held their own, because, well, networking teams have many constituents. HCI on the other hand has enabled a single team to manage the day to day of datacenter operations. Things are *much* better now, right?
Hold onto your ethernet there Kemosabe… A team that manages compute, storage, virtualization and their underlying component issues (memory, CPU, I/O, yada^3), still has some traditional challenges the business has not yet found a way to let go. Have you spotted it yet?
That’s right. Upgrades. Whether it be a “patch,” that is often an upgrade in disguise, or the operating system, or firmware, upgrades are still something many businesses do not identify as pivotal to their success. Putting off upgrades incurs Upgrade Debt. Upgrade Debt is the time it takes to get upgrades done. The longer you wait, the larger the debt.
Upgrade Debt is also the risk most businesses are carrying due to a desire for “stability” in the production environment. It sounds something like, “If we don’t spend the time to upgrade that system, we’ll have more time to deal with the day to day support issues!”
The risk of such perspectives lands as a P1 event in reality. Or worse, in your business critical data being shared with the world. Upgrade Debt or a hack due to outdated firmware will be the cause of your next business critical event. Your job? Do something about it--now.
Enter the much maligned upgrade. Most IT shops have their teams do upgrades or hardware replacement after “regular business hours,” on the weekend, or, (eeeek!) on holidays. As IT teams, many of us have long ago accepted that we will lose sleep, weekends, and holidays.
Or do we? There are Nutanix customers who are now doing their upgrades during the week at night (US time) without impact to the business. Some Nutanix customers are also doing full stack upgrades (AOS, AHV / hypervisor, NCC [Nutanix Cluster Check], and Life Cycle Manager or “LCM,” and firmware) during the business day, Monday through Friday.
A team of 3 Nutanix residents and a professional services consultant serve one of the top Fortune companies and manage their day to day support issues. This team has upgraded hundreds of clusters that are actively running production, during the daytime, without impact to the business, every week of every month. In over 2 years of following this process, there has yet to be a production-impacting event to the business, all while upgrades were in progress.
Can you imagine doing that on your business systems? It turns out, in the age of HCI, this is how it is done. Technology has improved to the point where once one cluster is upgraded, we just move onto the next one, get it done, and then move to the next. Nutanix HCI provides a robust foundation from which the team with vision can get this work done and still have a weekend, or better, holidays.
To be clear, upgrades carry risk. As noted earlier, P1 events tend to happen when there is a change in the environment. By introducing new code via upgrades, patches, or firmware, you open the door labeled P1 (otherwise known as a call from your CIO asking who dropped the ball?). HCI has brought upgrade risk to a nominal level.
The Nutanix upgrade process allows users to stage upgrades, and then execute them in an automated, rolling fashion, with VMs migrating to a different node to make the node available to upgrade. For firmware, the same thing, and you stage it all or just a single component on a single node via the LCM (http://nutanix.com/lcm). Once launched, you manage other tasks, check in on the process on occasion, and address issues with the upgrade should they occur. All while protecting your business critical functions. That means, no downtime, just a “maintenance window,” so that folks know that something is happening.
For the Fortune 500 company work noted above, the Nutanix team completed upgrades on 71 clusters in the above noted model (M-F, 8 - 5) in a 3.5 month period. This effort represents 973 total nodes. The average size of the clusters upgraded were 13.7 nodes and upgrades took roughly an average of 18 to 20 hours per cluster (including staging, firmware upgrades, AOS, NCC, Foundation, and LCM). The largest cluster upgraded was 32 nodes (some cluster stats...14 - 32 node clusters, 10 - 12 node clusters, 8 - 6 node clusters, 8 - 4 node clusters, and 5 - 1 node clusters… more clusters of other sizes so not a complete list in this semi-summary).
The fact that business operations are 24x7 globally, highlights that IT function and maintenance are no different. It’s just that no one says this out loud, nor have they adapted IT processes to take advantage of the HCI reality. The CIO I referenced earlier who made the decision to converge his datacenter will have realized the full breadth of change he initiated with that decision via HCI. Nutanix enables actual 24x7 maintenance windows.
Many companies aspire to be recognized as a “great place to work.” While some companies gain this recognition, that vote may not include support from the IT teams, unless they are running Nutanix. The ability to upgrade during business hours is the much ballyhooed “game changer” we as the executors of your IT strategy, have long dreamed of experiencing.
Interested in learning more about the processes and facilities to upgrade? Also, talk to your Nutanix Sales Engineer, TAM, Resident Consultant, and/or CSM.
This post was authored by Daniel Hinojosa, Sr. Technical Account Manager, Nutanix
© 2020 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and the other Nutanix products and features mentioned on this post are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned on this post are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site.