Consistent high performance with bullet-proof reliability are two fundamental attributes that enterprises expect of infrastructure that hosts their business-critical applications. Performance and resiliency (along with scalability and simplicity) are inherent to the design of the Nutanix AOS™ software that is at the heart of Nutanix® Hyperconverged Infrastructure (HCI). This is why Enterprises across the world run their business-critical applications on Nutanix HCI. We are continuously building on these strengths and further enhancing AOS performance and resiliency capabilities to enable customers to deploy more applications on Nutanix infrastructure with confidence. In this post I will cover key resiliency and performance focussed capabilities that we are introducing with AOS 6.
We have been focused on pushing the performance envelope by delivering major architectural advancements in the past few major releases (you can read about these here). AOS 6 is no exception and includes some significant performance enhancements that we will cover in this section.
Replication Factor 1 (RF1)
The first capability is one that helps improve performance and storage efficiency for applications that manage their resiliency. Typically, Nutanix AOS protects application data by creating 2 or 3 copies of the data on different nodes within the cluster. This is known as Replication Factor (RF), and until now could be configured to create either 2 or 3 copies (RF2 or RF3, respectively). Big data applications, such as Hadoop®, Splunk® and NoSQL® applications, handle data resiliency at the application level. For these types of applications, adding another level of resilience, causes operation overhead and increases the solution cost.
Other classes of applications which do not require data resiliency are analytics applications, such as SAS Analytics. Such applications copy the data for analytics and discard the data after the job is done. Traditional SQL databases also often store some temporary data that is ephemeral in nature and doesn’t need to be protected. For such cases, Nutanix AOS now supports storage containers to be created with Replication Factor 1 (RF1) which means data is not replicated for redundancy. As mentioned earlier, this enables 2 key benefits:
- Storage efficiency: 50% storage savings for applications that natively protect data since data is not replicated at the storage level
- Performance gains: I/O write performance is improved because data doesn’t need to be replicated across the network, providing true data locality even for writes. RF1 workloads like Cloudera experience 3X shortened completion times and SAS Analytics benefits from 2.5X throughput increases.
Data Sharding for Scale-up Database performance
Most workloads on Nutanix AOS tend to get deployed over multiple virtual disks to scale performance by taking advantage of the distributed nature of the architecture. There are certain workloads like scale up databases migrating from legacy SAN systems that may be deployed on a single large virtual disk. Before, each vDisk used a single thread for its IO which could limit the performance of single vDisk based applications. With the data sharding enhancement, each vDisk uses multiple threads for its IO, effectively allowing the performance of each vDisk to scale with the load on the vDisk. We have seen over 2x Improvement in the performance of a SQL database deployed on single vDisk, making it easier to run scale-up databases on HCI without requiring changes in how they consume storage. This capability is under development as part of the upcoming AOS 6.1 release.
AOS Fast Tier
One of the major drivers of the generational AOS architectural advancements that I referenced earlier in the post was to enable our customers to unlock the power of next generation storage technologies to optimize performance for their most demanding workloads. The AOS Fast Tier does just that by leveraging faster NVMe media, such as Intel® Optane® SSD as a fast tier between the extent store and unified cache. AOS Fast Tier is automatically enabled when such storage is detected. This allows AOS to take advantage of the superior read performance characteristics of Intel Optane and accelerate reads. We have seen over 30% higher read performance with AOS Read Fast Tier when compared to standard NVMe devices. (Footnote to be added: Working set of 240GB with 50/50 in-out rate from the Optane tier, Block store + SPDK + RDMA enabled, MS SQL DB workload on Intel DCB nodes).
As mentioned in the beginning, Resilience and data integrity are first principles for AOS architecture. We are always looking to build on this strength by incorporating feedback from our customers and bringing in capabilities that help them manage resilience better especially at scale. In fact every major AOS release in the past 18 months has seen the introduction of important resiliency focused capabilities such as storage reporting and resiliency visualisation enhancements and AOS 6 is no exception. You can find details about all these enhancements in this excellent blog by my colleagues Bibhash Seth and Steve Carter. I will just touch upon two the key resiliency enhancements from AOS 6.x here:
Reserve Rebuild Capacity
With AOS 5.18, we introduced the enhanced storage summary widget to display the rebuild capacity required for the cluster to self heal from failures. The Reserve Rebuild Capacity feature builds on this enhancement and allows you to reserve rebuild capacity and guarantee that there is sufficient capacity to rebuild in the event of failures. When configured, the cluster reserves and the capacity required to self heal from the loss of the largest node/block/rack within the cluster (could be up to 2 nodes/blocks/racks, depending on the configuration) and dynamically manages this capacity through configuration changes. Applications running on such clusters will only see the available “Resilient Capacity” at any time.
This is an optional setting that is particularly useful for environments with highly mission-critical data, but it can be left disabled for environments where manual intervention would be preferred to strictly enforcing the rebuild capacity.
Rebuild Progress Indication/ETA
We are delighted to deliver what is one of the most demanded resiliency related capabilities from our customers - a mechanism for them to rack the self-healing process. AOS 6.0 enhances the data resiliency widget to display a new rebuild progress indicator that enables administrators to track the time remaining until full resiliency has been restored to the cluster. The rebuild calculation uses distributed algorithms that take into account a large number of factors including the capacity to be rebuilt, the number and speed of individual drives, the number of nodes participating in the rebuild operation, and even the current I/O load on the cluster.
The journey continues…
Hopefully, I have been able to give you a good idea of the exciting new performance and resilience focused capabilities that we introduced in AOS 6. Performance and resiliency are two areas of unwavering focus for us, so watch this space for even more exciting innovations in future releases.
This post was authored by Aravindan Gopalakrishnan, Nutanix
© 2021 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.
This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances.