This post was authored by Yashesh Mankad, Staff Engineer Nutanix
Nutanix Era is a suite of software that automates and simplifies database management, bringing one-click simplicity to database provisioning and life cycle management (LCM), copy data management (CDM), and database backups. Era also provides simplicity in tracking and limiting access to databases and their copies through a sophisticated role-based access control (RBAC) mechanism.
While Era does not access any data in a customer's databases, it does maintain metadata about those databases and virtual machines, including the database name, size, access credentials, version, and configuration details. Additionally, it maintains metadata about entities like databases, clones, time machines, backups, and so on to help manage the database estate. Era stores all of this metadata in its repository, which is powered by Postgres.
By default, Era's repository runs on a single-instance Postgres database. Customers benefit significantly from this architecture, as it keeps Era's resource footprint small. However, if the Postgres instance goes down, it is possible to lose data if you can’t recover the underlying storage. Because even the slightest risk of data loss can be a show-stopper for Tier 0/1 workloads, Era version 1.2 gives you the option to cluster the repository, scaling the single-instance Postgres database to bolster Era availability.
When you enable the repository clustering option, Era transforms the repository into a three-node Postgres cluster. This distributed architecture is resilient against single node failures, including data disk failures. A three-node cluster provides 100% data durability (zero data loss) for Era.
As you might expect, this added reliability comes at the cost of some additional cluster resources like CPU and memory. We understand that not all customers may be able to spare these additional resources or have the need for high availability, so Era continues to offer the single-instance solution. The high availability option is there if and when you need it.
Creating the Cluster
Era currently allows customers to provision a Postgres cluster as part of its database-as-a-service offering. (For more information, refer to Provisioning PostgreSQL to be Highly Available and Resilient on Nutanix.) Era uses this same capability to provision a Postgres cluster with synchronous replication (on one replica), only this time the cluster is for its own use: to make the Era repository resilient and highly available.
The challenge comes in when we try to use Era's own capabilities while Era itself is going through a major overhaul. In other words, we’re using the Era service to change its own underlying data storage technology, while keeping the service up and running as long as possible. This feat is equivalent to changing the wings on an airplane while in flight.
Era solves this problem by identifying the block of time when the service absolutely needs to go down and minimizing that time interval. Era keeps running all the way up to the point of data migration to the new Postgres cluster and cutover. When the Era service resumes, it is now feeding off the new highly durable Postgres cluster. After you enable High Availability, Era's architecture looks like the image below.
The new architecture has various advantages—in addition to providing better durability and availability, it also scales better. Era uses the Postgres replicas for read-only queries, taking a significant load off the Master. As a result, even under normal operation, the replicas are not sitting idle and their resources are not wasted.
Era Service Availability
With Era's repository now clustered, let’s walk through various failure scenarios and look at how the Era service behaves when nodes in the underlying cluster fail over or come back online.
When all three Postgres nodes in the cluster are up, the cluster is in its Active state—it’s highly durable and can handle node failures. Because the quorum is intact, if a node fails, the cluster can safely use another Master or Sync Replica. If a failover has already occurred, we can perform failbacks and tolerate additional failovers as long as all the nodes in the cluster are up. Era writes through the Master, and reads come from the synchronous replica for improved performance.
Degraded Mode (one-node failure)
In this scenario, one node in the cluster fails or goes offline. If the node lost was the Master, we promote the Sync Replica to Master because its replication state matches the last committed transaction on the previous Master. As we didn’t lose transactions during this failover, there is no data loss for the customer.
If the failed node was the Sync Replica, the Async Replica now gets promoted to a Sync Replica. From Era’s perspective, the cluster operates normally—writes go through the new Master and read requests go to the new Sync Replica. However, because a node went down, there is no longer a quorum, so the cluster is running in degraded mode. In other words, the cluster is not prepared to be highly available or durable if another node in the system fails. We cover this double failure scenario next.
Degraded Mode (two-node failure)
In a two-node failure, we have already lost one node to failover, and now a second node in the system fails while the original failed node is still offline. As with the first failover, the cluster promotes the Sync Replica to Master, but with the other replicas down, we only have one active node, and its role is Master.
Without any available replicas in the cluster, replication halts and Era can no longer perform durable writes. In this scenario, reads come from the Master, and the system can no longer accept writes until one of the replicas returns to service. To safeguard data integrity, Era runs in read-only mode as long as this state persists.
New in the release of Nutanix Era version 1.2, Postgres clustering helps secure your system against unexpected failures. It also helps during planned maintenance and upgrades. The additional resource requirement for clustering is a small price to pay in return for data durability and the performance gain that comes from load balancing queries across replicas. Here’s to a highly durable and available Era!
To find out more about Era’s high availability option and other new capabilities in 1.2, head over to the Era Playlist on Nutanix University.
© 2020 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).