For mission-critical applications in various industries like banking, capital markets, insurance and healthcare, there is a need to support highly available synchronous data replication to ensure high level platform protection and availability. Providing this functionality is key for a business to function unhindered. The cost of data unavailability can be significant to a business, with some estimates measuring the impact at several thousand dollars per minute and upwards of six figures or more if the unavailability lasts for an hour or more. And this doesn’t take into account recoverability to the latest dataset right before the failure. With these types of dollar amounts, the inability to bring the system back to its latest data point can have a tremendous impact on the viability of a business. Customers need RPOs of zero to help ensure any incident that takes down a primary site will allow a secondary disaster recovery site to be brought online quickly with the latest copy of the data.
So without further ado, drum roll please…Nutanix™ Files Storage now supports establishing a metro cluster with our 4.4 release. This is not a drill and you are not seeing things, this isn’t a mirage, it’s here. This capability is built on top of our already mature and customer validated Metro offering that has been used for years now to protect applications running on our hyperconverged platform. I know what you are thinking: how does it work, so let's dive in!
When a file server is deployed on the Nutanix Enterprise Cloud platform, a set of highly available file server virtual machines (FSVMs) are created and aggregated together to form a single namespace aka a file server. Behind the scenes a series of storage entities are created and plumbed to each of the FSVMs. Each of these entities are stored in a storage construct that we call a Container on the Nutanix platform. As clients mount exports or map file shares, the data written to them is stored within this Files only container. It's the need to protect each of the entities that comprise a file server where a metro cluster configuration comes in. It is accomplished through creating a Metro Availability protection domain.
When Metro is configured, we synchronously replicate the contents of a container on the source to a recovery target cluster running in another data center. With the assistance of a witness, when a failure is detected, resulting in the primary cluster being unavailable, the file server is automatically brought up on the cluster running in the secondary site. A high level architecture with Metro would look like this:
To set up a Metro cluster, the high-level steps go a follows::
- Login to the Prism instance managing the Nutanix clusters
- Create the source File Server and shares on Cluster A
- Create a remote site on Cluster A
- Create a remote site on Cluster B
- Configure a Witness to help arbitrate failover (optional)
- Create a container on Cluster B (note: must match source container name)
- From the CLI of one of the CVMs on Cluster A, run the command below. This will create a metro enabled protection domain, ensure the File server entities are protected and link the clusters in site A and B for replication.
ncli fs protect uuid=<fs_uuid> pd-name=<pd_name> metro-avail=true remote-site-name=<remote_site_name> enable-witness=(true|false) - That’s it.fire away and start using the file server. No plexes to understand, no special networking or port connections required!
I know, using the command line isn’t ideal. Completely understand and we are working on getting this workflow moved into our Prism UI, stay tuned and check back with us. The more detailed steps and any additional requirements are included in our product documentation on the Nutanix customer portal.
With Metro configured, the next question might be “how does an IO work and ensure that data written to the file server in the primary site is replicated to the secondary cluster?”. For those familiar with how the Nutanix data path works for typical application VMs, the data path will be similar for the virtual machines that comprise the Nutanix file server, with the added data write that goes to the secondary cluster. A write IO for example would look like this:
- A client mounts a file share running on a file server and issues a write to a file in that share
- The IO lands on the FSVM that owns the file share and processes the write within Files (leaving out the Files data path here for brevity). The CVM on that node will receive the write request to the FSVM, processing the local write just as it would from any other VM on the cluster. The CVM will issue the number of replica writes defined by the local clusters replication factor.
- In parallel a remote write will be issued to a CVM in the paired secondary cluster that owns the standby container as part of the configured Metro relationship. With the remote cluster processing the number of replica writes based on its defined replication factor.
- Once both clusters have acknowledged the write to the AOS layer, an acknowledgement is sent to the File Server. This will then result in a response to the client that their write request has been processed.
In the above you may have noticed a reference to a “standby container: In the case of the secondary cluster, the container that is the recipient of replicated data is considered a “standby”. It will appear as available in the Prism UI, but any read or write request that may be sent to it (outside of incoming replica writes from the primary) will be forwarded to the “active” container running on the primary cluster. Upon an event requiring the activation of a failover to the secondary site, the standby container will become “active”, the file server is brought online and begins accepting reads and writes. Once the issues affecting the original primary site are resolved, replication would be reversed and a failback processed to return the configuration back to its desired normal operating state.
There is more to all of this but we hope this short walkthrough gives you a good idea on how it all works. For more information check out our documentation on Metro Availability and Files on the Nutanix Customer portal. In addition the chart below highlights where this fits into our overall data management for replication or synchronization on Nutanix Files Storage:
As you can see from the above, Nutanix Files Storage is well positioned to help provide valuable data services that ensures high availability and recoverability for your unstructured data storage needs. Don’t hesitate to reach out to your local Nutanix or Partner account team! I am sure they would love to walk you through in more detail what is possible. For more information on this or anything else with our Unified Storage platform check out the following:
Nutanix Unified Storage - https://www.nutanix.com/solutions/unified-storage
Nutanix Files Storage Test Drive - https://www.nutanix.com/one-platform?type=tddata