Solved

data difference between removed node and node failure

  • 1 December 2019
  • 5 replies
  • 2825 views

Badge +6

Hi

 

I have a concern with the data resilience in Nutanix Cluster about rebuild the data in 2 scenarios.

  1. When a node is broken or failure, then the data will be rebuilt at the first time, the node will be detached from the ring, and I can see some task about removing the node/disk from the cluster. The whole process will used about serveral minutes or half hour. It will last no long time to restore the data resilience of the cluster.
  2. When I want to remove a node from the cluster, the data will also be rebuilt to other nodes in the cluster. but the time will be last serveral hours or 1 day to restore the data resililence.

Seems remove node will also rebuild some other data like curator,cassandra and so on. but Does it will last so long time, hom many data will be move additionaly ? and What the difference for the user data resilience for the cluster?

icon

Best answer by Mutahir 2 December 2019, 15:04

View original

This topic has been closed for comments

5 replies

Badge +6

Hi @Mutahir 

 

My another concern is unplanned node remove

I have ever meet a real environment, the node has already broken, and the data resilience is already restored. the user reinstall the node with phoenix after replace the satadom, but mistakenly install the node by the install and configure Hypervisor and CVM(wipe data), So the node is initialized and need to be removed from the cluster and add it back. 

Accordingly, the data has already rebuild in the cluster and the remove process will be very quickly, but finally the remove process last about 20 hours and finished.

Do you know why it last so long time? Is it need to do a same process of planned node removal?

 

Regards

Userlevel 3
Badge +4

What is my meaning is When a node is failure/CVM failure, the data will migrated to other node and the data will be kept the status of RF=2/RF=3. This process will last not long time(serveral minutes or less than a hour), then the data resilience will be restore to OK.

But after that, I think the data will kept RF=2/RF=3 status, at this time I want to remove the node from the cluster(prism-hardware-diagram--remove node), accoding to the data resilience status, the cluster is restored and the remove process should be very short. But from the real enviroment operation, The remove process will last serveral hours.

What is my concern is the data has already migrated by stargare in the node/cvm failure process. Why I remove the node will last so long time? does any other data will be additional be removed? 

 

Hi @chenzh4 

Adding to what @Alona had mentioned above:

I understand that you want to know why it takes time for a planned node removal and why is it faster when a node fails / unplanned?

 

Un-Planned Node / Host failure:

When a node (physical host) fails (for e.g. power cut for that host, or a hardware failure taking the host offline), which is considered a critical failure in the cluster, “Curator Scan” will kick in at the highest priority to re-balance the cluster and ensure all data has two or three copies (to honour which ever replication factor was configured)

When does a rebuild begin?
When there is an unplanned failure (in some cases we will proactively take things offline if they aren't working correctly) we begin the rebuild process immediately.

We can do this because of:

a) the granularity of our metadata

b) choose peers for write RF dynamically (while there is a failure, all new data (e.g. new writes / overwrites) maintain their configured redundancy) and

c) we can handle things coming back online during a rebuild and re-admit the data once it has been validated. 

a Curator scan will find the data previously hosted on the node and its respective replicas. Once the replicas are found all nodes will participate in the re-protection.

In the event where the node remains down for a prolonged period of time (30 minutes as of 4.6), the down CVM will be removed from the metadata ring.  It will be joined back into the ring after it has been up and stable for a duration of time.

 

Planned Activity - Node Removal:

When we do a planned node removal from a running Nutanix Cluster, it will take time as - cluster operations / resiliency / incoming IO / performance will be given priority - curator scan will be running and will do the removal on a per-disk basis, ensuring data on the disk is available in the cluster before marking it ready to be removed. it will also depend on the size of the disks - 

Removing a host automatically removes all the disks in that host. Only one host can be removed at a time. If you want to remove multiple hosts, you must wait until the first host is removed completely before attempting to remove the next host.

Please also go through the node removal link provided by @Alona.

You can also read more about Data Path Resiliency

Hope this helps

BR

Userlevel 6
Badge +5

I see, thank you for the clarification. Generally, node removal takes some time. The amount of time it takes for the node to complete the eviction process varies greatly depending on the number of IOPS and how hot the data is in the OpLog. The OpLog data is replicated at the time of the initial write however a node cannot be evicted until the OpLog data is flushed to the extent store.

You mentioned that it takes several hours which sounds quite possible.

More on the OpLog from Nutanix Bible:

OpLog

  • Key Role: Persistent write buffer
  • Description: The OpLog is similar to a filesystem journal and is built as a staging area to handle bursts of random writes, coalesce them, and then sequentially drain the data to the extent store.  Upon a write, the OpLog is synchronously replicated to another n number of CVM’s OpLog before the write is acknowledged for data availability purposes.  All CVM OpLogs partake in the replication and are dynamically chosen based upon load.  The OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O performance, especially for random I/O workloads. All SSD devices participate and handle a portion of OpLog storage. For sequential workloads, the OpLog is bypassed and the writes go directly to the extent store.  If data is currently sitting in the OpLog and has not been drained, all read requests will be directly fulfilled from the OpLog until they have been drained, where they would then be served by the extent store/unified cache.  For containers where fingerprinting (aka Dedupe) has been enabled, all write I/Os will be fingerprinted using a hashing scheme allowing them to be deduplicated based upon fingerprint in the unified cache.

 

Let me know if that answers your question.

Badge +6

@Alona 

 

Thanks for your information.

What is my meaning is When a node is failure/CVM failure, the data will migrated to other node and the data will be kept the status of RF=2/RF=3. This process will last not long time(serveral minutes or less than a hour), then the data resilience will be restore to OK.

But after that, I think the data will kept RF=2/RF=3 status, at this time I want to remove the node from the cluster(prism-hardware-diagram--remove node), accoding to the data resilience status, the cluster is restored and the remove process should be very short. But from the real enviroment operation, The remove process will last serveral hours.

What is my concern is the data has already migrated by stargare in the node/cvm failure process. Why I remove the node will last so long time? does any other data will be additional be removed? 

Userlevel 6
Badge +5

@chenzh4 

If my understanding is correct you are trying to determine the difference in cluster behavior between a node failure and a node eviction from the cluster in terms of the time it takes to restore data resiliency as well as impact to users. Please let me know if I misunderstood.

Both scenarios are explained  Prism Web Console Guide - CVM and host failure and Prism Web Console Guide - Remove a node from a cluster

In terms of impact to users when a CVM fails on a node there may be observed a slight spike in latency as storage data service role is being transferred to another VM.

Similarly, when a host fails and HA is configured VMs will be restarted on a healthy host - that may be noticed by users as well.

When preparing for a node eviction from a cluster VMs will be migrated off the host hence no user impact is expected. In addition, data migration takes place as part of the preparation of the node for the eviction.

Let me know if that helps with understanding the two processes, please.