chenzh4 wrote:
What is my meaning is When a node is failure/CVM failure, the data will migrated to other node and the data will be kept the status of RF=2/RF=3. This process will last not long time(serveral minutes or less than a hour), then the data resilience will be restore to OK.
But after that, I think the data will kept RF=2/RF=3 status, at this time I want to remove the node from the cluster(prism-hardware-diagram--remove node), accoding to the data resilience status, the cluster is restored and the remove process should be very short. But from the real enviroment operation, The remove process will last serveral hours.
What is my concern is the data has already migrated by stargare in the node/cvm failure process. Why I remove the node will last so long time? does any other data will be additional be removed?
Hi @chenzh4
Adding to what @Alona had mentioned above:
I understand that you want to know why it takes time for a planned node removal and why is it faster when a node fails / unplanned?
Un-Planned Node / Host failure:
When a node (physical host) fails (for e.g. power cut for that host, or a hardware failure taking the host offline), which is considered a critical failure in the cluster, “Curator Scan” will kick in at the highest priority to re-balance the cluster and ensure all data has two or three copies (to honour which ever replication factor was configured).
When does a rebuild begin?
When there is an unplanned failure (in some cases we will proactively take things offline if they aren't working correctly) we begin the rebuild process immediately.
We can do this because of:
a) the granularity of our metadata
b) choose peers for write RF dynamically (while there is a failure, all new data (e.g. new writes / overwrites) maintain their configured redundancy) and
c) we can handle things coming back online during a rebuild and re-admit the data once it has been validated.
a Curator scan will find the data previously hosted on the node and its respective replicas. Once the replicas are found all nodes will participate in the re-protection.
In the event where the node remains down for a prolonged period of time (30 minutes as of 4.6), the down CVM will be removed from the metadata ring. It will be joined back into the ring after it has been up and stable for a duration of time.
Planned Activity - Node Removal:
When we do a planned node removal from a running Nutanix Cluster, it will take time as - cluster operations / resiliency / incoming IO / performance will be given priority - curator scan will be running and will do the removal on a per-disk basis, ensuring data on the disk is available in the cluster before marking it ready to be removed. it will also depend on the size of the disks -
Removing a host automatically removes all the disks in that host. Only one host can be removed at a time. If you want to remove multiple hosts, you must wait until the first host is removed completely before attempting to remove the next host.
Please also go through the node removal link provided by @Alona.
You can also read more about Data Path Resiliency
Hope this helps
BR