Upgrading Node/Cluster Capcity one Drive at a Time

  • 11 May 2022
  • 1 reply

Hello fellow Nutanix fans,

We have a three-node cluster in a DR location. This DR cluster has less capacity than the primary site which has resulted in us exceeding the resilient capacity of the DR cluster.

To remedy this situation, we decided to replace four 8TB HDDs with four 12TB HDDs in each node. 

Today we removed one drive from a node in Prism UI and waited approximately four hours before the system reported it was safe to remove the drive. During this time, Prism reported that data was lowering on the drive. The cluster showed about 700MB/s of throughput during this rebuild, which seems pretty really good for a cluster of this size. 

We removed the 8TB HDD and inserted the 12TB HDD. The rebuild did not initiate until we ran NCC. By default, NCC runs every 24 hours so, eventually the rebuild would have started but, we wanted to see it get started.

We watched the rebuild run for a while after it got underway. The system is writing around 80MB/s to the replacement drive. At this rate (250GB/hour), it should take about 16 hours to rebuild the roughly 4.5TiB of data. At this rate, we can only do one drive replacement per day.

We have a few questions we are hoping someone can answer.

  1. We haven’t been very successful in finding good information relating to capacity expansion by replacing existing drive with larger ones. Is there a good reference out there?
  2. Are we better off performing the graceful drive removal, which takes around four hours or can we just remove a drive and replace it with the larger one? Is one method safer than the other?
  3. The specs of the 12TB drive state it can sustain a transfer rate of roughly 240MB/s. Why is the drive rebuild only pushing a third of this potential bandwidth? 
  4. We believe that since we are on AOS 5.20, once all the drives in a node have been replaced, the node will provide the full capacity of the new drives to the cluster.
  5. Is there a better way for us to be upgrading the capacity of this cluster given our decision to simply replace drives?

Thanks for reading and replying.



Best answer by gabeo 8 June 2022, 08:33

View original

This topic has been closed for comments

1 reply

Badge +2

Doing the slow removal and replacing is the safer way.

Sure, you can yank a disk and just replace it, but during that process, your data is not resilient. When you pull the disk, the system will alert and begin “healing.” During this time, you are at a higher risk of data loss should another disk or node fail. It may save you time, but there will be hours that go by that you are at an elevated risk, whereas if you do the slow removal and addition, the cluster will always be in a resilient state.