I have a concern about the numbers of the nodes in a nutanix cluster.
According to the license limit,
starter license only have RF=2 and have a 12 nodes limit.
Pro and Ultimate license have RF=3 and no nodes limit in a cluster.
In production enviroments, even we set RF=3, it means there is no data lost when 2 nodes crashed at the same time. but if the nutanix cluster is very big, there is also many risks to lost >2 nodes at the same time.
So is there a best practise that how many nodes in a big nutanix cluster is a best choice and low risks.
Then we can divided the large number of nodes into some clusters and managed them by prism central.
Best answer by mmcghee
I think cluster size recommendations will ultimately depend on your specific environment and application RPO and RTO requirements. Here are some of the questions I would ask to help in making a decision. How large of a failure domain are you creating with a single large cluster for a given application? Is the data in the cluster protected elsewhere via backup or disaster recovery mechanisms should there be a simultaneous failure? How long would it take to recover the application if a particular cluster was unavailable? Operationally, how much overhead does managing multiple smaller clusters add to your organization?
The benefits of a large cluster are simplified management and better efficiency given a large pool of shared resources. But like you mentioned, Prism Central can be used to help simplify management where multiple clusters exist. Nutanix is also very flexible when it comes to adding and removing nodes/blocks from clusters. I can site examples where a customer will move blocks between clusters based on resource utilization. This capability helps to mitigate resource imbalances or shortfalls where smaller clusters are used. You can also mix RF2 and RF3 containers in the same cluster, so there's flexibility there as well if storage space and resiliency based on application needs tuning.
In my opinion it would be difficult to give an exact best practice on cluster size. On the high end, Nutanix has many customers that have cluster sizes of 32 nodes (and larger) running RF2 or RF3. I've seen some recommendations if you're over 32 nodes you should consider RF3, but we also have customers use RF3 at smaller cluster sizes. Keep in mind that a Nutanix cluster will immediately heal following a drive or node failure and again be fully redundant and capable of sustaining a subsequent failure. Large clusters heal quickly as all drives and nodes participate in the rebuild.
I know this isn't an exact answer, but hopefully it's some good food for thought. Hopefully others will answer with their experience.