Question

How Nutanix works!

  • 29 November 2019
  • 2 replies
  • 2044 views

Badge

Hi all,

I have some questions that I’m trying to answer but … ;)

So if you can explain to me or point me to a part of some resources

 

# Questions

  1. Is it recommended or mandatory to configure containers as ReplicationFactor-3 when the cluster is RedundancyFactor-3
  2. In case of ReplicationFactor-3, when reading, how many checks are done to validate data correctness?
  3. In a RedundancyFactor-2 only 1 failure is tolerated, the cluster will still work with (e.g) 2 Zookeeper. In RedundancyFactor-3 there is 5 Zookeeper, so why we can’t tolerate up to 3 failure?
  4. What are the limitations for which it is not possible to migrate VMs between containers without the export/import method?
  5. How the cluster will behave in case of network separation issue (e.g. 4 nodes can communicate and 4 other too)?
  6. If I a have 2 Guest VM in the same Vlan, will they communicate through the OVS br0 or the traffic will go till the external switch and come back to the cluster?
  7. With the bond0 (br0.up) interface having 2 links to 2 physical switches that are interconnected. In balance-slb mode, will a loop be formed?
  8. Can SED based DARE use the Nutanix native KMS?
  9. Nutanix ILM. Will the data be moved to tier1 while there is place in tier0? When and how data is considered as cold?

2 replies

Userlevel 3
Badge +4

Hi @Sam41 

Thank you for posting your questions. I have tried to answer your questions and provided relevant nutanixbible sections for further reading and understanding. 


1) Is it recommended or mandatory to configure containers as ReplicationFactor-3 when the cluster is RedundancyFactor-3

with redundancy factor 3, you can have ReplicationFactor 2 or 3 Containers. 
Redundancy factor = Meta Data availability / copies
Replication Factor = Data availability / copies

This is a Tech-Topx video on Redundancy factor and Replication factor :


2) In case of ReplicationFactor-3, when reading, how many checks are done to validate data correctness?

Any time the data is read, the checksum is computed to ensure the data is valid.  In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.

Data is also consistently monitored to ensure integrity even when active I/O isn't occurring. Stargate's scrubber operation will consistently scan through extent groups and perform checksum validation when disks aren't heavily utilisedd. This protects against things like bit rot or corrupted sectors.

Data Protection (section at nutanixbible.com)

 

3) In a RedundancyFactor-2 only 1 failure is tolerated, the cluster will still work with (e.g) 2 Zookeeper. In RedundancyFactor-3 there is 5 Zookeeper, so why we can’t tolerate up to 3 failure?

Redundancy Factor is on the number of Meta Data copies, while Replication Factor is number of Data copies across the cluster, which can be 2 (Replication factor 2) or 3 (Replication factor 3).

FTT for metadata (Redundancy Factor) expects a quorum of configured nodes to agree. So it is 2 out of 3 (RF2) or 3 out of 5 (RF3), 3 copies of metadata will be required to function properly in a Redundancy factor 3 configuration.

Scaleable Metadata

 

4) What are the limitations for which it is not possible to migrate VMs between containers without the export/import method?

Do you mean, moving a VM / vdisk between containers on the same Acropolis cluster? 
You can leverage the acli (acropolis cli) to move a virtual machine / disk to a different container or make an image of that VM as well. 

AHV - vDisk Management


5) How the cluster will behave in case of network separation issue (e.g. 4 nodes can communicate and 4 other too)?

Do you mean, 8 Node cluster, where 4 Nodes are unable to communicate to the other 4 nodes due to network isolation? - Cluster will be down. 

Depending on the Redundancy factor, you can loose one node or two nodes.

Network is required to be healthy in order for any distributed system to work reliably. if there is a network failure between some nodes - there is a certain time threshold (which i cannot re-call now 🙂 after which, CVMs are detached / nodes are detached from metadata ring. 

Network failure : If all Nodes in a cluster experience a network failure - the cluster will be down - VMs will be down as well.

Do note, that all CVMs in a cluster perform network checks at very frequent intervals to ensure they have the required network health to maintain integrity and reliability.

Some failure scenarios

6) If I a have 2 Guest VM in the same Vlan, will they communicate through the OVS br0 or the traffic will go till the external switch and come back to the cluster?

So, if you have two VMs on the same AHV Host on the same network belonging to the same vlan, they won't hit the TOR switch - you can even create a network via prism with a unique vlan number and attach two or three virtual machines to it (provided they are on the same host) - and they will communicate without leaving the host.

if you have two or more VMs on different hosts in a cluster with the same network membership they will of-course go out to thier respective switch in order to reach the other host. this is also the case with other hypervisor / v-switches. 

7) With the bond0 (br0.up) interface having 2 links to 2 physical switches that are interconnected. In balance-slb mode, will a loop be formed?

Nutanix does not recommend balance-slb due to the multicast traffic caveats. To utilise the bandwidth of multiple links, consider using link aggregation with LACP and balance-tcp instead of balance-slb.

AHV Networking Best practices Guide

AHV networking - Bond modes

8) Can SED based DARE use the Nutanix native KMS?
HW DARE (SED drives) requires 3rd party KMS. SW DARE can use 3rd party KMS (referred to as External Key Manager EKM)or it can use Nutanix built-in KMS (referred to as Local Key Manager LKM)

Data at rest encryption

9) Nutanix ILM. Will the data be moved to tier1 while there is place in tier0? When and how data is considered as cold?

Data will be moved in certain scenarios:
1) if the data is not considered hot any more
2) if the tier utilisation breaches a certain threshold - DSF ILM will kick-in and will down-migrate data (SSD to HDD tier). the data for down-migration is selected based on "last access time"

DSF ILM will constantly monitor the I/O patterns and (down/up) migrate data as necessary as well as bring the hottest data local regardless of tier.

Also, it is important to size the cluster properly according to workloads and their working set sizes, especially in a hybrid (SSD + HDD) system. 

Following topic on the Nutanix Bible - explains in detail when ILM is kicked in and how it works.
Storage Tiering and prioritizations

Disk Balancing

IO Path and Cache

 

Hope this helps answering your queries, if not - please do not hesitate to clarify / ask further - thanks

 

BR

Userlevel 5
Badge +5

@Mutahir 

Did you mean the 30 mins threshold for evicting Cassandra instance from the metadata ring?

Reply