Question

HA restore after 1 node down

  • 11 October 2020
  • 3 replies
  • 152 views

Badge

Hello Team,

 

I’m running a small 3 node cluster, during an AHV upgrade to the latest version one of the servers failed to reboot so AOS started to rebuild trying to restore HA. After several hours rebuilding failed, looking at the logs I saw the message: “not enough space available”. 

If I’m not mistaken a 3 node cluster can tolerate only one node down, but lets say the two running nodes had enough space, is it possible to restore HA? It basically means that after some time rebuilding, if one of the 2 remaining nodes fails, cluster is still up in a single node mode.

Just curious because it looked like AOS tried to rebuild HA despide only 2 nodes were up


3 replies

Userlevel 3
Badge +4

A 3 node cluster can tolerate only 1 node failure. In the event of a node failure, the data rebuild process kicks in to be able to prevent the data loss and if there is enough storage space in the cluster all the data will be safe even if another node fails. However, the cluster will not be able to run if 2 nodes go down, because some services (such as Cassandra, Zookeeper, etc) will not be able to run on a single node and they can tolerate only 1 node failure.

 

Badge

Hello Sergei,

 

So if there is not enough free space data loss is possible and HA with only two healthy nodes does not work as well (while in RoBo setup it seems that cluster can stay up with one node).

 

Well, I can’t remember if the “not enough space available” came from Cassandra but it was quite weird because 90% of the space is free (in AOS 5.18.1 the limit you should not pass is very clearly shown)

Userlevel 3
Badge +4

In RoBo clusters there is a difference that keeps them alive when 1 node goes down and that difference is the witness. In 3 node cluster you don’t have a witness, so the cluster will be completely down if a 2nd node is dead no matter if there is enough storage space or not.

Reply