Solved

Node failed...Need an approach

  • 18 September 2018
  • 60 replies
  • 37 views

Userlevel 3
Badge +11
Hi all,

I just came back from holiday to find that one of my nodes failed. I could login to the node this morning via IPMI to the AHV and did a shutdown -r now. It rebooted normaly. No errors were shown on the output.

Allthough the box is up again, it does not show in PRISM. The CVM is missing as well. The other 2 nodes seem to run fine.

Any starting point for me how to approach this issue?

Kind regards

Christian
icon

Best answer by Jon 23 September 2018, 20:44

View original

This topic has been closed for comments

60 replies

Userlevel 3
Badge +11
Update:

Ok so after some testing....

I am able to logon to the CVM of Node 2 and Node 3. + I am able to logon to the AHV of Node 3.

I cannot logon any more to AHV + CVM of node 1 (incorrect password - supposed to be the same on all nodes) and I cannot logon to AHV on Node 2, which is very strange. Also incorrect password. Retried it a couple of times. Verified keyboard settings and so on.

From the PRISM and the couple of running VMs I can see that network connectivity seems to be a problem. The switches ans the gateway (Unifi) all fine.

No change was done since I was on holiday. No one else has access. Physical access is blocked.

Userlevel 7
Badge +25
What is in the serial out of the broken CVM? (/tmp/NTNX... Of the AHV)
Userlevel 3
Badge +11
Hi Jrack,

I will try to look into this.

I just checked the console output of NODE 2. There are issues too:

Userlevel 3
Badge +11
Hi again,

I managed to login to NODE 1 (which is my primary concern). On the AHV (not on the CVM) I went to /tmp and opened the NTNT.serial.out.0 file.

Here is the content:

Userlevel 3
Badge +11
Is there anything I can provide from the problematic CVM directory?
Userlevel 3
Badge +11
I just restarted the CVM on NODE 2 (which had all the error messages I posted above). Now PRISM isn't working any more 😞
Userlevel 3
Badge +11
Here is a current state of the cluster

Userlevel 6
Badge +16
Can you run:

code:
nodetool -h localhost ring
Userlevel 3
Badge +11
@Primzy

Where? On any CVM? On any AHV?

Regards
Userlevel 6
Badge +16
On working CVM.
Userlevel 3
Badge +11
@Primzy

Userlevel 5
Badge +9
Hi,
I searched a bit the forum for cases when
nodetool -h localhost ring
returned
Error connection to remote JMX agent!
and it seems to me that you'll likely need to reinstall the node.
See e.g. https://next.nutanix.com/discussion-forum-14/connect-nodes-to-cluster-after-cvm-recovery-28522#post32492
Maybe someone else has a better idea...
Userlevel 7
Badge +25
Agreed w reinstall I think that first screenshot of serial out indicates it can't bring up the metadata disk that bad cvm. That would the node out of Cassandra. The recover node is for a bad ahv state so reinstall would be needed. Concern is if that ssd has issues.

Did another cvm fail though?
Userlevel 3
Badge +11
@jrack @andrew_ct

Thanks for your effort in this case!

My main concern now is how do I approach such a reinstall?
Will I loose all my data? Which NODE shall I start with?
Do I need to reinstall all of the 3 NODES?
How can I test if any of the SSDs has an issue?

Regards
Userlevel 5
Badge +9
Hi,
as for how to start a reinstall, see the link in my previous post (basically, you need to run a cleanup script first and then choose to repair/reinstall).
If a host repair helps (I doubt in this case) then data can be preserved, for a CVM repair, data on SSDs is destroyed; a reinstall wipes everything.
Userlevel 7
Badge +25
Well if you have 2 of your 3 than you won't have data loss. You would reinstall and add it to the cluster and Nutanix would take care of restoring the data from the remaining copy as needed.

Now if you lost 2 nodes and both were because the SSD failed than there is likely a data loss risk as in a 3 node cluster cassandra only has two copies (rf2) and there will be extents that were on those impacted devices.

So the SMART data can be reviewed w smartctl. On the original AHV instance you should be able to probe that SSD directly. Examples of use https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl

So what is the result of a genesis start?
Userlevel 3
Badge +11
@jrack

NODE 3 works fine. NODE 1 is definitely not responding. As written before I restarted the CVM on NODE 2 just to see if this works. Ever since I cannot access PRISM any more.

I can access both CVMs on NODE 2 + NODE 3 via SSH.

What do you mean here: "So what is the result of a genesis start?"

Regards
Userlevel 7
Badge +25
So from n2 or n3 in the CVM shell run "genesis start".

With only 2 of the three the cluster is running impaired as it does not have quorum so it has to run active/passive on some activity to avoid the need for the tie breaker. This is why commercial needs 4 nodes to start. 4 allows the loss of 1 w/o losing quorum in the cluster.

I would take a look at n1 and see if that SSD still seems to be ok then get to reinstalling. If prism is down than you will need to use "cluster add-node" from the new CVM. Don't reinstall that node as a one-node.
Userlevel 3
Badge +11
Hi again,

ok this is strange.

All 3 NODES are the same.

1x 512GB nVME drive
1x 2TB SSD
1x 32GB USB stick

I'm missing the 512GB disk here. Or is it missing because it is a nVME?

Userlevel 3
Badge +11
Result: So from n2 or n3 in the CVM shell run "genesis start".


Userlevel 3
Badge +11
So from n2 or n3 in the CVM shell run "genesis start".

With only 2 of the three the cluster is running impaired as it does not have quorum so it has to run active/passive on some activity to avoid the need for the tie breaker. This is why commercial needs 4 nodes to start. 4 allows the loss of 1 w/o losing quorum in the cluster.

I would take a look at n1 and see if that SSD still seems to be ok then get to reinstalling. If prism is down than you will need to use "cluster add-node" from the new CVM. Don't reinstall that node as a one-node.


Ok, I will plan to add a 4th NODE then for the future. 🙂
Userlevel 7
Badge +25
Yup NVMe is a bit unique as it is disconnected from the Host and passed through to the CVM. It doesn't use virtio to expose the block device to the CVM.

You have an NVMe+SSD where the 2T sata ssd seems to be being used as the metadata disk (the 4 partitions) which isn't preferred. It should have used the NVMe as the metadata and the SSD as capacity, but yours elected the opposite at least on that one case.

Is this the same situation on all 3 nodes?

So genesis is looking good. But cassandra is down likely because it can't make quorum between the 3 nodes in the ring. I would think it would run impaired like a CVM upgrade, but something is preventing it from doing so. Far from a zookeeper and cassandra expert.

I don't have an NVMe lab so flying a bit blind, but I think nutanix was using vfio for doing the passthrough. If you you do a "virsh dumpxml BLAH-CVM" (use virsh list --all to get the name of your CVM and replace blah) . You can grep for "vfio" and see your 512 device directly mapped to the CVM through IOMMU/VFIO.
Userlevel 3
Badge +11
@jrack

Unfortunately I cannot logon to NODE 2 (AHV) because its spamming the console. If I try it via Putty I get a login failed all the time. I afraid of rebooting the whole node.

Userlevel 3
Badge +11

Well if you have 2 of your 3 than you won't have data loss. You would reinstall and add it to the cluster and Nutanix would take care of restoring the data from the remaining copy as needed.

Now if you lost 2 nodes and both were because the SSD failed than there is likely a data loss risk as in a 3 node cluster cassandra only has two copies (rf2) and there will be extents that were on those impacted devices.

So the SMART data can be reviewed w smartctl. On the original AHV instance you should be able to probe that SSD directly. Examples of use https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl

So what is the result of a genesis start?


I started the tests on NODE 1 and NODE 3. I run a long test. It will be ready in approx 4 hours.

Userlevel 3
Badge +11
Yup NVMe is a bit unique as it is disconnected from the Host and passed through to the CVM. It doesn't use virtio to expose the block device to the CVM.

You have an NVMe+SSD where the 2T sata ssd seems to be being used as the metadata disk (the 4 partitions) which isn't preferred. It should have used the NVMe as the metadata and the SSD as capacity, but yours elected the opposite at least on that one case.

Is this the same situation on all 3 nodes?

So genesis is looking good. But cassandra is down likely because it can't make quorum between the 3 nodes in the ring. I would think it would run impaired like a CVM upgrade, but something is preventing it from doing so. Far from a zookeeper and cassandra expert.

I don't have an NVMe lab so flying a bit blind, but I think nutanix was using vfio for doing the passthrough. If you you do a "virsh dumpxml BLAH-CVM" (use virsh list --all to get the name of your CVM and replace blah) . You can grep for "vfio" and see your 512 device directly mapped to the CVM through IOMMU/VFIO.


From a capacity standpoint within PRISM the cluster did use the 2TB disk to provide storage. If it had used the 512GB as storage disks I would have had only like 512GB of usable space in the whole cluster, right?

With |grep I only get this (I'm a Linux amateur)



Here is the full output: (Do you see the 512GB disk? It should be SAMSUNG as well, like the 2TB one)

...hmmm I can't paste the virth output here. And I cannot attach a txt file. The editor here screws up the code when I save it.