Solved

CVM Crashed

  • 18 March 2020
  • 34 replies
  • 15379 views

Badge +2

All,

I have a 6 Node 1065 system in 2 blocks.  Recently one of the CVMs (node 2, block A) crashed.  When rebooted it was just going in loops.  When diagnosed it seems the SSD (not the SATADOM) had failed and we replaced it.  When we try to boot the CVM, it still just loops.

We were told to boot that node with Phoenix which the cluster provided me for download.  I do that and it doesn’t load Phoenix and gets errors instead. 

I’m looking for a suggestion of how to get the node back to 100%.  At this point (and throughout) the ESXi on the SATADOM has booted fine and I guess if I didn’t care about the storage side I could just ignore this but I’d like the system to be fully healthy.

Any suggestion about how to get the CVM working again would be appreciated.

Thank you

Johan

icon

Best answer by JohanITG 28 April 2020, 18:10

View original

This topic has been closed for comments

34 replies

Badge +2

Hi Jeremy, Thanks for the note.  That’s what Chandru had said at the end of the day as well.  We got a new drive from sales (which took a while) and then got that installed and things started to move forward.  The posts in the last 48 hours have been with the new drive installed.  Things are good and the cluster is back to full strength.

 

Thanks

Userlevel 3
Badge +4

Hello @JohanITG 

I believe it’s already clear, but up to the last update here the basic problem is the CVM boot disk failed and installing a CVM boot disk to the replacement SSD also failed.

I noted from your phoenix install attempt screenshot earlier, the error “No suitable SVM boot disk found” was reported.

Typically when this error comes up it is a result of checking the detected SSD model against the list of supported SSD hardware models. In other words, this error shows the Phoenix software refusing to install a CVM on a disk which we have not qualified for use as a CVM boot disk.

From your “lsscsi” output in the last image I’m seeing WDS100T1R0A which a brief search identifies as a Western Digital Red SA500 model SSD. Although a few western digital models have passed qualification I don’t think this one has been qualified for use as a CVM boot disk. I do not see this model string reflected as a supported drive model.

If it’s the drive model I mentioned above, I don’t believe it could be a qualified drive because the spec I found for this drive shows 0.33 DWPD (drive writes per day, an SSD endurance rating standard metric) and I believe 3 DWPD is a requirement for a CVM boot disk.

Maybe not the answer you were hoping for but unless I’ve missed something that’s why the phoenix process isn’t working here. 

Badge +2

All, there was an issue with RAM allocation on the CVM that was throwing a hidden error.  Once RAM was increased the CVM joined the cluster and we’re good.

 

Thanks

Badge +2

I take it back…  I thought I was done.  Cluster Expansion failed:

 

Help?

Badge +2

Final update.  Managed to get the SSD seen by the host and did a full wipe and re-install.  Host is now back to being part of the cluster.

 

Thanks for the help!

Badge +2

Hi there,

so I got and installed the new SSD from Sales.  I ran through the setup and it ended here (see photo).

 

What should I do now?

 

 

Userlevel 3
Badge +13

We don’t support ignoring the compatibility list since this node with an unqualified ssd can cause performance issues to other nodes. Also in the initial picture you shared with us the model says Intel SSD. Was it the old drive? Your sales engineer should be able to provide you the list of supported SSDs

Badge +2

I’m more than willing to try something else.  can you send me the list?  Or is there a way to ignore the “supported” drives?

 

Userlevel 3
Badge +13

ok that’s seems to be the problem. The SSD model detected is unsupported for boot drive. We have a list of supported SSDs to be used for boot drive and i can’t find this model in the list. I’m not able to see any Western digital drives in our supported list.

Badge +2

There are no partitions on the SSD or any other disks beyond the SATA DOM

Screenshots:

 

Userlevel 3
Badge +13

Can you mount the Phoenix is again and when it prompts for installation just select cancel. It will take you to Shell. In the Shell run

fdisk -l /dev/sd?

Confirm there is 4 partitions on the ssd, mount the first two partitions of the ssd and then do an ls -la on the mount point. By the way is the replacement ssd same model as the ssd on other nodes?

Badge +2

It appears to be the same:

 

 

Userlevel 3
Badge +13

@JohanITG can you check if the lsi controller is marked as passthru for the CVM? You can compare it with working cvm vmx file or settings

Badge +2

Screenshot now:

 

Userlevel 3
Badge +13

The failure will be above those lines, please send me the logs as private message.

Badge +2

These are the last lines:

 


2020-03-27 07:04:36,203 INFO  Running cmd ['touch /bootbank/Nutanix/firstboot/.firstboot_fail']

2020-03-27 07:04:36,224 INFO  Changing ESX hostname to 'Failed-Install'

2020-03-27 07:04:36,225 INFO  Running cmd ['esxcli system hostname set --fqdn Failed-Install']

2020-03-27 07:04:36,538 INFO  Running cmd ['esxcli network ip interface ipv4 set -i vmk0 -t dhcp -P true']

2020-03-27 07:43:16,207 INFO  First boot has been run before. Not running it again

 

Userlevel 3
Badge +13

ok i see an .firstboot_fail marker file which means the firstboot scripts failed. Can you check first_boot.log to see where is the failure?

Badge +2

Here you go:

 

 

Userlevel 3
Badge +13

@JohanITG can you run the following command on the Esxi host?

 

ls -la /bootbank/Nutanix/firstboot/*firstboot*

 

Badge +2

In phoenix there are 2 options, repair and install.  I’ve tried both with the same error posted above for each.  Also, in VMware, I removed the old CVM to see if it makes any difference and it does not.  Also, it does not add a CVM in the hypervisor under VMware after I remove the original one.

 

Is there a way to force Configure Hypervisor to run?

 

Userlevel 3
Badge +13

@JohanITG not required to choose the hypervisor. What option did you choose in phoenix prompt? The CVM is installed in Vmware using a process called Configure Hypervisor. Did you just try Install cvm?

Badge +2

Hi there,

Did that: 

 

no partitions.

Restarted phoenix, 

Also, when phoenix starts installing I see this error if it means anything:

When phoenix was just back at a prompt, I rebooted to VMware and there was no CVM created.

Question:  When building the ISO, do I need to tell it that it’s for VMware?

Userlevel 3
Badge +13

@JohanITG  I believe the failure is due to partition detected on the SSD. In the same screen please run the following commands,

 

lscssi

# the above command will list the drives detected, find the ssd drive and look for the device name like sda,sdb

 

fdisk -l  <device_name_from_above_command>

If you see a partition listed delete the partition by following the article below, 

https://www.cyberciti.biz/faq/linux-how-to-delete-a-partition-with-fdisk-command/

 

After all partitions are deleted on the drive, run the following command to start phoenix again

 

phoenix/phoenix

 

Badge +2

I’ve done a bunch of work and now have phoenix working but get this:

 

 

Userlevel 4
Badge +12

WIth a recent experience with a similar problem that’s happened to me in a production environment, I would say that it’s faster to rebuild the node. This was the answer given to me by Nutanix support that it would have taken longer to troubleshoot the issue like it is now in your case.  Sorry, not much help here but if we were to put it to a vote, i’d say “rebuild it”

 

PS. i also agree that the repair host boot device option does nothing :)