Solved

CVM Crashed

  • 18 March 2020
  • 34 replies
  • 6330 views

Badge +2

All,

I have a 6 Node 1065 system in 2 blocks.  Recently one of the CVMs (node 2, block A) crashed.  When rebooted it was just going in loops.  When diagnosed it seems the SSD (not the SATADOM) had failed and we replaced it.  When we try to boot the CVM, it still just loops.

We were told to boot that node with Phoenix which the cluster provided me for download.  I do that and it doesn’t load Phoenix and gets errors instead. 

I’m looking for a suggestion of how to get the node back to 100%.  At this point (and throughout) the ESXi on the SATADOM has booted fine and I guess if I didn’t care about the storage side I could just ignore this but I’d like the system to be fully healthy.

Any suggestion about how to get the CVM working again would be appreciated.

Thank you

Johan

icon

Best answer by JohanITG 28 April 2020, 18:10

All, there was an issue with RAM allocation on the CVM that was throwing a hidden error.  Once RAM was increased the CVM joined the cluster and we’re good.

 

Thanks

View original

This topic has been closed for comments

34 replies

Userlevel 3
Badge +13

@JohanITG  Did you try the single ssd replacement method mentioned below, specifically the “REPAIRING A DRIVE” section?

https://support-portal.nutanix.com/page/documents/details?targetId=Boot-Metadata-Drive-Replacement-Single-SSD-Platform-v5_16-NX1000-NX6000:Boot-Metadata-Drive-Replacement-Single-SSD-Platform-v5_16-NX1000-NX6000

Badge +2

Chandru,

 

Thanks for this.  I’ve tried this and had no success.  Specifically, once the SSD is replaced I’ve gone to the UI and selected “repair boot disk” for the node in question.  It errors saying it can’t find the node and then prompts me with the download for Phoenix.  I take that ISO, boot it and wait.  It takes some time but after an error about failing to mount the Phoenix CDROM it then gets to a final linux prompt where it says:  There is no screen to be attached matching centos_phoenix

 

When I get to this point I have gone back to the UI and clicked continue.  It just thinks for a minute and goes back to this:

 

Any suggestions.  I was curious, do I need to be using the on-board 1Gbps NICs to make this work?  If so, that could well be the issue as we’re just using the 10G NICs.  From the console shown above, I can ping the other CVMs and the cluster IP.

Thanks

Userlevel 3
Badge +13

@JohanITG It looks like you are trying to repair the host boot disk not the CVM boot disk. You need to select the replaced drive in the UI and then you will see the Repair boot disk option. This option is for repairing CVM boot disk. 

 

 

Badge +2

@Chandru - Here’s the screenshot of the hardware from Prisim:

I would love to know where the button you mention is.  Please note, that because the CVM isn’t loading, nothing is reporting to Prisim.  Also, note that when the SSD died, the CVM crashed.  Since that time the CVM only try’s to boot and then resets itself after a few tries to mount things.

As this is a VMware host, I have tried to fix the HDD with VMware running or with the Phoenix image provided by the system.  Neither have been successful in any way. I would like to think I’m missing something obvious and simple.  I just don’t know what.

 

Thanks

 

Badge +2

This is what’s on the CVM screen:  

 

Userlevel 3
Badge +13

@JohanITG  Looks like the disk is not detected as failed, there is an ID for the disk which means the system knows about the disk already. Did you replace the disk or just re-seat it?

Badge +2

The disk has been replaced.  The old one (we tested on another system and this one) had IO errors and was certainly bad.

There is an option to “remove disk” but it won’t let me saying there is some way to force it but there’s nothing obvious for that and I didn’t want to try that without knowing it was the right thing to do.  Note that the system only shows up as its IP address in Prism because it’s not currently seen by the cluster.

Badge +2

I’ve removed the disk from the Prism system but there’s nothing new showing up.  I’m going to double check the new hardware and will write back when I’ve done that.  I’m still concerned that the CVM doesn’t boot at all.

Badge +2

Update:  At this point I’ve tried a second SSD in the unit and am still stuck.  

  1. The node boots ESXi
  2. The CVM starts but loops
  3. The Prism UI never sees the node (because the CVM is down) and just shows the node as an IP in the UI
  4. There is no option ever to repair the specific HDD
  5. The repair host boot device does nothing

At this point I think my only choice would be to rebuild the node from scratch unless someone has some idea how to get the CVM to do anything other than boot up, fail as shown above and then restart itself.

Thanks

Userlevel 4
Badge +12

WIth a recent experience with a similar problem that’s happened to me in a production environment, I would say that it’s faster to rebuild the node. This was the answer given to me by Nutanix support that it would have taken longer to troubleshoot the issue like it is now in your case.  Sorry, not much help here but if we were to put it to a vote, i’d say “rebuild it”

 

PS. i also agree that the repair host boot device option does nothing :)

Badge +2

I’ve done a bunch of work and now have phoenix working but get this:

 

 

Userlevel 3
Badge +13

@JohanITG  I believe the failure is due to partition detected on the SSD. In the same screen please run the following commands,

 

lscssi

# the above command will list the drives detected, find the ssd drive and look for the device name like sda,sdb

 

fdisk -l  <device_name_from_above_command>

If you see a partition listed delete the partition by following the article below, 

https://www.cyberciti.biz/faq/linux-how-to-delete-a-partition-with-fdisk-command/

 

After all partitions are deleted on the drive, run the following command to start phoenix again

 

phoenix/phoenix

 

Badge +2

Hi there,

Did that: 

 

no partitions.

Restarted phoenix, 

Also, when phoenix starts installing I see this error if it means anything:

When phoenix was just back at a prompt, I rebooted to VMware and there was no CVM created.

Question:  When building the ISO, do I need to tell it that it’s for VMware?

Userlevel 3
Badge +13

@JohanITG not required to choose the hypervisor. What option did you choose in phoenix prompt? The CVM is installed in Vmware using a process called Configure Hypervisor. Did you just try Install cvm?

Badge +2

In phoenix there are 2 options, repair and install.  I’ve tried both with the same error posted above for each.  Also, in VMware, I removed the old CVM to see if it makes any difference and it does not.  Also, it does not add a CVM in the hypervisor under VMware after I remove the original one.

 

Is there a way to force Configure Hypervisor to run?

 

Userlevel 3
Badge +13

@JohanITG can you run the following command on the Esxi host?

 

ls -la /bootbank/Nutanix/firstboot/*firstboot*

 

Badge +2

Here you go:

 

 

Userlevel 3
Badge +13

ok i see an .firstboot_fail marker file which means the firstboot scripts failed. Can you check first_boot.log to see where is the failure?

Badge +2

These are the last lines:

 


2020-03-27 07:04:36,203 INFO  Running cmd ['touch /bootbank/Nutanix/firstboot/.firstboot_fail']

2020-03-27 07:04:36,224 INFO  Changing ESX hostname to 'Failed-Install'

2020-03-27 07:04:36,225 INFO  Running cmd ['esxcli system hostname set --fqdn Failed-Install']

2020-03-27 07:04:36,538 INFO  Running cmd ['esxcli network ip interface ipv4 set -i vmk0 -t dhcp -P true']

2020-03-27 07:43:16,207 INFO  First boot has been run before. Not running it again

 

Userlevel 3
Badge +13

The failure will be above those lines, please send me the logs as private message.

Badge +2

Screenshot now:

 

Userlevel 3
Badge +13

@JohanITG can you check if the lsi controller is marked as passthru for the CVM? You can compare it with working cvm vmx file or settings

Badge +2

It appears to be the same:

 

 

Userlevel 3
Badge +13

Can you mount the Phoenix is again and when it prompts for installation just select cancel. It will take you to Shell. In the Shell run

fdisk -l /dev/sd?

Confirm there is 4 partitions on the ssd, mount the first two partitions of the ssd and then do an ls -la on the mount point. By the way is the replacement ssd same model as the ssd on other nodes?

Badge +2

There are no partitions on the SSD or any other disks beyond the SATA DOM

Screenshots: