Solved

Nutanix CE - Single Node, NVMEs disappeared after power failure

  • 8 August 2022
  • 5 replies
  • 98 views

Badge

Ladies/Gents,

Second time this happens. First time it did, I did not know how to fix this and did a reinstall. This time I am trying to find a way to avoid going down the full reinstall route…

The issue

After a power failure, in my area, the power may come back for like 5-60 seconds and then off again. This may happen once or twice after the power failure. In that case, what happens, is the AHV seems to mark the two NVMe drives as not available (I have one SSD, 120GB Western Digital booting AHV - so boot drive, and two Samsung EVO 1TB) and due to that, CVM and all VMs are gone.

Booting off a USB stick with Linux Mint we can see the drives listed:

 

All I can find is info that requires the CVM, what is gone in this case…

So how can this be fixed? Is it a matter of booting with the Nutanix CE USB and doing a repair? Is there something that can be done on AHV (what of course I do have access to) that will 'unflag' the disks as bad (what they are not - if I do reinstall the whole thing and start from scratch, the disks can be formatted, working fine after that), allowing these disks to be used as they were before the power outage?

I cannot see anywhere a description about how this is done. I mean how AHV sets these disks as 'bad' and refuses to mount/see them (what again can be seen by any Linux distro and even the Nutanix CE USB installer). 

Thanks a lot all! You guys are my only hope Obi-Wans :-)

CR

icon

Best answer by tsmvp 8 August 2022, 19:22

View original

5 replies

Badge

Just keeping this thread in parallel with the one on Reddit, since it’s the same issue.

I don't think AHV is marking anything as bad, I think it just can't seem to locate the VM definition itself for the CVM (or it is corrupt). lspci will validate if CE can see the NVMe drives or not.

The VM xml for the CVM is stored in /var/run/libvirt/qemu I believe (I'm travelling so don't have access to my CE cluster right now to doublecheck). It can be rebuilt if necessary.

Badge

Just keeping this thread in parallel with the one on Reddit, since it’s the same issue.

I don't think AHV is marking anything as bad, I think it just can't seem to locate the VM definition itself for the CVM (or it is corrupt). lspci will validate if CE can see the NVMe drives or not.

The VM xml for the CVM is stored in /var/run/libvirt/qemu I believe (I'm travelling so don't have access to my CE cluster right now to doublecheck). It can be rebuilt if necessary.

Ok I got a PDF that explains how to rebuild the XML file. I just did it and rebooted the Supermicro (E200-8D, 128GB RAM, 120GB WD SSD for AHV, two Samsung EVO 1TB for VMs/CVM/etc). Let's see what happens…

CR

Badge

Ok no go. I tried the virsh undefine command after copying the NTNX-CVM.XML file to the /var/run/libvirt/qemu (modified it and copied with the proper name with the block/etc) and I get an error:

Not sure what that means LOL. That said if that does not work, the virsh define also fails and I cannot start the CVM. The storage I think is there as I can at least see something with LSPCI:

I assume this show the hardware can see that I have the two Samsung NVMes…

CR

Badge

For those who come here later:   The xml needed to be rebuilt and he was able to get things back up and running.   

Here’s the reddit thread

https://www.reddit.com/r/nutanix/comments/wiqcj2/nutanix_ce_storage_issue/

Badge

With help of a bunch of great Nutanix people, here is the fix. Keep in mind this is FOR CE ONLY and for a SINGLE NODE scenario. This has NOTHING TO DO WITH PRODUCTION systems and should NOT be used with these. For that, open a case with Nutanix so they assist you with issues with real, production hardware.

With that in mind…

Fix for CVM that disappears after a power failure

  • SSH into the AHV (I do hope you at least know the AHV IP...)
  • Find the CVM name. ls -lah /var/log/libvirt/qemu
    • ls -lah /var/log/libvirt/qemu
    • You should see a log file named NTNX-<BLOCK_SERIAL>-<NODE>-CVM.log (i.e., NTNX-ae965403-A-CVM.log). This is the CVM VM name (of course, without the log extension).
  • Run pwd just to make sure you are in the /root directory.
  • Make a copy of the NTNX-VM.XML file
    • cp NTNX-VM.xml NTNX-ae965403-A-CVM.xml (again, this is the name you got from the log file)
    • Edit the file (you can use NANO or VI if you do hate yourself) and check if the name inside is correct (Second line that says <name>NTNX-ae965403-A-CVM</name>). If it is, simply exit. If not, change to match the name you got from the log file.
    • All good? Now copy that file again but here:
      • cp NTNX-YOUR_CVM.xml /var/run/libvirt/qemu
    • Go to that directory and define the VM:
      • cd /var/run/libvirt/qemu
      • virsh define NTNX-YOUR_CVM.xml
    • Now start the VM
      • virsh start NTNX-YOUR_CVM.xml
    • After a bit the VM is now started. You can check using:
      • virsh list --all
    • Now SSH into the actual CVM, from the AHV:
      • ssh nutanix@192.168.X.X (where 192.168.X.X is the IP you had for your CVM)
    • You may need to xix the VLAN (in case the CVM was in a different VLAN, like in my case)
      • change_cvm_vlan YY (where YY is the VLAN ID)
    • May need to fix this so management works (depends how you setup the AHV of course):
      • manage_ovs --bridge_name br1 --interfaces 1g update_uplinks

​​​​​​​​​​​​​​Once I did all the above, everything came back up and running. All VMs, etc. Just like nothing had happened before.

Again thanks a lot for the assistance guys and I do hope this may help others in the future.

CR

Reply