Second time this happens. First time it did, I did not know how to fix this and did a reinstall. This time I am trying to find a way to avoid going down the full reinstall route…
The issue
After a power failure, in my area, the power may come back for like 5-60 seconds and then off again. This may happen once or twice after the power failure. In that case, what happens, is the AHV seems to mark the two NVMe drives as not available (I have one SSD, 120GB Western Digital booting AHV - so boot drive, and two Samsung EVO 1TB) and due to that, CVM and all VMs are gone.
Booting off a USB stick with Linux Mint we can see the drives listed:
All I can find is info that requires the CVM, what is gone in this case…
So how can this be fixed? Is it a matter of booting with the Nutanix CE USB and doing a repair? Is there something that can be done on AHV (what of course I do have access to) that will 'unflag' the disks as bad (what they are not - if I do reinstall the whole thing and start from scratch, the disks can be formatted, working fine after that), allowing these disks to be used as they were before the power outage?
I cannot see anywhere a description about how this is done. I mean how AHV sets these disks as 'bad' and refuses to mount/see them (what again can be seen by any Linux distro and even the Nutanix CE USB installer).
Thanks a lot all! You guys are my only hope Obi-Wans :-)
CR
Best answer by tsmvp
With help of a bunch of great Nutanix people, here is the fix. Keep in mind this is FOR CE ONLY and for a SINGLE NODE scenario. This has NOTHING TO DO WITH PRODUCTION systems and should NOT be used with these. For that, open a case with Nutanix so they assist you with issues with real, production hardware.
With that in mind…
Fix for CVM that disappears after a power failure
SSH into the AHV (I do hope you at least know the AHV IP...)
Find the CVM name. ls -lah /var/log/libvirt/qemu
ls -lah /var/log/libvirt/qemu
You should see a log file named NTNX-<BLOCK_SERIAL>-<NODE>-CVM.log (i.e., NTNX-ae965403-A-CVM.log). This is the CVM VM name (of course, without the log extension).
Run pwd just to make sure you are in the /root directory.
Make a copy of the NTNX-VM.XML file
cp NTNX-VM.xml NTNX-ae965403-A-CVM.xml (again, this is the name you got from the log file)
Edit the file (you can use NANO or VI if you do hate yourself) and check if the name inside is correct (Second line that says <name>NTNX-ae965403-A-CVM</name>). If it is, simply exit. If not, change to match the name you got from the log file.
All good? Now copy that file again but here:
cp NTNX-YOUR_CVM.xml /var/run/libvirt/qemu
Go to that directory and define the VM:
cd /var/run/libvirt/qemu
virsh define NTNX-YOUR_CVM.xml
Now start the VM
virsh start NTNX-YOUR_CVM.xml
After a bit the VM is now started. You can check using:
virsh list --all
Now SSH into the actual CVM, from the AHV:
ssh nutanix@192.168.X.X (where 192.168.X.X is the IP you had for your CVM)
You may need to xix the VLAN (in case the CVM was in a different VLAN, like in my case)
change_cvm_vlan YY (where YY is the VLAN ID)
May need to fix this so management works (depends how you setup the AHV of course):
Just keeping this thread in parallel with the one on Reddit, since it’s the same issue.
I don't think AHV is marking anything as bad, I think it just can't seem to locate the VM definition itself for the CVM (or it is corrupt). lspci will validate if CE can see the NVMe drives or not.
The VM xml for the CVM is stored in /var/run/libvirt/qemu I believe (I'm travelling so don't have access to my CE cluster right now to doublecheck). It can be rebuilt if necessary.
Just keeping this thread in parallel with the one on Reddit, since it’s the same issue.
I don't think AHV is marking anything as bad, I think it just can't seem to locate the VM definition itself for the CVM (or it is corrupt). lspci will validate if CE can see the NVMe drives or not.
The VM xml for the CVM is stored in /var/run/libvirt/qemu I believe (I'm travelling so don't have access to my CE cluster right now to doublecheck). It can be rebuilt if necessary.
Ok I got a PDF that explains how to rebuild the XML file. I just did it and rebooted the Supermicro (E200-8D, 128GB RAM, 120GB WD SSD for AHV, two Samsung EVO 1TB for VMs/CVM/etc). Let's see what happens…
Ok no go. I tried the virsh undefine command after copying the NTNX-CVM.XML file to the /var/run/libvirt/qemu (modified it and copied with the proper name with the block/etc) and I get an error:
Not sure what that means LOL. That said if that does not work, the virsh define also fails and I cannot start the CVM. The storage I think is there as I can at least see something with LSPCI:
I assume this show the hardware can see that I have the two Samsung NVMes…
With help of a bunch of great Nutanix people, here is the fix. Keep in mind this is FOR CE ONLY and for a SINGLE NODE scenario. This has NOTHING TO DO WITH PRODUCTION systems and should NOT be used with these. For that, open a case with Nutanix so they assist you with issues with real, production hardware.
With that in mind…
Fix for CVM that disappears after a power failure
SSH into the AHV (I do hope you at least know the AHV IP...)
Find the CVM name. ls -lah /var/log/libvirt/qemu
ls -lah /var/log/libvirt/qemu
You should see a log file named NTNX-<BLOCK_SERIAL>-<NODE>-CVM.log (i.e., NTNX-ae965403-A-CVM.log). This is the CVM VM name (of course, without the log extension).
Run pwd just to make sure you are in the /root directory.
Make a copy of the NTNX-VM.XML file
cp NTNX-VM.xml NTNX-ae965403-A-CVM.xml (again, this is the name you got from the log file)
Edit the file (you can use NANO or VI if you do hate yourself) and check if the name inside is correct (Second line that says <name>NTNX-ae965403-A-CVM</name>). If it is, simply exit. If not, change to match the name you got from the log file.
All good? Now copy that file again but here:
cp NTNX-YOUR_CVM.xml /var/run/libvirt/qemu
Go to that directory and define the VM:
cd /var/run/libvirt/qemu
virsh define NTNX-YOUR_CVM.xml
Now start the VM
virsh start NTNX-YOUR_CVM.xml
After a bit the VM is now started. You can check using:
virsh list --all
Now SSH into the actual CVM, from the AHV:
ssh nutanix@192.168.X.X (where 192.168.X.X is the IP you had for your CVM)
You may need to xix the VLAN (in case the CVM was in a different VLAN, like in my case)
change_cvm_vlan YY (where YY is the VLAN ID)
May need to fix this so management works (depends how you setup the AHV of course):