Solved

Reimage Cluster with ESXi Failing

  • 21 February 2020
  • 3 replies
  • 2169 views

Userlevel 1
Badge +1

I am reimaging my lab cluster after we had some serious problems with a conversion to AHV and rollback to ESXi.  2 of my 3 nodes reimaged fine, once I put a /firstboot directory in to the existing ESXi hosts.  My 3rd node is losing track of where the firstboot directory would be.

 

From the foundation log:

20200221 11:40:17 ERROR Command 'scp -i ~/.ssh/id_rsa /tmp/tmp.gkI7z6j7XG/nutanix_provision_network_utils-1.0-py2.7.egg root@192.168.5.1:./vmfs/volumes/4d501d79-cff63899-7951-75210fae7516/Nutanix ./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg' returned error code 127 stdout: stderr: FIPS mode initialized bash: line 1: ./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg: No such file or directory 20200221 11:40:17 ERROR Failed while copying file /tmp/tmp.gkI7z6j7XG/nutanix_provision_network_utils-1.0-py2.7.egg to host with error Command 'scp -i ~/.ssh/id_rsa /tmp/tmp.gkI7z6j7XG/nutanix_provision_network_utils-1.0-py2.7.egg root@192.168.5.1:./vmfs/volumes/4d501d79-cff63899-7951-75210fae7516/Nutanix ./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg' returned error code 127 stdout: stderr: FIPS mode initialized bash: line 1: ./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg: No such file or directory 20200221 11:40:18 ERROR Exception in running <ImagingStepProvisionNetwork(<NodeConfig(10.252.200.33) @6bf0>) @68d0> Traceback (most recent call last): File "foundation\imaging_step.py", line 161, in _run File "foundation\imaging_step_provision_network.py", line 209, in run File "foundation\imaging_step_provision_network.py", line 119, in provision_network StandardError: ('Failed to execute threaded_provision_network on %s, error (%s)', '10.252.200.33', "Command 'scp -i ~/.ssh/id_rsa /tmp/tmp.gkI7z6j7XG/nutanix_provision_network_utils-1.0-py2.7.egg root@192.168.5.1:./vmfs/volumes/4d501d79-cff63899-7951-75210fae7516/Nutanix\n./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg' returned error code 127\nstdout:\n\nstderr:\nFIPS mode initialized\r\nbash: line 1: ./vmfs/volumes/5e4c3359-4b398980-cc34-ac1f6bb9dd8a/Nutanix/firstboot/nutanix_provision_network_utils-1.0-py2.7.egg: No such file or directory\n")

Basically, it is going down the /bootbank/Nutanix/firstboot path instead of /firstboot.  I see the .egg file in the directory where it SCP’s to, but for SOME REASON, it is trying to go to a completely different directory to execute it.  I would love to just wipe this node clean and have AOS and ESXi install from scratch.  Are there any tricks that I can do to make that happen?

icon

Best answer by JeremyJ 25 February 2020, 17:17

View original

This topic has been closed for comments

3 replies

Userlevel 6
Badge +5

Hi there!

Bare with me as I am trying to help you, please.

Firstly, if I understand correctly, you have created /firstboot directory within ESXi file system. This would not matter when converting into AHV as AHV is installed fresh and independent to the pre-existing hypervisor.

Secondly, I would like to point out that Community edition is not exactly equal to full-enterprise version. I am sure I understand that anyway.

Unsuccessful conversion of the cluster should be accompanied by a message in Prism inviting to roll back the conversion. You could execute convert_cluster_status from a CVM in attempt to find out actual state of the node as well as any reasons for the imaging failure.

As per wiping the node clean, since it a lab environment and it is a 3 nodes cluster meaning evicting a node is not an option, you could re-image the entire cluster.

In an attempt to provide some guidance please see this Prism Web Guide: In-place hypervisor conversion which contains Requirements and limitations section amongst the rest.

I apologise for not being able to come up with a more helpful response. Please let us know how you go.

 

Userlevel 3
Badge +4

This is not a known issue with an already-recognized root cause. You should not have to make any manual changes during conversion.

I see a case is opened for this. I agree with that course of action. 

Much like the old method for replacing a failed hypervisor boot drive, it is possible to run a clean ESXi install, then download and use the phoenix iso and run “configure hypervisor” step. This does not complete any of the configuration steps for you so remaining procedures would need to be completed manually. 

If you have important VMs not yet registered on a successfully converted ESXi host, it may be better to troubleshoot the conversion process. 

Userlevel 1
Badge +1

Firstly, I am running the enterprise 5.10.9 version for the lab, and not CE.  Sorry for not being clear about that.  Also, I see that when I first wrote the post, I messed up my original description.  I was able to “successfully” convert the cluster from AHV back to ESXi with the help of Support, since genesis was bombing out on network configurations it thought had changed, but had not to the best of my knowledge.

After getting it back to ESXi, LCM and other software upgrades were not working anymore due to network configuration issues.  That was when I decided to just save our test VM’s and re-foundation the cluster.  2 nodes made it fine with just having to add the /firstboot directory so Foundation could do its thing (which may be an ENG that was fixed in 4.5.2).  The 3rd node is the one that had a weird pathing issue.

Jeremy, I see you found the case I opened yesterday.  I am doing the boot-from-phoenix now and installing the CVM.  Once that is complete, I’ll manually run the cluster create.

Thanks all for the feedback!