Solved

Foundation process seems to be uncertain.

  • 17 April 2018
  • 7 replies
  • 3745 views

I am new to this foundation process and learning. It seems the team has always had problems with the process and I am trying to make it smoother and ask the help of guru's here. First item I notice if we are "Foundationing" machines that are not new (been used) the foundation process does not discover them. These are Nutanix 3060-G5 machines. Because of that we have to look at and enter the mac addresses to have any hope of getting them foundationed. So my question is why is the foundation process not auto discovering them after they are no longer new? We are using version 3.9.1 foundation.

Second question is I notice the following error messages in the log for each node trying to go through the foundation process:
INFO Validating parameters. This may take few minutes
20180221 15:04:07 DEBUG Unable to ssh using private key
20180221 15:04:10 ERROR Unable to ssh using password
20180221 15:04:10 ERROR Unable to get a ssh session
20180221 15:04:13 ERROR Unable to ssh using password
20180221 15:04:13 ERROR Unable to get a ssh session

Wanting to know why these messages are here and if they related to my first question above. Is there a set username and password in the foundation process that is no longer on the nutanix block? If so is there a way to sync or make this process easier?

Currently with these foundation issues it is taking all day to do each of our 5 four node blocks and the process should not be this time consuming. Any help in making this a better process would be appreciated.
icon

Best answer by cameron 17 April 2018, 23:03

Hi

  1. Always use the latest version of Foundation - 3.9.1 is now quite old. The latest is 4.0.2 at this time. A lot of improvements have been made in the UI to guide users into where troubleshooting issues may lie in the 3.12+ versions. As newer models of Nutanix hardware gets released, you'll need the newest versions anyway.
  2. Nodes will only ever broadcast that they can be 'discovered' when they are NOT part of an existing cluster. If you want to re-discover old nodes that used to be part of a working cluster, you need to destroy the cluster first. Search for "Destroying a Cluster" in the Advanced Admin Guide on the portal. Once that is done, you can discover the nodes again similar to if they arrived fresh from factory. This is done so people do not accidentally blow away running-in-production cluster nodes of course. That would be a bad day.
  3. The 'ssh' errors above are cosmetic which I am sure we have fixed in a newer release (or have a ticket open to remove them from the logs). You can safely ignore them. The newer versions of Foundation have much better logging capability.
Cheers
Cameron
View original

7 replies

Userlevel 1
Badge +9
Hi

  1. Always use the latest version of Foundation - 3.9.1 is now quite old. The latest is 4.0.2 at this time. A lot of improvements have been made in the UI to guide users into where troubleshooting issues may lie in the 3.12+ versions. As newer models of Nutanix hardware gets released, you'll need the newest versions anyway.
  2. Nodes will only ever broadcast that they can be 'discovered' when they are NOT part of an existing cluster. If you want to re-discover old nodes that used to be part of a working cluster, you need to destroy the cluster first. Search for "Destroying a Cluster" in the Advanced Admin Guide on the portal. Once that is done, you can discover the nodes again similar to if they arrived fresh from factory. This is done so people do not accidentally blow away running-in-production cluster nodes of course. That would be a bad day.
  3. The 'ssh' errors above are cosmetic which I am sure we have fixed in a newer release (or have a ticket open to remove them from the logs). You can safely ignore them. The newer versions of Foundation have much better logging capability.
Cheers
Cameron
Userlevel 1
Badge +4
Hey Gizmo,

First -- apologies that you're running into this!

To answer the first question you had, specifically around why a node is no longer discoverable after attempting to foundation it:

1) When we discover a node on the network, we do that by communicating with the CVM that is pre-provisioned in the factory on that node.
2) The Foundation process goes about either re-installing the CVM and/or the Hypervisor depending on your situation. If the foundation process stopped part-way through, the node is likely stuck in a state where the CVM is not reachable. There are a couple reasons that this could occur:
2a) Host is booted into Phoenix. Phoenix is our recovery utility that we use for imaging during foundation and for manual recovery procedures.
2b) CVM on that host has become unreachable for some reason.
3) This can be alleviated by manually entering the MAC address found on the back of the unit, as you've been doing. This effectively removes the "discovery" portion of the foundation process and just connects you to the node directly.

Regarding the issue you raised around the SSH issues you're seeing, it's hard to say what would cause that. If you wanted to, I'd be happy to sit with you or a member of your team and look into this over a webex. Just respond to this message and we'll figure out the best way to do that.

Regards,
Stephan Mercatoris - Staff Systems Reliability Engineer @ Nutanix
Userlevel 7
Badge +35
Hi @Gizmo take a look at the replies from @cameron and @smerc

If they help, consider clicking the 'like' and 'best answer' link on the reply - that will help others find answers to similar questions much quicker, Thanks 👍
Cameron,
Thank you for your response. With your encoragement I am certainly going to try to get the team here to upgrade the foundation stand alone laptop to 4.02. Hopefully as you said that will clean up the log so i can nit pick at relavent items in trying to make this less painful process. I will find the guide for that version so I can better understand the product.

All of the blocks I am working with were working just fine in an old cluster. That cluster was destroyed and taken away before these blocks were moved to thier new home. According to what you said they should be discoverable now but when bringing up foundation it never shows any blocks and I always have to add blocks. I have also noticed that if I setup the foundation process to include the MAC addresses of the IPMI port the process ends up failing. If I take the time to boot each machine and go into the bios and set the IPMI ip to the expected new IPMI IP then the process will suceed. Each rack has 8 blocks with 4 nodes and I have MANY racks of these so as you can imagine that takes a long time. It is like some take and some don't if I don't set the IPMI IP address. That means I have to wait for the ones that took to finish then hit the retry missed machines and catch then on the next 1 or 2 go arounds. It almost seems like the process is not timed right to set the IPMI address on its own. Any clue why this is so unpredicatable if i don't set the IPMI address?

@smerc
Thank you for your response. As I said above none of the CVM's were stuck in an unknown state. All of these have ESXi as the hypervisor and were shutdown from a working unit. Setting the MAC address alone does not seem to be reliable and the only way to get them all to go through on the first round is for me to going into the Nutanix 3060 BIOS and preset the IPMI IP. I will try it again to see if i get same results but am betting if I don't set IP address not all of them will go through the first time around. Suggestions?
Userlevel 1
Badge +9
"bare metal" ipmi/mac address method assumes your laptop is on the same L2 network as the nodes IPMI

Details are in the Field Installation Guide: http://download.nutanix.com/documentation/Foundation_v4_0/Field-Installation-Guide-v4-0.pdf
Cameron,
We are using a Cisco SG102-24 unmanaged switch which has no vlans. So everything is in the same L2 network. I read in the field guide where Foundation uses IP6 to discover things. Is IP 6 needed to make the discovery work as I am not sure if this low level Cisco SG102 even supports IPv6?

Also would like to ask if there is foundation has a limit on the number of blocks or nodes it can do at once? I have tried 5 blocks of 4 but would save a huge amount of time if I could do more blocks at a time. Thoughts?
Userlevel 1
Badge +9
Discovery requires ipv6 multicast.

"Bare-Metal" / IPMI Mac address method see "Imaging Bare Metal Nodes" section in that doc does not require it (as you are not discovering anything). I'm not sure what that particular cisco switch may be blocking by default.

No real limit on the number of nodes, other than you need to take into account that each node will try and copy required files (hypervisor image, any new AOS image) required to each node in parallel, so at scale 10Gbit is recommended to keep the time taken low. I've seen people do 80+ nodes at once on 10Gbit.

Reply