Skip to main content

I was testing Link Failover and it was not working as expected. I have configured ACtive-Standby, and once I am taking out the cable from NIC1 all the traffic does not goes through Standby link, after connecting nic back, the traffic was also restored. Just want to know the faiover scenario related to NIC.

 

Note : Time seems to be correct in every CVM

 

Once the traffi is back I am trying to access CVM with virtual IP giving below error

 

 

Can you show us a screenshot from the network Dashboard in prism element? 


I am unable to login to element not sure how can I show? Can you guide more?


Hi there, based on the error:

“upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED”

 

This appears to be is a TLS/SSL verification error when accessing Prism via a virtual IP (VIP). The connection is failing due to certificate validation issues, likely triggered by:

  1. Untrusted or expired certificate on the CVM hosting the VIP.
  2. DNS or VIP misrouting, causing a mismatch between the expected hostname and the certificate CN/SAN.
  3. Client-side certificate validation strictness (e.g., using curl or gRPC libraries with hardened TLS settings).
  4. Virtual IP being reassigned to a different CVM that doesn't have the SSL state correctly initialized, especially after your failover event.

 

Action

1. Bypass TLS Verification Temporarily (Only for Testing)

If you're using a script or CLI and not a browser, you can test:

curl -k https://<VIP>:9440

If that works, it confirms that TLS certificate validation is the blocker.

2. Check Certificate Validity on VIP-Holding CVM

SSH to the CVM currently holding the VIP and run:

openssl s_client -connect localhost:9440 -showcerts

Look for:

  • Certificate subject (CN)
  • Certificate issuer
  • Certificate expiration date
  • Errors like self signed certificate, unable to verify the first certificate, or expired

3. Force VIP Failover or Restart Prism Services

You can rotate the VIP and restart Prism on all CVMs:

allssh 'genesis restart'

Or only rotate VIP:

genesis --rotate_virtual_ip

This can reset Prism Envoy and regenerate proper routing.

4. Check VIP Ownership Consistency

genesis status | grep -i virtual

Ensure the CVM showing VIP ownership is healthy, has correct SSL bindings, and its prism service is running.

5. Browser Trust (if GUI)

If you're seeing this from a browser:

  • Open browser developer tools, view the certificate (padlock icon), and inspect CN/SAN.
  • Ensure the URL matches the certificate CN (common with IP vs hostname mismatch).
  • Add the CVM certificate to trusted cert store only if it's internal CA signed.

 

Permanent Fix Options

A. Replace the SSL Certificate

You can upload a new certificate signed by a trusted CA (or internal CA):

  1. Prepare .crt and .key files.
  2. Use Prism Element GUI or CLI:

ncli ssl-certificate import certificate-path=/tmp/your_cert.crt key-path=/tmp/your_key.key

B. Disable TLS Validation in Controlled Environments

Only in air-gapped or test environments, you can disable TLS client verification in automation/scripts. Never recommended for production-facing apps.

 

Note

This is not directly a failover bug, but likely the result of VIP reassignment after failover interacting with:

  • An invalid/self-signed certificate
  • A CVM missing proper SSL service state
  • Or a client expecting a trusted cert (which the new VIP owner doesn’t present properly)

Hope this helps...


I have login to Prism Cental and when going on subnet tab, it is showing same error

 

 


What you can do is reset/generate a new selfsigned certificate on the cluster via: https://portal.nutanix.com/page/documents/details?targetId=Nutanix-Security-Guide-v7_3:sec-security-self-signed-certificate-wc-t.html

 


The link provided by you is not understandable unfortunately. I cannot do it.


Hi Himanschu, just to give a bit of clarity as to what's happened here is your NIC failover broke part of the trust path. Now Prism Central can’t securely talk to one of the CVMs (or VIP), because the cert it’s getting back is either invalid or doesn’t match what it expects. Restarting Genesis and making sure the cert is valid usually clears it.

The same TLS error (CERTIFICATE_VERIFY_FAILED) is now happening within Prism Central, specifically when trying to load networking configuration (subnets, internal interfaces, virtual switches). This suggests:

  • Prism Central is attempting a secure gRPC/REST API call to a CVM or other Prism Element node, and the certificate presented is not valid or trusted.
  • This is not just a client-to-UI error, but a control plane internal trust issue between Prism Central and cluster services (likely over port 9440).
  • It’s likely triggered by recent failover or incorrect VIP handoff, which exposed an untrusted cert or stale service binding.

 

What this Confirms:

  • Your NIC failover did not complete cleanly.
  • As a result, the CVM that now holds a role (possibly VIP or service endpoint) is presenting either a self-signed, expired, or mismatched certificate.
  • Prism Central cannot establish trusted communication with the back-end component (e.g., Prism Element on the CVM).

 

Recommended Action

1. Check Certificate State on Affected CVM(s)

SSH to the CVM (or use the one hosting the VIP) and run:

openssl s_client -connect localhost:9440 -showcerts

Look for:

  • CN/SAN mismatches (e.g., IP used instead of FQDN)
  • Expired certs
  • Self-signed certs not trusted by the Prism Central trust store

2. Restart Genesis + Prism Services

To clear stale state and refresh VIP and certificate bindings:

allssh 'genesis restart && sudo service prism start'

Note: This is safe and non-disruptive in healthy clusters.

3. Check VIP Ownership + Accessibility

On any CVM:

genesis status | grep -i virtual

Verify that:

  • The VIP is assigned to a healthy CVM.
  • The CVM can serve valid HTTPS (try curl -k https://<VIP>:9440 for test).

4. Replace or Reimport SSL Certs if Needed

If you're using self-signed or internal certs, consider reuploading them:

ncli ssl-certificate import certificate-path=/tmp/cert.crt key-path=/tmp/key.key

Or do it via Prism GUI → Settings > SSL Certificate.

 


 

Point no 2  screenshot

Point No 3 : Does not show any thing in CVM CLI

 

Point No 4, I am unable to open the webpage

 


I am under very critical situation, help please


this is now very clearly a Genesis-level failure, not just a UI or cert issue.

Let me try to summarize what’s happening based on your screenshot:

 

What you are seeing:

  1. You're running:
  2. allssh 'genesis restart && sudo service prism start'
  3. Genesis restarts successfully.
  4. prism.service fails with:

Failed to start prism.service: Unit prism.service not found.

  1. Running:
  2. genesis status | grep -i virtual

...returns no VIP ownership or output.

 

What This Tells Us:

  • The prism.service you tried to start doesn’t exist because Prism is part of the genesis process tree and not managed via systemd on the CVMs.
  • The VIP assignment has not returned, likely due to a networking fault, misconfiguration, or certificate state issue that’s preventing Genesis from successfully reassigning it.

Next Steps

1. Skip the prism.service start,  it's not needed.

Instead, just use:

allssh 'genesis restart'

Genesis handles Prism startup automatically on CVMs.

 

2. Check VIP assignment manually

Run this on each CVM:

ip a | grep 9440

Look for which CVM (if any) has the VIP bound to port 9440.

Also, try:

host <your-prism-vip>

and

ping <VIP>

to see if it's reachable.

 

3. Force VIP Reassignment (advanced)

You can try rotating the VIP manually:

genesis --rotate_virtual_ip

If that fails silently, try:

manage_ovs show_uplinks

...to ensure uplinks and bonds are healthy on each host.

4. Check Genesis Health Deeply

If Prism VIP isn’t coming back after restarting Genesis, check the Genesis logs:

cat ~/data/logs/genesis.out | grep -i vip

You're looking for lines like:

Assigning virtual IP ...

Binding VIP ...

Or errors like:

Failed to bind VIP due to ...

 

This is now beyond a simple cert issue, you have a VIP that’s not assigned, likely due to one of the following:

  • Bonding failure (interface not healthy or link not detected).
  • Misconfigured uplinks (Genesis can't bind VIP).
  • Inconsistent Genesis state (restart didn’t properly re-elect a CVM leader).

If this doesn’t resolve the issue and its a Production Cluster then I would urge that you contact Nutanix Support.


  1. All Genesis Service Restarted successfully.
  2. ip a | grep 9440 does not show anything
  3. genesis --rotate_virtual_ip - provide lot of text what to look in to that?
    manage_ovs show_uplinks - SHows below
  4. cat ~/data/logs/genesis.out | grep -i vip - shows below

     


this confirms Genesis and Prism services are up, but the VIP is not assigned, and that’s your root problem now.

 

Confirmed is:

  • br0 is in active-backup mode with eth0 and eth1, no LACP this is fine.
  • Genesis logs show VipMonitorService started, which is responsible for managing VIPs.
  • ip a | grep 9440 shows nothing so the VIP is not currently assigned to any CVM.
  • genesis --rotate_virtual_ip runs, but you’re unsure what to look for in output.

 

What to Check Right Now:

1. Check Output of genesis --rotate_virtual_ip

Scroll through the output and look for lines like:

  • Releasing VIP from this node...
  • Assigning VIP 192.168.x.x to this node
  • Failed to assign VIP...
  • Unable to bind VIP to interface br0...

If there is any failure to bind, it will show there. Let me know or paste that snippet.

 

2. Check All CVMs for VIP via IP and Port

Run this on each CVM individually:

ip a | grep 192.168.249

Replace 192.168.249 with your VIP subnet. This ensures it's not hiding on another CVM.

 

3. Check Uplink Health

You said manage_ovs show_uplinks shows “below”  but I can’t see that detail. You should look for:

  • State: up
  • A valid MAC address on each interface
  • Interface not disabled or down

If any bond member (eth0 or eth1) is showing down or link: false, Genesis may refuse to assign the VIP for safety.

 

Recovery Options:

If everything seems healthy but VIP is still unassigned, you can manually assign it temporarily for access (for emergency access only):

sudo ip addr add <VIP>/24 dev br0

Important: Do this on only one CVM, and only if you're confident it's safe.

Once you're in Prism:

  • Check Cluster > Settings > IP Addresses
  • Validate the VIP is properly configured there.

Then later you can let Genesis take over again by:

sudo ip addr del <VIP>/24 dev br0

genesis restart

 

TL;DR

Your system is healthy except VIP isn’t getting assigned. That’s likely due to:

  • Bond/interface state not healthy
  • Genesis logic refusing to bind
  • Network condition preventing VIP ARP announcement

Note, as I mentioned previously, if this is a “Production” Cluster you should seek Nutanix Support


I am under very critical situation, help please

Please involve support. That is the way to go.  The AI answers given here can break more then you want. 


this confirms Genesis and Prism services are up, but the VIP is not assigned, and that’s your root problem now.

 

Confirmed is:

  • br0 is in active-backup mode with eth0 and eth1, no LACP this is fine.
  • Genesis logs show VipMonitorService started, which is responsible for managing VIPs.
  • ip a | grep 9440 shows nothing so the VIP is not currently assigned to any CVM.
  • genesis --rotate_virtual_ip runs, but you’re unsure what to look for in output.

 

What to Check Right Now:

1. Check Output of genesis --rotate_virtual_ip

Scroll through the output and look for lines like:

  • Releasing VIP from this node...
  • Assigning VIP 192.168.x.x to this node
  • Failed to assign VIP...
  • Unable to bind VIP to interface br0...

If there is any failure to bind, it will show there. Let me know or paste that snippet.

 

2. Check All CVMs for VIP via IP and Port

Run this on each CVM individually:

ip a | grep 192.168.249

Replace 192.168.249 with your VIP subnet. This ensures it's not hiding on another CVM.

 

3. Check Uplink Health

You said manage_ovs show_uplinks shows “below”  but I can’t see that detail. You should look for:

  • State: up
  • A valid MAC address on each interface
  • Interface not disabled or down

If any bond member (eth0 or eth1) is showing down or link: false, Genesis may refuse to assign the VIP for safety.

 

Recovery Options:

If everything seems healthy but VIP is still unassigned, you can manually assign it temporarily for access (for emergency access only):

sudo ip addr add <VIP>/24 dev br0

Important: Do this on only one CVM, and only if you're confident it's safe.

Once you're in Prism:

  • Check Cluster > Settings > IP Addresses
  • Validate the VIP is properly configured there.

Then later you can let Genesis take over again by:

sudo ip addr del <VIP>/24 dev br0

genesis restart

 

TL;DR

Your system is healthy except VIP isn’t getting assigned. That’s likely due to:

  • Bond/interface state not healthy
  • Genesis logic refusing to bind
  • Network condition preventing VIP ARP announcement

Note, as I mentioned previously, if this is a “Production” Cluster you should seek Nutanix Support

 

  1. Nothing can be found in genesis --rotate_virtual_ip
  2. IP is proverly assigned to CVMs
  3. sudo ip addr add <VIP>/24 dev br0 - What we are doing here?
  4. Note : I can ping the virtual IP Address.

 

 

this confirms Genesis and Prism services are up, but the VIP is not assigned, and that’s your root problem now.

 

Confirmed is:

  • br0 is in active-backup mode with eth0 and eth1, no LACP this is fine.
  • Genesis logs show VipMonitorService started, which is responsible for managing VIPs.
  • ip a | grep 9440 shows nothing so the VIP is not currently assigned to any CVM.
  • genesis --rotate_virtual_ip runs, but you’re unsure what to look for in output.

 

What to Check Right Now:

1. Check Output of genesis --rotate_virtual_ip

Scroll through the output and look for lines like:

  • Releasing VIP from this node...
  • Assigning VIP 192.168.x.x to this node
  • Failed to assign VIP...
  • Unable to bind VIP to interface br0...

If there is any failure to bind, it will show there. Let me know or paste that snippet.

 

2. Check All CVMs for VIP via IP and Port

Run this on each CVM individually:

ip a | grep 192.168.249

Replace 192.168.249 with your VIP subnet. This ensures it's not hiding on another CVM.

 

3. Check Uplink Health

You said manage_ovs show_uplinks shows “below”  but I can’t see that detail. You should look for:

  • State: up
  • A valid MAC address on each interface
  • Interface not disabled or down

If any bond member (eth0 or eth1) is showing down or link: false, Genesis may refuse to assign the VIP for safety.

 

Recovery Options:

If everything seems healthy but VIP is still unassigned, you can manually assign it temporarily for access (for emergency access only):

sudo ip addr add <VIP>/24 dev br0

Important: Do this on only one CVM, and only if you're confident it's safe.

Once you're in Prism:

  • Check Cluster > Settings > IP Addresses
  • Validate the VIP is properly configured there.

Then later you can let Genesis take over again by:

sudo ip addr del <VIP>/24 dev br0

genesis restart

 

TL;DR

Your system is healthy except VIP isn’t getting assigned. That’s likely due to:

  • Bond/interface state not healthy
  • Genesis logic refusing to bind
  • Network condition preventing VIP ARP announcement

Note, as I mentioned previously, if this is a “Production” Cluster you should seek Nutanix Support

I raised a ticket.


Interesting observation... but as I’ve mentioned earlier, if this is a production environment, then yes as JeroenTielen has also mentioned, engaging Nutanix Support is absolutely the right approach.


Hello Team,

 

Thank you so much for jumping out here and giving useful  suggestion and feedback, finally issue was resolved after loggin a case withnutanix. The actual issue was somehow after Network Testing, the CVM time move ahead of PC time, thus causing browser access issue, after waiting for the time to match with not before, it automatically got accessed.