Recurring Daily Cluster Crash – Seeking Advice on Automatic Failover and Service Restarts | Nutanix Community
Skip to main content
Question

Recurring Daily Cluster Crash – Seeking Advice on Automatic Failover and Service Restarts


Forum|alt.badge.img

Hello everyone,

I am currently experiencing a recurring issue with my Nutanix cluster: every day at exactly 9:35 PM, the cluster crashes and then restarts a short time afterward. Once it comes back online, the system automatically initiates “Repairing metadata for node…” tasks on all three nodes.

While checking the logs, I noticed the following events:

  • Power on VM after failover recovery
  • Multiple service restarts of acropolis,anduril,aplos,aplosengine,catalog,clusterconfig,delphi,ergon,flow,lazan,minervacvm,uhuraacropolis, anduril, aplos, aplos_engine, catalog, cluster_config, delphi, ergon, flow, lazan, minerva_cvm, uhuraacropolis,anduril,aplos,aplose​ngine,catalog,clusterc​onfig,delphi,ergon,flow,lazan,minervac​vm,uhura across all CVMs.
  • The most recent crashes for these services occurred on January 8, 2025, between 9:31 PM and 9:34 PM.

Has anyone encountered a similar behavior or have any suggestions on how to resolve this recurring crash? Any help or insights are greatly appreciated, as I’m struggling to pinpoint the root cause of these daily crashes.

Thank you in advance for your assistance!

2 replies

  • Voyager
  • 1 reply
  • January 8, 2025

What AOS version are you using?  We encountered cluster crashes due to use of TCP rsyslog on AOS 6.8 and 6.10.  The issue is supposed to be fixed in AOS 7, but we have not yet verified it in our environment.


Forum|alt.badge.img
  • Author
  • Voyager
  • 2 replies
  • January 13, 2025

Hello ASmith2,

Thank you for your response.

Currently, our cluster is running the following versions:

  • Prism Central: pc.2024.3
  • AHV: v.20230302.102005
  • AOS: 6.10.0.5

We are experiencing the issue you described with all virtual machines, regardless of the operating system in use.

Your mention of TCP rsyslog on AOS 6.10 causing similar issues is quite relevant. While we are aware this was flagged as resolved in AOS 7, we have not yet upgraded our environment.

Thank you for your insights and guidance!

Best regards,


Reply