Skip to main content
Question

AOS 6.8.X - Cerebro WAL stuck in pending action / Crash loop

  • May 2, 2026
  • 1 reply
  • 8 views

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

  • Remote replication alerts after the upgrade.
  • Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.
  • Cerebro entering a crash loop?
  • cluster status showing constantly changing/high PIDs for Cerebro.
  • cerebro.FATAL reporting an error similar to:
    Check failed: citer != (pd.second)->snapshot_uuid_map().end()
    with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

1 reply

jarrodl
Forum|alt.badge.img+2
  • Vanguard
  • May 2, 2026

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

  • Remote replication alerts after the upgrade.
  • Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.
  • Cerebro entering a crash loop?
  • cluster status showing constantly changing/high PIDs for Cerebro.
  • cerebro.FATAL reporting an error similar to:
    Check failed: citer != (pd.second)->snapshot_uuid_map().end()
    with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

Looking through the documentation, I found these articles. 
Do any of these sound like the issue you are having?

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO000000CXkz0AG
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LTJPCA4

Everything I see related to Cerebro crash loop is suggesting to reach out to support.