Skip to main content
Question

AOS 6.8.X - Cerebro WAL stuck in pending action / Crash loop

  • May 2, 2026
  • 4 replies
  • 38 views

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

  • Remote replication alerts after the upgrade.
  • Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.
  • Cerebro entering a crash loop?
  • cluster status showing constantly changing/high PIDs for Cerebro.
  • cerebro.FATAL reporting an error similar to:
    Check failed: citer != (pd.second)->snapshot_uuid_map().end()
    with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

4 replies

jarrodl
Forum|alt.badge.img+2
  • Vanguard
  • May 2, 2026

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

  • Remote replication alerts after the upgrade.
  • Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.
  • Cerebro entering a crash loop?
  • cluster status showing constantly changing/high PIDs for Cerebro.
  • cerebro.FATAL reporting an error similar to:
    Check failed: citer != (pd.second)->snapshot_uuid_map().end()
    with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

Looking through the documentation, I found these articles. 
Do any of these sound like the issue you are having?

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO000000CXkz0AG
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LTJPCA4

Everything I see related to Cerebro crash loop is suggesting to reach out to support.


LMohammed
Forum|alt.badge.img+1
  • Trailblazer
  • May 5, 2026

Hi ​@AbdelazizO 

Nutanix Engineering team are aware of the issue and are working on a solution. Please contact Nutanix Support for the workaround.

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO000000CXkz0AG

Is there any chance to upgrade to newer AOS version ?


  • Author
  • Voyager
  • May 6, 2026

Hello, 

It was a known race condition (ENG-679407) between the Garbage Collector and a replication task right after the AOS upgrade. Basically, the GC deleted a snapshot that was still tagged in a pending replication.

This corrupted the WAL. Every time Cerebro booted up, IsSnapshotInUse() tripped over the missing snapshot and crashed the service (Check failed: citer != ...).

How SRE fixed it :

  • Stopped Cerebro everywhere (genesis stop cerebro).
  • Dumped the WAL database to grab the exact MetaOp ID of the stuck task.
  • Injected an internal GFLAG so Cerebro ignores that specific operation on startup.
  • Started Cerebro. It bypassed the bad entry, finished its WAL recovery, and purged the ghost task automatically.

 


jarrodl
Forum|alt.badge.img+2
  • Vanguard
  • May 6, 2026

Hello, 

It was a known race condition (ENG-679407) between the Garbage Collector and a replication task right after the AOS upgrade. Basically, the GC deleted a snapshot that was still tagged in a pending replication.

This corrupted the WAL. Every time Cerebro booted up, IsSnapshotInUse() tripped over the missing snapshot and crashed the service (Check failed: citer != ...).

How SRE fixed it :

  • Stopped Cerebro everywhere (genesis stop cerebro).
  • Dumped the WAL database to grab the exact MetaOp ID of the stuck task.
  • Injected an internal GFLAG so Cerebro ignores that specific operation on startup.
  • Started Cerebro. It bypassed the bad entry, finished its WAL recovery, and purged the ghost task automatically.

 

Glad to hear you got it resolved and thanks for the update.