Hi Experts !We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.Symptoms observed:Remote replication alerts after the upgrade.Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.Cerebro entering a crash loop?cluster status showing constantly changing/high PIDs for Cerebro.cerebro.FATAL reporting an error similar to: Check failed: citer != (pd.second)->snapshot_uuid_map().end() with a snapshot stuck in a pending action.It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.Has anyone already seen this behavior after an AOS upgrade?Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?Thanks :)

Question

AOS 6.8.X - Cerebro WAL stuck in pending action / Crash loop

Forum|Forum|2 months ago
May 2, 2026
4 replies
91 views

AbdelazizO
Voyager

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

Remote replication alerts after the upgrade.

Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.

Cerebro entering a crash loop?

cluster status showing constantly changing/high PIDs for Cerebro.

cerebro.FATAL reporting an error similar to:
Check failed: citer != (pd.second)->snapshot_uuid_map().end()
with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

This topic has been closed for replies.

+2

jarrodl
Vanguard
Forum|Forum|2 months ago
May 2, 2026

Hi Experts !

We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.

Symptoms observed:

Remote replication alerts after the upgrade.

Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.

Cerebro entering a crash loop?

cluster status showing constantly changing/high PIDs for Cerebro.

cerebro.FATAL reporting an error similar to:
Check failed: citer != (pd.second)->snapshot_uuid_map().end()
with a snapshot stuck in a pending action.

It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.

Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?

Thanks :)

Looking through the documentation, I found these articles.
Do any of these sound like the issue you are having?

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO000000CXkz0AG
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LTJPCA4

Everything I see related to Cerebro crash loop is suggesting to reach out to support.

Platform Engineer | Nutanix Technology Champion 2026

Like

+2

LMohammed
Trendsetter
Forum|Forum|2 months ago
May 5, 2026

Hi @AbdelazizO

Nutanix Engineering team are aware of the issue and are working on a solution. Please contact Nutanix Support for the workaround.

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO000000CXkz0AG

Is there any chance to upgrade to newer AOS version ?

2*NCA & 3*NCP & NCS and Microsoft & Dell and VmWare certified

Like

A

AbdelazizO
Author
Voyager
Forum|Forum|2 months ago
May 6, 2026

Hello,

It was a known race condition (ENG-679407) between the Garbage Collector and a replication task right after the AOS upgrade. Basically, the GC deleted a snapshot that was still tagged in a pending replication.

This corrupted the WAL. Every time Cerebro booted up, IsSnapshotInUse() tripped over the missing snapshot and crashed the service (Check failed: citer != ...).

How SRE fixed it :

Stopped Cerebro everywhere (genesis stop cerebro).

Dumped the WAL database to grab the exact MetaOp ID of the stuck task.

Injected an internal GFLAG so Cerebro ignores that specific operation on startup.

Started Cerebro. It bypassed the bad entry, finished its WAL recovery, and purged the ghost task automatically.

Like

+2

jarrodl
Vanguard
Forum|Forum|2 months ago
May 6, 2026

Hello,

It was a known race condition (ENG-679407) between the Garbage Collector and a replication task right after the AOS upgrade. Basically, the GC deleted a snapshot that was still tagged in a pending replication.

This corrupted the WAL. Every time Cerebro booted up, IsSnapshotInUse() tripped over the missing snapshot and crashed the service (Check failed: citer != ...).

How SRE fixed it :

Stopped Cerebro everywhere (genesis stop cerebro).

Dumped the WAL database to grab the exact MetaOp ID of the stuck task.

Injected an internal GFLAG so Cerebro ignores that specific operation on startup.

Started Cerebro. It bypassed the bad entry, finished its WAL recovery, and purged the ghost task automatically.

Glad to hear you got it resolved and thanks for the update.

Platform Engineer | Nutanix Technology Champion 2026

Like

Sign up

Login to the community