Hi Experts !
We recently upgraded one of our Nutanix clusters to AOS 6.8.1.8 and started facing an issue with the Cerebro service shortly after.
Symptoms observed:
- Remote replication alerts after the upgrade.
- Remote site configuration warning indicating that some CVM/SVM IPs were not properly configured on the peer site.
- Cerebro entering a crash loop?
- cluster status showing constantly changing/high PIDs for Cerebro.
- cerebro.FATAL reporting an error similar to:
Check failed: citer != (pd.second)->snapshot_uuid_map().end()
with a snapshot stuck in a pending action.
It looks like there is a stale or orphaned snapshot operation in the WAL / metadata, likely triggered after the upgrade while replication configuration was not fully consistent.
Has anyone already seen this behavior after an AOS upgrade?
Did you resolve it by fixing the remote site configuration only, or did it require Nutanix Support intervention to skip/clean the offending WAL operation?
Thanks :)
