You may have encountered an alert about PE-PC Connection Failure, or regarding IDF db to db sync. It’s possible that you had a visible issue with Prism Central’s management and monitoring of clusters, but also possible that you didn’t.
What are these alerts for? How do we know if there’s really a problem? What can we do to fix it? I would like to answer these questions for you.
Prism Element and Prism Central need two-way communication for a number of reasons. Manageability, forecasting, alert visibility, and reporting in Prism Central all depend on periodic syncing of data from PE to PC. Prism Central’s VM and infrastructure management, image creation, DR orchestration, and other enhanced features like Calm require API communication from PC to PE.
PE-PC Connection Failure alerts indicate communication failure. This could be a short-lived issue and even an expected one, like when PC is rebooted during an upgrade, or it could reflect a more prolonged connection loss.
Seeing the idf_db_to_db_sync check failure indicates that data sync from PE to PC has not completed for an extended time. Sync completion updates a timestamp held on either end, and the check looks at the age of the timestamp so you’ll get a “fail” result if sync is too far behind. This could be a communication failure but it could also be a performance concern with the PE cluster, or in PC, or in the network in between. For more information on that check, see NCC Health Check: idf_db_to_db_sync_heartbeat_status_check.
Knowing what triggers these alerts can help us know when further investigation is needed. Did they pop up during upgrades or planned maintenance? If so it’s probably fine, just a temporary disconnect. The system can recover from a short loss of communication.
If there’s no ready explanation, there are a few more things we can easily check to better understand the situation.
First, run a full health check and see if PE-PC communication or sync gets flagged as an issue again. From PE you should also watch for the cluster_connectivity_status check as this looks at PE-PC connectivity as well. If everything is fine now the problem was temporary. If we have a failure on a new health check this tells us the issue is still going on.
Second, look at the cluster statistics and check for any gaps. If you see gaps in cluster performance or utilization data, these would indicate times when PE wasn’t successfully syncing data to PC. If there aren’t any gaps this tells us the communication from PE to PC has been communicating fine, or at least has been able to catch up in a short time.
The third thing do is check PC to PE connectivity directly. Can you use “launch Prism Element” from PC and get access to the cluster? If so, PC to PE communication is working now. Alternately you could attempt a VM management task from PC, like powering on a VM or updating the description text. If these can complete, PC is able to communicate with PE.
If you’re seeing alerts but checks pass and you are not seeing an issue, make sure NCC is up to date. As described in the article “PE-PC Connection Failure alerts” this check needed some tuning to reduce unnecessary alarms. Those improvements came in NCC 3.9.4 so if your NCC version is lower I would suggest upgrading.
If PE to PC sync appears fine, there are no gaps in monitoring data, but you can’t launch Prism Element from PC this is often an issue with firewall or proxy configuration. For more detail on that, please check out my earlier post here.
I hope this helps to clear some confusion around these alerts. If you have questions, ask them in the comments.