We've recently been undertaking a project to implement Nutanix clusters within our organisation.
Part of the operational acceptance testing (OAT) before go live, is looking at our monitoring and alerting processes. We presently have an automated alerting system which will invoke our on call staff if a problem is found out of hours.
This operates off of emails, utilising specific rules based on the content of the received emails.
We have identified a few alerts during our implementation that we want to flag, but we are wondering if there is a list of critical conditions we really should respond to?
The severity of the alerts should give us a good basis but there have a been a few "Critical" alerts we probably wouldn't want to wake our staff up for!
Any advice / guidance greatly appreciated!