Genesis alarm

  • 25 October 2016
  • 6 replies
  • 1242 views

Userlevel 2
Badge +11
My cluster was looking very nice in Prism (as usual) yesterday but I happened to take a look at my NCC summary output and found that Genesis wasn't running on one of my cvm's. I was able to issue a 'genesis start' command and all was good again but I'm kind of curious if I should expect the Genesis service status to be reported up to Prism.

I gather that, without Genesis running, the cvm won't play in any of the cluster reindeer games so its running status seem kind of important to me. Am I wrong in this assumption? And if not, is there some way that I can get an alarm should this happen again (other than running/reviewing NCC)?

Thanks in advance.

This topic has been closed for comments

6 replies

Userlevel 6
Badge +29
Genesis is the "boot strap" service, meaning its the service that starts the other services, and does some house keeping, but isn't critical path (meaning data is still safe and everything generall hums along if its not running).

That said, sure, if a service is down, it should whine. What version of AOS and NCC are you on?
Userlevel 4
Badge +19
Cluster services will still carry on as normal but if for any reason they stopped, they wouldn't be able to restart as genesis is needed for that. Would be great to get a support ticket open so they could grab the logs to see why it went down in the first place.
Userlevel 2
Badge +11
Thanks, Jon. I'm running AOS 4.7.1 and NCC 2.2.7 in this cluster.
Userlevel 2
Badge +11
Thanks, dlink7- it appears that the Genesis log was indicating the following message;

2016-10-20 05:51:48 ERROR command.py:156 Failed to execute sudo service iptables status: [Errno 12] Cannot allocate memoryIt was about 20 hours afterwards though that I found the issue and sent the start command which made everything happy again.
Userlevel 6
Badge +29
Please file a support case with Dell for your XC nodes, and have them escalate it up to us just to verify, but I rooted around our internal bug DB and I think you're hitting Bug ID ENG-46371.

As far as Genesis not alarming correctly, we're working improving the alerting for Gensis under internal tickets ENG-59035 (to get this in NCC properly) and FEAT-3021 (to get this in the GUI properly)
Userlevel 6
Badge +29
I followed up with our NCC team, and we've introduced the first phase of alerting to cover this specific case in NCC 2.3, which is actually GA already. Read the release notes and upgrade at your leisure. You can do NCC upgrades separate from NOS upgrades.

Though if memory serves me correctly, NCC 2.3 is integrated in the NOS 4.7.2.1 upgrade as well, so you can go either route.

Touch base with support first just to triple check on the previous note I posted.