Question

Nutanix Cluster API v2 reports normal status when a node is "down"


Userlevel 1
Badge +3
Yesterday our Nutanix CE cluster suffered a node loss (onboard NIC went offline somehow, investigating), HA failover performed correctly, but I had Zabbix monitoring cluster API 2.0 which did not state there is a problem. The only difference was that the host stopped being visible in the discovery by its proper name and remained only by IP, every other reading were "normal" and "not degraded" and even cluster state was reported "normal" during host effective downtime. WTF?!

Also, WTF is with re-posting a question to another forums when asked to? Do you maybe have crosslinks allowed to ask the same question in a different subset of forums?

8 replies

Userlevel 4
Badge +19
@Maxim Grishin

How many nodes you have this cluster.

And can you share the API url which u used to check the cluster status?
Userlevel 1
Badge +3
sandeepmp wrote:

@Maxim Grishin

How many nodes you have this cluster.

And can you share the API url which u used to check the cluster status?


4 (CE).

URL was "GET /api/nutanix/2.0/cluster/" I was using Zabbix's JSON parser to extract the value of "operation_mode" which I think to be the current cluster status. Am I correct, or there should be some other parameters to watch?
Userlevel 1
Badge +3
Yep, I also monitor "is_degraded" value from "/api/nutanix/2.0/hosts/UUID", which actually reported me zeroes for hosts that are down. "Not degraded but failed" - weird state you know.
Userlevel 4
Badge +19
@Maxim Grishin

operation_mode is used to identify if the cluster is "Single node" , "multi node" ,etc

To identify the "Data resiliency" please use below APIs

Request URL:

V1
https://cluster_ip:9440/PrismGateway/services/rest/v1/cluster/domain_fault_tolerance_status/

V2
https://cluster_ip:9440/PrismGateway/services/rest/v2.0/cluster/domain_fault_tolerance_status/
Userlevel 4
Badge +19
"https://next.nutanix.com/api-31/powershell-cdmlets-or-rest-api-to-get-data-resiliency-status-30979"
Userlevel 4
Badge +19
https://next.nutanix.com/api-31/powershell-cdmlets-or-rest-api-to-get-data-resiliency-status-30979
Userlevel 4
Badge +19
"is_degraded" flag is used to identify Degraded node status.

https://portal.nutanix.com/#/page/docs/details?targetId=Web-Console-Guide-Prism-v510:man-node-degraded-wc-c.html
Userlevel 1
Badge +3
@sandeepmp v2 version of this API doesn't work, returns an XML-formatted error "java.lang.NullPointerException". v1 does display the data, although in a somewhat unfriendly way, as failures tolerable are listed per service, and no "minimum" value is readily available. There is also a nice value for under-replicated data which can be used to signify a disk had failed.

Still, is there an API under /api/nutanix/ (which seems to be the more modern way of querying the cluster) that would deliver similar info?

Reply