Nutanix Guest Tools dead. Restarting. | Nutanix Community
Skip to main content

Hi,

 

Nutanix guest tools service restarting on cvms.

After starting Ngt master initialization it failed

I20241113 06:08:09.987768Z 23965 ngt_master.cc:320] Starting Ngt master initialization
I20241113 06:08:09.987829Z 23965 ngt.cc:506] Ngt master changed to 192.168.6.73:2073+15769496029
I20241113 06:08:09.987845Z 23965 ngt.cc:519] Initializing ngt guest interface to 192.168.6.73:2073
I20241113 06:08:09.987959Z 23967 zeus_leadership_ops.cc:2354] FetchLeaderOp(174))arithmos lk:0 watch_id:15 inv_cbk:1 wait:0]: old leader handle:: new leader handle:192.168.6.64:2025
I20241113 06:08:09.987975Z 23967 zeus_leadership_ops.cc:2242] Updating leader cache for arithmos to 192.168.6.64:2025
I20241113 06:08:09.988018Z 23967 zeus_leadership_ops.cc:2547] StartLeaderChangeWatchOp(173))arithmos watch_id:15]: Successfully set the watch for arithmos leadership
I20241113 06:08:09.988027Z 23967 zeus.cc:1793] Finished StartLeaderChangeWatchOp(173))arithmos watch_id:15]
I20241113 06:08:09.988273Z 23965 ngt_master.cc:411] Obtained Arithmos master handle : 192.168.6.64:2025
I20241113 06:08:09.988297Z 23965 ngt_master.cc:428] Arithmos master changed to 192.168.6.64
I20241113 06:08:10.026528Z   549 wal_cassandra_rows_metadata.cc:1255] Recovery for ngt_master finished. Checking for unfinished wal records.
I20241113 06:08:10.031661Z   549 wal_cassandra_rows_metadata.cc:1281] Done deleting unfinished wal records for entity: ngt_master
I20241113 06:08:10.031678Z   549 wal_cassandra_backend.cc:1859] Recovery done for : ngt_master
I20241113 06:08:10.031843Z 23965 ngt_master_WAL.cc:172] Ngt master WAL recovery completed
I20241113 06:08:10.031867Z 23965 ngt_master.cc:343] Creating zeus capabilities node if not present
I20241113 06:08:10.033236Z 23967 ngt.cc:1120] Read of the zknode /appliance/logical/ngt_capabilities succeeded
I20241113 06:08:10.033267Z 23965 ngt_master.cc:353] Initiating inital sync of NGT capabilities in zeus
I20241113 06:08:10.034096Z 23965 ngt_master.cc:363] Sync capabilities done. Resuming initialization
I20241113 06:08:10.034148Z 23965 ngt_master.cc:1352] Ngt internal guest server setup successfull
I20241113 06:08:10.063593Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137779
I20241113 06:08:10.063719Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137777
I20241113 06:08:10.063733Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137775
I20241113 06:08:10.063758Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137773
I20241113 06:08:10.063771Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137745
I20241113 06:08:10.063779Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137743
I20241113 06:08:10.063788Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137739
I20241113 06:08:10.063793Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137741
I20241113 06:08:10.063804Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137749
I20241113 06:08:10.063813Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137737
I20241113 06:08:10.063818Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137751
I20241113 06:08:10.063824Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137735
I20241113 06:08:10.063829Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137733
I20241113 06:08:10.063838Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137747
I20241113 06:08:10.063863Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137753
I20241113 06:08:10.063874Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137755
I20241113 06:08:10.063879Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137757
I20241113 06:08:10.063884Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137759
I20241113 06:08:10.063889Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137761
I20241113 06:08:10.063894Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137763
I20241113 06:08:10.063906Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137765
I20241113 06:08:10.063915Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137767
I20241113 06:08:10.063923Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137769
I20241113 06:08:10.063931Z 23965 vm_tools_entity.cc:291] Resuming meta op with id 137771
I20241113 06:08:10.063938Z 23965 ngt_master.cc:395] Ngt master initialization completed
F20241113 06:08:10.138688Z 23965 ergon_utils.cc:706] Check failed: ret->task_list().size() 
*** Check failure stack trace: ***
First FATAL tid: 23965
Installed ExitTimer with timeout 30 secs and interval 5 secs
Leak checks complete
Flushed log files
Initialized FiberPool BacktraceGenerator
Collected stack-frames for threads
I20241113 06:08:10.153867Z 23965 fiberpool.cc:450] Collected backtraces of all fibers.
Collected stack-frames for all Fibers
Symbolized all thread stack-frames
Symbolized all fiber stack-frames
Obtained stack traces of threads responding to SIGPROF
Failed to create stack trace file: /home/nutanix/data/cores/nutanix_guest_t.23965.20241113-090810.stack_trace.txt
Stack traces are dumped here.

@AlexanderK2 , could you please chek the NTP configuration on your whole envirenment?


 

@AndrePulia1 , thank you for reply!  NTP is ok. Time is same on all hosts in cluster. No ncc errors.


Hello ​@AlexanderK2 ,
The log indicates that the NGT internal guest server setup was successful setup su(Ngt internal guest server ccessful), which can indicate that the NGT components within the guest VM have been initialized properly up to this point.

However, this line "F20241113 06:08:10.138688Z 23965 ergon_utils.cc:706] Check failed: ret->task_list().size()" indicates that a check failed due to an empty or invalid task list.


1- Can you review the task handling in the code to know why the task list is empty or invalid.
2- Can you provide an ncc output.


​Hello ​​​​@LMohammed 

  1. Unfortunately I didn't find ergon_utils.cc code. Нow can I determine that the task list is broken and repair it?
  1. NCC checks. Output is too long, WARN/FAIL/ERR section:

/health_checks/data_protection_checks/protection_domain_checks/backup_schedule_check                                                                                                                                                 WARN ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/health_checks/hypervisor_checks/host_cpu_contention                                                                                                                                                                                 WARN ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/health_checks/system_checks/check_license_compliance                                                                                                                                                                               WARN ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/health_checks/hardware_checks/disk_checks/vg_space_usage_check                                                                                                                                                                     - WARN ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/health_checks/hypervisor_checks/vm_checks                                                                                                                                                                                          - FAIL ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/health_checks/ngt_checks/ngt_client_cert_expiry_check                                                                                                                                                                              - ERR  ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Detailed information for ngt_installer_version_check:
Node 192.168.6.64:
INFO: Following VMs do not have latest NGT version installed:
 VM: atttest - NGT installed version: 2.1.4, NGT latest version: 2.1.5
 VM: rdsham-teach01 - NGT installed version: 2.1.4, NGT latest version: 2.1.5

Refer to KB 5487 (http://portal.nutanix.com/kb/5487) for details on ngt_installer_version_check or Recheck with: ncc health_checks ngt_checks ngt_installer_version_check

Detailed information for backup_schedule_check:
Node 192.168.6.64:
WARN: Backup schedule does not exist for protection domain Veeam_PD_Job_Cluster_6_50 protecting some entities. If this protection domain is created for use by backup software, this warning can be ignored.
WARN: Backup schedule does not exist for protection domain VeeamBackupProtection protecting some entities. If this protection domain is created for use by backup software, this warning can be ignored.
Refer to KB 1910 (http://portal.nutanix.com/kb/1910) for details on backup_schedule_check or Recheck with: ncc health_checks data_protection_checks protection_domain_checks backup_schedule_check

Detailed information for host_cpu_contention:
Node 192.168.6.52:
WARN: High host CPU utilization on host 192.168.6.51: 86 (Threshold: 75).
Node 192.168.6.73:
WARN: High host CPU utilization on host 192.168.6.72: 89 (Threshold: 75).
Refer to KB 2797 (http://portal.nutanix.com/kb/2797) for details on host_cpu_contention or Recheck with: ncc health_checks hypervisor_checks host_cpu_contention --cvm_list=192.168.6.52,192.168.6.73

Detailed information for check_license_compliance:
Node 192.168.6.64:
WARN: Cluster license has expired. Detailed license info: >u'LIC-********,ANY,Pro,2024-07-19', u'LIC-01206703,ANY,Pro,2024-07-19', u'FILES-FREE-TIB,ANY,File,2024-07-19']
Refer to KB 2469 (http://portal.nutanix.com/kb/2469) for details on check_license_compliance or Recheck with: ncc health_checks system_checks check_license_compliance

Detailed information for vg_space_usage_check:
Node 192.168.6.67:
WARN: Volume Group Veeam-ed704fe5-7397-4f9c-a93f-fcb856a73648 space usage (79 😵 above 75 %
Refer to KB 4600 (http://portal.nutanix.com/kb/4600) for details on vg_space_usage_check or Recheck with: ncc health_checks hardware_checks disk_checks vg_space_usage_check --cvm_list=192.168.6.67

Detailed information for vm_checks:
Node 192.168.6.52:
FAIL: CVM 'NTNX-3d-ntnx-0201-CVM' has cpu utilization (100 😵 above threshold (90 %)

FAIL: VM 'Mail00' has transmitted packet drop rate (290) above threshold (0)

Node 192.168.6.73:
FAIL: CVM 'NTNX-3d-ntnx-0208-CVM' has cpu utilization (100 😵 above threshold (90 %)

Refer to KB 2733 (http://portal.nutanix.com/kb/2733) for details on vm_checks or Recheck with: ncc health_checks hypervisor_checks vm_checks --cvm_list=192.168.6.52,192.168.6.73

Detailed information for ngt_client_cert_expiry_check:
Node 192.168.6.64:
ERR : NgtRpcError
Refer to KB 10075 (http://portal.nutanix.com/kb/10075) for details on ngt_client_cert_expiry_check or Recheck with: ncc health_checks ngt_checks ngt_client_cert_expiry_check
+-----------------------+
| State         | Count |
+-----------------------+
| Pass          | 235   |
| Info          | 1     |
| Warning       | 4     |
| Fail          | 1     |
| Error         | 1     |
| Total Plugins | 244   |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
 


Hello,
I found that all cluster services certificates in /home/cert have expired. How I can renew them? Is this the cause of the NGT problem?

Hi ​@AlexanderK2 I assume this is a production cluster. Please involve Nutanix support on this. They are there to help you. 


Reply