Just like with many other services NTP is something you do not think about until it breaks. Then all these strange things start creeping into your environment.
Symptoms:
- Users are not able to log in to the Prism web console using LDAP or other directory integrated services.
- Cluster services do not start. Cluster does not function correctly due to major time-skew post outage or maintenance.
- Log collection is inaccurate.
- Health checks that rely on accurate time frames and event correlation return inaccurate results.
- Incorrect and skewed graphs in Prism.
- User VMs start on hypervisor hosts with inaccurate RTC (real-time clocks) causing guest OS time skew.
- Third-party software products like Veeam or CommVault have trouble interacting with the cluster.
- Snapshots expire too early or too late when the time between a cluster and a remote site is out of sync.
To resolve any doubts please run NCC check_ntp - status other than PASS indicates that troubleshooting is needed.
Troubleshooting NTP issues in a nutshell
- NTP server is external to cluster entity (yes, it is possible to configure NTP to be your VM on the cluster but it does not work well and hence is not recommended).
- Where NTP is configured using FQDN validate that the NTP server FQDN name is resolvable by the entity.
- NTP IP address is reachable (if ping messages fail, validate that ping traffic is enabled by pinging another responsive to ping messages destination).
- Verify that the NTP server returns a valid and accurate response. In other words query NTP server application layer.
- Check the status of NTP synchronization on all CVMs and hosts. This shows sync source and time skew value for each CVM.
- Check the NTP configuration on all hosts - see if there are any inconsistencies or any missing configuration.
If the CVM time is in the future, DO NOT manually set the clock backward! Contact Nutanix Support for assistance and provide the above output.
NTP hygiene
- Synchronizing a Nutanix AOS/PC cluster with a Windows-based time source is known to cause issues over time. Nutanix does not recommend to synchronize cluster time with Windows time sources. Use reliable non-Windows time sources instead.
- Use the NTP source that is external to your cluster.
- For AHV based environment configuring NTP servers via Prism/ncli updates both the CVMs and the AHV hosts.
- In ESXi based environments configuring NTP sources in Prism web console or ncli does not trigger automatic update of /etc/ntp.conf file on the hosts. After you add the NTP servers in Prism. You must also manually configure those NTP servers on the ESXi hosts.
- In a mixed-hypervisor cluster (AHV + ESXi), AHV hosts will be configured via Prism while ESXi hosts must be updated manually.
- In a Hyper-V cluster, the check_ntp plugin validates only the CVM NTP configuration. NTP/time configuration of the Windows Hyper-V hosts is not checked. Thus no FAIL status is returned by the check even if there is NTP misconfiguration or out of sync state. Confirm that your Hyper-V hosts and Domain Controllers have a healthy Windows time hierarchy manually. The AD PDC(s) should be using reliable upstream NTP time sources, preferably the same that is used with CVMs (see below).
- Ideally, to simplify the comparison of logs and to avoid complex time sync issue triage, the hypervisors and the Controller VMs should all be using the same NTP servers. Should the hypervisors and the Controller VMs use different NTP sources, the NCC health check may produce INFO output to raise awareness and ensure the configuration is intended.
Further reading:
KB-4519 NCC Health Check: check_ntp - exhaustive NTP issues troubleshooting guide.
KB-3851 Troubleshooting NTP Sync to Windows Time Servers.
Recommendations for Time Synchronization in the Prism Web Console Guide.
For more information about configuring NTP servers on the ESXi hosts, see VMware KB Configuring Network Time Protocol (NTP) on ESX/ESXi hosts using the vSphere Client (2012069).