NTP pool wrong time issue 2/5/20

We point our servers to the pool, but yesterday we had 80% of our ESXi hosts in several data centers pick up time that was roughly 40 minutes fast yesterday morning. We are still trying to figure out what happened, as we host servers for customers and all their time was off as well. If it was one location it would make sense but it was not isolated to a single part of the country. We started a reddit thread and saw atleast 2 other people who had this issue.

[ moderator edit: this was also discussed on reddit at https://www.reddit.com/r/sysadmin/comments/ez9qwf/poolntporg_incorrect/ ]

Hi, please can you confirm which pool addresses the servers point to. Are they all configured the same way? Do all your servers point to pool hosts or do you have a local hierarchy in place? Are you able to be more specific about the time it happened (Guess there’ll be something in the logs when the time jumped!).

This thread is a good place to point the other two other people at if they can advise the same details. Thanks.

I pointed the other ppl to this thread.

So we have VMware clusters and each VMware host is configured to point to

0.us.pool.ntp.org to 3.us.pool.ntp.org

We are on EST time, so it started for us once we woke up and started noticing issues and clients calling in around 6am EST all the way to 12pm EST.

Can you post a link to the Reddit thread?

Do you have DNS or NTP logs that could show which NTP servers you were using?

here is the thread

I am trying to see if we can get some logs we were busy trying to put out and fix fires for over 100 clients. I am going to see if our firewall logs have anything.

Not sure which ESX version you guys are running but the link below has some troubleshooting steps from Vmware.

It also mentions the standard ntpd max jump of 1000s which is 16 and a bit minutes, so a 40 minute jump, if it was in one go, would be surprising.

https://kb.vmware.com/s/article/1005092

Does the ESXi system use a “proper” NTP client or just a one-off SNTP implementation? Does your system say which IPs it was talking to?

Looking in the monitoring data (it’s in a public bigquery bucket if anyone else wants to look, too), I see two servers that were active in the pool and returned an offset > 30 seconds (both about 45-50 seconds off). The monitoring system would have taken them out within 10-20 minuets (the monitoring interval plus a few minutes). The servers were both outside the US though.

1 Like

Oh, just realized you said 40 minutes. I looked back in the monitoring data since January 1st for monitoring data with a 30 to 50 minutes offset and I see one server from France on January 21st that had a 31 minute offset (obviously it was promptly removed from the pool after that until it had better time). An IPv6 in the UK has a single measurement with a 47 minute offset on January 18th.

For what it’s worth, it looks like in the ~21 million checks done in 2019 there was just one server that had two measurements like that, so this year is on track to be many times worse. :slight_smile:

40 minutes is weird though, and as others said a careful NTP implementation would ignore that.

2 Likes

At some points we would get drifts of up to 8 minutes on some host while others were fine. This was across dallas, New Jersey, New York and Boston datacenters. We have VMware host in all those locations. I got the log from the firewall but it’s too large to add to the forum. Let me figure out how to share it .

I looked at the Newark data starting February 1 and found no pool servers with offset > 1 minute. A NTP server IP list would help.

How are your VMware servers syncing? Was NTPD saying everything was okay even with the time off by 40 minutes, or did it say all sources were invalid (and your VM was free running)???

Also, it’s unlikely that if you had numerous hosts, in numerous data centers that they would all be using the same exact pool servers…

The ESX server can be configured with several NTP servers (though I have no idea if it’s a full NTP implementation or SNTP round-robin)…

But then each of the VM guests can either sync periodically to the host they are on, OR freerun and require their own NTP service running… Depending on which route you go there’s a little extra configuring you have to do.

40 minutes fast is impossible for NTPD to shift that much once it is running unless you made some not-well-documented config changes.

FYI, 4 servers is the MINIMUM you would want in your ntp.conf… There is no reason to be stingy as NTP packets are very small and only sent at long intervals. Realistically you would want around 7 servers to have good redundancy against falsetickers.

The few times I’ve seen people complaining about their time being wildly off, they were always running VMs… unfortunately nobody to date has had any sort of logging, or even bothered to do a ‘ntpq -pn’ before rebooting, so there’s no way to know what went on…

VMware was configured to use same settings in all our clusters 0-3 us pool servers. The initial red flag for us for a 40min time skew. IF you look at the reddit thread there is also another person who had the same issue 2 days ago with a 40min skew. We initially thought it was something on our end and when we could not find anything we started the reddit thread. As we kept testing later on on the day we were getting 4-8 min skews. I would restart NTP on the host and it would be normal for about 5-10mins then it would start to drift. While I dont have the logs from the host, I was able to grab the logs from the firewall and I going to post that now.

I was able to grab the firewall logs and pulled all the NTP request and found all the ips

dstip=38.141.40.220",
dstip=216.126.233.109"
dstip=50.205.244.37"
dstip=38.141.40.220
dstip=204.93.207.12"
dstip=107.189.11.246",
dstip=207.244.103.95"
dstip=198.98.60.13"
dstip=162.159.200.123"
dstip=44.190.6.254"
dstip=23.239.24.67"
dstip=172.16.1.70"
dstip=97.127.48.73"

dstip=38.141.40.220",
dstip=216.126.233.109"
dstip=50.205.244.37"
dstip=38.141.40.220
dstip=204.93.207.12"
dstip=107.189.11.246",
dstip=207.244.103.95"
dstip=198.98.60.13"
dstip=162.159.200.123"
dstip=44.190.6.254"
dstip=23.239.24.67"
dstip=172.16.1.70"
dstip=97.127.48.73"

Just an idea. I had servers running for years without rebooting. The IP address the NTP server picked up at start time were still answering queries. Is it possible that the ESX cluster were synchronizing to IP addresses which were in the pool long time ago, and got removed from the pool already? I guess it would be good idea to have a run on historical IP addresses (not in the pool database any more) to check their offset if answering NTP queries.

According to the Vmware link I posted above there are log files on the ESX which record the NTPD messages such as if the system applied a time correction and show which NTP server the ESX was syncing with. It also says “There are known NTP server synchronization issues with certain versions of ESXi 5.x”. When things calm down it’s worth running through :slight_smile:

https://kb.vmware.com/s/article/1005092

I will see if the host logs are still there but I am running 6.5 and 6.7 everywhere

Going through that list of IPs, except for two of those, they are all different networks and different user accounts. All had no issues with their time reported except one, which had a brief i/o timeout which pulled it out of the pool for a few hours. One of those IPs is even a cloudflare IP (which they’ve added to the pool, it’s legit).

One of those IPs doesn’t pull up any info, ‘172.16.1.70’, that has never been a pool IP. and ‘97.127.48.73’ was removed from the pool 2018-07-01.

Are you using NTPD or Chrony? Can you post your config file?

Just curious, the other people reporting issues, were their servers also running as VM guests, or were any of them physical servers? Just trying to narrow things down.

It’s extremely weird that the time would jump forward, and by that far. Usually due to delays if anything a server gets behind by a few minutes before someone notices.

looks like the other guys were also running vmware and windows guest from what I can tell.

is an unroutable (local LAN) address.

Even if just that one was giving dodgy time, ntpd will reject it. It also won’t jump more than 16.67 minutes in one go (unless the config has been strangely modified). There’s something else happening here. ESX is the common thread so far. We need to understand ESX versions, know the flavour and version of NTPD, see config files and look through logs if we are to make an informed comment.