NTP pool wrong time issue 2/5/20

Oh, just realized you said 40 minutes. I looked back in the monitoring data since January 1st for monitoring data with a 30 to 50 minutes offset and I see one server from France on January 21st that had a 31 minute offset (obviously it was promptly removed from the pool after that until it had better time). An IPv6 in the UK has a single measurement with a 47 minute offset on January 18th.

For what it’s worth, it looks like in the ~21 million checks done in 2019 there was just one server that had two measurements like that, so this year is on track to be many times worse. :slight_smile:

40 minutes is weird though, and as others said a careful NTP implementation would ignore that.

2 Likes

At some points we would get drifts of up to 8 minutes on some host while others were fine. This was across dallas, New Jersey, New York and Boston datacenters. We have VMware host in all those locations. I got the log from the firewall but it’s too large to add to the forum. Let me figure out how to share it .

I looked at the Newark data starting February 1 and found no pool servers with offset > 1 minute. A NTP server IP list would help.

How are your VMware servers syncing? Was NTPD saying everything was okay even with the time off by 40 minutes, or did it say all sources were invalid (and your VM was free running)???

Also, it’s unlikely that if you had numerous hosts, in numerous data centers that they would all be using the same exact pool servers…

The ESX server can be configured with several NTP servers (though I have no idea if it’s a full NTP implementation or SNTP round-robin)…

But then each of the VM guests can either sync periodically to the host they are on, OR freerun and require their own NTP service running… Depending on which route you go there’s a little extra configuring you have to do.

40 minutes fast is impossible for NTPD to shift that much once it is running unless you made some not-well-documented config changes.

FYI, 4 servers is the MINIMUM you would want in your ntp.conf… There is no reason to be stingy as NTP packets are very small and only sent at long intervals. Realistically you would want around 7 servers to have good redundancy against falsetickers.

The few times I’ve seen people complaining about their time being wildly off, they were always running VMs… unfortunately nobody to date has had any sort of logging, or even bothered to do a ‘ntpq -pn’ before rebooting, so there’s no way to know what went on…

VMware was configured to use same settings in all our clusters 0-3 us pool servers. The initial red flag for us for a 40min time skew. IF you look at the reddit thread there is also another person who had the same issue 2 days ago with a 40min skew. We initially thought it was something on our end and when we could not find anything we started the reddit thread. As we kept testing later on on the day we were getting 4-8 min skews. I would restart NTP on the host and it would be normal for about 5-10mins then it would start to drift. While I dont have the logs from the host, I was able to grab the logs from the firewall and I going to post that now.

I was able to grab the firewall logs and pulled all the NTP request and found all the ips

dstip=38.141.40.220",
dstip=216.126.233.109"
dstip=50.205.244.37"
dstip=38.141.40.220
dstip=204.93.207.12"
dstip=107.189.11.246",
dstip=207.244.103.95"
dstip=198.98.60.13"
dstip=162.159.200.123"
dstip=44.190.6.254"
dstip=23.239.24.67"
dstip=172.16.1.70"
dstip=97.127.48.73"

dstip=38.141.40.220",
dstip=216.126.233.109"
dstip=50.205.244.37"
dstip=38.141.40.220
dstip=204.93.207.12"
dstip=107.189.11.246",
dstip=207.244.103.95"
dstip=198.98.60.13"
dstip=162.159.200.123"
dstip=44.190.6.254"
dstip=23.239.24.67"
dstip=172.16.1.70"
dstip=97.127.48.73"

Just an idea. I had servers running for years without rebooting. The IP address the NTP server picked up at start time were still answering queries. Is it possible that the ESX cluster were synchronizing to IP addresses which were in the pool long time ago, and got removed from the pool already? I guess it would be good idea to have a run on historical IP addresses (not in the pool database any more) to check their offset if answering NTP queries.

According to the Vmware link I posted above there are log files on the ESX which record the NTPD messages such as if the system applied a time correction and show which NTP server the ESX was syncing with. It also says “There are known NTP server synchronization issues with certain versions of ESXi 5.x”. When things calm down it’s worth running through :slight_smile:

https://kb.vmware.com/s/article/1005092

I will see if the host logs are still there but I am running 6.5 and 6.7 everywhere

Going through that list of IPs, except for two of those, they are all different networks and different user accounts. All had no issues with their time reported except one, which had a brief i/o timeout which pulled it out of the pool for a few hours. One of those IPs is even a cloudflare IP (which they’ve added to the pool, it’s legit).

One of those IPs doesn’t pull up any info, ‘172.16.1.70’, that has never been a pool IP. and ‘97.127.48.73’ was removed from the pool 2018-07-01.

Are you using NTPD or Chrony? Can you post your config file?

Just curious, the other people reporting issues, were their servers also running as VM guests, or were any of them physical servers? Just trying to narrow things down.

It’s extremely weird that the time would jump forward, and by that far. Usually due to delays if anything a server gets behind by a few minutes before someone notices.

looks like the other guys were also running vmware and windows guest from what I can tell.

is an unroutable (local LAN) address.

Even if just that one was giving dodgy time, ntpd will reject it. It also won’t jump more than 16.67 minutes in one go (unless the config has been strangely modified). There’s something else happening here. ESX is the common thread so far. We need to understand ESX versions, know the flavour and version of NTPD, see config files and look through logs if we are to make an informed comment.

Doh. Yeah, didn’t catch that… Still drinking my coffee this morning. LOL…

1 Like

Kinda implies server names aren’t being refreshed when unroutable and a very stale DNS lookup (1 year 7 months assuming the server was removed on that date!). That IP doesn’t respond to NTP requests from a web based “check my NTP server” page, so it seems to be both out of the pool and not replying to NTP queries.

Almost 3pm here - lots of caffeine already :grinning:

Any new information?

1 Like

Something to consider, in light of that 172.16 server - if you had that NTP server on your network giving bad time, and firewall rules that prevent your servers from obtaining time from anywhere else (either outright blocks, or NAT all NTP traffic to the same host, or broken NAT), then that could be the cause, as your servers would have no other correct sources to compare it against. What does ntpdate -q 172.16.1.70 say? That could explain why none of the pool addresses you were configured to use were providing bad time, but yet you still ended up with a massive offset.

I’ve seen this happen before in a network where somebody set up an NTP server for public access, disabled connection tracking for all 123/UDP, and didn’t realise that doing so also means that none of the clients behind NAT will be able to access public NTP servers as a result (because no connection tracking means the router has no idea where to route the replies, and will therefore simply drop them). The only NTP server that replies could be received from was an internal one with a private IP - which had a clock that was way off, and was unmaintained. It was also being advertised by DHCP, but I’m not sure how many clients picked it up that way vs manual config or DNS.

2 Likes

When I pulled the logs, that server that was internal was the last in the set of requst in the list. Everything earlier was NTP servers.

@steve2030 As others said, spot checking the IPs from your logs they all have had good performance. Looking in the monitoring logs “40 minutes off” hasn’t happened in tens of millions of monitoring checks (I only looked back a year).

Without more trouble reports, it being 40 minutes for multiple people running the same platform(?) seems too much like a coincidence.