NTP pool wrong time issue 2/5/20

Doh. Yeah, didn’t catch that… Still drinking my coffee this morning. LOL…

1 Like

Kinda implies server names aren’t being refreshed when unroutable and a very stale DNS lookup (1 year 7 months assuming the server was removed on that date!). That IP doesn’t respond to NTP requests from a web based “check my NTP server” page, so it seems to be both out of the pool and not replying to NTP queries.

Almost 3pm here - lots of caffeine already :grinning:

Any new information?

1 Like

Something to consider, in light of that 172.16 server - if you had that NTP server on your network giving bad time, and firewall rules that prevent your servers from obtaining time from anywhere else (either outright blocks, or NAT all NTP traffic to the same host, or broken NAT), then that could be the cause, as your servers would have no other correct sources to compare it against. What does ntpdate -q 172.16.1.70 say? That could explain why none of the pool addresses you were configured to use were providing bad time, but yet you still ended up with a massive offset.

I’ve seen this happen before in a network where somebody set up an NTP server for public access, disabled connection tracking for all 123/UDP, and didn’t realise that doing so also means that none of the clients behind NAT will be able to access public NTP servers as a result (because no connection tracking means the router has no idea where to route the replies, and will therefore simply drop them). The only NTP server that replies could be received from was an internal one with a private IP - which had a clock that was way off, and was unmaintained. It was also being advertised by DHCP, but I’m not sure how many clients picked it up that way vs manual config or DNS.

2 Likes

When I pulled the logs, that server that was internal was the last in the set of requst in the list. Everything earlier was NTP servers.

@steve2030 As others said, spot checking the IPs from your logs they all have had good performance. Looking in the monitoring logs “40 minutes off” hasn’t happened in tens of millions of monitoring checks (I only looked back a year).

Without more trouble reports, it being 40 minutes for multiple people running the same platform(?) seems too much like a coincidence.