On the 15th of August between 17 and 20 UTC a number of Danish street light controllers got at time stamp from the above Stratum 1 server that was more than four months off, 31 March. Has this been reported elsewhere, and should we expect that the issue has been resolved?
That IP was removed from the pool at 2023-08-15 14:42:29 (UTC) when it wasn’t responding. When it came back online around 14:59 it had an unsynchronized time signal (“leap=3”) and remained out of the pool.
A couple recommendations:
- Don’t use a single IP to get the time
- Don’t hold on to a DNS result for an excessive period of time
- Check the leap flag, when it’s 3 the clock isn’t synchronized
Who programmed just 1 timeserver in those streetlights instead of a dns-pool?
That person should be held accountable.
Sorry, but the pool only shares active and checked to be correct servers.
The problem is not caused by us, but by your programmer/supplier of those lights.
Thank you – I see from logs on the units that leap was indeed 3, thank you for pointing that out. The problem seems to be the way we have been using a quite old version of chrony for the time synchronization. It does invalidate the response at one point, but still registers a time skew, which after a chrony “makestep” sets the time incorrectly.
The units do not use a single IP, but use a DNS result from the pool. It does however store the ip address for some time (which should only be as long as it receives valid results), this was originally done to save bandwidth. I believe I have the information to get this fixed, thank you.
I don’t know where you got that they are programmed that way? They use a dns-pool, but due to bandwidth limitations they can’t do a dns update on every ntp request. If you only have a few available kB per month, that is indeed an issue. But as you see from my other answer, there is clearly a bug in that they don’t correctly discard the response with leap=3.
Trusting only one IP for time is not advisable. There is no SLA nor guarantee on the NTP Pool service. If you are going to trust a single IP you really need to be the one operating the software on that IP at the very least (and even then it’s not ideal).
What stops someone joining the pool with many IPs, waiting until they see one of your devices as a client, and then giving you false time with leap deliberately set to 0? I don’t think that anything the pool does, nor anything you do, can stop that, if you trust only one IP address at a time.
This is a very bad idea that should not be replicated by anyone else.
You can request time but a dns-request is too expensive?
That several months?
Normally you have an DNS-cache that holds a working request for some time, usually hours but can be extended.
However, DNS requests take just a few bytes.
Also, the ntppool doesn’t use the leap-system.
Looks to me your programmer is way wrong on NTP.
What I don’t understand is, if you have limited bandwidth, why not starting your own NTP-stratum2/3/4 server on a static-IP that you control yourself???
Then let that server connect to the pool. I mean, a simple RaspberryPi4 is up-to that task.
Placing streetlights all over town costs millions a street(s) but running a server to keep BW low is too expensive? For Denmark? One of the richest countries in the world…weird.
@amplex Sorry, from the scenario I assumed it was a (too) “dumb” SNTP client!
I’m surprised that chrony didn’t discard the answers from this server and used some from the other configured / used IPs.
If you can tell which version of chrony and share your configuration, @mlichvar might have suggestions for making it behave appropriately. (Though maybe “don’t run an ancient version” will be the first suggestion!)
Since they are using an NTP daemon (rather than an SNTP client) that’s less relevant; we expect NTP daemons to hold on to the IPs they get for weeks or longer.
No, we try to support leap seconds (as much as I’m looking forward to them going away)! (Not that it’s relevant in this scenario, leap=3 doesn’t have anything to do with leap seconds).
This is getting a bit out of context, but FWIW: It is not uncommon that M2M-simcards are limited to 1MB of data per month, even if this seems absurd. As a DNS request is normally around 500 bytes, while I figure an ntp request is around 50, it makes a lot of sense, to me at least, to limit the number of DNS requests. This doesn’t make the current handling correct, and as others have suggested the solution seems to be to switch to our own ntp server.
No problem, and I understand why you would think that. The units that we are talking about are running a very old version of chrony (1.45), where requesting a “makestep” seems to not take the leap=3 into account. We will look into switching to a more recent version as well, there is probably no need to spend more time on the problem mentioned here. Thanks for your kind assistance, though.
I don’t. My Chrony has the support turned off.
@ask please enlighten me, as leap=3 is something not to be found with google, so I assumed it’s the leap-second-support-system in the daemons and clients.
Please tell me what it is, thanks.
(Ask had originally assumed SNTP so posted that RFC, but the same packet is relevant for NTP and is also described in RFC1305.)
Yes. I don’t know what chrony or ntpd do with leap=3 (“not synchronized”) packets, but it’d be reasonable if the’d be wary of them in particular if stepping the clock. The monitoring system doesn’t give a negative score for “not in sync” responses if the responses are still accurate (though I have a todo item to do something else with them …).
@davehart or @mlichvar, what is the recommended client behavior when getting leap=“not in sync” responses? Should the NTP Pool monitor remove servers with that response (that aren’t also KoD responses, which are already being removed).
@amplex I’d recommend you configure chrony with the
pool keyword to make sure it’s using multiple servers (or multiple
server statements if your version is too old to have
Ntpd ignores upstream leap=3/unsynched responses as far as timekeeping, but it considers the source still responsive as far as the reach register goes.
As the (S)NTP RFCs surely state, clients must not use the time when leap is 0b11 (eg 3). The occurs routinely during ntpd startup before the first system peer selection.