Very long jumps in system time

Hi

I’m investigating a strange phenomenon we’re seeing with a fleet of IOT devices. We’re seeing significant time jumps on the system clocks. These jumps can be minutes but some are months. The erroneous time never lasts long (estimated at most an hour but usually a matter of minutes). Across a fleet of approximately 2000 devices in diverse locations this occurs perhaps once a week.

We have very little evidence to explain why this is happening. However the system log does show the SNTP client updating the system clock at both the moment the clock jumps to the wrong time, and then back again. The client is set to the default NTP pool for base image as provided by our supplier: [0,1,2,3].debian.pool.ntp.org.

Obviously we’re investigating software / hardware issues on our side, but having spent some weeks investigating this I’d like to get a word on the reliability of the Debian pool. As I understand it SNTP is much less fault tolerant than NTP. How likely is it that we receive sporadic incorrect times from one of these servers? Here “incorrect” means more than a minute out.

Does anyone have any other suggestions about possible causes, or ways we can guard against this? We have already spotted instances of home routers DHCP re-configuring the timeserver and we’ve now overridden this as a precaution. Is anyone aware of any ISP messing with ntp.org DNS domain names or intercepting NTP packets?

All suggestions welcome.

SNTP is a one-shot deal. It queries one server in a few burst packets back & forth to determine the time, then updates the clock, and that’s it…

All the NTP servers in the pool are regularly queried (about every 10-15 minutes) by a monitor to ensure they are responding and have valid time.

I’ve never heard of ISPs intercepting / redirecting NTP packets, it’s a very small UDP packet so it’s not like it consumes any noticeable bandwidth unless someone is doing something malicious. There was a time where some ISPs blocked all NTP traffic back when there was a bug and it allowed an amplification attack, but that was many many years ago.

If the time jump only occurs during updating, then I would start debugging your SNTP client to see what time it is sending, and what is being returned. Likewise I would check to see if it is one particular IP from the pool causing issues (maybe a malicious server?), or if it comes from multiple sources then you can probably rule the pool out and focus more on your IoT’s software / hardware.

Also, depending on your SNTP distribution, you can limit large time jumps (which should be the default for most), in which case it will just consider the time invalid and not do anything.

Are these IoT devices always connected to the internet, or is it intermittent? More details would be helpful…

Have you considered using a modern standard full NTP package like NTPD or Chrony to give better fault tolerance and stability?

Okay. That would suggest a jump of months is highly unlikely, though of course errors do happen. I wonder if there any any stats on servers being removed from the pool?

The weird thing is we have seen logs stating “Updating time from 192.168.1.1” which was how we caught the DHCP thing. After disabling the DHCP client from re-configuring the time servers the issue persisted.

If you / anyone else knows how to do this for systemd-timesyncd then that will save me trawling a not-so-helpful manual.

I am of course making the case for doing so. There was a reluctance at-first because it was seen as “heavy weight” and “a server”. sigh

First, if these are commercial devices, you might want to consider looking at the following page and applying for your own vendor zone.

https://www.ntppool.org/en/vendors.html

You would need the specific IP(s) of the server(s) to see its pool stats. But yes, information about each server is logged.

Ah, okay… That is DHCP Option 042. I’ve never seen an ISP (or even off-the-shelf) router implement that! Basically the DHCP server has a bunch of options is can send out to clients, option 42 happens to be a default NTP server for the client to query.

I’ve never really messed with SNTP on any modern system, though I’m sure google will provide plenty of information. There’s not a lot of options for SNTP, it’s a very basic query utility.

It’s no different than any other running service, but yes it does require minimal configuration (but once you have it configured right you will likely never need to change it again). Yes, I would recommend disabling the ability for it to send out time as there is no need for your IoT device to be a NTP server.

Thanks I’ll be raising that internally too. Alas we live in an age where people buy off the shelf devices and expect them to just “sync with the internet” forgetting there need to be servers on the other side to make that work. One positive to come out of this has been that we as a company are now more aware of time sync issues. We do of course endeavour to be “good citizens”.

Hi, https://manpages.debian.org/stretch/systemd/timesyncd.conf.5.en.html doesn’t show a lot of options other than having a space-separated list of servers! Best practice would be to have four servers in the list, so [0,1,2,3].debian.pool.ntp.org as you say in your first post. Not sure how it decides which server(s) to query.

If it logs the IP of the server it synced from (which your posts imply it does) then it should be able to back track from that and see what the pool thought of that server.

192.168.x.x is obviously an unrouteable address, though it looks like you ruled that out by disabling DHCP.