Inconsistent Time Drift Across Distributed Nodes Despite Using NTP Pool Servers

Hi everyone,

I’m currently managing a small distributed setup where multiple nodes (mix of Linux VMs and a few edge devices) are synchronized using the NTP Pool Project via regional pool servers.

While most systems stay within acceptable offset ranges, I’ve noticed intermittent time drift on a subset of nodes — sometimes jumping beyond 100–200 ms before re-syncing. This becomes noticeable in log correlation and time-sensitive processes.

A few details about the setup:

  • Using default pool.ntp.org entries (no dedicated servers)

  • Standard NTP client (chrony on Linux, systemd-timesyncd on some nodes)

  • Nodes distributed across different regions with varying network latency

  • No strict firewall blocking, but NAT is involved in some cases

One interesting side effect I ran into: during testing, even minor time offsets caused inconsistencies when syncing timestamps with externally generated assets (e.g., preview files from a video editor; visit this site, used in a separate workflow), which made debugging a bit more confusing than expected.

I’m trying to understand whether this behavior is expected when relying purely on the public pool, or if there are recommended best practices to improve consistency.

Hello @alishawinson, and welcome to the community!

On your questions, I don’t think NAT is much of an issue, unless it is CGNAT done by the ISP, which may have unknown packet handling properties. Apart from that, I think the first three bullet items have some hints regarding your questions:

  • When using the Pool, you get more or less randomly assigned servers, most certainly when you access it from different locations (different country or even continent). While each of the upstreams you get may be “accurate” in the sense of being “close enough” to true time, they may still have different characteristics. E.g., they drift differently, e.g., one off by +1ms, the other by -1ms at the same time. Or one may have a constant bias in one direction or the other, e.g., due to asymmetric traffic paths between server and client. Due to those and other factors, the clients to those different servers will also have different starting points. Imagine a client behind the +1ms offset server itself having an offset of +1ms, and a client behind the server with -1ms having itself -1ms offset, and you have a total difference between the clients of 4ms. So it has nothing to do with the Pool, or quality of its servers, but only with the fact of getting potentially somewhat different upstreams due to the distributed nature of the Pool and your systems. So my recommendation would be to pick an upstream where it is known that they are typically well synchronized across regions, because they are more tightly synchronized to a common reference (typically some GNSS-based system) than some random Pool server, where it is just not known. Many of the big Internet companies provide such a service, e.g., I (and the Pool itself in various places) like to use Apple’s servers.
  • systemd-timesyncd is not your friend in this case as it is not an NTP client, but an SNTP only (“S” standing for simple). A “true” NTP client will track multiple upstream servers “continuously”, and steer the local system clock continuously, and “discipline” it, i.e., set its parameters, e.g., the frequency at which it ticks, so that ideally, it would keep proper time by itself (weren’t it for disturbing factors such as temperature changes). An SNTP client is simpler, e.g., it typically contacts a single server only during each sync interval, instead of tracking multiple, and that may even be a different one each time. And it does not continuously steer the clock, but periodically makes bigger corrections, with the clock pretty much drifting in between. I think that is where the sudden jumps that you observed come from.
  • As mentioned in the first bullet, distributed clients make it more difficult to keep common time. I.e., pick upstreams that are more closely synchronized to each other than an “ordinary” random Pool server. All are synchronized to UTC eventually, but some are more tighly synchronized, i.e., the so-called sync distance is smaller, because they have less delay to a reference clock than others.

Others may have additional points, or more details or other perspectives, or (hopefully just) slightly different or additional recommendations.

1 Like

Thanks for the detailed explanation — that actually clarizes a lot, especially the part about upstream variance and how offsets can compound across distributed clients. The example with ±1 ms servers leading to multi-ms divergence between nodes makes perfect sense in hindsight.

A couple of follow-ups based on your suggestions:

  • For more tightly synchronized upstreams (like Apple or other large providers), would you recommend configuring a fixed set of known servers across all nodes globally, or still keeping some level of diversity per region for resilience? I’m trying to balance consistency vs fault tolerance.

  • On the client side, it sounds like standardizing on chrony everywhere is the right move. In your experience, does tuning things like makestep, maxsources, or polling intervals significantly reduce those visible jumps, or is the main gain simply from continuous discipline vs SNTP behavior?

  • Lastly, regarding “sync distance” — is there a practical way to monitor or compare this across nodes in production (e.g., via chronyc tracking or similar metrics), so I can actually verify improvements after switching upstreams?

Your point about this not being a Pool “quality” issue but rather a distributed system characteristic was especially helpful — I was initially treating the drift as something abnormal rather than expected under mixed upstream conditions.

Appreciate the insights!

+1 Profile - MagicNTP - NTP Pool Project recommendations. The application requirements are a primary consideration. What accuracy is required? Are brief periods of inaccuracy acceptable? Network assymetries of 10’s of milliseconds or even more are fairly common.

The Apple NTP servers are very good, though not perfect; typically they operate at stratum1 or 2. Be aware that individual servers sometimes stop responding (decommissioned? maintenance?). Google server are good, but requests to them may be routed to distant anycast servers.

Manually selected, nearby, NTP servers can give good results, but this necessitates ongoing monitoring.. If you take this option I’d recommend at least four, preferably eight servers.

1 Like

Nothing to add to @stevesommars’ recommendations on your point one. On the other points

  • My expectation would be that simply switching to chronyd would get rid of the jumps. The makestep command is only for stepping the clock when chronyd starts up and time has been set from a somewhat inaccurate RTC, or there even isn’t an RTC available. maxsources is only relevant in conjunction with the pool command. You could use it, e.g., in context of the first bullet item and configure, e.g., time.apple.com and/or time.google.com and/or… with the pool command, but limit it to get at most 2 servers from each (instead of the default 4). But if you hand-pick individual servers from among the big ones, or even a local one, e.g., from a national metrology institute in the country of the server, or a neighboring one, one would probably rather use the server command. But as per Steve’s description, you need to play a little to find out what works best for you. Regarding the polling interval, I usually don’t play with that and leave chronyd adjust it on its own. As said, the idea with NTP is that the local clock is “tuned” so that if there weren’t changing disturbances, it would run fine on its own. So it does some averaging over time to smooth any perceived short-term drifting of an upstream clock (either actual drifting, or changing network conditions). When the interval gets too short, the local clock will tend to try to follow each small deviation that is coming from the upstream, making it less stable. With longer intervals, those upstream deviations are kind of “averaged out” for smoothing (actually, averaging happens either way, but the longer the interval, the smoother the averaging over time). Only exception is when your local clock is unstable and has a lot of short-term drifting, e.g., because of frequently shifting CPU or network load, then having it coupled more tightly with a more stable upstream clock by shortening the polling interval could be beneficial.
  • I usually don’t go into that detail with my own servers, so others might be better positioned to advise on that. But I believe the “Root delay” in the output of chronyc tracking is the metric that indicates how “far away” the local clock is from the reference clock, from which an upper bound can be derived for the clock error due to path asymmetries. “Root dispersion” is a metric for other errors such as finite clock resolution, and delays when reading the clock. The two can be combined to form the “root distance”, which is the overall worst case timing error that can accumulate between a stratum 1 server, and a client. But my understanding is that chronyd anyhow places a lot of weight on this metric when selecting servers (in comparison to ntpd), and less on, e.g., stratum.

So all in all, I think the first step is to move all clients to chronyd, then find a set of initial “good” upstreams for each client (quick initial pick). And then check whether that meets your needs, and if not, iteratively refine the picking of upstream servers until you are satisfied. Thereafter continuously assess whether the ones configured still perform, and are still available - e.g., Apple have individual names/addresses for the servers behind time.apple.com, but as Steve pointed out, those are not stable, i.e, may get removed at any time.

This process of continuous assessment may sound tedious, but you don’t even need to check each of the servers closely all the time. Rather, I’d assume you’ll find that a clock isn’t working well anymore when the alignment of the timestamps from your various sources starts to noticeably diverge again.

So as pointed out by Steve, it all depends on your needs, and on how much time and effort you are willing to invest.

I agree with the advice already given. Particularly important to answer is @stevesommars’ question: What are your actual requirements for synchronisation?

However, I’d be reluctant to suggest manually curating hosts or using server since that will require closer ongoing management. My suggestions for next steps:

  1. Switch the systemd-timesyncd hosts to chronyd per @MagicNTP’s suggestion
  2. Enable logging of chrony statistics - there should be log and logdir directives in the default config, and you should be able to just uncomment them and restart chrony to activate them
  3. If it’s something you want to revisit often, use a time series database to record the chrony stats. I use InfluxDB and telegraf with my script NTPmon to gather the chrony logs and Grafana to graph the results, but there are various other choices.
  4. Post output of chronyc -n sources; chronyc -n tracking from 2 or 3 of the hosts in your network so we can see how well the pool selections are working for you.
1 Like