No PR action for project and dramatic reduction in the number of active servers

Your example uses time.google.com which is a very bad source for an NTP server in the NTP pool. Google uses Leap smear Alas Leap  |  Public NTP  |  Google for Developers which handles leap seconds in a totally different way than the NTP protocol does. The NTP protocol adds or deletes one second at the end of a minute, whereas Google smears the adjustment out over a period of 24 hours.

You can find nice plots of the difference between NTP time and Google time in this discussion here: Google Public NTP - #9 by ddrown

1 Like

Ok, thank you for replies and also I have just on question
How many servers I need to add in /etc/ntp.conf? 4? 6? or 10 or maybe 12? What maximum number? I understand it 4 and more, but what is optimal value?

Hi Alex, four is a good compromise between having enough in case one becomes unreachable and putting unnecessary load on the servers. There’s some guidance here: https://www.ntppool.org/en/use.html

If your daemon supports the “pool” keyword then a single “pool” entry should automatically keep enough active servers. (See https://www.eecis.udel.edu/~mills/ntp/html/confopt.html#pool)

If you are operating Stratum 2 servers as you mentioned a few posts above, then the pool statement won’t work, because there is no pool of exclusively Stratum 1 servers available. Also if your goal is to add your servers to the NTP pool, than pooling your servers with the pool itself might not be a good idea. You may want to use some independent external time sources instead. In that case it is better to find some stable Stratum 1 NTP servers in your neighborhood. A good starting point for that is the list of public Stratum 1 servers at https://support.ntp.org/bin/view/Servers/StratumOneTimeServers

yes, I have stratum 2 time server and I have a question - “how many servers I need to add to ntp.conf”? Now I have added 6 servers all stratum 1, is it true?

NTPD’s limit is a maximum of 10 servers, I don’t know about chrony but I would assume the same. Adding more than 10, NTPD will only monitor them, they aren’t used in any of the calculations.

The general rule is for 2n+1 to protect against “n” falsetickers. Four upstream time servers will protect you against only one falseticker. Five upstreams will protect you against two falsetickers, seven will protect you against three falsetickers, etc…

Really it’s QUALITY over quantity. You want to choose servers that have low delay & low jitter, and have similar offsets.

If I’m not using upstream sources that I control, then I aim for having seven servers in my config, but five is my minimum. Like I said, it’s about finding sources that agree with each other and are reliable, not about just adding as many as you can into your configuration and let it be at that…

2 Likes

Note, that some servers can be one group, geographically located in one point and have one channel to internet (can be checked by ip address). This is a common point of failure. For reliability, it is better to use several servers from different groups, geographically close to you and have good delay/jitter. I guess 5-7 servers from stratum-1 list will be ok. Also I highly recommend that you leave default poll time and do not create additional load, otherwise the servers may ban you.

You can add up to 10 servers, leave for a 1-2 days, then see which servers ntp chose and remove others. Also you can enable stats and use ntppoltter (Net Tools) to quick analyze servers offset/jitter over time.

1 Like

Thank you and I’ve last question:
why I cannot use more 7 servers? Can I use 10 servers? No, because it overload upstreams, is it true?

You can use 10… But that is really overkill unless you are doing some specific monitoring. In my experience, 5-7 that are stable is the “sweet spot” for NTP to run stable and not bounce around too much changing its selection. Also like I mentioned before, you really want a selection of servers that all agree on time, just sticking 10 random servers will likely have a majority of them tossed out of the time solution and ignored.

I select

  1. servers in my country, national standard preffered, ping<100ms. I see no reason to choose distant servers
  2. different groups/location/subnets
  3. 2 servers is primary stable,2 backup (sometimes servers stop to maintance),and some overbackup if parts of Internet fail. It’s like DNS servers - you need primary and secondary, possible more, but only as an emergency backup. Really ntpd select and use 1-2 servers constantly. The rest is just observation/statistics/overkill.

Hello – I added some related notes in another thread: Monitoring and packet loss in transit - #3

All the software is on Github and patches are welcome, even if I don’t always apply them super fast (sorry, @lammert! I don’t remember it, but I am guessing you updating the PR made the email thread bump to the top of my inbox so I saw it again. As you might have surmised, I could use help reviewing and screening the various pull requests that come in, in particular for the translations; I don’t know why that’d need to be stuck on me doing it except nobody is helping).

I am personally just, well, one person and the NTP Pool isn’t my only commitment in life, so I only have the time I have. There’s been some years where virtually all the time goes with just one little tweak or fix or distraction after another when I sit down to work on it, so what I often do is “cycle” between areas of concern to get some needed maintenance or a certain improvement put in and then move on to the next.

For example some years ago I spent a lot of time getting the system running in Kubernetes (and getting the k8s cluster up and running and maintained). It was frustrating not spending the time writing software or otherwise improving or adding something new, but the old setup was extremely hard to upgrade in production and the Kubernetes setup completely unlocked that so updates are relatively safe and easy which was great.

Last winter a lot of time went with making the new archiver and getting it in production and getting all the old data properly archived. The code I ended up writing was probably just a few days of work here and there, but I’d previously failed an embarrassing number of times at getting it structured and written in a way that felt like it’d be reliably and low maintenance for the next decade (or whatever) – or the systems I tried using for archiving turned out to not work as well as I’d hoped. The new system is great though and minus a few changes is ready for getting a much higher volume of monitoring data.

In the spring we got told on short notice that the co-location facility we’d been in for a long time was going away. Getting everything moved to another facility and getting the system moved in a way that didn’t cause downtime and then the ongoing “tidying up” afterwards didn’t leave much time for anything else for a while.

As I posted in the other thread linked above, the beta system has some new features that I wanted to build before adding more features on manage.ntppool.org that’d have to be migrated later. I could really use some help testing it though.

This past holiday period I was working on making it easier to run the full Kubernetes setup locally for development (instead of using docker-compose). The motivation for this was that the old system (ksonnet) I’d been using to deploy the application worked great but isn’t maintained anymore and is starting to fall apart on the version of Kubernetes the cluster has been upgraded to.

Anyway – for the server count, I agree that the biggest “cost” on that was the DDOS attacks years ago and servers needing upgrading, etc.

I have maintained the system for about 15 years now and we still have more servers than we ever dreamed about in the first handful of years. We also have more servers than ever with higher capacity and servers in more countries than ever, so … well, I don’t think it’s so bleak. I don’t know if any of the other “free access NTP” systems handle more devices or queries than we do (as @Bas said).

2 Likes

The version that’s been in production the last couple of years is at GitHub - ntppool/monitor: monitoring agent for the NTP Pool (and it’s been there all along I believe).

One of the problems with this is that you are now one of the the users requiring NTP servers (and for the pool also the DNS servers) to have many many times more capacity than the average load because every 15 minutes (and in particular every hour and in particular when it’s midnight somewhere) MANY more queries come in.

I looked to see the code review requests you’d put up, but I couldn’t find them! Since you, apparently, care more than me I’m happy that you’ve put time into improving the system.

No, he’s proven that some transit providers are (sometimes) dropping NTP queries. One of them is a global carrier mostly focused in Europe. We’ve talked to them and they believe the risk of being affected by ongoing NTP based DDOS attacks is too great to disable the filters.

3 Likes

Not sure if it’s helpful, but maybe they would make an exception just for traffic to the Pool monitoring host IPs?

I wouldn’t be surprised if the problem is with Telia. Seeing big problems to connect my Stratum 1 server in the office in the Netherlands reliably with my Stratum 2 servers in the Hetzner data center in Germany. Connection between the Stratum 1 servers and my Stratum 2 servers in France, US and Kazakstan is problem free.

The connection with Germany is the shortest in distance but the only one which is routed through telia.net, and suffers the highest packet loss. It only happens with traffic originating from port 123. I use UDP port 1123 as a backup port and that traffic is not affected. Also, all packets to port 123 reach the Stratum 1 server as far as I have checked, but the responses are filtered.

liitle OT:
I’m using Hetzner as well and the telia backbone is sh*t even if you try from germany to reach the server which is located in germany (not the ones from finland). This dosen’t only affect the UDP 123, it’s the hole traffic. Sometimes SSH connection drops / just pause, other people reports that download and webtreffic is horrible slow.
Hetzner will route your server via DTAG (Deutsche Telekom AG) for an extra fee of 5€ per server. That backbone works very well.

@lammert, I don’t think that the problem is in Telia, I have asymmetric routing and go to monitoring through the Telia and there is no problems. Also from Russia to half of Europe routes via Telia. Can you PM you server addresses? I add them to my server on Russia and then we can analyze peerstats.

Also you can use network tool like MTR or PingPlotter. Latest one can save history for every ping packet and show errors/delay very intuitive graphically. Only in case of asymmetric routing the problems are not always clear.

Forgot to mention another thing to look out for is to make sure each server’s upstream source is unique. Sometimes servers in one physical area will end up using the same sources (including yourself).

3 posts were split to a new topic: Number of configured servers