No PR action for project and dramatic reduction in the number of active servers

I know the power of a community. I have been supplying data-center based servers to the NTP pool since 2008 and added the first NTP pool site translation in the Dutch language. What changed over time is an increased distance between the community part, i.e. the servers in the field actually serving the time, and the infrastructure part, i.e. the DNS servers, monitoring system and website. The github account of the project doesn’t contain the current version of the software anymore, community pull requests take a long time to be considered or are ignored.

The NTP world has changed since fall 2013 as you can see in the server graph which I posted earlier in this thread. The massive amplification attack possible through the ntpd monlist command caused NTP traffic to become prone to filtering by parties all over the world.

The NTP pool project never adapted to this new reality. The monitor system stayed central with pool servers being kicked on- and off the pool depending on if and how traffic from the single monitor station was able to reach those pool servers.

I am providing professional NTP services since 2007 to my customers. My servers are much more stable than the network connection with the pool monitor. I have provided my excess bandwidth to the NTP pool until now. But seeing the constant jig-saw graphs in the pool graphs, getting automated emails about bad performance and even having to relocate some servers between data centers because some data-centers cannot be reached by the pool monitor all together have now changed my stance. If Ask is not willing or able to change this, and the community is not allowed to help because the source code repository is out of date and pull requests are ignored, so be it. There is a limit to the power of the community and a limit to the amount of effort I personally want to put in beating a dead horse.

you are right in many ways. I will add this: in my experience of administering projects alone, over time it is very annoying, the project is abandoned and starts working on autopilot. Any desire to make any changes and generally do something disappears. After a while, the project dies. In the case of a pool, it can be taken by someone from large companies having own CDN (as Google create own DNS).

The problem here is that the ntp protocol, like the dns, is very old, prone to attacks and not so necessary for end users. (even ntpd old and developed almost alone). Although every soho router and toilet bowl synchronizes time today, many of them really no needs it. How often you see logs no your soho router? Never. If your equipment in production needs accurate time, you prefer to use your own ntp or stable national standard servers. I never used pool servers on production hardware (a lot of this hardware is even cut off from the Internet). I think up to 90% ntp traffic and servers represents an unnecessary load. Do companies need this? I don’t know.

2 Likes

and again we have connectivity problems at 19:45-20:15, 20:35-21:05, 21:27-21:29 MSK (GMT+3) and again my score drops to 3.9. I’ts kindergarten. Ask, if the fool monitoring drops the score so quickly, then let him return it also quickly.

I guess I will make an offer in my company to create a pool zone on our DNS servers and intercept/redirect our customers to themselves.

1 Like

For more stability I’ve added on my stratum 2 servers ntpdate in to crontab
Is it clear? What do you think about this?

*/15 * * * *	/usr/sbin/ntpdate -4 -u time.google.com >/dev/null 2>&1

you attached the fifth wheel (from the tractor) to the car. I’t big error.

Computers and devices have different oscillators, CPU load, temperature variation, etc. Therefore, their hardware watches go a little at different speeds. ntpd looks at the local clock error and calculates the correction. Speed of clock accelerates and decelerates. After entering this correction into the system, the speed of the local clock is made equal to the reference. (Not just time equal, speed of clock!) Like the watchmaker tightens the adjustment screw.
Even if the connection with the reference disappears, due to the introduced correction, the speed of the local clock will remain correct and pretty accurate.

ntpate just get time from reference and sets the time locally. Like clock user adjusts the clock according to the signals of the radio every morning. If you do that every 15 minutes, watchmaker do not understand how to ajust speed. All the watchmaker’s calculations and speed adjustments become incorrect. Thus, knocking down the logic of the ntpd.

If you use ntpd, ntpdate can used only on start before ntpd. For example, Raspberry does not have battery and always powered BIOS, thus need synchronize huge date gap in one jump at start (ntpd -g can do it too). Then best solution - use ntpd and give him an opportunity do thin/fine funing work.
Also some devices does not need fine tuning time or does not have resources for ntpd (microcontrollers, arduino, etc), thus it just use ntpdate/sntp to perform rough synchronization occasionally (once in hour or something like that). Error in milliseconds (or even seconds) is not terrible for these devices.

Just read the Google, man.

1 Like

Your example uses time.google.com which is a very bad source for an NTP server in the NTP pool. Google uses Leap smear https://developers.google.com/time/smear which handles leap seconds in a totally different way than the NTP protocol does. The NTP protocol adds or deletes one second at the end of a minute, whereas Google smears the adjustment out over a period of 24 hours.

You can find nice plots of the difference between NTP time and Google time in this discussion here: Google Public NTP

1 Like

Ok, thank you for replies and also I have just on question
How many servers I need to add in /etc/ntp.conf? 4? 6? or 10 or maybe 12? What maximum number? I understand it 4 and more, but what is optimal value?

Hi Alex, four is a good compromise between having enough in case one becomes unreachable and putting unnecessary load on the servers. There’s some guidance here: https://www.ntppool.org/en/use.html

If your daemon supports the “pool” keyword then a single “pool” entry should automatically keep enough active servers. (See https://www.eecis.udel.edu/~mills/ntp/html/confopt.html#pool)

If you are operating Stratum 2 servers as you mentioned a few posts above, then the pool statement won’t work, because there is no pool of exclusively Stratum 1 servers available. Also if your goal is to add your servers to the NTP pool, than pooling your servers with the pool itself might not be a good idea. You may want to use some independent external time sources instead. In that case it is better to find some stable Stratum 1 NTP servers in your neighborhood. A good starting point for that is the list of public Stratum 1 servers at https://support.ntp.org/bin/view/Servers/StratumOneTimeServers

yes, I have stratum 2 time server and I have a question - “how many servers I need to add to ntp.conf”? Now I have added 6 servers all stratum 1, is it true?

NTPD’s limit is a maximum of 10 servers, I don’t know about chrony but I would assume the same. Adding more than 10, NTPD will only monitor them, they aren’t used in any of the calculations.

The general rule is for 2n+1 to protect against “n” falsetickers. Four upstream time servers will protect you against only one falseticker. Five upstreams will protect you against two falsetickers, seven will protect you against three falsetickers, etc…

Really it’s QUALITY over quantity. You want to choose servers that have low delay & low jitter, and have similar offsets.

If I’m not using upstream sources that I control, then I aim for having seven servers in my config, but five is my minimum. Like I said, it’s about finding sources that agree with each other and are reliable, not about just adding as many as you can into your configuration and let it be at that…

1 Like

Note, that some servers can be one group, geographically located in one point and have one channel to internet (can be checked by ip address). This is a common point of failure. For reliability, it is better to use several servers from different groups, geographically close to you and have good delay/jitter. I guess 5-7 servers from stratum-1 list will be ok. Also I highly recommend that you leave default poll time and do not create additional load, otherwise the servers may ban you.

You can add up to 10 servers, leave for a 1-2 days, then see which servers ntp chose and remove others. Also you can enable stats and use ntppoltter (https://www.satsignal.eu/software/net.htm#NTPplotter) to quick analyze servers offset/jitter over time.

1 Like

Thank you and I’ve last question:
why I cannot use more 7 servers? Can I use 10 servers? No, because it overload upstreams, is it true?

You can use 10… But that is really overkill unless you are doing some specific monitoring. In my experience, 5-7 that are stable is the “sweet spot” for NTP to run stable and not bounce around too much changing its selection. Also like I mentioned before, you really want a selection of servers that all agree on time, just sticking 10 random servers will likely have a majority of them tossed out of the time solution and ignored.

I select

  1. servers in my country, national standard preffered, ping<100ms. I see no reason to choose distant servers
  2. different groups/location/subnets
  3. 2 servers is primary stable,2 backup (sometimes servers stop to maintance),and some overbackup if parts of Internet fail. It’s like DNS servers - you need primary and secondary, possible more, but only as an emergency backup. Really ntpd select and use 1-2 servers constantly. The rest is just observation/statistics/overkill.

Hello – I added some related notes in another thread: Monitoring and packet loss in transit

All the software is on Github and patches are welcome, even if I don’t always apply them super fast (sorry, @lammert! I don’t remember it, but I am guessing you updating the PR made the email thread bump to the top of my inbox so I saw it again. As you might have surmised, I could use help reviewing and screening the various pull requests that come in, in particular for the translations; I don’t know why that’d need to be stuck on me doing it except nobody is helping).

I am personally just, well, one person and the NTP Pool isn’t my only commitment in life, so I only have the time I have. There’s been some years where virtually all the time goes with just one little tweak or fix or distraction after another when I sit down to work on it, so what I often do is “cycle” between areas of concern to get some needed maintenance or a certain improvement put in and then move on to the next.

For example some years ago I spent a lot of time getting the system running in Kubernetes (and getting the k8s cluster up and running and maintained). It was frustrating not spending the time writing software or otherwise improving or adding something new, but the old setup was extremely hard to upgrade in production and the Kubernetes setup completely unlocked that so updates are relatively safe and easy which was great.

Last winter a lot of time went with making the new archiver and getting it in production and getting all the old data properly archived. The code I ended up writing was probably just a few days of work here and there, but I’d previously failed an embarrassing number of times at getting it structured and written in a way that felt like it’d be reliably and low maintenance for the next decade (or whatever) – or the systems I tried using for archiving turned out to not work as well as I’d hoped. The new system is great though and minus a few changes is ready for getting a much higher volume of monitoring data.

In the spring we got told on short notice that the co-location facility we’d been in for a long time was going away. Getting everything moved to another facility and getting the system moved in a way that didn’t cause downtime and then the ongoing “tidying up” afterwards didn’t leave much time for anything else for a while.

As I posted in the other thread linked above, the beta system has some new features that I wanted to build before adding more features on manage.ntppool.org that’d have to be migrated later. I could really use some help testing it though.

This past holiday period I was working on making it easier to run the full Kubernetes setup locally for development (instead of using docker-compose). The motivation for this was that the old system (ksonnet) I’d been using to deploy the application worked great but isn’t maintained anymore and is starting to fall apart on the version of Kubernetes the cluster has been upgraded to.

Anyway – for the server count, I agree that the biggest “cost” on that was the DDOS attacks years ago and servers needing upgrading, etc.

I have maintained the system for about 15 years now and we still have more servers than we ever dreamed about in the first handful of years. We also have more servers than ever with higher capacity and servers in more countries than ever, so … well, I don’t think it’s so bleak. I don’t know if any of the other “free access NTP” systems handle more devices or queries than we do (as @Bas said).

1 Like

The version that’s been in production the last couple of years is at https://github.com/ntppool/monitor (and it’s been there all along I believe).

One of the problems with this is that you are now one of the the users requiring NTP servers (and for the pool also the DNS servers) to have many many times more capacity than the average load because every 15 minutes (and in particular every hour and in particular when it’s midnight somewhere) MANY more queries come in.

I looked to see the code review requests you’d put up, but I couldn’t find them! Since you, apparently, care more than me I’m happy that you’ve put time into improving the system.

No, he’s proven that some transit providers are (sometimes) dropping NTP queries. One of them is a global carrier mostly focused in Europe. We’ve talked to them and they believe the risk of being affected by ongoing NTP based DDOS attacks is too great to disable the filters.

2 Likes

Not sure if it’s helpful, but maybe they would make an exception just for traffic to the Pool monitoring host IPs?