Some client really can't behave

Definition of Round-Robin:

Round-robin DNS is a technique of load distribution, load balancing, or fault-tolerance provisioning multiple, redundant Internet Protocol service hosts, e.g., Web server, FTP servers, by managing the Domain Name System’s (DNS) responses to address requests from client computers according to an appropriate statistical model.[1]

As the pool is using DNS-round-robin it’s not an issue.
Every client gets 4 servers.
It doesn’t matter if one doesn’t respond.

However, if you are a bad client you probably being dropped by all servers you are presented with.
It’s your own fault if that happens.

Normal clients shouldn’t poll this much, if they do, it’s their mistake.

Don’t forget, it’s 1 IP that polls so much, not just 1 computer.
For IPv6 this won’t happen quickly but large (LAZY) IPv4 networks behind NAT will probably suffer.
There is no reason why those large networks can setup their own local NTP-server (as proxy) as router can perfectly handle this.

It’s just they do not do it. Their fault, not mine.

Linux Mint and Ubuntu do use DHCP-NTP by default, and when your router offers it, it will be used and overrides the pool, that is only set as backup.

1 Like

Feels like there’s a bunch of possible things in play here, but also I feel like ISPs SHOULDN’T transparently proxy/intercept NTP calls.

Possible problems:

  • CGNAT : You might have lots of different independent clients coming from the same public IP address. You cannot think one public IP address is equivalent to one device, person, home, office, or even building anymore. Although I don’t like CGNAT, you can’t blame the providers much for this, as IPv4 addresses are scarce now.
  • DHCP and routers: ISPs not providing NTP servers through DHCP and configuring their provided routers to use it by default and broadcast it to the local network is bad. Yep, you can blame them for that.
  • Router vendors (and some ISPs): some vendors hardcode their routers ntp configuration to an IP address and/or don’t ask for an NTP pool vendor zone. That’s really bad.
  • DNS cache: reducing TTL could distribute load better between servers, although I think right now it’s not too big (140 seconds, I think) and there are a bunch of things that could affect that, like OS, router or ISP caches. But I feel like ISP caches are way less common nowadays.
  • Countries with not a lot of NTP servers: ¯_(ツ)_/¯. I think I get around 7% of all my country NTP pool traffic, so what can I say.

About the ISP level NTP proxies and why I wouldn’t do it:

That would work for most of the common devices, but would break or impair time sensitive clients. All the NTP queries would be routed to just one NTP server that could be good or not, and any small disturbance or sync problems would distribute to all the clients. You wouldn’t be able to choose what NTP servers you use to sync. You couldn’t add several different NTP servers as targets.

So it could go either kinda good, or terrifically bad.

This feels like years ago when DNS queries were intercepted at DNS levels. ISP DNS were comically bad, slow, unmaintained… and often misused spy or to inject ads and stuff in your browsing. Not good.

2 Likes

I’m not talking about intercepting.

I’m talking about configuring their routers to use the pool (or their own NTP-server and pool as backup).
BUT use the router-build-in NTP-server to connect to the pool but use the DHCP-NTP to assign that NTP-server (I call it a proxy) to set the network behind NAT on time.

As IPv4 NAT networks are common and many devices are behind most routers.

However they do not do this, apart for a few exceptions, they give DHCP-NTP to the pool, so all devices behind the router ALL route to the pool, instead of just the router.

As providers often use pre-configured routers for their clients, they can easily do this.
And even better, install their own NTP-pool-servers for their clients.

Then we do not need to feed entire countries of users with NTP but just de ISP’s and of course people that want to access the pool directly.

I’m NOT talking about forced intercepting or anything like that, just try to move them to use routers for local NTP-time-server instead of point all DHCP-clients to the pool.

Makes no difference for the user, nor for the ISP, but reduces our load a lot.

3 Likes

So, the root of the issue must be fixed by people who are likely blissfully unaware of it. Should the owners of offending IP addresses systematically be contacted and informed through the abuse email address found in whois? Or how do we raise the issue with the right people?

1 Like

Unless it is a IPv4 and involves a CGNAT pool ?

I have some regions with insane client behavior, too but hesitate to block or rate limit in case it may cause collateral damage. One would have to check every offender manually and see if the PTR provides any clues. Some ISP will put there 2321.cust.village.city.dsl.cgnat.isp.com

With the volume and number of bad IPs seen every that that would be a lot of work. OTOH it is super annoying to burn several TB of transfer each month knowing that ~80% of that traffic comes from ~3% of the client IPs seen.

Perhaps it would not be unreasonable for the NTP Pool project to declare a queries/IP/second that should be supported by a pool volunteer and then everything after that is accepted to be at risk of being dropped?

As IPv4 depletion gets worse and more gets put behind CGNAT, I worry this is not sustainable growth in per-IP query traffic for a volunteer project. Maybe it is acceptable to say that large CGNAT populations can’t be accomodated and should query using IPv6?

Of course, whether you drop the traffic or not, you can’t stop it arriving, which is also a problem. Can we brainstorm ways that pool volunteers can more quickly temporarily remove themselves out of the pool when they get overwhelmed? Lower DNS TTL and a simple API for a volunteer to say “enough is enough, remove me until I say I want back in”?

If one were designing an NTP Pool v2 what radical changes would one make to try to address abusive clients?

Off the top of my head:

With the rise of cheap bring-your-own-IP VM providers like Vultr, how about an NTP pool that is dispersed in say 12 or so locations worldwide and uses its own IPv6 space anycasted?

The pool proxies to its volunteers at their own IPv4 /32 and/or IPv6 /128 but only ever presents its own IPs (just on IPv6 - it’s a new world out there) to the clients, distributes client queries amongst its local volunteers. Since NTP clients keep track of servers by IP you should not put different servers behind different IPs, so each volunteer would need a unique IPv6 address to be known to the world by, but there are billions of them in a /64.

The thing is, if a volunteer says they have had enough they can disable the proxying and then the pool itself sinks the traffic, not the volunteer. In this way, the idea of allocating a given queries-per-second and monthly GB of traffic to be donated to the pool could be normalised, as once you say “enough”, you do not get any more traffic from the pool.

A very understanding upstream may even let the pool blackhole an individual IP at its border using a BGP community, but that is an advanced topic.

There would be increased latency introduced by the proxying stage between the pool point-of-presence and the volunteer server, but as long as the pool was a bit selective about which volunteers it accepted at which PoP (e.g. automatically drop candidates whose latency/jitter grows too severe), I think the NTP protocol should cope with this. I think it would be better than entire regions going dark because all volunteers were DDoS’ed out.

Fund the project by Patreon lowest feasible monthly contribution or other subscription method compatible with small transactions. If necessary only give out per-client-org DNS names to paying users in the same way that various DNSBLs do.

A pool volunteer would stop being able to tell who their clients were, though (they’d only see pool IPs). It could be taken a step further and every pool volunteer iBGP peers an IPv6 /64 with the pool. That way they directly receive the NTP traffic to their individual /64 but can tear down the BGP session whenever thjey like, becoming unreachable. The proxying step becomes a routing step.

There is probably a fatal flaw in all of this that I have overlooked.

2 Likes

No, the problem should be fixed by people that route a lot of IP4’s behind NAT.

Use DHCP-NTP to assin NTP-servers and point them to your own proxy-server.
This way unaware users/devices will ask their severs and not put pressure on the pool.

Many ISP’s, Cloud-providers and others are symply lazy…this lazyness should be changed and make them aware that running their own-NTP-proxies (Stratum 2/3/4/etc) also benefits them too.
As they also capture traffic before they need to pay peers. Win for them as well.

Maybe something that @ask should put on the main page, to inform those ISP’s of the bennefits.
Also, maybe a good idea to make a special pool for them to access stratum1 servers, to have good time but unload the pool in total.

So they get the best time, in return for putting proxies (local NTP servers!) in place.

I do not mind to feed them from my Stratum1 servers when they run a proxy for their clients.

1 Like

I suspect it has never occurred to them that this problem exists, that all the clients behind the same DNS cache will hit the same 4 servers, and they probably never paid attention to how much bandwidth they’re consuming with this. It needs to be brought to their attention, or they will not realise there’s a problem.

1 Like

I like the idea, but it seems to me the pool needs to first start returning IPv6 addresses for all DNS queries. As it is, only queries to 2.*.pool.ntp.org return IPv6 as well as IPv4. I believe the initial reasoning for this was that a number of clients had broken configurations where IPv6 addresses were assigned but couldn’t reach the internet, while IPv4 worked. I think we’ve outlived that justification, as IPv6 use has increased and such broken clients would be having symptoms with other software that would have triggered the networks to clean up by now.

6 Likes

That exists, they can lower their configured bit rate to “Monitoring only”. There’s undoubtedly a way to automate this, but I don’t know how difficult it is. It’s trivial using the web management.

3 Likes

Just to connect some dots, the topic, and the challenge involved in realizing something like an API for this, have been discussed in this thread.

1 Like
Hostname                                  NTP   Drop Int IntL Last     Cmd   Drop Int  Last                              
============================================================================================                              
2001:e68:5e28:3a00:215:5dff:fe9a:7b00    4632   3016   2   2     2       0      0   -     -                  
2001:b011:2008:1880:4a0:48d8:d9d:b770     128     48   3   4     1       0      0   -     -                  
2001:e68:7d68:fa00:215:5dff:fe00:9b00   14857   9681   2   2     3       0      0   -     -                  
2001:e68:6241:8c00:215:5dff:fe00:c800    2314   1494   2   2     2       0      0   -     -                  
2001:e68:6385:7401:215:5dff:fe00:6a00   13475   8764   2   2     1       0      0   -     -                  
2001:e68:8381:8500:215:5dff:fe00:1601   11309   7382   2   2     0       0      0   -     -                  
2001:e68:8342:9500:215:5dff:fe00:b600   11321   7365   2   2     2       0      0   -     -                  

Yeah, we talking about clients that don’t behave? No reason to be polling an internet NTP server once every 4 seconds.

Not sure about this one since they are from different /48 prefixes, so it can’t be the same dude, but they all have Microsoft MAC addresses if you reverse the EUI-64 address.

2 Likes

Is this sufficient? It’s my perception (haven’t checked) that resolvers are sometimes caching for longer than the pool’s TTL and broken clients are holding onto resolved pool IPs, so it is hard to persuade clients that are broken to the level of abuse to go away.

If just removing yourself from the pool’s DNS is sufficient then yes, this becomes a lot easier.

I think there is a misunderstanding here. Actually, that is how “full” NTP clients are supposed to operate. Note the remark on the “Join the pool” page:

because of how the ntp clients operate it will take weeks, months or even YEARS before the traffic completely goes away

These clients - e.g., but not necessarily limited to, chrony, NTPSec, classic ntpd - will continually poll, but with typically sensible minimum interval, increasing that interval if circumstances permit. And those polls are typically independent, i.e., naturally spread evenly.

The real challenge are SNTP clients and similar, that actually do not continuously keep track of, and poll their upstreams. In general, that is not a problem if they randomize their polls, avoiding polling synchronization between clients.

Many such SNTP clients aren’t an issue, either, as they poll infrequently and implement mechanisms to avoid poll synchronization as per the guidance given on the Information for vendors page.

It is “cheap” clients that do not observe that guidance that cause issues, e.g., because they poll at certain “fixed” times so their poll synchronization leads to the traffic peaks mentioned in this and other threads, or that don’t adapt their polling interval or have fixed low polling intervals to begin with. E.g., ntpdate periodically run from a cron script. Or a “full” NTP client periodically being restarted from a cron script, forcing re-resolution of server names and subsequent polling peaks when the clients work to re-establish sysnchronization quickly. But also simple “home-baked” implementations of the NTP protocol for purported simplicity. Which is why the NTP pool pages encourage using a “reputable” NTP implementation (mentioning ntpd classic only, but I think there is consensus that at least chrony and NTPSec also fall into that category nowadays).

Please also note the Use the pool page, which already mentions many items that have been suggested above to be added to it. Maybe the language could be stronger here or there, or more elaborate. But not sure how much that helps if people simply don’t read that page, or the others provided by the NTP pool.

2 Likes

To add to this: Once a server does not get regular pool traffic anymore, it should be able to handle whatever residual traffic still arrives. I once “removed” one of my servers from the pool for maintenance with the “Monitoring only” option, and traffic went down by ~80% after just a couple of minutes (= the DNS ttl).

3 Likes

I think you’ll find the DNS software is not to blame, it’s all on the NTP clients, including ntpd. IIRC Miroslav has improved Chrony in this regard recently, and I’m planning to improve ntpd to re-resolve hostnames in non-pool associations (typically “server”) quickly when they stop contributing, and every few weeks even if they are responding, both to allow traffic to decay more quickly when a server is taken down, and to allow a server which changes IP in DNS to see more clients follow the change.

The current ntpd behavior is to resolve non-pool source just once at startup and hold the IP as long as ntpd stays up.

4 Likes

DNS resolvers that cache stuff too long certainly exist but may be a fairly insignificant source of NTP traffic.

I’m not sure this would be an improvement. Obviously, in this forum, we mostly have the perspective of a pool server operator, where such behavior on the part of the clients would be desirable.

However, if I configure a specific server rather than the pool, I want for that one to be used (even if initially, only one IP address from potentially multiple returned by DNS is going to be picked). I have some kind of “relationship” with it, and be it only that I found it to have served me well so far. I don’t want it to be switched automatically. If it stops working for me in some way (not limited to not being reachable anymore), I will notice, and take manual action. E.g., replace it with another, again carefully selected upstream server.

If I want even for a “statically” configured server to be updated automatically in the way you describe, then the pool directive works fine for any domain name to point to multiple IP addresses, not only the actual NTP pool.

So I would prefer for the distinction between the server directive (which already has a preempt option by the way which might also play a role), and the pool directive not to be blurred.

Rather, I think we should push more for use of the pool directive when accessing the pool, e.g., by finally updating the configuration example on the Use the pool page.

1 Like

No misunderstanding here. That’s my point - all ntp clients both “good” and “bad” sometimes hold onto pool IPs longer than would be desired, but that’s only a problem when it’s the misbehaving ones doing it.

It’s all very well the join page saying that traffic may take years to go away, but I’m not sure it adequately spells it out that you may be subjected to a lot of traffic from broken clients that can overwhelm your connectivity and that might not go away for an indeterminate period of time. That is, I don’t think the average prospective volunteer appreciates that abusive traffic could be what’s tenacious.

We can’t assume that anyone using the pool as a client will follow any of our advice so while it is good to advise about decent client implementations and the configuration of them, it’s not going to do much to stop problem clients.

But, there are positive reports in this thread that switching to “monitoring only” does make traffic go away pretty quickly, so that’s encouraging. That maybe could be mentioned alongside any mention of abusive traffic levels.

I agree with what Dave Hart (and others, going back years) have said about enabling IPv6 on all the pool names. We can’t keep complaining about CGNAT when we’re not giving an alternative.

2 Likes

I’m with you that some things might well be spelled out more explicitly for newcomers, e.g., the possibility of “abuse” by non-well-behaved clients, and how to potentially deal with it.

I am not seeing that

traffic from broken clients that can overwhelm your connectivity

might not go away for an indeterminate period of time

My understanding is that such traffic is mostly coming from poor implementations of the protocol, rather than from deliberate attacks (not susceptible to the bandwidth controls) or well-behaved implementations (not directly/immediatley susceptible to the bandwidth controls). I.e., “cheap” implementations not adhering to some of the best practices. Part of that “poor” implementation is for them to try to be simple, e.g, when configured with a server name, re-resolving that name when needed, rather than having arguably more complex logic to resolve once and then keep track of an address received to use again in the future (rather than simply re-resolving the name every time a request is to be sent).

I would be happy to learn that that understanding is incorrect, e.g., that the majority of issues are coming from otherwise “good” implementations of the protocol so that tweaking such an implementation would actually address what I understand to be the main problem (traffic peaks, vs. constant traffic load which should be manageable via the bandwidth settings). Not sure simply blocking traffic to teach clients a lesson will have the effect we desire, i.e., for various players to see the wrongs of their ways and to change their behavior. Could be even counter-productive, e.g., in the context of a server behind a router being overwhelmed dropping packets. Sure, “full” clients will eventually move on. “Poor” clients will simply keep hitting the addresses they get from DNS as they don’t have the memory/intelligence to remember when a certain IP address did not respond in the past.

Enabling IPv6 on all the pool names would already go some way to mitigate some of the issues seen with IPv4 (e.g., from CGNAT aggregating many distinct clients behind a single IPv4 address, or large number of cheap clients that, for now at least, often do not support IPv6, …).

Regarding IPv4, I believe Ask is planning to move much more to a “proximity”-based server assignment mechanism anyway (rather than the current primarily zone-based one). That would help spread the overall load more evenly, not capping peaks, but reducing the impact on individual servers. In that context, he is also running an experiment to understand better how DNS resolvers and servers are being used to access the pool. Not sure when that will have any outcome, but hoping when it comes, it will also improve the situation.

3 Likes