Monitoring and packet loss in transit

@ask please answer, will there be any changes in monitoring? Every Saturday-Sunday, some servers in Russia drop score due to packet losses (between he.net and Rascom if it is important). Honestly, the question is already tired. Now I have about 57% of losses and even this post is sent incredibly sloooooowly. Pool monitoring and I suggest pool DNS servers have this impact too.

I understand that errors in server score does not play a big role for the work of the entire pool in the world, but still this is bad. Clients redirect to non-optimal servers, and one of the tasks of the pool is to give clients the geographically nearest good servers. A single monitoring server has a bad effect to this “GeoDNS server” principle. I seriously think of adding a dummy pool zone to my DNS and catch all my ISP customers on my server to get rid score problem. I don’t like this idea, but what are the alternatives?

Errors in transit do play a big role in the entire pool. I recently tried to sign up two servers to the pool and the monitor was not able to reach them after a few tries I gave up. How many potential servers do never reach the pool because the monitor doesn’t see them and the potential new user gives up before ever being able to provide their server? What is now active in the pool are only those servers who have a reasonable error free network path to the monitor server. That list is biased.

Hi!

A lot to answer here (and in other threads)! I’ll put down some thoughts here that hopefully answers your comments (and some of the others from other threads).

The DNS system is massively distributed. I haven’t heard about DNS queries not being answered (not even when it’s midnight somewhere with a lot of computers and we get many many times the normal number of queries for a brief moment). It’s also, as far as I know, been working very reliably in terms of giving a relatively “even” level of queries for server operators in countries that aren’t massively underserved (which is a whole other area that I’d love to spend some time fixing).

For the monitoring system discussion:

Yes, it’s really frustrating when clients get sub-optimal servers because of problems between the monitoring system and the server. The trouble is that even with more monitors it’s difficult to tell which clients would have trouble, too. Just adding more monitors will add more data, but without some care it’s not obvious that it’ll be BETTER for the clients. Yes, in the situation you describe it probably will, but in others it might not.

Most of all it’ll just be more data and honestly so far there are very few of you who are genuinely helpful in trying to “debug the internet”, so I’m concerned just having more monitoring systems will just give more things to “point fingers at”.

All that being said, my plan still is to have a system to add more monitors. I have put in a bunch of the infrastructure ground work. The data collection for the monitoring data is ready, but there’s a lot of other development work still.

  • The graphs on the site need a round of updating.
  • A system to authenticate the monitors needs to be built (my plan is to have them register and then get a TLS certificate issued by Vault).
  • A new component to manage the queue of servers to be checked needs to be built. I have a bunch of ideas but I’m not really sure how this should work yet.
  • The monitoring system needs to log the round trip time of the query. This should be used to “assign” monitors appropriately to each server.
  • The monitoring software has a simple “self-check” it runs to make sure it’s probably likely working fine. It’ll likely need some refinements to work elsewhere.
  • Some kind of management of monitors coming and going (or monitors having trouble) needs to be built. Something like “have each server monitored by at least X and no more than Y monitors; at least Z of which should be well-established monitors with a track record”.
  • The monitor should learn to do traceroutes.
  • The system should then schedule automatic traceroutes when problems occur to make it more likely that we can figure out which internet paths are dropping NTP.
  • Traceroute data needs to be stored and some minimal UI to pull up the data to start (and fancier software to analyze it later).
  • The monitoring software (linked above) needs packaging in RPM / .deb / whatever formats and repositories need to be setup to make it easy to install.

We don’t need all of this to add more monitors, but a bunch of it for sure. And yes, I could definitely use help. I only have so much time each week.

The beta site has a new system for managing accounts where multiple people can have access to the same account (instead of it being one user per account). I put this in place to make it easier for teams to manage servers or monitors together. People needing help changing email addresses and such on accounts is one of the things that needed manual and careful updating by me and in the new system it’s self-service and the volunteers helping with the support requests are able to do it.

The new features could use some more testing and feedback though! I’d like to get it rolled out to the production site and move on to focusing on the next new features.

For you note about it being slow to post to community.ntppool.org – this site is being hosted by Discourse so if it’s slow for you that’s completely unrelated to the rest of the NTP Pool system (except for the general case that if the internet connection from your part of the world to/from the US is crappy then all the services being run from here won’t perform as well as they should).

Happy to help. Not sure what to look at… :slightly_smiling_face:

1 Like

Thanks for your responses @ask, it is really appreciated. I didn’t know a specific github account existed specific for the monitor tool. The monitor sources under the abh github account seemed pretty old to me. Cool to see the latest sources are available.

I have a number of development systems specifically for generating packages for various operating systems and processor architectures for my own projects. If you agree I can take up the task to make the monitor tool more robust in different environments and build the various distribution packages and repositories.

1 Like

Hi Ask!

It’s a good news, will look at betta.

I think that you approach the task as a programmer. It is not trivial and complex. I approach as a networker. I think pool project do not need “debug and repair network” at all. Just create as much as possible independent stable infrastructure and get around network connectivity problems. At least most of them.

At first every continent have hundreds and thousands ISP and big mesh of thick channels. Within a single continent/zone/region, the situation is more stable. Even if there is an accident somewhere, there are always low-cost and big bandwidth routes to get around problem. And local ISP reaction is fast. Betveen continents only several strings. If string break it’s a big affect on ALL traffic. But we does not care. If not pool monitoring, I would not pay attention to the loss. There are no my permanent interest on a route through a problematic channel. But pool is affected.

So? Divide and rule.

I mentioned DNS and CDN to the fact that it is a distributed system and each part of it is relatively independent. It might be worth doing the same with the monitoring system. I think the more independent the every mon.server the better. Perhaps the mon.server should turn into a regional server.

As example each mon.server can monitors only servers in its zone and manages own dns instance. The project does not need be complex and aggregate big data from many servers to make a “complex good server” decision. Every server can do own decision. For statistics purpose main site can recieve summary from remote mon.servers. Another approach: every mon.server query all servers and make self decision good/bad for own region.
Here i need refresh my knowlege and not ready to give a decision about dns system right away. But I feel like it’s like CDN and anycast, not DNS task.

Hint: instead of traceroute monitoring, you can use TTL in query answers to rough estimate server distance and big route change. (over long distances, significant changes may not change much TTL, so it rough). But, as I said, I see no reason to try to “debug the internet” and complicate the system too much. Also, you know that in the IP network the route “to server” and “from server” can be different. So you have twice as many potential problems in transit. Detect reverse problem is not trivial before you get reverse traceroute and you can’t automate this. Assymetric route especially true for long distances like intercontinental.

1 Like