Monitoring and packet loss in transit

Hi!

A lot to answer here (and in other threads)! I’ll put down some thoughts here that hopefully answers your comments (and some of the others from other threads).

The DNS system is massively distributed. I haven’t heard about DNS queries not being answered (not even when it’s midnight somewhere with a lot of computers and we get many many times the normal number of queries for a brief moment). It’s also, as far as I know, been working very reliably in terms of giving a relatively “even” level of queries for server operators in countries that aren’t massively underserved (which is a whole other area that I’d love to spend some time fixing).

For the monitoring system discussion:

Yes, it’s really frustrating when clients get sub-optimal servers because of problems between the monitoring system and the server. The trouble is that even with more monitors it’s difficult to tell which clients would have trouble, too. Just adding more monitors will add more data, but without some care it’s not obvious that it’ll be BETTER for the clients. Yes, in the situation you describe it probably will, but in others it might not.

Most of all it’ll just be more data and honestly so far there are very few of you who are genuinely helpful in trying to “debug the internet”, so I’m concerned just having more monitoring systems will just give more things to “point fingers at”.

All that being said, my plan still is to have a system to add more monitors. I have put in a bunch of the infrastructure ground work. The data collection for the monitoring data is ready, but there’s a lot of other development work still.

  • The graphs on the site need a round of updating.
  • A system to authenticate the monitors needs to be built (my plan is to have them register and then get a TLS certificate issued by Vault).
  • A new component to manage the queue of servers to be checked needs to be built. I have a bunch of ideas but I’m not really sure how this should work yet.
  • The monitoring system needs to log the round trip time of the query. This should be used to “assign” monitors appropriately to each server.
  • The monitoring software has a simple “self-check” it runs to make sure it’s probably likely working fine. It’ll likely need some refinements to work elsewhere.
  • Some kind of management of monitors coming and going (or monitors having trouble) needs to be built. Something like “have each server monitored by at least X and no more than Y monitors; at least Z of which should be well-established monitors with a track record”.
  • The monitor should learn to do traceroutes.
  • The system should then schedule automatic traceroutes when problems occur to make it more likely that we can figure out which internet paths are dropping NTP.
  • Traceroute data needs to be stored and some minimal UI to pull up the data to start (and fancier software to analyze it later).
  • The monitoring software (linked above) needs packaging in RPM / .deb / whatever formats and repositories need to be setup to make it easy to install.

We don’t need all of this to add more monitors, but a bunch of it for sure. And yes, I could definitely use help. I only have so much time each week.

The beta site has a new system for managing accounts where multiple people can have access to the same account (instead of it being one user per account). I put this in place to make it easier for teams to manage servers or monitors together. People needing help changing email addresses and such on accounts is one of the things that needed manual and careful updating by me and in the new system it’s self-service and the volunteers helping with the support requests are able to do it.

The new features could use some more testing and feedback though! I’d like to get it rolled out to the production site and move on to focusing on the next new features.

For you note about it being slow to post to community.ntppool.org – this site is being hosted by Discourse so if it’s slow for you that’s completely unrelated to the rest of the NTP Pool system (except for the general case that if the internet connection from your part of the world to/from the US is crappy then all the services being run from here won’t perform as well as they should).