Add more monitoring stations


#1

Hey,

is there a possiblity to add new monitoring stations in other regions ?
Could this also be done by the community like the pool server ?

Maybe this could reduce the roundtrip time to some regions while the check.


#2

It would be good to know the requirements for a monitoring station, it might help people figure out if they can donate a VM to the cause.

That is, assuming the monitoring has the ability to be split across different nodes.


#3

I agree wholeheartedly that regional monitoring is desperately needed.
Of late, I’ve been getting repeated problems with asymmetry on the transatlantic leg of connections.
My server is staying in synchronisation with the GPS/PPS source I use as my reference clock (within about 5µs except when there are sudden changes in temperature, when it may briefly go to to a few tens of µs offset), within a few ms of all other servers that I track in Europe, but is showing tens to hundreds of ms offset from the monitor, meaning I keep dropping out of the pool.
I doubt if such a problem is limited to my connection (I monitor the connection, and use bandwidth management to ensure that there is always sufficient to allow for NTP traffic), my ISP (if it was my ISP, why would my offset to other EU servers stay within reasonable ranges?) so other servers must also be affected, which will reduce the number of available servers in the pool whenever this situation arises.
It also appears as if just having one other monitor in Europe would be sufficient to ensure service is maintained here (and presumably, another one in any other area so affected).
Until we have such additional monitoring capability, the pool is dangerously lacking in resilience and losing capacity.


#4

Thank you for asking. Arnold Schekkerman had a good summary of my concerns and questions about implementing something like this in a recent email thread: http://lists.ntp.org/pipermail/pool/2016-October/007965.html

Typically people ask for this because the California system can’t talk to their server and it seems unfair. However, the point of the system is to take out servers that are potentially possible bad for some clients somewhere, so adding more monitoring systems would likely exclude more servers more of the time.

That being said, it’s something I want to do, but there’s some outstanding work to do first and some decisions to make. I also want the monitoring checks to be at a higher frequency.

Work:

  • Write a better “scheduler” to allow multiple monitors to “eat” at the queue.
  • Rewrite how the monitor data is stored (short term and in particular long term). Currently they’re just in an SQL table and then moved to an archive table after X weeks. The archive table is a MySQL table with the “archive” engine and it’s not very reliable (or useful). Adding more data to this system won’t work (not with the current hardware anyway).

Decisions:

  • What happens when monitoring stations disagree? I’d likely go with the one signaling a worse outcome. Maybe with exceptions for countries/regions that aren’t well connected to the rest of the internet (Cough .cn cough)?

There’s also the trouble of just managing more monitoring stations. Either it has to be done very carefully, or code has to be written to decide from moment to moment if a particular monitoring station is to be trusted. The latter could also enable using RIPE Atlas probes for tests.

There are lots of possible implementations of this.

In the email thread referenced above @mlichvar suggested that NTP servers could monitor each other uploading the data. That’d be a whole new type of work and exploration. I recall maybe very early on the system worked like this; or Adrian was talking about having it work like that? http://lists.ntp.org/pipermail/pool/2016-October/007964.html — Anyway, it doesn’t feel like a practical approach to me, but I reference it just to point out that there are many paths to take here. There’s no “just add more monitoring stations” solution that’ll make anything obviously better.


#5

What about breaking it up regionally, so only one monitoring server monitors each NTP server?

  • California monitors Americas, Asia and Oceania.
  • China monitors China
  • Europe monitors Europe and Africa
  • Nobody monitors Antarctica :sob:

That would be complicated in a different way.


#6

That’d be natural and surely better if the goal was the most accurately measure the performance of each server, but it’s not. The goal is to remove servers from the system that might not be available to some clients (temporarily most of the time).

Whenever I see localized problems that the system catches (and the operators are frustrated because it works from some/other/most places) I don’t think “oh, the monitoring system shouldn’t have caught that”, I think of all the other servers that also are unavailable from somewhere but we didn’t have something to catch that. :slight_smile:

Of course somewhere someone always have a problem on the internet, so we couldn’t say use RIPE Atlas and exclude anything that have a problem sometimes from somewhere.

One monitoring station is at least simple.

The observant reader might have noticed that the beta system actually have multiple monitors configured (but only one running at the moment; I was testing that it worked some years ago after starting support for multiple monitor systems. I never finished the work to improve how the metrics are stored – or rather I’ve had multiple false starts on it, hopefully one of the next attempts will be successful at making something that’ll be simple to operate, scale and last a long time).

http://www.beta.grundclock.com/scores/128.138.141.172?


#7

Some penguin should set up at least one pool server there first. :penguin:


#8

How about this penguin :smiley:?


#9

You need at least three monitors, to form a quorum. If two say a server is bad, it is bad.

Your stats for any NTP server could be an average of three monitors, representing the average NTP client and not from any specific loation.


#10

That’d make the system better at detecting if the server itself is healthy or not. It’d also be ignoring some network problems. For people using the NTP Pool either will cause a failure though.