Beta system now has multiple monitors

beta
monitoring

#1

The beta system has a few new changes.

Just to be really clear – this is just on the beta system.

  • Two active “monitors”(!) which means that we’re getting closer to being able to run monitoring systems in Europe and Asia (and US East Coast?). This has been way too long in the works.
  • One of the monitors does 4 queries (2 seconds apart).
  • The monitor weren’t running consistently for a little while; they’re back now (though semi-manually run while I’m testing and debugging).
  • The old perl monitor has been replaced with a new agent with more safety checks, features and improved performance.
  • The CSV log has added an “error code”; it’s the “Kiss of Death” code (reference ID) or the socket error (“i/o timeout”) as appropriate.
  • To see data for the individual monitors, add a monitor=* query parameter. The lines with an empty “monitor_name” are the aggregate score.

So far the “four samples” thing is interesting – before it goes to the production system there’s some work to do to figure out if it’s too aggressive (or if servers that don’t answer four requests in 8 seconds are too picky).

I can find examples in the logs, but it feels like “outing” operators. I’m not sure if that really makes sense, but it means I won’t post examples right now. If you are running a server that’s affected by this change, I invite you to post the example so we can discuss.

Also, if you don’t have your server in the beta system, please consider adding it. :slight_smile:


Server started keeping a bad time
Beta site changes - monitoring updates
#5

Added an IP4 and an IPv6. :slight_smile: The v6 didn’t auto work out the country - not sure if that’s a problem.


#6

Hi Ask,

What are the requirements for a monitoring server ? Are there any links describing this and how-to compile/configure ?

Kind regards
Hans


#7

That it is well connected and keeps good time itself. Probably no virtual machines and either right next to a stratum 1 system or ideally with a high quality clock (GPS driven or otherwise).

For the beta system it matters less, in particular right now when we’re just testing …

To compile, install Go and then run this:

go get -u -v github.com/ntppool/monitor/cmd/ntppool-monitor

and the binary should be in ~/go/bin/ntppool-monitor. Email me (ask at ntppool org) for a “monitoring key”.

The monitor needs some more metrics/monitoring itself and a daemontools / systemd / etc configuration.
Patches welcome. :smiley:


#8

Thanks!

That was a long standing issue; fixed today! https://github.com/abh/ntppool/commit/45110786d0ed1d29fe225674f548e750989a7e7a


#9

Any idea of the bandwidth requirements?


#10

For the beta system something that’d round down to zero.

For the production system something like ntp-packet-size times 24 times 5 times ~4000 per day. So… tens of megabytes per day? :slight_smile:


#11

@ask I think the score algorithm is too aggressive in the beta. My server here had problems with Los Angeles but not with Zurich.https://web.beta.grundclock.com/scores/45.76.113.31


Considering that the server is in Australia I think it is reasonable to assume that Australians did not notice any down time. Due to the changing nature of internet links, I would surmise that the link to the monitoring server had issues and not the ntp server.

On the production site 5 servers would also be impacted by this: (-5) active 1 day ago in oceania.


#12

Yeah, it looks like it was indeed an internet link fuss thing (“i/o timeout”'s – https://web.beta.grundclock.com/scores/45.76.113.31/log?limit=250&monitor=12 )

Generally it’s hard to generalize about who’d still have access when some don’t. Isn’t it better for the system to err on caution when including servers in the DNS response?

Somewhere on the todo list: automatic traceroutes when the status changes. Also, maybe a way to have many more monitors.


#13

Yea the current system does work quite well. But I do think that the system could be improved to add a bit of resiliency when for example an internet link goes down. Specifically I don’t think it would be beneficial to remove servers from DNS if a major part of a regions bandwidth is removed (e.g. say 30% ish of the servers). Maybe just remove those from pool.ntp.org but keep them on region.pool.ntp.org? or have an upper limit of removed servers per day per region? I know its a bit of a rare event if that happens, but what do you think?

Detecting this would become easier with multiple monitors or automatic traceroutes.

Note: I’m taking about servers that still answer queries from at least one monitoring server.