Beta system now has multiple monitors

ask · February 10, 2018, 10:11am

The beta system has a few new changes.

Just to be really clear – this is just on the beta system.

Two active “monitors”(!) which means that we’re getting closer to being able to run monitoring systems in Europe and Asia (and US East Coast?). This has been way too long in the works.
One of the monitors does 4 queries (2 seconds apart).
The monitor weren’t running consistently for a little while; they’re back now (though semi-manually run while I’m testing and debugging).
The old perl monitor has been replaced with a new agent with more safety checks, features and improved performance.
The CSV log has added an “error code”; it’s the “Kiss of Death” code (reference ID) or the socket error (“i/o timeout”) as appropriate.
To see data for the individual monitors, add a monitor=* query parameter. The lines with an empty “monitor_name” are the aggregate score.

So far the “four samples” thing is interesting – before it goes to the production system there’s some work to do to figure out if it’s too aggressive (or if servers that don’t answer four requests in 8 seconds are too picky).

I can find examples in the logs, but it feels like “outing” operators. I’m not sure if that really makes sense, but it means I won’t post examples right now. If you are running a server that’s affected by this change, I invite you to post the example so we can discuss.

Also, if you don’t have your server in the beta system, please consider adding it.

elljay · February 10, 2018, 1:45pm

Added an IP4 and an IPv6. The v6 didn’t auto work out the country - not sure if that’s a problem.

HansMayer · February 10, 2018, 10:20pm

Hi Ask,

What are the requirements for a monitoring server ? Are there any links describing this and how-to compile/configure ?

Kind regards
Hans

ask · February 10, 2018, 11:38pm

That it is well connected and keeps good time itself. Probably no virtual machines and either right next to a stratum 1 system or ideally with a high quality clock (GPS driven or otherwise).

For the beta system it matters less, in particular right now when we’re just testing …

To compile, install Go and then run this:

go get -u -v github.com/ntppool/monitor/cmd/ntppool-monitor

and the binary should be in ~/go/bin/ntppool-monitor. Email me (ask at ntppool org) for a “monitoring key”.

The monitor needs some more metrics/monitoring itself and a daemontools / systemd / etc configuration.
Patches welcome.

ask · February 11, 2018, 9:16am

Thanks!

That was a long standing issue; fixed today! Use geoip http service rather than a local database · abh/ntppool@4511078 · GitHub

elljay · February 11, 2018, 9:59am

Any idea of the bandwidth requirements?

ask · February 11, 2018, 11:20am

For the beta system something that’d round down to zero.

For the production system something like ntp-packet-size times 24 times 5 times ~4000 per day. So… tens of megabytes per day?

publicarray · February 19, 2018, 8:40pm

@ask I think the score algorithm is too aggressive in the beta. My server here had problems with Los Angeles but not with Zurich.https://web.beta.grundclock.com/scores/45.76.113.31

Considering that the server is in Australia I think it is reasonable to assume that Australians did not notice any down time. Due to the changing nature of internet links, I would surmise that the link to the monitoring server had issues and not the ntp server.

On the production site 5 servers would also be impacted by this: (-5) active 1 day ago in oceania.

ask · February 20, 2018, 1:01am

Yeah, it looks like it was indeed an internet link fuss thing (“i/o timeout”'s – https://web.beta.grundclock.com/scores/45.76.113.31/log?limit=250&monitor=12 )

Generally it’s hard to generalize about who’d still have access when some don’t. Isn’t it better for the system to err on caution when including servers in the DNS response?

Somewhere on the todo list: automatic traceroutes when the status changes. Also, maybe a way to have many more monitors.

publicarray · February 21, 2018, 8:07am

Yea the current system does work quite well. But I do think that the system could be improved to add a bit of resiliency when for example an internet link goes down. Specifically I don’t think it would be beneficial to remove servers from DNS if a major part of a regions bandwidth is removed (e.g. say 30% ish of the servers). Maybe just remove those from pool.ntp.org but keep them on region.pool.ntp.org? or have an upper limit of removed servers per day per region? I know its a bit of a rare event if that happens, but what do you think?

Detecting this would become easier with multiple monitors or automatic traceroutes.

Note: I’m taking about servers that still answer queries from at least one monitoring server.

arnold · March 12, 2018, 8:26pm

Hi Ask,

I think your focus on KoD is too aggressive. My server uses “discard average 5” (instead of 3), requiring clients to keep an average interval of 32 seconds. Clients should not set minpoll lower than the default of 64 seconds, so this average setting seems perfectly normal to me.

However, it has the side effect of not only increasing the minimum average interval by a factor four, but it also decreases the allowed number of packages in the iburst by four. This limits the iburst to two valid time replies, giving a KoD at the third one.

This should not affect SNTP clients as (i)burst is useless for them anyway.

RFC compliant ntpd clients should all be able to handle KoD in a good way. Anyway, they should only use iburst and not burst, so it will only affect the initial few packages in a long time relationship. Therefore, I think the monitor should just honour the KoD and ignore those packages, using only the valid time responses in the calculation of the score.

The alternative is to do rate limiting in my firewall. There it is easy to both allow an initial burst of 8 packages and limit the average (after the burst) to only 1 package every 32 seconds. That way I’ll satisfy the monitor and iburst. However, RFC compliant clients no longer get the KoD/RATE warning, as my firewall will just drop the request. I think replying with KoD/RATE is better

Cheers,
Arnold

mnordhoff · March 12, 2018, 8:57pm

But if there are 3 clients behind the same NAT IP address restarted at the same time, some of their packets would be blocked.

arnold · March 12, 2018, 10:00pm

Any server operator enabling the “limited” option in a “restrict” config line will serve a limited number of clients behind NAT. In the default situation the 9th client won’t get a reply. With IoT 9 is not a lot either… However, if they honour the KoD, they can increase the delay between requests. If I implement other rate limiting measures without KoD, the clients don’t have a clue they should increase the delay.

With all DDoS’es going on using UDP port 123, I want to implement rate limiting per IP address.

It would be better if ntpd allowed me to set the average polling time without the side effect of limiting the burst size, but that’s not the case. Even better would be if the burst size is a separate configuration option as well.

ask · March 13, 2018, 5:10am

Doesn’t the “check 4 times over 8-9 seconds” mimic ‘iburst’? What period do you calculate the average over? (I’d expect virtually all servers in the NTP Pool to see so many clients that they don’t keep an average count for very long?).

arnold · March 13, 2018, 4:29pm

First of all, I didn’t do much

I enabled “limited” and “kod” on my “restrict default …” lines (IPv4 and IPv6).
I specified “discard average 5” (while the default is 3).

That’s all I did.

Then ntpd does its magic… To my understanding, these three documents are relevant.
https://www.eecis.udel.edu/~mills/ntp/html/poll.html
https://www.eecis.udel.edu/~mills/ntp/html/rate.html
https://www.eecis.udel.edu/~mills/ntp/html/assoc.html

The second link (about rate) describes the “Minimum Average Headway Time (MAH)”. This is what the “discard average” parameter sets.

From the webpage, I understand it works like this. Upon arrival of a packet from a client, the “input counter” for that client is increased by MAH. This counter is then reduced by 1/s. A burst can be 8 packets. If the counter started at 0, this counter reaches a “ceiling” of 8 MAH. “If the input counter is greater than the ceiling, the packet is discarded” (or KoD if KoD is enabled).

According to this description, I expect the “ceiling” to be dynamically set to 8 MAH. With a default MAH of 8, the “ceiling” is 64. With my MAH of 32, I expect the “ceiling” to be 8 * 32 = 190.

However, my version of ntpd (4.2.6p5) seems to set the ceiling at 8 “default MAH”. So at the first packet, the input counter is set from 0 to 32. Two seconds later, the counter is 30. Then I receive the second packet in the burst, so the input counter goes to 62. Two seconds later, the counter is 60. Then the third packet arrives and the counter should go to 92, well below 190 which is 8 MAH. However, my server sends an KoD. I guess because the counter now is larger than 64, the ceiling with default MAH…

To my understanding, there’s no option to set this “ceiling” manually.

So, a lot of ‘guesses’, ‘seems’ and ‘expects’ in the above text. I am sorry, but I just want to specify “discard average 5”. I have no intention of reverse engineering ntpd nor do I want to claim to know better what ntpd should do than mister Mills

Cheers, Arnold

alica · March 24, 2018, 2:30am

Seems like the beta system not satisfied with my server…
https://web.beta.grundclock.com/scores/103.226.213.30
http://www.pool.ntp.org/scores/103.226.213.30 (for reference)
My restrict setting:
restrict default limited kod nomodify notrap nopeer noquery
restrict -6 default ignore

ask · March 24, 2018, 3:55pm

At a glance it looks like it’s both Los Angeles and Zurich that has periodic timeouts. It’d be interesting to see how it’d look if we had a monitor in Asia, too.

https://web.beta.grundclock.com/scores/103.226.213.30/log?limit=500&monitor=*

jpp · March 24, 2018, 7:34pm

I don’t suppose changing the sort order of servers to use the dns name would be on your todo list? - it bugs me when the ipv6 and ipv4 plots of the same server are at opposite ends of the list due to sorting by up address. The more servers I add the worse it gets.

Majkl · March 26, 2018, 2:43pm

Hi, just got five e-mails about having crappy servers.
https://web.beta.grundclock.com/user/cjdagr7bpnfjppb3n2v74

Out of curiosity, could anyone explain what’s going on? Both servers have score 20 in production pool.

Thanks.

ask · March 27, 2018, 5:07am

The monitor that does 4 samples (2 seconds apart) is getting two bits set in the leap flag in the response (“3” in the leap column):

https://web.beta.grundclock.com/scores/89.221.210.188/log?limit=500&monitor=12

All monitors:

https://web.beta.grundclock.com/scores/89.221.210.188/log?limit=500&monitor=*

Topic		Replies	Views
Beta monitoring operators/systems Pool Development beta , monitoring	15	1005	May 5, 2022
Beta system now has one monitor only Pool Development beta , monitoring	0	1000	March 22, 2019
Beta site changes - monitoring updates Pool Development monitoring , beta	17	3064	May 4, 2018
Monitoring upgrade Announcements	68	3364	May 25, 2023
Monitor: Production and beta Server operators monitoring	0	511	April 16, 2021

Beta system now has multiple monitors

Related topics