Beta system now has multiple monitors

beta
monitoring

#1

The beta system has a few new changes.

Just to be really clear – this is just on the beta system.

  • Two active “monitors”(!) which means that we’re getting closer to being able to run monitoring systems in Europe and Asia (and US East Coast?). This has been way too long in the works.
  • One of the monitors does 4 queries (2 seconds apart).
  • The monitor weren’t running consistently for a little while; they’re back now (though semi-manually run while I’m testing and debugging).
  • The old perl monitor has been replaced with a new agent with more safety checks, features and improved performance.
  • The CSV log has added an “error code”; it’s the “Kiss of Death” code (reference ID) or the socket error (“i/o timeout”) as appropriate.
  • To see data for the individual monitors, add a monitor=* query parameter. The lines with an empty “monitor_name” are the aggregate score.

So far the “four samples” thing is interesting – before it goes to the production system there’s some work to do to figure out if it’s too aggressive (or if servers that don’t answer four requests in 8 seconds are too picky).

I can find examples in the logs, but it feels like “outing” operators. I’m not sure if that really makes sense, but it means I won’t post examples right now. If you are running a server that’s affected by this change, I invite you to post the example so we can discuss.

Also, if you don’t have your server in the beta system, please consider adding it. :slight_smile:


Beta site changes - monitoring updates
Server started keeping a bad time
#5

Added an IP4 and an IPv6. :slight_smile: The v6 didn’t auto work out the country - not sure if that’s a problem.


#6

Hi Ask,

What are the requirements for a monitoring server ? Are there any links describing this and how-to compile/configure ?

Kind regards
Hans


#7

That it is well connected and keeps good time itself. Probably no virtual machines and either right next to a stratum 1 system or ideally with a high quality clock (GPS driven or otherwise).

For the beta system it matters less, in particular right now when we’re just testing …

To compile, install Go and then run this:

go get -u -v github.com/ntppool/monitor/cmd/ntppool-monitor

and the binary should be in ~/go/bin/ntppool-monitor. Email me (ask at ntppool org) for a “monitoring key”.

The monitor needs some more metrics/monitoring itself and a daemontools / systemd / etc configuration.
Patches welcome. :smiley:


#8

Thanks!

That was a long standing issue; fixed today! https://github.com/abh/ntppool/commit/45110786d0ed1d29fe225674f548e750989a7e7a


#9

Any idea of the bandwidth requirements?


#10

For the beta system something that’d round down to zero.

For the production system something like ntp-packet-size times 24 times 5 times ~4000 per day. So… tens of megabytes per day? :slight_smile:


#11

@ask I think the score algorithm is too aggressive in the beta. My server here had problems with Los Angeles but not with Zurich.https://web.beta.grundclock.com/scores/45.76.113.31


Considering that the server is in Australia I think it is reasonable to assume that Australians did not notice any down time. Due to the changing nature of internet links, I would surmise that the link to the monitoring server had issues and not the ntp server.

On the production site 5 servers would also be impacted by this: (-5) active 1 day ago in oceania.


#12

Yeah, it looks like it was indeed an internet link fuss thing (“i/o timeout”'s – https://web.beta.grundclock.com/scores/45.76.113.31/log?limit=250&monitor=12 )

Generally it’s hard to generalize about who’d still have access when some don’t. Isn’t it better for the system to err on caution when including servers in the DNS response?

Somewhere on the todo list: automatic traceroutes when the status changes. Also, maybe a way to have many more monitors.


#13

Yea the current system does work quite well. But I do think that the system could be improved to add a bit of resiliency when for example an internet link goes down. Specifically I don’t think it would be beneficial to remove servers from DNS if a major part of a regions bandwidth is removed (e.g. say 30% ish of the servers). Maybe just remove those from pool.ntp.org but keep them on region.pool.ntp.org? or have an upper limit of removed servers per day per region? I know its a bit of a rare event if that happens, but what do you think?

Detecting this would become easier with multiple monitors or automatic traceroutes.

Note: I’m taking about servers that still answer queries from at least one monitoring server.


#14

Hi Ask,

I think your focus on KoD is too aggressive. My server uses “discard average 5” (instead of 3), requiring clients to keep an average interval of 32 seconds. Clients should not set minpoll lower than the default of 64 seconds, so this average setting seems perfectly normal to me.

However, it has the side effect of not only increasing the minimum average interval by a factor four, but it also decreases the allowed number of packages in the iburst by four. This limits the iburst to two valid time replies, giving a KoD at the third one.

This should not affect SNTP clients as (i)burst is useless for them anyway.

RFC compliant ntpd clients should all be able to handle KoD in a good way. Anyway, they should only use iburst and not burst, so it will only affect the initial few packages in a long time relationship. Therefore, I think the monitor should just honour the KoD and ignore those packages, using only the valid time responses in the calculation of the score.

The alternative is to do rate limiting in my firewall. There it is easy to both allow an initial burst of 8 packages and limit the average (after the burst) to only 1 package every 32 seconds. That way I’ll satisfy the monitor and iburst. However, RFC compliant clients no longer get the KoD/RATE warning, as my firewall will just drop the request. I think replying with KoD/RATE is better :wink:

Cheers,
Arnold


#15

But if there are 3 clients behind the same NAT IP address restarted at the same time, some of their packets would be blocked.


#16

Any server operator enabling the “limited” option in a “restrict” config line will serve a limited number of clients behind NAT. In the default situation the 9th client won’t get a reply. With IoT 9 is not a lot either… However, if they honour the KoD, they can increase the delay between requests. If I implement other rate limiting measures without KoD, the clients don’t have a clue they should increase the delay.

With all DDoS’es going on using UDP port 123, I want to implement rate limiting per IP address.

It would be better if ntpd allowed me to set the average polling time without the side effect of limiting the burst size, but that’s not the case. Even better would be if the burst size is a separate configuration option as well.


#17

Doesn’t the “check 4 times over 8-9 seconds” mimic ‘iburst’? What period do you calculate the average over? (I’d expect virtually all servers in the NTP Pool to see so many clients that they don’t keep an average count for very long?).


#18

First of all, I didn’t do much :wink:

  • I enabled “limited” and “kod” on my “restrict default …” lines (IPv4 and IPv6).
  • I specified “discard average 5” (while the default is 3).

That’s all I did.

Then ntpd does its magic… To my understanding, these three documents are relevant.
https://www.eecis.udel.edu/~mills/ntp/html/poll.html
https://www.eecis.udel.edu/~mills/ntp/html/rate.html
https://www.eecis.udel.edu/~mills/ntp/html/assoc.html

The second link (about rate) describes the “Minimum Average Headway Time (MAH)”. This is what the “discard average” parameter sets.

From the webpage, I understand it works like this. Upon arrival of a packet from a client, the “input counter” for that client is increased by MAH. This counter is then reduced by 1/s. A burst can be 8 packets. If the counter started at 0, this counter reaches a “ceiling” of 8 MAH. “If the input counter is greater than the ceiling, the packet is discarded” (or KoD if KoD is enabled).

According to this description, I expect the “ceiling” to be dynamically set to 8 MAH. With a default MAH of 8, the “ceiling” is 64. With my MAH of 32, I expect the “ceiling” to be 8 * 32 = 190.

However, my version of ntpd (4.2.6p5) seems to set the ceiling at 8 “default MAH”. So at the first packet, the input counter is set from 0 to 32. Two seconds later, the counter is 30. Then I receive the second packet in the burst, so the input counter goes to 62. Two seconds later, the counter is 60. Then the third packet arrives and the counter should go to 92, well below 190 which is 8 MAH. However, my server sends an KoD. I guess because the counter now is larger than 64, the ceiling with default MAH…

To my understanding, there’s no option to set this “ceiling” manually.

So, a lot of ‘guesses’, ‘seems’ and ‘expects’ in the above text. I am sorry, but I just want to specify “discard average 5”. I have no intention of reverse engineering ntpd nor do I want to claim to know better what ntpd should do than mister Mills :wink:

Cheers, Arnold


#19

Seems like the beta system not satisfied with my server…
https://web.beta.grundclock.com/scores/103.226.213.30
http://www.pool.ntp.org/scores/103.226.213.30 (for reference)
My restrict setting:
restrict default limited kod nomodify notrap nopeer noquery
restrict -6 default ignore


#20

At a glance it looks like it’s both Los Angeles and Zurich that has periodic timeouts. It’d be interesting to see how it’d look if we had a monitor in Asia, too.

https://web.beta.grundclock.com/scores/103.226.213.30/log?limit=500&monitor=*


#21

I don’t suppose changing the sort order of servers to use the dns name would be on your todo list? - it bugs me when the ipv6 and ipv4 plots of the same server are at opposite ends of the list due to sorting by up address. The more servers I add the worse it gets.


#22

Hi, just got five e-mails about having crappy servers. :grinning:
https://web.beta.grundclock.com/user/cjdagr7bpnfjppb3n2v74

Out of curiosity, could anyone explain what’s going on? Both servers have score 20 in production pool.

Thanks.


#23

The monitor that does 4 samples (2 seconds apart) is getting two bits set in the leap flag in the response (“3” in the leap column):

https://web.beta.grundclock.com/scores/89.221.210.188/log?limit=500&monitor=12

All monitors:

https://web.beta.grundclock.com/scores/89.221.210.188/log?limit=500&monitor=*