Add server produces: ""Could not check NTP status

You don’t need to remove your server from the pool to be able to register on the beta site too. All server servers on the beta side aren’t going into the pool. The beta site is more for testing / monitoring.

Thanks for that, but doesn’t answer the question regarding the monitor algorithm ?..

Checking out the score page again today, find this:

Screenshot_2020-11-01 pool ntp org Statistics for 81 133 34 141

Whereas, usually we get something like this:

,Screenshot_2020-10-31 pool ntp org Statistics for 81 133 34 141

So why has the scaling suddenly changed, compressing the actual offset values ?..

You have one bad sample with 2000 msec delay, see the red dot bottom-right on the image. The scale just adapts to the max and min values.

The actual data is:

1604238592,"2020-11-01 13:49:52",-562377441.518062,-5,4.8,6,"Newark, NJ, US",3,RATE

Rate limiting reply?

Thanks for that, first time i’ve seen the scale change, but will re download the monitor data and have another look.

Would be interesting to know, if the monitor gets an i/o timeout, if it tries again without enough delay to take account of any rate limiting, kod, whatever…

Wrote simple script earlier to bash the server as fast as possible with ntpdate -q requests and it always fails on the second request with the “no server suitable for sync” found. Comment out the restrict keyword in ntp.conf and it works fine.

Looks like the restrict mechanism might need a revisit ?..

What “restrict” options have you got set in the conf file?

Restrict args are:

restrict default kod nomodify notrap noquery nopeer
restrict source kod nomodify notrap noquery

I had also included the limited keyword originally, but when I started to stress test the server with continuous ntpdate -q in a tight loop yesterday, only the first request was answered. Fwics, that’s to be expected, since > clientperiod limit requests were arriving within the time period. That of course breaks for the case where several hosts behind a nat router, same source ip, all send their ntp requests within the client limit period ?. Probably most of such cases could be handled if that timeout could be specified in mS, rather than seconds.

Anyway, here’s the plot for the last 24hrs or so, currently at 14 and peak > 200 hits per minute.

Screenshot_2020-11-02 pool ntp org Statistics for 81 133 34 141

Restarted ntpd last night after removing the limited keyword and packet stats are:

root@ntp-host:~ # ntpq -c iostats | grep -e sent -e received

received packets: 25125
packets sent: 25120

5 replies adrift, but that hasn’t increased since late last night…

Please run
tcpdump -i any -w /tmp/monewr.pcap host 139.178.64.42
for 12 hours & mail the pcap to stevesommarsntp@gmail.com. I will compare packet-by-packet with what Newark sees.

Hi,

Will run that from later this evening. From the tcpdump tap on the server, am seeing a lot of clusters of up to 5 requests from the same ip well within a second. Different ip over time of course, but if the limited keyword had been active, only the first of request would have had a reply…

That’s running and can see the remote host and replies to it. Monitor probes look like groups of three, with 5 seconds delay between each.

Looking at the downloaded monitor data, there doesn’t seem to be correlation between the severe demerits on score and your own monitor data. Also, the monitor poll doesn’t seem to do any retries, which could help with the unreliable udp protocol…

Chris sent me a sample tcpdump. I compared to the Newark monitor and reported back.
Each Newark poll consists of 3 NTP requests separated by 5 seconds. If a responses is received for any of these three, that’s a successful poll.

Here’s a summary at the NTP level. During the time of Chris’s data collection:
Newark sent 135 NTP requests
Server received 131 NTP requests. 4 were lost somewhere on the Internet
Server sent 131 NTP responses.
Newark received 103 NTP responses. 28 were lost somewhere on the Internet <<< PROBLEM

This is not just a problem with Newark. I see NTP loss between my NTP clients and Chris’s server.

I’ve written about NTP filtering: https://weberblog.net/ntp-filtering-delay-blockage-in-the-internet/
We can sometimes determine where NTP filtering is done, but I’ve seen no success in convincing ISPs to stop filtering NTP

1 Like

Take a look at my first screenshot. I think the problem is point 9 and 11. Route just changed while “pinging” the end.

Steve,

I think you have to assume the isp problem won’t go away, but find a
solution that compensates for it. Otherwise, just shifting the
blame in a way :-).

Overall, what this suggests to me is that there is a mismatch between
the monitoring process expectations and real life ntp operation.

Ntp itself is quite robust by design. A typical client will have a
varying list of pool servers to refer to and it makes little practical
difference if one server in the list is unreachable for a short period,
as that will be compensated for by the flywheel effect of the other
servers in the list. In contrast, correct me if wrong, monitoring makes
just 3 poll attempts at 5 sec intervals and gives up if none of them can
be reached within 15 seconds That seems far too blunt compared to the
robustness of ntp itself and is perhaps the reason for rapid demerit of
what are in fact, perfectly good time servers. Perhaps needs a slightly
better matched filter ?.

Work here is mainly real time embedded, hw & sw. Limited unix systems
programming experience, but a lot of comms work over potentially
unreliable channels. For that, a retry mechanism is normally considered
essential, often state driven and with counters and timers, so that the
system can degrade gracefully. In this case, perhaps something like a
2^n scaled time delay / backoff between each poll might involve minimum
code change and might be worth trying. Haven’t been able to find any
docs on the monitoring algorithm, but that would help understanding as well.

Not sure how I can help more with this, but would be good to contribute
to a solution if possible. Have h/w and s/w capability, test gear,
standards etc and a fair amount of free time at present…

Chris.

We’ve been round this loop a huge number of times. The pool is trying to give a real world experience to clients where UDP packets do get lost in the internet, but one monitor is obviously not massively representative, especially for clients accessing servers over routes different to the ones the monitor uses (i.e. not via America). The beta system does have a number of monitors in place. But all that said the bottleneck is @Ask’s time as he’s the only one who can put changes in place. (Just search for vendor zone requests… :frowning_face:) Personally I don’t think it will ever get fixed unless @ask wins the lottery and suddenly has more time to spend on the pool, or extra people are able to update the code &/ hardware. :man_shrugging:

1 Like

Steve, elljay,

Just a few thoughts…

One of the problems is that at any point in time, there will be an unknown number of isp connections in the path from client to server, each doing it’s own thing, switching circuits etc. Some packet loss is inevitable, so the system should be able to deal with that. Either by accident or design, the present monitoring system is weighted towards servers closest to the monitoring systems and takes no account of actual server accessabilty or uptime.

For monitoring, there’s the conflict between the time taken to poll the whole list of servers in a reasonable time, while having enough time to properly evaluate an individual server. Thinking of unix inetd. That accepts incoming service requests through a given interface, forks a child process to handle the request for the port, before returning straight away to handle the next request… The key thing is the fork, which allows normal processing to continue, while the forked process works asynchronously in the background.

Timeouts ?. Do we have enough data to estimate optimum timeout to initial reply, 1 sec, 2 secs or what ?. Also, do we need a model and test rig to allow a reasonable estimate of how long a circuit is down, to more accurately weed out broken servers, while not penalising servers suffering from temporary connection loss.

A test rig for data collection might scan the server list, say with 0.5 sec initial timeout, but for each fail, start a second timer and keep checking that server until the reply is seen, or a final timeout, then save the timer value. Of course, those values will be all over the place, but might allow a reasonable prediction of temporary connection loss time. That might help to specify a more accurate retry schedule.

To make that more robust, if a poll does ultimately fail, that server could be put into a suspect list that schedules a retry or two at a later time, perhaps up to minutes away, either via a randomised or scaled backoff time delay value. That sort of evaluation would more closely match the robust characteristics of ntp itself.

I did have a look at the github monitor code. Never used github and really only fluent in C here, well, assembler years ago, but can’t find the code that calculates the scoring, nor any build info, makefile or docs ?. There may be good reason for that, but makes it difficult see how it works.

As for hardware and software, have a variety of old machines and os’s that could brought online at reasonably short notice. Mainly x86, but some Sun Sparc as well. FreeBSD is the os of choice, but pre systemd Linux and windows are here as well. Code wise, anything that can be built with gnu tools. Would be interesting to do some modeling of connection loss failures, so wonder if there is an ntp library with functions such as: GetNtpTime (ip_address) and similar functions ?.. Either object, or (preferably) buildable from source.

Just started another months lockdown here in the uk and unlikely to be formally working until at least the new year, so this could make good use of the time. I know one should never volunteer for anything, but this does look interesting…

Chris

Seemed to be doing ok last night, but this morning:

Screenshot_2020-11-05 pool ntp org Statistics for 81 133 34 141

No change in traffic levels to the server and 100% system uptime, so what’s the story ?. Perhaps a ghost in the machine ?..