No PR action for project and dramatic reduction in the number of active servers

umike · December 22, 2019, 10:02pm

It’s not a filtering. At least with regard to Russia it’s a great packet loss from “Telia network” 213.155.130.28 (nyk-b2-link.telia.net) to Packet 62.115.175.183 (packethost-ic-345229-nyk-b2.c.telia.net).
I have udp trace lost here(last hop - telia, packet does not reply). We have ping/icmp traceroute lost here. Even http/tcp lost and sometimes we cannot open pool.ntp.org: the internet cluster of ntp servers page.

When packets of multiple protocols are lost, this is not a filter. It’s a bad connectivity - channel overload (but usually, when overloaded, packets alternately disappear or not. In our case, ping disappears completely for 20-120 seconds, this is not very typical for overload), channel/interface/route/routing protocol flapping, or DDoS. The problem is between Telia and Packet (or in Telia). Looking at the trace from RU to pool servers this problem is clear as day. Reverse route and trace go through he.net without telia and does not snow trouble (and yes, we have assymetric route there)

At 22 dec I was wrote to Packet and open ticket [XHCD-6557-HDBQ] with ping/trace details. Since you are Packet client, you can join and rush them to resolve this issue ASAP.

lammert · December 23, 2019, 1:54am

It is a serious design flaw in the system when a single monitor server can kick a cluster of Stratum 1 servers connected to a central national internet exchange and a country’s reference clocks from the pool due to random packet loss at an intermediate connection outside of the country/continent where the ntp servers and primary users are located.

umike · December 23, 2019, 6:57am

that is why I speak that each continent/large zone should have its own monitoring server, preferably connected to massive local IX. Many services setup cache/CDN servers closer to user. Pool management infrastructure goes in the opposite direction.

littlejason99 · December 23, 2019, 1:53pm

Because an overwhelming majority of the queries are basic SNTP from embedded devices. SNTP is a one-shot query… If the server doesn’t reply back then the client just gives up (until the next time the command is ran). Likewise there are many existing OSes out there running NTPD that don’t support the “pool” server directive and won’t refresh a server with a new IP if it stops responding (or never responds in the first place).

Likewise if you had no monitoring, a person could theoretically flood the pool with non-working IPs. Or even worse flood the pool with servers with incorrect time (possibly for some malicious purpose on a server or device that uses the pool).

Even major links can have issues… Not saying that is the case in the above instance, but it’s not unheard of in the history of the Internet…

Also, remember that pulling a server out of the pool is only TEMPORARILY removing them from DNS rotation so they don’t receive any new clients until their score stabilizes. All the existing clients will continue to query (so long as the server didn’t really go down)…

umike · December 23, 2019, 2:24pm

when the server score drops, the number of requests to it decreases several times. Score=20 and 3.5-7k queries per second. Score is -5, only 500 queries per second. When the working server stops receiving requests it is bad. But it’s not play a big role for the pool project. The load is simply redistributed. But when many servers from one region are down mistakely - it’s a problem.

Bas · December 23, 2019, 4:34pm

If that where the only problem, nobody would complain.

Some people can not even get any monitor to work with them, yet still ntpdate when queried works fine.
That makes the monitor is not to good.

But if that was the only problem still not many would complain.

A lot of people do get 1 or 2 out of 3 monitors working at the bate system, yet ntpdate doesn’t fail.
Failure of 1 monitor results in a daily mail to pester the sysop.

However if that was just one of the problems, still most won’t complain.

If you ask for help on how to make the monitor work, you are bullied into the woods in the hope you never get out.

The biggest problem of this project is the total lack of wiliness by Ask to even start resolving the issue.
Instead you get nonsense replies and told there is a beta-system…yeah used it, it confirmed the issues.
So I reported all findings…you guessed it, loads of mails and NO changes to the system.

My stratum1 waits until it’s fixed, it’s running and running well, but it won’t join the pool…until?

I agree on this, there must be monitoring.

But just have a monitor to make IOT’s happy seems silly to me, those devices should be better programmed in total. They are not only bad a NTP but security is almost non-existent too.
The NTP-pool should not have to worry about those devices and simply state they may only query once every 24 hours or less. 1 second difference is not a major issue for e.g. a webcam.

We need a solid monitoring system that has 10 monitors all over the world and just 1 out of 10 needs to see you online and on time, just 1, then it’s ok.
If e.g. all 10 fail, then you are banned for e.g. 1 hour from the DNS, it should be that simplistic.

umike · December 23, 2019, 9:46pm

Hi, guys, it’s me again. Reporting:

I am in correspondence on the ISP on route. For now, some changes have been made, and the situation with packet loss has significantly improved for me. It is not yet clear whether this is a temporary improvement or final. Tomorrow will show. Presumably the problem was in the filters due to DDoS at the intermediate ISP.
In any case, the problem of the connection of a single monitoring station with the all world remains very relevant.

Bas · December 24, 2019, 4:52pm

Hi Umike,

Been there, done that. Won’t help.
I have been speaking with my ISP, carriers and Packet.
They all are willing, they all check and make changes (they say at least), it doesn’t help.
The NTP UDP packages seem to have very little priority.
I wouldn’t be surprised if they are ditched into darkness in order to keep NetFlix and alike happy.

Weird is, if you check with ntpdate you practical always get a decent response.
I had somebody here in Belgium (who stopped the pool too) and he wasn’t even able to get any monitor response to his server.
So we tested ntpdate, bingo, no problem at all, every single tick ntpdate reported the time and date.

Maybe Ask will post the source-code of the monitor so we can see how it’s done.
I did find a monitor of him, but he reported it to be old and not the current one.

I kept hitting walls and nothing changes

umike · December 24, 2019, 9:01pm

it’s a little easier for me to influence the situation since I’m an ISP engineer and solve the problem as a ISP problem (our customers suffer from packet loss too!) - make diagnostic, looking test on Looking Glasses and interacting to ISP NOC services worldwide. This is one of the reasons why I am extremely indignant at the pool monitoring. I understand the network a bit and can see what exactly is happening. (at the same time I am very disappointed with ntp-clients, about half of ntp traffic is just garbage)

After correspondence with some intermediate ISP, my situation has improved significantly. No score drops today.

You’re not quite right about low udp priorities. Voice/SIP, online games… Сhrome QUICK protocol and many Google Services like Youtube are transmitted by udp if can. In general, lower udp priority = get a lot(LOT!!!) of customers complaints. (gamers will kill you, yep ) But specifically NTP can be filtered as it is used in DDoS attacks.
I saw customers who have opened an ntp/dns server (especially Mikrotik) and generate 10-30 Mbit/s of outbound NTP/DNS of traffic constantly. It’s a problem for ISP. Not like loading channels/hardware load, but because of the garbage and poor network reputation. Therefore we try to identify such “flooders” customers and monitor the cleanliness of our network. We are a little ISP and do it. Many large ISP do not. Some ISP can just filter ntp as a radical solution to the flood/ddos problem.

I think monitoring should look like a “main server” + probes. The probe may well even be a container or virtual machine image (anyone can deploy it). Probe can sync to geographically local servers and does simplified monitoring for ntp servers availability and low accuracy check. Then it periodically load stats to primary server over tcp. Tcp does not filter as much as udp/ntp (because ddos) and less prone to problems. (another solution - use udp proxy/traffic redirects/tunneling instead probes). This can greatly improve adequacy of monitoring.

In Europe I’m suggest host probe in Hetzner. This is a well-known German hosting with good reliability. I think they also have their own ntp server, which can be used as the primary one.

lammert · December 25, 2019, 12:27am

Thanks for your thoughtful reply @umike.

Just wanted to comment on your suggestion of Hetzner: I have four dedicated servers there, all working as NTP servers. Their network is reliable and fast, but their NTP servers are Stratum 2, with a very high root dispersion to the upstream Stratum 1 server: between 12 to 39 msec currently. My servers at Hetzner use my own Stratum 1 server in the Netherlands and two Stratum 1 servers in Germany as a reference and report a root dispersion between 0.6 and 1.0 msec. So while their network is good and they do not apply any DDOS filtering AFAIK, their ntp servers are low quality.

umike · December 25, 2019, 6:41am

it’s just simplified suggestion.
Also you can act as Tor. Rent a minimum server in the region (output/exit node), start a tunnel from the main server to the output node. Route requests to a region/zone through this output node or do double check - from main server and from output nodes. Then aggregare results.

There may be a variant with proxy nodes without tunnels. UDP from main server send to the proxy node from port > 1024 to port > 1024. This will eliminate ntp filtering. On a proxy node, iptables change ip/ports (simple double nat translations) and route query to ntp server. Recieve the answer and translate/route it back to main server.

There are many option. The most reliable, of course - a separate full-functional mon server for the zone wich will check servers, manage region dns zone and stay in sync with main server. Maybe like distributed dns system. Even if the main server fails, the regional server will be able to support the zone some time.

littlejason99 · December 25, 2019, 2:04pm

I wonder if the recent russia issue was caused from when they did their “test” to “unplug from the internet” and isolate the russian network from the rest of the world?

umike · December 25, 2019, 7:55pm

issue was occured week or two earlier and then a 3-4 days before “test”. “test” are empty big words for politicians who are not versed in network technologies. As if you announced the reconstruction of the car into truck, but really only replaced the fragrance in the cabin. Very sad to realize this and this is not a topic for community here.

Bas · December 27, 2019, 4:28pm

I added my servers to the pool again, that lasted less then a week for Newark to bitch again.

Ask why don’t you fix this monitor-system? It is not good and proven to fail.
Newark is not stable as monitor for the world.

We have been asking for it to be fixed for months and months now, nothing happens.
Ask do you really care about this project?

Steve is feeding you data since September, proving Newark is wrong.

lammert · December 27, 2019, 5:17pm

Simple reason which I already stated near the beginning of the topic: nobody is paying for it.

Ask explained in this thread that he has to do it between other obligations. While there is a serious design flaw in the monitoring system, the more serious design flaw is in the lack of monetary and organizational foundation of the project. Remember that the NTP pool is not an Network Time Foundation project but a private project of Ask and his consultancy company Develooper LLC.

mnordhoff · December 28, 2019, 4:24am

Can you lay off? Ignoring what people post and harassing them instead doesn’t help anything.

umike · December 28, 2019, 11:27am

nobody pay to server owners too. You underestimate the power of the community.
I think that some hosters/IX already have own ntp and may agree to provide a server or virtual machine for the monitoring station for free. But I don’t see such requests on the site.
I agree there are a lot of servers in the project and “single station” monitoring errors do not have a significant impact, but still they are very bad for the “best neareset” service of the zones/areas worldwide.

lammert · December 28, 2019, 5:17pm

I know the power of a community. I have been supplying data-center based servers to the NTP pool since 2008 and added the first NTP pool site translation in the Dutch language. What changed over time is an increased distance between the community part, i.e. the servers in the field actually serving the time, and the infrastructure part, i.e. the DNS servers, monitoring system and website. The github account of the project doesn’t contain the current version of the software anymore, community pull requests take a long time to be considered or are ignored.

The NTP world has changed since fall 2013 as you can see in the server graph which I posted earlier in this thread. The massive amplification attack possible through the ntpd monlist command caused NTP traffic to become prone to filtering by parties all over the world.

The NTP pool project never adapted to this new reality. The monitor system stayed central with pool servers being kicked on- and off the pool depending on if and how traffic from the single monitor station was able to reach those pool servers.

I am providing professional NTP services since 2007 to my customers. My servers are much more stable than the network connection with the pool monitor. I have provided my excess bandwidth to the NTP pool until now. But seeing the constant jig-saw graphs in the pool graphs, getting automated emails about bad performance and even having to relocate some servers between data centers because some data-centers cannot be reached by the pool monitor all together have now changed my stance. If Ask is not willing or able to change this, and the community is not allowed to help because the source code repository is out of date and pull requests are ignored, so be it. There is a limit to the power of the community and a limit to the amount of effort I personally want to put in beating a dead horse.

umike · December 29, 2019, 9:05am

you are right in many ways. I will add this: in my experience of administering projects alone, over time it is very annoying, the project is abandoned and starts working on autopilot. Any desire to make any changes and generally do something disappears. After a while, the project dies. In the case of a pool, it can be taken by someone from large companies having own CDN (as Google create own DNS).

The problem here is that the ntp protocol, like the dns, is very old, prone to attacks and not so necessary for end users. (even ntpd old and developed almost alone). Although every soho router and toilet bowl synchronizes time today, many of them really no needs it. How often you see logs no your soho router? Never. If your equipment in production needs accurate time, you prefer to use your own ntp or stable national standard servers. I never used pool servers on production hardware (a lot of this hardware is even cut off from the Internet). I think up to 90% ntp traffic and servers represents an unnecessary load. Do companies need this? I don’t know.

umike · December 29, 2019, 8:09pm

and again we have connectivity problems at 19:45-20:15, 20:35-21:05, 21:27-21:29 MSK (GMT+3) and again my score drops to 3.9. I’ts kindergarten. Ask, if the fool monitoring drops the score so quickly, then let him return it also quickly.

I guess I will make an offer in my company to create a pool zone on our DNS servers and intercept/redirect our customers to themselves.

Topic		Replies	Views
Server access dead time characterisation Pool Development	17	2000	March 6, 2021
List of trackers monitoring	32	440	May 21, 2025
Additional monitoring servers (help wanted) Server operators	36	3425	November 26, 2019
Monitoring stations timeout to our NTP servers Server operators	103	8422	May 22, 2021
Add server produces: ""Could not check NTP status Server operators	36	1576	November 5, 2020

No PR action for project and dramatic reduction in the number of active servers

Related topics