Disable monitoring emails for a single server

Hey,
is it possible to disable the notification emails for a single server? I am getting 5-10 emails a day for one of my servers (5.223.49.159). But there is nothing i can do about that, the server is stable, the datacenter is stable and the server keeps perfect time. According to staff this is an issue with monitoring. So is there a way to disable the emails for this server until the monitoring in this region is fixed?

Thanks and greetings

I don’t know what exactly you have been told, but based on pool.ntp.org: Statistics for 5.223.49.159 I wouldn’t blame monitoring. When a problem with your server occurs, multiple monitors detect the issue. This is something on your end. Fixing the problem will also stop the nag emails.

Here’s a screenshot of the score page for reference:

Maybe your server or the connection can’t handle the load. Singapore is a fairly busy zone. Try setting the speed on the pool server management page to the lowest setting (512 Kbit) and wait a few days. If the server seems to handle the lower load, you can try increasing the speed setting gradually.

1 Like

I was told exactly that. According to the email the monitoring is currently not optimal in this region. So servers get flagged as bad for no reason.

Trust me, it can handle the load. :wink: I have 48 cores, 128GB RAM, 4TB M.2 SSD and a 10Gbit connection directly to the rack router and currently there is nothing else on that server except ntpd. (The server has to do a special task once a week for 90 minutes, the rest of the time he is idling)

There are currently only around 20-50 requests per second on that server. So beeing to busy is clearly not the issue.

I would asume that a lot of the monitors are in EU/US and have bad routing to this location. That would explain the issues and thats basically what i was told in the email.

The reason why you’re right now receiving only around 20-50 requests per second is that your server’s score is less than 10 and it’s not included in the pool DNS rotation. I’m guessing that once the server’s score exceeds 10 your server will be included in the DNS rotation again and it’ll start getting quite a lot more traffic again, which may cause the experienced issues.

One of my pool servers is approximately 1 ms away from your server and it seems to be doing fine:
https://www.ntppool.org/scores/94.237.79.110
https://kaguaani.miuku.net/stats/ntppackets.html
The most usual problem is some sort of overflowing connection tracking, either at the server or some upstream firewall. Disabling connection tracking for NTP traffic (both inbound and outbound) may help. It really does not matter how many cores your server has if the firewall has been misconfigured. For reference, my NTP server in Singapore has exactly one CPU core and 1GB of RAM.

If this is a server at Hetzner, they may also have some sort of traffic limits in place. Maybe ask them as well.

1 Like

That is interresting. I have a great connection to your server as well. Around 11 hops according to traceroute. The firewall configuration and traffic limits are the same as on the other servers i added to the pool. Some in this account, some in the company account (around 20 servers) and they all reach 20 scores without any issues.

So i would asume that its not an issue with my tech stack and my company monitoring proves me correct. Uptime around 80 days.

If i read the monitor logs correct, this server was never part of the pool before. At least not when i owned the IP address. Always right before i reach 10, i get dropped back again.

Correct. I just used this as an example showing that the server is not busy with other tasks that would cause lag.

So how do we get this fixed? Could someone else see if they have connection problems to this datacenter? Maybe some tracerouts would help?

In all seriousness, try setting the netspeed to the lowest speed on the pool management page. Doing so will give us an additional data point for troubleshooting. Then we can figure out what to do next.

2 Likes

Just as reference: With a 3MBit/s setting for IPv4, I get peaks of up to 6 MBit/s (sometimes a bit more) on one of my two instances in Singapore (including less than 250 kbit/s IPv6 traffic at a 3Gbps setting).

On another instance, with the minimum setting of 512 kbit/s, I get peak above 1Mbit/s (which is why IPv4 is disabled on that instance as it exceeds the contracted bandwidth, only IPv6 is enabled at the 3 Gbps setting).

Looking at https://www.ntppool.org/scores/5.223.49.159/log?limit=20000&monitor=recentmedian, one can see that your server has passed a score of 10 several times. I haven’t thoroughly checked every instance, but it looks as if the score always dropped shortly, almost immediately, after the threshold of 10 was crossed. That strongly suggests that the issue is traffic load related.

Unlike other cases, however, the drop in score is way more pronounced, going way into the negative area (as nicely visible in the graph as well). That strongly suggests that this is not just some overload that would go away once the load eases a bit, causing kind of a saw-tooth pattern with a relatively small amplitude. But that something is seriously blocking traffic in response to this overload, and for longer than the actual overload condition itself probably persists*.

I.e., my suspicion would also be that there is some “protection” mechanism in place somewhere that kicks in upon the sudden onslaught of traffic when the score threshold of 10 is crossed*.

So as @avij suggests, I concur that the way to systematically troubleshoot this is by starting with the lowest “netspeed” setting available, and take it from there. I.e., see what happens, and slowly increase until the characteristic pattern starts again, and then investigate what is going on around that threshold.

And just to re-emphasize what @avij already stated as well: The mere bandwidth of the link is no indicator of how well the system can handle loads of NTP traffic. NTP traffic is not volume distributed across fewer (relatively speaking) but larger packets, but many very small packets from a few tons of different sources.

Handling high rates of small packets is something that network equipment typically is a bit more challenged in dealing with than with larger-packet traffic because it needs to do more lookups and processing for the same amount of data transferred. Where only one forwarding lookup and passing of all the handling up and down the protocol stack is needed for a single 1500 octet packet like, e.g., common for video streaming, the system needs to do almost 20 lookups when carrying the same amount of data in small NTP packets.

And because the connection tracking of stateful firewalls/NAT/port forwarding mechansims can very easily get overloaded by the sheer amount of sources for the traffic a server gets when included in the pool.

Note that ISPs may have different traffic limits in different regions, even if everything else regarding the setup appears to be the same. E.g., I have the impression that bandwidth is more expensive in Asia than, e.g., in Europe. E.g., I only found out by re-reading the fine-print on the first instance mentioned above that for the Singapore site and another site in the wider region of that hoster, there is a traffic volume limit in place that does not exist in the European and North American data centers (which is another reason why it’s only at the 3Mbit/s setting even though the 1 core/1 GB memory instance could handle loads of 20Mbit/s actual traffic and more).

* Note in this context that due to past exploits of NTP for traffic reflection and amplification attacks, many “protection” systems (and admins) even nowadays are very sensitive to NTP traffic, even when current NTP implementations are no longer susceptible to that kind of attack. E.g., that is one reason why my IPv4 instance is limited to the 3Mbit/s setting only, as the next higher value would drive the traffic volume into regions where my ISP’s protection mechanisms occasionally trigger traffic blocking for 15 minutes because of such a suspected reflection/amplification attack (even though the traffic itself clearly does not fit the pattern of such an attack, e.g., no amplification whatsoever).

1 Like

I assume you have reduced the “netspeed” setting since yesterday as the score pattern looks “better” now. In the sense that while there are still drops, they don’t dip into double-digit negative values anymore, but exhibit this smaller amplitude sawtooth pattern that I was referring to. It’s not easily visible in the graph, unfortunately, as the overall score (“recentmedian”) gets drowned out among the score points of the individual monitors. But it is visible in the data. And when you click/tap on the “overall score” line in the table, the other scores get faded a bit so that the overall score becomes a bit more visible (might take a few seconds after clicking/tapping).

Also started a measurement yesterday to collect some additional data points:

https://atlas.ripe.net/measurements/80809594/results/

This also reflects the behavior. The good news is that this doesn’t seem to show the kind of (near) complete blocking that was visible yesterday, but rather bursts of packet drops.

Looking at the RTT of those probes that get through even when there are drops nearby, one can see that the latency does’t seem to go up noticeably. To me, that indicates that it is likely not basic congestion/processing overload somewhere that is causing those drops, as that typically would entail increased latency as packets get queued up in various network buffers (even if buffer bloat shouldn’t be a big thing anymore these days).

Rather, as also hinted at by @avij, it seems more likely that some connection tracking table is being overwhelmed, resulting in packet drops. Or something is doing some explicit rate limiting.

E.g., while you write that the node is mostly not doing anything major from a CPU point of view, maybe it is doing some light other work that is relevant from the networking point of view described above.

E.g., if the node is acting as DNS server as well, that traffic could also indirectly affect the NTP traffic. Not from a volume/throughput point of view, but, e.g., filling the connection tracking tables of some unit handling the combined traffic. And maybe it is not the node hosting the NTP service itself that is causing this overflow, but another node sharing the same upstream infrastructure.

Hope this gives you some hints as to what to further look at. But keep in mind, it is difficult to diagnose an issue such as this from afar with certainty, based on the few visible data points available only. Those suggest a high likelyhood for the analysis described above, but it could also be very different, which could be visible in data/observations/information not available to us.

1 Like

Always on the lookout for interesting hosting offerings, I took a look at the Hetzner cloud offering, and there are a few items in there that relate to what was discussed above.

  • “Our stateful Firewalls make it easy to secure your infrastructure […]” The operative word here being “stateful”, hinting at the potential of connection tracking table overflow. Or if port forwarding from some public IP address to an internal, private one is needed (“Floating IPs”).
  • “Hetzner […] safeguard your […] cloud servers using the latest hardware appliances and sophisticated perimeter security technologies, providing you with first-rate protection against large-scale DDoS attacks.” Hinting at there being protection mechanisms in place. While they say those are the “latest” and “sophisticated”, it would really need to be seen how those are effectively treating high volumes/rates of NTP traffic.
  • “You’ll get at least 20 TB of inclusive traffic for cloud servers at EU and US locations and 1 TB in Singapore.” Hinting at traffic limits in Singapore being lower than for otherwise identical instances in other locations. Though unlike my provider, they seem to just charge more for traffic above the included volume, rather than throttling traffic when the quota is exceeded. But it might impact how sensitive the firewall and other protections react to high volumes/rates of NTP traffic vs. how they are tuned in other locations.
  • “Let your servers communicate through a private network and setup complex network topologies.” This, and other items I don’t quote here for brevity, hint at multiple servers potentially sharing the same infrastructure for their uplink to the Internet. I.e., there could be a base load from some other server/service on that uplink infrastructure, and adding the NTP traffic from the pool on top of that just pushes that infrastructure “over the edge”.

As I type this this note, the server’s score is 19.0, which is good. Glad that progress is being made.

I’ve been monitoring the server closely from my monitors.

I have some concerns. See top chart. The server’s time is drifting noticeably. This may be driven by the “time since last update”. See RFC5905,

Reference Timestamp: Time when the system clock was last set or
   corrected, .  

I don’t know what causes the sawtooth, but something isn’t right. If the server’s time is drifting, the NTP software should be making adjustments.

The main time sources are Cloudflare (stratum 3) and Google (stratum 1). Both of these use IP anycast, so I can’t check their status directly. Both organizations have good history with their NTP servers’ accuracy.

Loss is much better than a week ago, but there are some patterns I don’t understand.

If you want to further discuss the server accuracy we’ll need software details.

As complementary observation, it looks as if the packet loss is not specific to NTP. Here the latency graphs of ongoing pings from four vantage points within Singapore, with vertical red dashed lines indicating packet loss:

Live updated observations can be viewed at https://atlas.ripe.net/latencymon/80924914/.

Hey,
sorry for replying so late, i had to check back with Hetzner to make sure that there is nothing going on on there part.

I took the server out of the pool into monitoring-only mode. It stabelized a bit, but it still randomly drops into negative ratings. But not as frequently as before.

Yeah, i never go on traffic alone. PPS is much more important in the case of NTP. Hetzner has an optional firewall that comes with a new connection limit of 10 000 new connections per second. However, i never use this firewall (precise because of thise limitations) and i always use my own where no such limits apply.

True, but i checked with the Hetzner Support. According to them there are no such limits.

The server is adjusting the time to keep it correct.

$ doas ntpctl -s status                                                                                                                                                                                                 
8/9 peers valid, constraint offset -1s, clock synced, stratum 3

My config looks like this on the latest OpenBSD release.

listen on *

servers time.google.com weight 10
servers time.cloudflare.com weight 10
servers ntp1.hetzner.de weight 5
servers ntp2.hetzner.com weight 5
servers ntp3.hetzner.net weight 5

constraints from "https://www.google.com"
constraints from "https://www.microsoft.com"

sensor *

I could get rid of the Hetzner Timeservers in this config. They are basically just a backup and Google and Cloudflare are prefered.

   wt tl st  next  poll          offset       delay      jitter
216.239.35.4 from pool time.google.com
    5  6  1    1s    9s        -4.281ms     1.440ms     0.079ms
216.239.35.8 from pool time.google.com
    5  8  1    5s   31s        -4.245ms     1.134ms     0.055ms
216.239.35.12 from pool time.google.com
    5 10  1    4s   33s        -4.234ms     1.524ms     0.099ms

Thanks for the screenshot. I send it to the Hetzner Support, so they can take another look at it.

So far, i dont think its my setup or internally at Hetzner. Because when i use another server on that rack to really stress the NTPd, its no problem at all, even with a couple of million packets send at the server. So my guess is that somewhere upstream, some peering has a firewall that is very agressiv at filtering out to many UPC connections from different sources.

I am currently waiting on hearing back from the support. I keep you updated.

Thanks for all the replies!

1 Like

@stevesommars and others have more experience regarding NTP server implementations and their behavior to make more definitive statements about OpenNTPD and its suitability for the pool, but I think that this implementation might be a possible explanation for the oddities with the time the instance is serving that @stevesommars pointed out, and that are now also better visible in the monitoring graph (besides the packet drops). See the frequently rather unsteady green lines especially during times with increased packet drops.

The implementation might not be causing the packet drops (but it also might), but it doesn’t seem to deal well with them. The varying offsets could obviously also be caused by varying delay asymmetries related to the same (or other) causes also behind the packet drops, but the delay variations don’t seem to reach a level that an NTP daemon shouldn’t be able to handle a bit more gracefully than what we see.

Do you have the chance to quickly, without too much effort, temporarily spin up another NTP daemon implementation instead of OpenNTPD, one of chronyd or NTP classic or NTPsec? Just to see how that one would behave under the same circumstances.

I think there is vastly more experience with running one of those and their behavior than with OpenNTPD. That would give a much better baseline to do further troubleshooting.

I infer that the server currently is at the 6Mbit setting. Would it be possible to reduce that to 512Kbit and leave it there for the time being?

You obviously know when you change the setting, and can correlate that with subsequent behavior at that setting, as you describe. For the rest of us, though, it is more difficult to correlate a behavior with a setting when we don’t know for sure what the latter is, or when it changes.

Here is the daily load profile of my main Singapore instance, the load changes throughout the day are way more pronounced than on my German instances. I think with a bit of imagination, one can see some correlation with the error profile visible, e.g., in above offset/score graph. (Not sure why there apparently is a higher peak yesterday, but this is all traffic, not NTP only.)

Hi @leo, I’d like to bring this issue to some sort of a conclusion.

According to the score page your server is still experiencing problems whenever the score increases to over 10 and the server gets included in the pool.

To refute some of the claims about bad monitors or bad routing, I set up a temporary server at Hetzner’s cloud in the same datacenter in Singapore. Its score page looks much better. Note that this server is in “monitoring only” so it receives very little NTP traffic compared to the other servers. Nevertheless, this temporary server’s score should prove that the pool monitors work fine and the general routing towards Hetzner’s SG datacenter is also working properly.

Here are some statistics that may help with your diagnostics:

As measured from another server in Singapore but not at Hetzner, whenever ping shows packet loss to your server, there’s also NTP packet loss. This packet loss does not occur when I measure packet loss to my temp server at Hetzner (except briefly on Monday for whatever reason).

Edit: I thought I had already set up monitoring from the temp server to your server and my other server, but apparently it hadn’t been set up properly. In any case, here are fresh stats that I started collecting today. They will likely start showing useful data in a few days.

Edit2: Even though there’s only two hours’ worth of data in these latter statistics at the moment, I find it concerning that these stats already show packet loss. Traceroute from 5.223.43.189 to 5.223.49.159 shows only five hops, all within Hetzner’s datacenter. At the same time, traffic to/from this temporary server to a different server in Singapore at another ISP seems to flow fine.

I hope your server doesn’t have Hetzner’s “protection” enabled:

Did you really receive a couple of million packets back? Because that’s what counts – the server’s ability to respond to queries. Try again and record (with tcpdump or similar) the actual traffic and the responses. If the server starts dropping requests, it will decrease your score in the pool as expected.

Quite frankly, if your company monitoring says your NTP server is doing fine, your company monitoring is not doing its job properly. Maybe your company monitoring monitors from a rack next door to your NTP server, hiding problems elsewhere in the network. Maybe set up additional monitoring from an entirely different location, if you don’t trust the pool monitoring.

Pinging @leo again. I’m fairly confident that the problem you are seeing can be resolved one way or another by either yourself or Hetzner.

I’m planning to shut down my temporary server next week. I have not shut it down yet, because it may be useful for diagnostics purposes, ie. you can show Hetzner that this temporary server is doing fine in the pool and yours isn’t.

Hey,

yeah, the server is still struggeling a bit in the ratings. But i was busy with work the last few days so i did not look into it further.

I get what you mean, however if i put my server into monitoring mode only i get perfect scores as well. So thats not really helpful.

My guess is, that somewhere upstream someone has a router who cuts off traffic at certain limits. But thats not in the Hetzner DC or in my internal setup. I am no quit sure how we could debug this.

I would suggest leaving it for now, maybe that person fixes/upgrates there router and automatically fixes the issue.

Yes of course, i stressed the server a lot and i always record network traffic using tcpdump for 2 hours or up to 5TB (whatever is reached first, company policy)

Yeah no, that monitoring is sold.

I am thankful for your help looking into this. I would recommend we leave this for now. Lets wait a couple of weeks/months and see, maybe the defect router gets replaced/updated on its own by someone else.

Greetings
Leo

I’m not giving up yet, because Singapore is a fairly busy zone and additional properly working servers would be very welcome to share the load.

I changed the netspeed of my temporary server to 6 Mbit/s two weeks ago and it seems it’s handling peaks of 32k requests/sec just fine.



I don’t know the netspeed setting of your server, but it must be at most 6 Mbit/s because at 12 Mbit/s the server would get added to the global @ zone and yours doesn’t seem to be in it.

Do note that there is also packet loss between your server and my temporary cloud server at Hetzner. There are five network hops between these servers, all within Hetzner. Therefore I don’t think “wait and see” is going to be the winning strategy here.

Hey,

that looks good. I mean there is no reason even a half way modern server should be unable to handle a couple thousand requests a second. Specially with small NTP packages thats nothing.

$ traceroute sgh.miuku.net                                                                                                                                                      
traceroute to sgh.miuku.net (5.223.43.189), 64 hops max, 40 byte packets
 1  172.31.1.1 (172.31.1.1)  5.171 ms  1.491 ms  1.345 ms
 2  26688.your-cloud.host (5.223.8.9)  0.357 ms  0.289 ms  0.198 ms
 3  * * *
 4  26721.your-cloud.host (5.223.8.36)  2.015 ms  0.303 ms  0.208 ms
 5  sgh.miuku.net (5.223.43.189)  0.455 ms  0.312 ms  0.276 ms

When i ping you back everything seams fine so far. I will keep a ping running over night and observe the stats (i do not log ICMP traffic usually).

But yeah, it seams that there is a problem somewhere in there. I mean, i am not married to the IP address of that server. I think i could just move this to another one of the host instances over there. Maybe this host from Hetzner is just defect or so.

The interresting thing is, i am a long-time Hetzner customer. Even in der early days. So i know a lot of people there and i asked them about this. Mostly to get non-responses. They even closed the currently open ticket for that issue with some bullshit arguments that looked a lot like a ChatGPT response.

If the overnight ping does not show anyting usefull, i will go ahead and move this instance to a new host. Maybe that will show us something.

Greetings
Leo