As I’ve mentioned before, I believe CGNAT can lead to issues similar to the one you described - even if that wasn’t the case here. That’s why I think we need better IPv6 support in the pool.
I think there still is a misunderstanding:
The point is not that one can do without knowing them.
The point is that knowing them probably would not help in your situation.
Whatever one labels the traffic that is overrunning your server, the server is still being overrun apparently, despite all the countermeasures apparently taken so far. Thus, what good would it be to white list monitor IP addresses when the monitors’ packets probably get dropped due to overload before they even reach the white list, and subsequently the server behind it?
As was mentioned before, the monitor servers are extremely unlikely to behave in a way that would get them added to a black list as per the criteria you mentioned. And indeed, if the latest revision of the list on GitHub is what you are using, still none of the current monitor IP addresses appear on that black list.
So, what difference would it make if the monitors were white listed?
Thus, it would be necessary to better understand why your server is apparently being overrun. For which various pieces of information, as mentioned throughout this thread, would be helpful. Among them the still open question as to which country zone the server in question is registered in. And in conjunction with that, what the netspeed setting is.
While it is somewhat speculative without having the information mentioned above, I cannot shake the feeling, in line with @gombadi’s suspicion, that the server in question might be located in what some people call an underserved zone, and/or be configured with too high a netspeed setting for its capabilities in relation to the zone it is in.
If it is indeed the case that the server is located in an underserved zone, and the netspeed setting is already at its lowest “official”* setting, then you might simply be out of luck. As the pool infrastructure currently doesn’t support neither clients nor servers in such underserved zones very well, as documented in various threads throughout this forum. And while the issue of underserved zones is a well-known and long-standing one, there is no indication of concrete attempts being made to address it.
So if the server in question is indeed affected by the underserved zone issue, don’t expect any improvement regarding that topic anytime soon.
* “Official” in the sense of what the web UI lets one configure without resorting to manual tweaking of the underlying API.
today I see one monitor show always -92.9, when another monitors show positive values.
so I will enable NTP traffic limiting to prevent network downtime and then block unauthorized access attempts from that traffic.
Glad to hear it.
It’s obviously entirely up to you, and there’s valid reasons not to. But if you’d care to share the IP address of the server in question, that would be another good way to allow people to take a look and help figure out what is going on.
I have only this name “usdaa1-1tcp71g”
I think your server’s IP would be more useful
Indeed Sorry, I should have been clearer.
Still, here is the assessment by that monitor of one of my servers, while all other monitors score full points (20):
usday1-1tcp71g -2.4
Similar picture for some other servers of mine, but far from all seeing a low score from that monitor. So probably mostly due to some glitches in the Internet’s forwarding fabric between the remote networks involved.
Thanks, very helpful! In the sense that all seems rather well now. The low score in the table of that one monitor is apparently bogus, because it is reflected neither in the graph, nor in the log data. I noted that since around the time of some recent changes in the pool infrastructure, I am seeing similar artifacts in higher numbers. E.g., there “always” used to be some discrepancies between the overall score values given in various places at the same time, I guess due to different lags of the different ways the score value was obtained from the underlying data for it to be displayed in the different places. But recently, I am seeing much larger discrepancies, for more diverse data points, and for longer, than before those recent changes (not implying a causal link, even though the temporal correlation might indicate there to be one).
Why there was near total outage for most monitors before the morning of 2025-05-05 is strange. Especially given it looks like really a full outage for the majority of monitors, while a small number of other monitors (1 or 2 or so) at least had intermittent connectivity, or recovered noticeably before the majority of the rest.
So I’d suggest you’d consider whether you changed anything on your end on the morning of the fifth, in your approach to manage the influx, or in the configuration of the server (netspeed setting), or anything else, and reflect on how that might have impacted the outcome. That could give a clue as to what was going on. E.g., overzealous blocking, too high a netspeed setting for the capabilities of your system in relation to the zone it is in, …
Not sure you are aware, but there was a major issue with the RU zone not long ago. It wasn’t very stable to begin with, and the roll-out of some buggy software to a huge number of systems (I think it was some kind of wireless speaker system) created a significant surge in traffic that pushed the zone over the edge (see the sharp dip in the number of servers at the end of 2024).
Due to a concerted effort, the surge has been quelled at the source (people from this forum reached out to the vendor, who then fixed the issue and rolled out the new version - one of the rare successful attempts to get an issue fixed at the source), and the number of servers in the zone has increased significantly after a call to action in local communities.
But I don’t have a server in that zone, so cannot judge as to how well that zone is being served right now, i.e., what the effective relation between netspeed, and actual traffic volume is. And how easy it is to knock one-self out by too high a netspeed setting. Though the complete outage of the majority of monitors suggests this might have been caused by more deliberate, indiscriminate blocking, rather than overload (with overload, one would not see a complete outage, but kind of a sawtooth pattern as the server moves into and out of the pool).
There is no problem with your server, this monitor does not seem to function at all. Otherwise you would see some red dots on the bottom right part of the image. The monitor will disappear from the list of monitors when the last measurement data falls out from the score file.
Hmm, you are right. I thought I had seen entries for that monitor in the CSV logs of two servers (the one being discussed in this thread, and one of mine) dating to just a few minutes before my last post, but now, checking again, see the most recent entries are indeed as old as last Sunday. Not sure anymore what it was I looked at earlier… It think I might have mixed up usdaa1-1tcp71g and usday1-1tcp71g…
This question has been on my mind for a while, and perhaps someone has already explained it here before, but what reference clocks do the monitoring systems themselves rely on? Are they independent of the pool?