You’re missing case #5:
- User configures a bandwidth setting on their server that is far higher than it can handle, and DDoSes themselves or nearby third parties.
This isn’t such a big deal in well-served zones like US, where a server is lucky to get only 20% of its requested traffic level. Half of the pool can disappear with little effect on pool users or server operators. Arguably, it would be better to tune the scoring for the bottom 10% of the pool to be rejected all the time–, i.e. silently drop the servers below the 10th percentile accuracy to raise the average accuracy of the pool. I’m not proposing we do that, just pointing out that “include as many nodes in the pool as possible” optimizes one quality metric at the expense of another.
In underserved zones like Asia, a pool node can get many times its requested traffic level. A naive user can easily trigger a devastating tsunami of packets to land on their NTP host–or their entire country. Fast eviction from the pool is important, because the DDoS effect gets worse for the target IP address the longer it’s in the pool. We can’t rely on the target removing themselves from the pool because:
- the target might be unable to reach the ntppool admin interface because they are flooded by NTP packets
- the affected entity might be a third party (wrong IP address or shared network infrastructure) who has no idea what the NTP pool is or how to make the NTP traffic stop
The #2 case is especially ugly for the NTP pool–abuse complaints, multi-ISP cooperative investigations–all the expensive, non-revenue-generating things ISPs hate to do. In the worst cases, ISPs start dropping NTP packets at their borders and evicting NTP pool servers from their hosting services because they’re too much hassle.
When in doubt, refrain from pointing a firehose of network traffic at inexperienced strangers, and be ready to turn it off immediately at the first sign of trouble.
Start with monitor Newark…[lots of conditionals and communication between monitor servers]
That sounds far more complicated than it needs to be. There’s only 4000 servers in the pool to monitor, and each monitor can score them all.
“Run N monitor servers, use the score from the monitor in the same zone if there is one, and the highest score if there isn’t” is probably good enough for small N. A truly unreachable or broken host will have a low score in every monitor, while a host with half-broken-half-working peering will have a high score in at least one monitor. A bad monitor will have a much lower average score for its pool nodes than its peers. If N is 10 or more, then the median or 90th percentile score can be used to weed out false positives or negatives. The DNS service can choose the percentile to change the tolerance for network partitioning failures.
The scoring algorithm itself doesn’t need to change, other than to pick a favorable observation point to measure the score from. If the monitoring servers can reach the NTP pool hosting networks reliably, most pool nodes will have no problems staying in the pool. Dropping out of the pool for an hour or two of maintenance and score ramp-up barely affects the query rate, and ntpdate
users will be glad that servers undergoing maintenance are not included in the pool for their one-shot queries.
N > 1 monitors has been work in progress entering its 4th year this month. It’s fun to talk about how we’d use 10 or 50 monitoring stations, but it’s a moot point while there’s still only one monitor running in production.