Ville Mattila
2013-Nov-06  13:26 UTC
[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux
Hi, Please advise how to use Response Rate Limiting on a server which has multiple NSD server processes (nsd.conf server section has server-count> 1).We have a problem with NSD v3.2.16 repeatedly unblocking and blocking again a single source which is flooding positive queries at a ~steady 700 qps rate. rrl-ratelimit setting is the default 200 qps. The unblock-block happens multiple times a minute. This is causing false negatives: NSD bursts out 200 responses on every unblock: Nov 6 10:11:18 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:19 dnstest1 nsd[6876]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:20 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:21 dnstest1 nsd[6875]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:23 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:25 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:27 dnstest1 nsd[6879]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:28 dnstest1 nsd[6877]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:29 dnstest1 nsd[6879]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:29 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:30 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:42 dnstest1 nsd[6878]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:11:42 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:12:30 dnstest1 nsd[6877]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:12:31 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:12:31 dnstest1 nsd[6882]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:13:30 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:13:31 dnstest1 nsd[6876]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Nov 6 10:14:31 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type positive target 193.166.5.0/24 query 193.166.5.1 NS Noting how the PIDs change on the log messages lines I'm guessing what happens here is that the operating system (RHEL 6; Linux kernel v2.6.32) process scheduler decides to start using a different NSD server process every now and then to handle the incoming data on the socket / NIC receive queue. The newly chosen process has the rrl hash bucket for the flooding source/type empty and only after sending 200 replies it starts blocking. (NB: The behaviour/interval of changing to a different process may depend on what NIC / Linux kernel version / cpu scheduler / irq&cpu affinity settings etc. one is using, and of course cannot be controlled by NSD. In this example case the query flood source is our lab nameserver 193.166.5.1 itself, but I'm afraid we can expect our production Linux server behave ~similarly with external flood sources.) If my guess is correct I think the options would be: 1. Do nothing and use RRL even though it's per-process. Even if the flood gets unblocked multiple times a minute RRL may still make the attack ineffective enough. 2. Make use of the multiple receive queues / irq affinity of the server network interface card and so that queries from a specific source IP always end up being processed by the same CPU, and configure process scheduling to tie a single NSD server process to each of those CPUs. (Too complex for us! And of course this has it's drawbacks, too, wrt load distribution at least. And unfortunately our Intel igb NICs only can choose the receive queue based on IPv4 srcip,dstip tuples but all IPv6 packets end up always in the same queue.) FWIW, the unblocking seems to be triggered every time by this, around line 425 of rrl.c from nsd-3.2.16: ----- } else if(now - b->stamp > 0) { /* older bucket */ int olderblock = used_to_block(b->rate, b->counter, lm); rrl_attenuate_bucket(b, now - b->stamp); if(olderblock && b->rate < lm) rrl_msg(query, "unblock"); b->counter = 1; b->stamp = now; } ----- Thanks, -- Ville Mattila, CSC -------------- next part -------------- A non-text attachment was scrubbed... Name: 0xF55B661A.asc Type: application/pgp-keys Size: 6992 bytes Desc: not available URL: <http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: <http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment-0001.bin>
Matthijs Mekking
2013-Nov-07  09:26 UTC
[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux
Hi Ville, On 11/06/2013 02:26 PM, Ville Mattila wrote:> Hi, > > Please advise how to use Response Rate Limiting on a server which has > multiple NSD server processes (nsd.conf server section has server-count >> 1). > > We have a problem with NSD v3.2.16 repeatedly unblocking and blocking > again a single source which is flooding positive queries at a ~steady > 700 qps rate. rrl-ratelimit setting is the default 200 qps. The > unblock-block happens multiple times a minute. This is causing false > negatives: NSD bursts out 200 responses on every unblock: > > Nov 6 10:11:18 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:19 dnstest1 nsd[6876]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:20 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:21 dnstest1 nsd[6875]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:23 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:25 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:27 dnstest1 nsd[6879]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:28 dnstest1 nsd[6877]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:29 dnstest1 nsd[6879]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:29 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:30 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:42 dnstest1 nsd[6878]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:11:42 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:12:30 dnstest1 nsd[6877]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:12:31 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:12:31 dnstest1 nsd[6882]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:13:30 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi. > type positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:13:31 dnstest1 nsd[6876]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > Nov 6 10:14:31 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type > positive target 193.166.5.0/24 query 193.166.5.1 NS > > Noting how the PIDs change on the log messages lines I'm guessing what > happens here is that the operating system (RHEL 6; Linux kernel v2.6.32) > process scheduler decides to start using a different NSD server process > every now and then to handle the incoming data on the socket / NIC > receive queue. The newly chosen process has the rrl hash bucket for the > flooding source/type empty and only after sending 200 replies it starts > blocking. (NB: The behaviour/interval of changing to a different > process may depend on what NIC / Linux kernel version / cpu scheduler / > irq&cpu affinity settings etc. one is using, and of course cannot be > controlled by NSD. In this example case the query flood source is our > lab nameserver 193.166.5.1 itself, but I'm afraid we can expect our > production Linux server behave ~similarly with external flood sources.) > > If my guess is correct I think the options would be: > 1. Do nothing and use RRL even though it's per-process. Even if the > flood gets unblocked multiple times a minute RRL may still make the > attack ineffective enough. > 2. Make use of the multiple receive queues / irq affinity of the server > network interface card and so that queries from a specific source IP > always end up being processed by the same CPU, and configure process > scheduling to tie a single NSD server process to each of those CPUs. > (Too complex for us! And of course this has it's drawbacks, too, wrt > load distribution at least. And unfortunately our Intel igb NICs only > can choose the receive queue based on IPv4 srcip,dstip tuples but all > IPv6 packets end up always in the same queue.)Yes, this is a problem. A third solution could be to maintain the RRL table globally, and processes all make use of the same table. However, I suspect that such a solution will have a huge performance impact. I am afraid that option 1 (do nothing) is currently the best option. You may want to tweak the threshold a bit: lower the "rrl-ratelimit:" to reduce the period of sending false negatives. Best regards, Matthijs> > FWIW, the unblocking seems to be triggered every time by this, around > line 425 of rrl.c from nsd-3.2.16: > ----- > } else if(now - b->stamp > 0) { > /* older bucket */ > int olderblock = used_to_block(b->rate, b->counter, lm); > rrl_attenuate_bucket(b, now - b->stamp); > if(olderblock && b->rate < lm) > rrl_msg(query, "unblock"); > b->counter = 1; > b->stamp = now; > } > ----- > > Thanks, > > > > _______________________________________________ > nsd-users mailing list > nsd-users at NLnetLabs.nl > http://open.nlnetlabs.nl/mailman/listinfo/nsd-users >-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 555 bytes Desc: OpenPGP digital signature URL: <http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131107/3649100b/attachment.bin>