thr3ads.net - nsd users - [nsd-users] Frequent RRL false negatives when using multiple server processes on Linux [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Ville Mattila

2013-Nov-06 13:26 UTC

[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux

Hi,

Please advise how to use Response Rate Limiting on a server which has
multiple NSD server processes (nsd.conf server section has
server-count> 1).
We have a problem with NSD v3.2.16 repeatedly unblocking and blocking
again a single source which is flooding positive queries at a ~steady
700 qps rate.  rrl-ratelimit setting is the default 200 qps.  The
unblock-block happens multiple times a minute.  This is causing false
negatives: NSD bursts out 200 responses on every unblock:

Nov  6 10:11:18 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:19 dnstest1 nsd[6876]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:20 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:21 dnstest1 nsd[6875]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:23 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:25 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:27 dnstest1 nsd[6879]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:28 dnstest1 nsd[6877]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:29 dnstest1 nsd[6879]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:29 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:30 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:42 dnstest1 nsd[6878]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:42 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:30 dnstest1 nsd[6877]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:31 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:31 dnstest1 nsd[6882]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:13:30 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:13:31 dnstest1 nsd[6876]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:14:31 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS

Noting how the PIDs change on the log messages lines I'm guessing what
happens here is that the operating system (RHEL 6; Linux kernel v2.6.32)
process scheduler decides to start using a different NSD server process
every now and then to handle the incoming data on the socket / NIC
receive queue.  The newly chosen process has the rrl hash bucket for the
flooding source/type empty and only after sending 200 replies it starts
blocking.  (NB: The behaviour/interval of changing to a different
process may depend on what NIC / Linux kernel version / cpu scheduler /
irq&cpu affinity settings etc. one is using, and of course cannot be
controlled by NSD.  In this example case the query flood source is our
lab nameserver 193.166.5.1 itself, but I'm afraid we can expect our
production Linux server behave ~similarly with external flood sources.)

If my guess is correct I think the options would be:
1. Do nothing and use RRL even though it's per-process.  Even if the
flood gets unblocked multiple times a minute RRL may still make the
attack ineffective enough.
2. Make use of the multiple receive queues / irq affinity of the server
network interface card and so that queries from a specific source IP
always end up being processed by the same CPU, and configure process
scheduling to tie a single NSD server process to each of those CPUs.
(Too complex for us!  And of course this has it's drawbacks, too, wrt
load distribution at least.  And unfortunately our Intel igb NICs only
can choose the receive queue based on IPv4 srcip,dstip tuples but all
IPv6 packets end up always in the same queue.)

FWIW, the unblocking seems to be triggered every time by this, around
line 425 of rrl.c from nsd-3.2.16:
-----
        } else if(now - b->stamp > 0) {
                /* older bucket */
                int olderblock = used_to_block(b->rate, b->counter, lm);
                rrl_attenuate_bucket(b, now - b->stamp);
                if(olderblock && b->rate < lm)
                        rrl_msg(query, "unblock");
                b->counter = 1;
                b->stamp = now;
        }
-----

Thanks,
-- 
Ville Mattila, CSC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0xF55B661A.asc
Type: application/pgp-keys
Size: 6992 bytes
Desc: not available
URL:
<http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment-0001.bin>

Matthijs Mekking

2013-Nov-07 09:26 UTC

head link

[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux

Hi Ville,

On 11/06/2013 02:26 PM, Ville Mattila wrote:> Hi,
> 
> Please advise how to use Response Rate Limiting on a server which has
> multiple NSD server processes (nsd.conf server section has server-count
>> 1).
> 
> We have a problem with NSD v3.2.16 repeatedly unblocking and blocking
> again a single source which is flooding positive queries at a ~steady
> 700 qps rate.  rrl-ratelimit setting is the default 200 qps.  The
> unblock-block happens multiple times a minute.  This is causing false
> negatives: NSD bursts out 200 responses on every unblock:
> 
> Nov  6 10:11:18 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:19 dnstest1 nsd[6876]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:20 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:21 dnstest1 nsd[6875]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:23 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:25 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:27 dnstest1 nsd[6879]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:28 dnstest1 nsd[6877]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:29 dnstest1 nsd[6879]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:29 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:30 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:42 dnstest1 nsd[6878]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:11:42 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:12:30 dnstest1 nsd[6877]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:12:31 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:12:31 dnstest1 nsd[6882]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:13:30 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
> type positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:13:31 dnstest1 nsd[6876]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> Nov  6 10:14:31 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
> positive target 193.166.5.0/24 query 193.166.5.1 NS
> 
> Noting how the PIDs change on the log messages lines I'm guessing what
> happens here is that the operating system (RHEL 6; Linux kernel v2.6.32)
> process scheduler decides to start using a different NSD server process
> every now and then to handle the incoming data on the socket / NIC
> receive queue.  The newly chosen process has the rrl hash bucket for the
> flooding source/type empty and only after sending 200 replies it starts
> blocking.  (NB: The behaviour/interval of changing to a different
> process may depend on what NIC / Linux kernel version / cpu scheduler /
> irq&cpu affinity settings etc. one is using, and of course cannot be
> controlled by NSD.  In this example case the query flood source is our
> lab nameserver 193.166.5.1 itself, but I'm afraid we can expect our
> production Linux server behave ~similarly with external flood sources.)
> 
> If my guess is correct I think the options would be:
> 1. Do nothing and use RRL even though it's per-process.  Even if the
> flood gets unblocked multiple times a minute RRL may still make the
> attack ineffective enough.
> 2. Make use of the multiple receive queues / irq affinity of the server
> network interface card and so that queries from a specific source IP
> always end up being processed by the same CPU, and configure process
> scheduling to tie a single NSD server process to each of those CPUs.
> (Too complex for us!  And of course this has it's drawbacks, too, wrt
> load distribution at least.  And unfortunately our Intel igb NICs only
> can choose the receive queue based on IPv4 srcip,dstip tuples but all
> IPv6 packets end up always in the same queue.)
Yes, this is a problem. A third solution could be to maintain the RRL
table globally, and processes all make use of the same table. However, I
suspect that such a solution will have a huge performance impact.

I am afraid that option 1 (do nothing) is currently the best option. You
may want to tweak the threshold a bit: lower the "rrl-ratelimit:" to
reduce the period of sending false negatives.

Best regards,
  Matthijs
> 
> FWIW, the unblocking seems to be triggered every time by this, around
> line 425 of rrl.c from nsd-3.2.16:
> -----
>         } else if(now - b->stamp > 0) {
>                 /* older bucket */
>                 int olderblock = used_to_block(b->rate, b->counter,
lm);
>                 rrl_attenuate_bucket(b, now - b->stamp);
>                 if(olderblock && b->rate < lm)
>                         rrl_msg(query, "unblock");
>                 b->counter = 1;
>                 b->stamp = now;
>         }
> -----
> 
> Thanks,
> 
> 
> 
> _______________________________________________
> nsd-users mailing list
> nsd-users at NLnetLabs.nl
> http://open.nlnetlabs.nl/mailman/listinfo/nsd-users
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131107/3649100b/attachment.bin>

Reasonably Related Threads

Search for more possibly parallel threads

nsd users - Nov 2013 - Frequent RRL false negatives when using multiple server processes on Linux

[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux

[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux

Reasonably Related Threads