thr3ads.net - netfilter buglog - [Bug 1082] New: Hard lockup when inserting nft rules (esp. ct rule) [Aug 2016]

If this information is useful, please help other people find it:
Share via:

bugzilla-daemon at netfilter.org

2016-Aug-17 17:52 UTC

[Bug 1082] New: Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

Bug ID: 1082
Summary: Hard lockup when inserting nft rules (esp. ct rule)
Product: nftables
Version: unspecified
Hardware: x86_64
OS: Debian GNU/Linux
Status: NEW
Severity: blocker
Priority: P5
Component: kernel
Assignee: pablo at netfilter.org
Reporter: larkwang at gmail.com

We are switching from openvpn to strongswan (ipsec) for our branch offices to
headquarter VPN link.

We use nftables for better performance and clean ruleset. The ruleset is

-----snip-----
#!/usr/sbin/nft -f

flush ruleset

table inet filter {
set allowed_addr {
type ipv4_addr
elements = { <about 40+ IPs> }
}
set allowed_port {
type inet_service
elements = { 80,443,<other about 10 ports> }
}

chain forward {
type filter hook forward priority 0;
ip saddr { 10.xx.210.0-10.xx.217.255, 10.xx.0.12 } ip daddr
10.xx.0.0/16 counter accept
ip saddr 10.xx.0.0/16 ip daddr @allowed_addr tcp dport
@allowed_port counter accept
ip saddr 10.xx.0.0/16 ip daddr { 10.xx.254.0/24, 10.xx.yy.zz }
counter accept
ip saddr 10.xx.0.0/16 ip daddr 10.0.0.0/8 ip protocol tcp ct
state invalid,new counter reject
}
}
-----snip-----

The vpn server (debian jessie with bpo) uses these:

linux-image 4.6.4-1~bpo8+1 (also 4.5.5-1)
nftables 0.6-1~bpo8+1
libnftnl4 1.0.6-1~bpo8+1
libmnl0 1.0.3-5

The ruleset is loaded without problem before we begin to transit vpn links.
After we transit all links, we want to update the ruleset to add a new open IP.
But loading the modified ruleset causes this machine hard lockup immediately.

Then we had to revert the high load vpn link to openvpn server. With remaining
vpn links, we can reproduce hard lockup 100%.

After quick pinpoints, we are sure:

1. The unmodified ruleset can cause lockup too
2. The lockup is caused by the last "ct state" rule (if commented, no
lockup)

We move most of vpn links to a backup server after work time, which has the
same hardware and software. Loading ruleset in this backup server doesn't
cause
hard lockup. Loading ruleset in the aforementioned now unloaded server
doesn't
cause hard lockup, either.

We are sure:

3. Certain traffic load is a factor for the hard lockup

Please look into this issue.

--
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160817/52ea5b73/attachment.html>

bugzilla-daemon at netfilter.org

2016-Aug-17 17:53 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

Wang Jian <larkwang at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |larkwang at gmail.com

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160817/d3765e91/attachment.html>

bugzilla-daemon at netfilter.org

2016-Aug-18 01:44 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

Pablo Neira Ayuso <pablo at netfilter.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #1 from Pablo Neira Ayuso <pablo at netfilter.org> ---
(In reply to Wang Jian from comment #0)
[...]> The ruleset is loaded without problem before we begin to transit vpn links.
> After we transit all links, we want to update the ruleset to add a new open
> IP. But loading the modified ruleset causes this machine hard lockup
> immediately. 
What do you mean by loading the "modified ruleset"? So as soon as you
invoke
some specific command you experience problems?
> After quick pinpoints, we are sure:
> 
> 1. The unmodified ruleset can cause lockup too
> 2. The lockup is caused by the last "ct state" rule (if
commented, no lockup)
This is confusing.

Now you say that the lockup only happens if the last rule using 'reject'
is
there?
> We move most of vpn links to a backup server after work time, which has the
> same hardware and software. Loading ruleset in this backup server
doesn't
> cause hard lockup. Loading ruleset in the aforementioned now unloaded
server
> doesn't cause hard lockup, either.
I'm getting confused here. So the backup server does not experience any
problem
at all with this ruleset?
> We are sure:
> 
> 3. Certain traffic load is a factor for the hard lockup
Please provide more specific information to make sure this is a bug in
nftables, such as backtraces.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160818/45ac4e3e/attachment.html>

bugzilla-daemon at netfilter.org

2016-Aug-18 05:07 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

--- Comment #2 from Wang Jian <larkwang at gmail.com>
---> What do you mean by loading the "modified ruleset"? So as soon as
you invoke some specific command you experience problems?
Sorry for the confusion. I am trying to replay the situation.

It's very clear that modified or not, is not relevant. We loaded the ruleset
before we added traffic to the server, no problem. We wanted to load ruleset
again after we added some traffic (about 200M-500M bps) for additional
permission, the server lockuped.

The modification gives us a chance to catch the problem.
> Now you say that the lockup only happens if the last rule using
'reject' is there?
I didn't say 'reject'. I said 'ct state'. But seriously, I
didn't check which
one of 'reject' or 'ct state' is the culprit.
> I'm getting confused here. So the backup server does not experience any
problem at all with this ruleset?
No, backup server doesn't experience problem. We did this after work time.
There was no much traffic load on it at that time.
> Please provide more specific information to make sure this is a bug in
nftables, such as backtraces.
I will if I can.

The hard lockup is hard lockup, and the server just freezes. No single
character is emitted on console or in logs.

After we move traffic from the server, loading the ruleset doesn't cause
lockup.

My wild guess is that when there is high traffic (so there are connection
tracking manipulation operations), inserting 'ct state' rule is racy.
When
traffic is low, the problem will not be triggered.

I can't reproduce the lockup on line as my will, because certain vpn link is
business critical. I can have a 5 minutes window per day, including reboot
(reboot needs 1 minutes), at most.

BTW, we tried various kernels, and excluded hardware problems (not 100%
excluded though). I will stress test it.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160818/5ed662ef/attachment.html>

bugzilla-daemon at netfilter.org

2016-Dec-19 08:19 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

--- Comment #3 from Wang Jian <larkwang at gmail.com> ---
The following are steps to reproduce. It's different from our production
setup,
though.

== network setup

HostB <= ipsec =>  HostA <= ipsec => HostC

HostA
eth0: 10.2.16.13/24
eth1: 192.168.235.12/24

HostB
eth0: 10.2.16.14/24
eth1: 192.168.234.12/24

HostC
eth0: 10.2.16.18/24
eth1: 192.168.236.12/24

IPsec config

HostA  /var/lib/strongswan/ipsec.conf.inc
--snip--
conn %default
    ikelifetime=1440m
    keylife=20m
    rekeymargin=3m
    keyingtries=1
    authby=secret
    keyexchange=ikev2
    mobike=no
conn base
    leftid=host-a at peers
    left=10.2.16.13
conn host-b
    leftsubnet=192.168.235.0/24,192.168.236.0/24
    right=10.2.16.14
    rightid=host-b at peers
    rightsubnet=192.168.234.0/24
    also=base
    auto=start
    dpdaction=restart
    keyingtries=%forever
conn host-c
    leftsubnet=192.168.235.0/24,192.168.234.0/24
    right=10.2.16.18
    rightid=host-c at peers
    rightsubnet=192.168.236.0/24
    also=base
    auto=start
    dpdaction=restart
    keyingtries=%forever
--snip--


HostB /var/lib/strongswan/ipsec.conf.inc
--snip--
conn %default
    ikelifetime=1440m
    keylife=20m
    rekeymargin=3m
    keyingtries=1
    authby=secret
    keyexchange=ikev2
    mobike=no
conn base
    leftid=host-b at peers
    left=10.2.16.14
conn host-a
    leftsubnet=192.168.234.0/24
    right=10.2.16.13
    rightid=host-a at peers
    rightsubnet=192.168.235.0/24,192.168.236.0/24
    also=base
    auto=start
    dpdaction=restart
    keyingtries=%forever
--sip--
HostC /var/lib/strongswan/ipsec.conf.inc
--snip--
conn %default
    ikelifetime=1440m
    keylife=20m
    rekeymargin=3m
    keyingtries=1
    authby=secret
    keyexchange=ikev2
    mobike=no
conn base
    leftid=host-c at peers
    left=10.2.16.18
conn host-a
    leftsubnet=192.168.236.0/24
    right=10.2.16.13
    rightid=host-a at peers
    rightsubnet=192.168.234.0/24,192.168.235.0/24
    also=base
    auto=start
    dpdaction=restart
    keyingtries=%forever
--snip--


All /var/lib/strongswan/ipsec.secrets.inc
--snip--
host-a at client.bytedance.net host-b at client.bytedance.net : PSK
0sPJ6QU/WlSrbj8caGCcXxO6qBcyxdbMbh8RVTRhDDNXMhost-a at client.bytedance.net
host-c at client.bytedance.net : PSK
0sPJ6QU/WlSrbj8caGCcXxO6qBcyxdbMbh8RVTRhDDNXM--snip--


== test method

1. run ab on HostC against HostA's webserver (such as nginx)

$ ab -n 10000000 -c <concurrency> http://192.168.234.12/

2. load/reload nftable ruleset on HostA during ab

# ./rules.nft

if ab concurrency is equal to or more than 1000, HostA will freeze without any
panic information.
A smaller concurrency may or may not trigger freeze.

We try to trigger freeze without ipsec involved, but fail to at the time.

== software

It's mix of debian jiessie/jessie-backports and home built strongswan

HostA kernel: 4.6.4-1~bpo8+1
strongswan:   5.5.0-1
nftables:     0.6-1~bpo8+1

The debian jessie backport kernel 4.7.8-1~bpo8+1 & 4.8.11-1~bpo8+1 are not
affected by this test setup,

BUT 4.7.8-1~bpo8+1 is affected on our production server setup. We can't test
4.8.11-1~bpo8+1 on our production server.

== rules.nft

It's not suitable for public post. I will mail it privately.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20161219/4b4b3099/attachment.html>

bugzilla-daemon at netfilter.org

2019-Jul-12 10:05 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

--- Comment #4 from Pablo Neira Ayuso <pablo at netfilter.org> ---
This is three years old, we need more information to know if this bug exists
these days.

Thanks for reporting.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20190712/dc9c79f1/attachment.html>

bugzilla-daemon at netfilter.org

2019-Jul-12 10:05 UTC

head link

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

https://bugzilla.netfilter.org/show_bug.cgi?id=1082

Pablo Neira Ayuso <pablo at netfilter.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |WONTFIX

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20190712/4b101af2/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

netfilter buglog - Aug 2016 - [Bug 1082] New: Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] New: Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)

Seemingly Similar Threads