bugzilla-daemon at netfilter.org
2016-Aug-17 17:52 UTC
[Bug 1082] New: Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 Bug ID: 1082 Summary: Hard lockup when inserting nft rules (esp. ct rule) Product: nftables Version: unspecified Hardware: x86_64 OS: Debian GNU/Linux Status: NEW Severity: blocker Priority: P5 Component: kernel Assignee: pablo at netfilter.org Reporter: larkwang at gmail.com We are switching from openvpn to strongswan (ipsec) for our branch offices to headquarter VPN link. We use nftables for better performance and clean ruleset. The ruleset is -----snip----- #!/usr/sbin/nft -f flush ruleset table inet filter { set allowed_addr { type ipv4_addr elements = { <about 40+ IPs> } } set allowed_port { type inet_service elements = { 80,443,<other about 10 ports> } } chain forward { type filter hook forward priority 0; ip saddr { 10.xx.210.0-10.xx.217.255, 10.xx.0.12 } ip daddr 10.xx.0.0/16 counter accept ip saddr 10.xx.0.0/16 ip daddr @allowed_addr tcp dport @allowed_port counter accept ip saddr 10.xx.0.0/16 ip daddr { 10.xx.254.0/24, 10.xx.yy.zz } counter accept ip saddr 10.xx.0.0/16 ip daddr 10.0.0.0/8 ip protocol tcp ct state invalid,new counter reject } } -----snip----- The vpn server (debian jessie with bpo) uses these: linux-image 4.6.4-1~bpo8+1 (also 4.5.5-1) nftables 0.6-1~bpo8+1 libnftnl4 1.0.6-1~bpo8+1 libmnl0 1.0.3-5 The ruleset is loaded without problem before we begin to transit vpn links. After we transit all links, we want to update the ruleset to add a new open IP. But loading the modified ruleset causes this machine hard lockup immediately. Then we had to revert the high load vpn link to openvpn server. With remaining vpn links, we can reproduce hard lockup 100%. After quick pinpoints, we are sure: 1. The unmodified ruleset can cause lockup too 2. The lockup is caused by the last "ct state" rule (if commented, no lockup) We move most of vpn links to a backup server after work time, which has the same hardware and software. Loading ruleset in this backup server doesn't cause hard lockup. Loading ruleset in the aforementioned now unloaded server doesn't cause hard lockup, either. We are sure: 3. Certain traffic load is a factor for the hard lockup Please look into this issue. -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160817/52ea5b73/attachment.html>
bugzilla-daemon at netfilter.org
2016-Aug-17 17:53 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 Wang Jian <larkwang at gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |larkwang at gmail.com -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160817/d3765e91/attachment.html>
bugzilla-daemon at netfilter.org
2016-Aug-18 01:44 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 Pablo Neira Ayuso <pablo at netfilter.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #1 from Pablo Neira Ayuso <pablo at netfilter.org> --- (In reply to Wang Jian from comment #0) [...]> The ruleset is loaded without problem before we begin to transit vpn links. > After we transit all links, we want to update the ruleset to add a new open > IP. But loading the modified ruleset causes this machine hard lockup > immediately.What do you mean by loading the "modified ruleset"? So as soon as you invoke some specific command you experience problems?> After quick pinpoints, we are sure: > > 1. The unmodified ruleset can cause lockup too > 2. The lockup is caused by the last "ct state" rule (if commented, no lockup)This is confusing. Now you say that the lockup only happens if the last rule using 'reject' is there?> We move most of vpn links to a backup server after work time, which has the > same hardware and software. Loading ruleset in this backup server doesn't > cause hard lockup. Loading ruleset in the aforementioned now unloaded server > doesn't cause hard lockup, either.I'm getting confused here. So the backup server does not experience any problem at all with this ruleset?> We are sure: > > 3. Certain traffic load is a factor for the hard lockupPlease provide more specific information to make sure this is a bug in nftables, such as backtraces. -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160818/45ac4e3e/attachment.html>
bugzilla-daemon at netfilter.org
2016-Aug-18 05:07 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 --- Comment #2 from Wang Jian <larkwang at gmail.com> ---> What do you mean by loading the "modified ruleset"? So as soon as you invoke some specific command you experience problems?Sorry for the confusion. I am trying to replay the situation. It's very clear that modified or not, is not relevant. We loaded the ruleset before we added traffic to the server, no problem. We wanted to load ruleset again after we added some traffic (about 200M-500M bps) for additional permission, the server lockuped. The modification gives us a chance to catch the problem.> Now you say that the lockup only happens if the last rule using 'reject' is there?I didn't say 'reject'. I said 'ct state'. But seriously, I didn't check which one of 'reject' or 'ct state' is the culprit.> I'm getting confused here. So the backup server does not experience any problem at all with this ruleset?No, backup server doesn't experience problem. We did this after work time. There was no much traffic load on it at that time.> Please provide more specific information to make sure this is a bug in nftables, such as backtraces.I will if I can. The hard lockup is hard lockup, and the server just freezes. No single character is emitted on console or in logs. After we move traffic from the server, loading the ruleset doesn't cause lockup. My wild guess is that when there is high traffic (so there are connection tracking manipulation operations), inserting 'ct state' rule is racy. When traffic is low, the problem will not be triggered. I can't reproduce the lockup on line as my will, because certain vpn link is business critical. I can have a 5 minutes window per day, including reboot (reboot needs 1 minutes), at most. BTW, we tried various kernels, and excluded hardware problems (not 100% excluded though). I will stress test it. -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20160818/5ed662ef/attachment.html>
bugzilla-daemon at netfilter.org
2016-Dec-19 08:19 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 --- Comment #3 from Wang Jian <larkwang at gmail.com> --- The following are steps to reproduce. It's different from our production setup, though. == network setup HostB <= ipsec => HostA <= ipsec => HostC HostA eth0: 10.2.16.13/24 eth1: 192.168.235.12/24 HostB eth0: 10.2.16.14/24 eth1: 192.168.234.12/24 HostC eth0: 10.2.16.18/24 eth1: 192.168.236.12/24 IPsec config HostA /var/lib/strongswan/ipsec.conf.inc --snip-- conn %default ikelifetime=1440m keylife=20m rekeymargin=3m keyingtries=1 authby=secret keyexchange=ikev2 mobike=no conn base leftid=host-a at peers left=10.2.16.13 conn host-b leftsubnet=192.168.235.0/24,192.168.236.0/24 right=10.2.16.14 rightid=host-b at peers rightsubnet=192.168.234.0/24 also=base auto=start dpdaction=restart keyingtries=%forever conn host-c leftsubnet=192.168.235.0/24,192.168.234.0/24 right=10.2.16.18 rightid=host-c at peers rightsubnet=192.168.236.0/24 also=base auto=start dpdaction=restart keyingtries=%forever --snip-- HostB /var/lib/strongswan/ipsec.conf.inc --snip-- conn %default ikelifetime=1440m keylife=20m rekeymargin=3m keyingtries=1 authby=secret keyexchange=ikev2 mobike=no conn base leftid=host-b at peers left=10.2.16.14 conn host-a leftsubnet=192.168.234.0/24 right=10.2.16.13 rightid=host-a at peers rightsubnet=192.168.235.0/24,192.168.236.0/24 also=base auto=start dpdaction=restart keyingtries=%forever --sip-- HostC /var/lib/strongswan/ipsec.conf.inc --snip-- conn %default ikelifetime=1440m keylife=20m rekeymargin=3m keyingtries=1 authby=secret keyexchange=ikev2 mobike=no conn base leftid=host-c at peers left=10.2.16.18 conn host-a leftsubnet=192.168.236.0/24 right=10.2.16.13 rightid=host-a at peers rightsubnet=192.168.234.0/24,192.168.235.0/24 also=base auto=start dpdaction=restart keyingtries=%forever --snip-- All /var/lib/strongswan/ipsec.secrets.inc --snip-- host-a at client.bytedance.net host-b at client.bytedance.net : PSK 0sPJ6QU/WlSrbj8caGCcXxO6qBcyxdbMbh8RVTRhDDNXMhost-a at client.bytedance.net host-c at client.bytedance.net : PSK 0sPJ6QU/WlSrbj8caGCcXxO6qBcyxdbMbh8RVTRhDDNXM--snip-- == test method 1. run ab on HostC against HostA's webserver (such as nginx) $ ab -n 10000000 -c <concurrency> http://192.168.234.12/ 2. load/reload nftable ruleset on HostA during ab # ./rules.nft if ab concurrency is equal to or more than 1000, HostA will freeze without any panic information. A smaller concurrency may or may not trigger freeze. We try to trigger freeze without ipsec involved, but fail to at the time. == software It's mix of debian jiessie/jessie-backports and home built strongswan HostA kernel: 4.6.4-1~bpo8+1 strongswan: 5.5.0-1 nftables: 0.6-1~bpo8+1 The debian jessie backport kernel 4.7.8-1~bpo8+1 & 4.8.11-1~bpo8+1 are not affected by this test setup, BUT 4.7.8-1~bpo8+1 is affected on our production server setup. We can't test 4.8.11-1~bpo8+1 on our production server. == rules.nft It's not suitable for public post. I will mail it privately. -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20161219/4b4b3099/attachment.html>
bugzilla-daemon at netfilter.org
2019-Jul-12 10:05 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 --- Comment #4 from Pablo Neira Ayuso <pablo at netfilter.org> --- This is three years old, we need more information to know if this bug exists these days. Thanks for reporting. -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20190712/dc9c79f1/attachment.html>
bugzilla-daemon at netfilter.org
2019-Jul-12 10:05 UTC
[Bug 1082] Hard lockup when inserting nft rules (esp. ct rule)
https://bugzilla.netfilter.org/show_bug.cgi?id=1082 Pablo Neira Ayuso <pablo at netfilter.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |WONTFIX -- You are receiving this mail because: You are watching all bug changes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20190712/4b101af2/attachment.html>