thr3ads.net - netfilter buglog - [Bug 1743] New: Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout

If this information is useful, please help other people find it:
Share via:

bugzilla-daemon at netfilter.org

2024-Apr-04 20:55 UTC

[Bug 1743] New: Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

Bug ID: 1743
Summary: Flowtable: Flows exiting OFFLOAD State being assigned
value of nf_conntrack_tcp_timeout_unacknowledged
Product: nftables
Version: 1.0.x
Hardware: x86_64
OS: other
Status: NEW
Severity: normal
Priority: P5
Component: kernel
Assignee: pablo at netfilter.org
Reporter: tim at muppetz.com

Created attachment 739
--> https://bugzilla.netfilter.org/attachment.cgi?id=739&action=edit
Session where Conntrack Changed to 300

Kernel: 6.6.21

I have a TCP flow between an Android Phone and Google's Firebase Cloud
Messaging (FCM). FCM uses TCP port 5228 and is a very low traffic connection,
it can be anywhere up to 28 minutes before a keepalive packet goes via it. It
is used for push messaging (and probably a lot of other things too)

Firstly, I have Flowtable Disabled: When I watch the FCM flow in conntrack as
such:

watch -n 1 "sudo conntrack -L -p TCP -s 192.168.0.128 -d 142.251.12.188
--dport
5228"

I will quite often see the flow change from a keepalive time of ~432000 down to
300. To determine if this was nf_conntrack_tcp_timeout_unacknowledged or
nf_conntrack_tcp_timeout_max_retrans I altered both sysctls entries and was
able to determine if I changed nf_conntrack_tcp_timeout_unacknowledged to 400,
that when I see the keepalive time change, it changes to 400 seconds.

So my first question that I don't understand is, why is a flow in the
Established state changing to the unacknowledged timeout? It only changes for
a second though, then I assume another packet comes in and the time jumps back
to 5 days.
This to be appears odd, but probably this is normal behaviour and I just
don't
understand it.
[I tested with OpenWRT with kernel 5.15.150 (also using nftables) and I see it
do the same thing, conntrack timeout dropping to 300 for a second before
bouncing back to $nf_conntrack_tcp_timeout_established so this must be expected
behaviour.]

My real issue comes about when I enable Flow Offload. With the same sort of
packet flow, I will see the following:

The flow enters the OFFLOAD state in conntrack. When it comes out of OFFLOAD
it will be in one of 3 states:

A timeout of ~432000 (Seems odd, I expect ~86400)
A timeout of ~86400 (This is what I expect)
A timeout of 300 ($nf_conntrack_tcp_timeout_unacknowledged) minus anywhere up
to 30 seconds. So values like 260, 274, 283 are all values I've seen.

A major problem comes about when it enters the table with the
nf_conntrack_tcp_timeout_unacknowledged timeout of ~300. Because there is so
little traffic on this session, it will often age out and leave the conntrack
table. When this happens, the FCM session dies and Android devices on the
network no longer receive push messages until they are woken up, realise the
session is dead and establish a new one.

Attached is a tcpdump of a Google FCM session where I saw the timeout drop to
$nf_conntrack_tcp_timeout_unacknowledged at approx packet 23.

I have tried watching conntrack with -E but I see no events for this session
being generated when the keepalive times are changing.

Other details:

This is happening on a Vyos 1.4.0-epa2 release Router.
My WAN interface is a PPPoE interface, my LAN Interface is an Ethernet
interface (virtio, the router is virtualised)
There are two patches in the Vyos kernel that are "non-standard" - I
have
looked at them and I can't see how they could interfere with Offload - here
is
the link to them:
https://github.com/vyos/vyos-build/tree/sagitta/packages/linux-kernel/patches/kernel

Please let me know what other details I can provide that might help locate the
issue.

Thank you very much.
Tim

--
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240404/da85a292/attachment.html>

bugzilla-daemon at netfilter.org

2024-Apr-08 03:17 UTC

head link

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

--- Comment #1 from Tim Harman <tim at muppetz.com> ---
In fact on further reading/investigation, I don't know why I thought a
timeout
of ~86400 was expected.  I don't see this value anywhere in
/proc/sys/net/netfilter

Also I have
ct state { established, related } meta l4proto { tcp, udp }
as my offload rule.  Should that be ALL traffic, or is the established+related
correct?

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240408/dc0a684b/attachment.html>

bugzilla-daemon at netfilter.org

2024-Apr-16 01:46 UTC

head link

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

--- Comment #2 from Tim Harman <tim at muppetz.com> ---
Further testing shows that it doesn't matter if I use established/related or
just accept everything, the same odd timeouts persist.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240416/67eb3156/attachment.html>

bugzilla-daemon at netfilter.org

2024-Apr-30 22:29 UTC

head link

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

--- Comment #3 from Tim Harman <tim at muppetz.com> ---
I have recently moved ISPs.
My old ISP required PPPoE, my new ISP doesn't (uses DHCP)
Since moving to my new ISP, I have been 100% unable to reproduce this problem.
My easy-to-reproduce test before I have tried 100 times and I can't
reproduce
it.

I wonder if the issue that I was encountering was related to this fix I see in
6.6.29, and moving away from PPPoE has stopped the problem from appearing?

---- begin ----

ommit 4ed82dd368ad883dc4284292937b882f044e625d
Author: Pablo Neira Ayuso <pablo at netfilter.org>
Date:   Thu Apr 11 00:09:00 2024 +0200

    netfilter: flowtable: incorrect pppoe tuple

    [ Upstream commit 6db5dc7b351b9569940cd1cf445e237c42cd6d27 ]

    pppoe traffic reaching ingress path does not match the flowtable entry
    because the pppoe header is expected to be at the network header offset.
    This bug causes a mismatch in the flow table lookup, so pppoe packets
    enter the classical forwarding path.

--- end ----

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240430/655177f9/attachment.html>

bugzilla-daemon at netfilter.org

2024-May-02 07:36 UTC

head link

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

--- Comment #4 from Pablo Neira Ayuso <pablo at netfilter.org> ---
Hi,

flowtable PPPoE was broken in software mode.

The flow entry was created in the flowtable, but it did not match. That is,
listing with conntrack -L shows an entry with the OFFLOAD flag but it was never
match, but you still see packets hitting the forward chain which is not
correct. Once flowtable fast path is set up, packets are seen at ingress and
egress hooks.

Basically, PPPoE encapsulated packets were pushed back to classic path because
the tuple was not correctly set up, only one direction of the flow followed the
fast path.

I managed to reproduce this in a small testbed with a PPPoE server/client,
hence the fix I posted.

I have a more permanent testbed to test PPPoE, it would be good to integrate
this into a script that can run in nftables tests/shell with containers to make
sure this does not break again in the future, I have to look into this.

Please, note that this patch is also convenient to have for those that require
PPPoE:

From: Pablo Neira Ayuso <pablo at netfilter.org>

[ Upstream commit 87b3593bed1868b2d9fe096c01bcdf0ea86cbebf ]

Ensure there is sufficient room to access the protocol field of the
PPPoe header. Validate it once before the flowtable lookup, then use a
helper function to access protocol field.

Reported-by: syzbot+b6f07e1c07ef40199081 at syzkaller.appspotmail.com
Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo at netfilter.org>
Signed-off-by: Sasha Levin <sashal at kernel.org>

These two patches has been enqueued to -stable kernels.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240502/8ff96a71/attachment.html>

bugzilla-daemon at netfilter.org

2024-May-02 18:11 UTC

head link

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

https://bugzilla.netfilter.org/show_bug.cgi?id=1743

Tim Harman <tim at muppetz.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Tim Harman <tim at muppetz.com> ---
Hi Pablo,

Thank you very much for your detailed explanation of the issue and obviously
for the patches you've issued that have made it into the latest stable
kernel.

When I get a change I will test PPPoE with a v6.6.29+ kernel.

But I think it's pretty safe to say this can be closed.

Thanks again for all your hard work on the Netfilter system.

-- 
You are receiving this mail because:
You are watching all bug changes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.netfilter.org/pipermail/netfilter-buglog/attachments/20240502/fffbe8c8/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

netfilter buglog - Apr 2024 - [Bug 1743] New: Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] New: Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

[Bug 1743] Flowtable: Flows exiting OFFLOAD State being assigned value of nf_conntrack_tcp_timeout_unacknowledged

Apparently Analagous Threads