On Thursday, February 11, 2021 7:57:43 AM CET Helge Oldach
wrote:> Hi Stefan,
>
> Stefan Ehmann wrote on Thu, 11 Feb 2021 02:50:35 +0100 (CET):
> > On Wednesday, February 10, 2021 7:46:25 AM CET Helge Oldach wrote:
> > > Hi,
> > >
> > > Stefan Ehmann wrote on Tue, 09 Feb 2021 23:23:32 +0100 (CET):
> > > > I'm having issues with stale TCP connections after the
upgrade from
> > > > 12.2
> > > > to
> > > > 13.0-BETA1.
> > > >
> > > > Symptoms:
> > > > Outgoing TCP connections no longer receive data after being
idle.
> > > >
> > > > I can do more testing later, but I think these ipfw rules
trigger the
> > > > problem: - check-state
> > > > - allow tcp from me to any setup keep-state
> > > > - deny ip from any to any
> > > >
> > > > After establishing an outgoing connection (e.g, via netcat),
I see a
> > > > new
> > > > dynamic rule and the 300s counter running down via
> > > > # ipfw -Da list
> > > >
> > > > net.inet.ip.fw.dyn_keepalive is set to 1, so the timer
should be
> > > > refreshed
> > > > via keep-alive on idle connections.
> > > >
> > > > Don't know if it's deterministic, but from what
I've seen so far:
> > > > - When counter gets low the first time, it is reset to 300
as
> > > > expected.
> > > > - When the counter nears zero for the second time, the
dynamic rule is
> > > > deleted and I get ipfw denies.
> > >
> > > I am afraid I can't reproduce. I have followed your test case
however
> > > I'm seeing that a TCP keepalive reliably triggers a timer
refresh. For
> >
> > > example (sleep 1 loop over ipfw -Da list | grep):
> > Tested in VirtualBox with amd64.vmdk from:
> >
> > https://download.freebsd.org/ftp/releases/VM-IMAGES/13.0-BETA1/
>
> We do agree on amd64, right?
>
> I precisely followed your steps (VirtualBox 6.1.18), except:
[...]
For some reason, the issue only occurs with bridged network, not NAT network
(virtualbox-ose-5.2.44_4)>
> I am seeing keepalives every 5 minutes and the ipfw timer has fired
> every time, resetting the dynamic rule to 300 secs TTL. I am also seeing
> keepalives received and replied in the tcpdump. Everything according
> to the books I am afraid. My nc session is still sending after some 45
> minutes.
>
> > Updated to 187492ef639f, but nothing changed.
>
> Hmmm. I'm out of ideas. Are you 100% sure the remote session is not
torn
> down routinely after something between 300-600 seconds silence?
Finally did a git bisect:
283c76c7c3f2f634f19f303a771a3f81fe890cab is the first bad commit
There is PR 252449 where sysctl net.inet.tcp.tolerate_missing_ts was
introduced.
tolerate_missing_ts should only be necessary for communicating with broken TCP
stacks, as far as I understand. I think there's another problem because
I'm
also seeing this issue using epair devices.