Julien Charbon
2017-Aug-15 21:33 UTC
mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
Hi Ben, On 8/11/17 11:32 AM, Ben RUBSON wrote:>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote: >> >> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>> >>> Suggested fix attached. >> >> I agree we your conclusion. Just for the record, more precisely this >> regression seems to have been introduced with: >> (...) >> Thus good catch, and your patch looks good. I am going to just verify >> the other in_pcbrele_wlocked() calls in TCP stack. > > Julien, do you plan to make this fix reach 11.0-p12 ?I am checking if your issue is another flavor of the issue fixed by: https://svnweb.freebsd.org/base?view=revision&revision=307551 https://reviews.freebsd.org/D8211 This fix in not in 11.0 but in 11.1. Currently I did not found how an inp in INP_TIMEWAIT state can have been INP_FREED without having its tw set to NULL already except the issue fixed by r307551. Thus could you try to apply this patch: https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch and see if you can still reproduce this issue? And in the spirit of r307551 fix and based on Hans patch I will also propose to add a kernel log describing the issue instead of starting an infinite loop when INVARIANT is not set. -- Julien
Ben RUBSON
2017-Aug-16 09:02 UTC
mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote: > > On 8/11/17 11:32 AM, Ben RUBSON wrote: >>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote: >>> >>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>> >>>> Suggested fix attached. >>> >>> I agree we your conclusion. Just for the record, more precisely this >>> regression seems to have been introduced with: >>> (...) >>> Thus good catch, and your patch looks good. I am going to just verify >>> the other in_pcbrele_wlocked() calls in TCP stack. >> >> Julien, do you plan to make this fix reach 11.0-p12 ? > > I am checking if your issue is another flavor of the issue fixed by: > > https://svnweb.freebsd.org/base?view=revision&revision=307551 > https://reviews.freebsd.org/D8211 > > This fix in not in 11.0 but in 11.1. Currently I did not found how an > inp in INP_TIMEWAIT state can have been INP_FREED without having its tw > set to NULL already except the issue fixed by r307551. > > Thus could you try to apply this patch: > > https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch > > and see if you can still reproduce this issue?Thank you for your answer Julien. Unfortunately, I'm not sure at all how to reproduce the issue. I have other servers which are 100% identical to this one, same workload, same some-months uptime, but they did not trigger the bug yet. If other network stack experts (I'm not) agree with your analysis, we could then certainly go further with D8211 / r307551. One thing that perhaps might help : # netstat -an | grep TIME_WAIT$ | wc -l 468 Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails. As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time). As if netstat was able to somehow help... Number of TIME_WAIT connections however does not decrease, but increases.> And in the spirit of r307551 fix and based on Hans patch I will also > propose to add a kernel log describing the issue instead of starting an > infinite loop when INVARIANT is not set.Which should then never be triggered :) Good idea I think ! Thank you again ! Ben