thr3ads.net - freebsd stable - mlx4en, timer irq @100%... (11.0 stuck on high network load ???) [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Ben RUBSON

2017-Aug-16 09:02 UTC

mlx4en, timer irq @100%... (11.0 stuck on high network load ???)

> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote:
> 
> On 8/11/17 11:32 AM, Ben RUBSON wrote:
>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org>
wrote:
>>> 
>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote:
>>>> 
>>>> Suggested fix attached.
>>> 
>>> I agree we your conclusion.  Just for the record, more precisely
this
>>> regression seems to have been introduced with:
>>> (...)
>>> Thus good catch, and your patch looks good.  I am going to just
verify
>>> the other in_pcbrele_wlocked() calls in TCP stack.
>> 
>> Julien, do you plan to make this fix reach 11.0-p12 ?
> 
> I am checking if your issue is another flavor of the issue fixed by:
> 
> https://svnweb.freebsd.org/base?view=revision&revision=307551
> https://reviews.freebsd.org/D8211
> 
> This fix in not in 11.0 but in 11.1.  Currently I did not found how an
> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw
> set to NULL already except the issue fixed by r307551.
> 
> Thus could you try to apply this patch:
> 
>
https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch
> 
> and see if you can still reproduce this issue?
Thank you for your answer Julien.
Unfortunately, I'm not sure at all how to reproduce the issue.
I have other servers which are 100% identical to this one, same workload,
same some-months uptime, but they did not trigger the bug yet.

If other network stack experts (I'm not) agree with your analysis,
we could then certainly go further with D8211 / r307551.

One thing that perhaps might help :
# netstat -an | grep TIME_WAIT$ | wc -l
468

Note that due to this running bug, sendmail has lots of difficulties to send
outgoing mails.
As soon as I run the above netstat command, I receive a lot of stacked mails
(more than 20 this time).
As if netstat was able to somehow help...

Number of TIME_WAIT connections however does not decrease, but increases.
> And in the spirit of r307551 fix and based on Hans patch I will also
> propose to add a kernel log describing the issue instead of starting an
> infinite loop when INVARIANT is not set.
Which should then never be triggered :)
Good idea I think !

Thank you again !

Ben

Ben RUBSON

2017-Aug-28 08:25 UTC

head link

mlx4en, timer irq @100%... (11.0 stuck on high network load ???)

> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson at gmail.com> wrote:
> 
>> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org>
wrote:
>> 
>> On 8/11/17 11:32 AM, Ben RUBSON wrote:
>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at
freebsd.org> wrote:
>>>> 
>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote:
>>>>> 
>>>>> Suggested fix attached.
>>>> 
>>>> I agree we your conclusion.  Just for the record, more
precisely this
>>>> regression seems to have been introduced with:
>>>> (...)
>>>> Thus good catch, and your patch looks good.  I am going to just
verify
>>>> the other in_pcbrele_wlocked() calls in TCP stack.
>>> 
>>> Julien, do you plan to make this fix reach 11.0-p12 ?
>> 
>> I am checking if your issue is another flavor of the issue fixed by:
>> 
>> https://svnweb.freebsd.org/base?view=revision&revision=307551
>> https://reviews.freebsd.org/D8211
>> 
>> This fix in not in 11.0 but in 11.1.  Currently I did not found how an
>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw
>> set to NULL already except the issue fixed by r307551.
>> 
>> Thus could you try to apply this patch:
>> 
>>
https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch
>> 
>> and see if you can still reproduce this issue?
> 
> Thank you for your answer Julien.
> Unfortunately, I'm not sure at all how to reproduce the issue.
> I have other servers which are 100% identical to this one, same workload,
> same some-months uptime, but they did not trigger the bug yet.
> 
> If other network stack experts (I'm not) agree with your analysis,
> we could then certainly go further with D8211 / r307551.
> 
> One thing that perhaps might help :
> # netstat -an | grep TIME_WAIT$ | wc -l
> 468
> 
> Note that due to this running bug, sendmail has lots of difficulties to
send outgoing mails.
> As soon as I run the above netstat command, I receive a lot of stacked
mails (more than 20 this time).
> As if netstat was able to somehow help...
> 
> Number of TIME_WAIT connections however does not decrease, but increases.
> 
>> And in the spirit of r307551 fix and based on Hans patch I will also
>> propose to add a kernel log describing the issue instead of starting an
>> infinite loop when INVARIANT is not set.
> 
> Which should then never be triggered :)
> Good idea I think !
What about :
D8211/r307551
+ Hans' patch
+ Julien's idea of a kernel log (sort of "We should not be here but we
are")

And backporting all this to 11.0 (and so to 11.1 too) ?

As this bug can impact every FreeBSD machine / server,
leading to an unavailable / unreachable system (this is how mine ended),
sounds like it could inevitably be a good thing, for production stability
purpose.

Thank you very much !

Ben

freebsd stable - Aug 2017 - mlx4en, timer irq @100%... (11.0 stuck on high network load ???)

mlx4en, timer irq @100%... (11.0 stuck on high network load ???)

mlx4en, timer irq @100%... (11.0 stuck on high network load ???)