Julien Charbon
2017-Aug-28 09:27 UTC
mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
Hi Ben, On 8/28/17 10:25 AM, Ben RUBSON wrote:>> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson at gmail.com> wrote: >> >>> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote: >>> >>> On 8/11/17 11:32 AM, Ben RUBSON wrote: >>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote: >>>>> >>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>>>> >>>>>> Suggested fix attached. >>>>> >>>>> I agree we your conclusion. Just for the record, more precisely this >>>>> regression seems to have been introduced with: >>>>> (...) >>>>> Thus good catch, and your patch looks good. I am going to just verify >>>>> the other in_pcbrele_wlocked() calls in TCP stack. >>>> >>>> Julien, do you plan to make this fix reach 11.0-p12 ? >>> >>> I am checking if your issue is another flavor of the issue fixed by: >>> >>> https://svnweb.freebsd.org/base?view=revision&revision=307551 >>> https://reviews.freebsd.org/D8211 >>> >>> This fix in not in 11.0 but in 11.1. Currently I did not found how an >>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw >>> set to NULL already except the issue fixed by r307551. >>> >>> Thus could you try to apply this patch: >>> >>> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch >>> >>> and see if you can still reproduce this issue? >> >> Thank you for your answer Julien. >> Unfortunately, I'm not sure at all how to reproduce the issue. >> I have other servers which are 100% identical to this one, same workload, >> same some-months uptime, but they did not trigger the bug yet. >> >> If other network stack experts (I'm not) agree with your analysis, >> we could then certainly go further with D8211 / r307551. >> >> One thing that perhaps might help : >> # netstat -an | grep TIME_WAIT$ | wc -l >> 468 >> >> Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails. >> As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time). >> As if netstat was able to somehow help... >> >> Number of TIME_WAIT connections however does not decrease, but increases. >> >>> And in the spirit of r307551 fix and based on Hans patch I will also >>> propose to add a kernel log describing the issue instead of starting an >>> infinite loop when INVARIANT is not set. >> >> Which should then never be triggered :) >> Good idea I think ! > > What about : > D8211/r307551 > + Hans' patch > + Julien's idea of a kernel log (sort of "We should not be here but we are")I did this change and I am testing it, on your side did you try this patch applied on 11.0? https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch> And backporting all this to 11.0 (and so to 11.1 too) ? > > As this bug can impact every FreeBSD machine / server, > leading to an unavailable / unreachable system (this is how mine ended), > sounds like it could inevitably be a good thing, for production stability purpose.The main fix for your issue is (I believe): Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. https://svnweb.freebsd.org/base?view=revision&revision=307551 This fix has been MFC-ed on both stable/11, stable/10 and is already included in 11.1 and will be in 10.4. To push in 11.0 release directly, I guess you have to promote this change to an Errata (never did that myself): https://www.freebsd.org/security/notices.html https://www.freebsd.org/security/security.html#reporting While waiting for this errata to be accepted better using the patch. My 2 cents. -- Julien
Ben RUBSON
2017-Aug-31 16:04 UTC
mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
> On 28 Aug 2017, at 11:27, Julien Charbon <jch at freebsd.org> wrote: > > On 8/28/17 10:25 AM, Ben RUBSON wrote: >>> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson at gmail.com> wrote: >>> >>>> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote: >>>> >>>> On 8/11/17 11:32 AM, Ben RUBSON wrote: >>>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote: >>>>>> >>>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>>>>> >>>>>>> Suggested fix attached. >>>>>> >>>>>> I agree we your conclusion. Just for the record, more precisely this >>>>>> regression seems to have been introduced with: >>>>>> (...) >>>>>> Thus good catch, and your patch looks good. I am going to just verify >>>>>> the other in_pcbrele_wlocked() calls in TCP stack. >>>>> >>>>> Julien, do you plan to make this fix reach 11.0-p12 ? >>>> >>>> I am checking if your issue is another flavor of the issue fixed by: >>>> >>>> https://svnweb.freebsd.org/base?view=revision&revision=307551 >>>> https://reviews.freebsd.org/D8211 >>>> >>>> This fix in not in 11.0 but in 11.1. Currently I did not found how an >>>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw >>>> set to NULL already except the issue fixed by r307551. >>>> >>>> Thus could you try to apply this patch: >>>> >>>> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch >>>> >>>> and see if you can still reproduce this issue? >>> >>> Thank you for your answer Julien. >>> Unfortunately, I'm not sure at all how to reproduce the issue. >>> I have other servers which are 100% identical to this one, same workload, >>> same some-months uptime, but they did not trigger the bug yet. >>> >>> If other network stack experts (I'm not) agree with your analysis, >>> we could then certainly go further with D8211 / r307551. >>> >>> One thing that perhaps might help : >>> # netstat -an | grep TIME_WAIT$ | wc -l >>> 468 >>> >>> Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails. >>> As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time). >>> As if netstat was able to somehow help... >>> >>> Number of TIME_WAIT connections however does not decrease, but increases. >>> >>>> And in the spirit of r307551 fix and based on Hans patch I will also >>>> propose to add a kernel log describing the issue instead of starting an >>>> infinite loop when INVARIANT is not set. >>> >>> Which should then never be triggered :) >>> Good idea I think ! >> >> What about : >> D8211/r307551 >> + Hans' patch >> + Julien's idea of a kernel log (sort of "We should not be here but we are") > > I did this change and I am testing itGood news !> on your side did you try this patch applied on 11.0? > > https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patchYes, patch applied and running correctly, however hard to say whether or not it solves this issue, as there is no easy way to reproduce it.>> And backporting all this to 11.0 (and so to 11.1 too) ? >> >> As this bug can impact every FreeBSD machine / server, >> leading to an unavailable / unreachable system (this is how mine ended), >> sounds like it could inevitably be a good thing, for production stability purpose. > > The main fix for your issue is (I believe): > > Fix a double-free when an inp transitions to INP_TIMEWAIT state > after having been dropped. > https://svnweb.freebsd.org/base?view=revision&revision=307551 > > This fix has been MFC-ed on both stable/11, stable/10 and is already > included in 11.1 and will be in 10.4. To push in 11.0 release directly, > I guess you have to promote this change to an Errata (never did that > myself): > > https://www.freebsd.org/security/notices.html > https://www.freebsd.org/security/security.html#reportingMail sent to FreeBSD Security Team ! Many thanks, let's stay tuned ! Ben
Julien Charbon
2017-Sep-07 15:22 UTC
mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
Hi Ben, On 8/31/17 12:04 PM, Ben RUBSON wrote:>> On 28 Aug 2017, at 11:27, Julien Charbon <jch at freebsd.org> wrote: >> >> On 8/28/17 10:25 AM, Ben RUBSON wrote: >>>> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson at gmail.com> wrote: >>>> >>>>> On 15 Aug 2017, at 23:33, Julien Charbon <jch at freebsd.org> wrote: >>>>> >>>>> On 8/11/17 11:32 AM, Ben RUBSON wrote: >>>>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch at freebsd.org> wrote: >>>>>>> >>>>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote: >>>>>>>> >>>>>>>> Suggested fix attached. >>>>>>> >>>>>>> I agree we your conclusion. Just for the record, more precisely this >>>>>>> regression seems to have been introduced with: >>>>>>> (...) >>>>>>> Thus good catch, and your patch looks good. I am going to just verify >>>>>>> the other in_pcbrele_wlocked() calls in TCP stack. >>>>>> >>>>>> Julien, do you plan to make this fix reach 11.0-p12 ? >>>>> >>>>> I am checking if your issue is another flavor of the issue fixed by: >>>>> >>>>> https://svnweb.freebsd.org/base?view=revision&revision=307551 >>>>> https://reviews.freebsd.org/D8211 >>>>> >>>>> This fix in not in 11.0 but in 11.1. Currently I did not found how an >>>>> inp in INP_TIMEWAIT state can have been INP_FREED without having its tw >>>>> set to NULL already except the issue fixed by r307551. >>>>> >>>>> Thus could you try to apply this patch: >>>>> >>>>> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch >>>>> >>>>> and see if you can still reproduce this issue? >>>> >>>> Thank you for your answer Julien. >>>> Unfortunately, I'm not sure at all how to reproduce the issue. >>>> I have other servers which are 100% identical to this one, same workload, >>>> same some-months uptime, but they did not trigger the bug yet. >>>> >>>> If other network stack experts (I'm not) agree with your analysis, >>>> we could then certainly go further with D8211 / r307551. >>>> >>>> One thing that perhaps might help : >>>> # netstat -an | grep TIME_WAIT$ | wc -l >>>> 468 >>>> >>>> Note that due to this running bug, sendmail has lots of difficulties to send outgoing mails. >>>> As soon as I run the above netstat command, I receive a lot of stacked mails (more than 20 this time). >>>> As if netstat was able to somehow help... >>>> >>>> Number of TIME_WAIT connections however does not decrease, but increases. >>>> >>>>> And in the spirit of r307551 fix and based on Hans patch I will also >>>>> propose to add a kernel log describing the issue instead of starting an >>>>> infinite loop when INVARIANT is not set. >>>> >>>> Which should then never be triggered :) >>>> Good idea I think ! >>> >>> What about : >>> D8211/r307551 >>> + Hans' patch >>> + Julien's idea of a kernel log (sort of "We should not be here but we are") >> >> I did this change and I am testing it > > Good news ! > >> on your side did you try this patch applied on 11.0? >> >> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch > > Yes, patch applied and running correctly, > however hard to say whether or not it solves this issue, > as there is no easy way to reproduce it.No problem, it is just a matter of not seeing the issue anymore during a long enough period. I created a review that includes Hans's patch and uses the same log(LOG_ERR) logic than r307551: https://reviews.freebsd.org/D12267 On my side, TCP smoke tests are ok. And I am going to launch our TCP QA on it while receiving review comments.> Mail sent to FreeBSD Security Team ! > > Many thanks, let's stay tuned !Thanks to you and Hans for reporting that issue. And in summary: - Applying r307551 on top of 11.0 should prevent this case to happen - D12267 will prevent the tcp_tw_2msl_scan() infinite loop while reporting the error, in case a regression defeating r307551 is introduced Thanks. -- Julien