thr3ads.net - freebsd stable - 11.0 stuck on high network load [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Slawa Olhovchenkov

2016-Sep-05 16:46 UTC

11.0 stuck on high network load

On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:
> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <slw at
zxy.spb.ru> wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> >> > Under high network load and may be addtional conditional
system go to
> >> > unresponsible state -- no reaction to network and console
(USB IPMI
> >> > emulation). INVARIANTS give to high overhad. Is this exist
some way to
> >> > debug this?
> >>
> >> Can you panic it from console to get to db> to get backtrace
and other
> >> info when it goes unresponsive?
> >
> > no
> > no reaction
> 
> So the canonical 'ipmitool chassis power diag' doesn't send an
NMI to
> get you to the debugger?
Don't try (and don't know about this).
Can you some explain?
Is this FreeBSD by default catch NMI and enter to debugger?
How to interoperable with USB stack (I am beware USB keyboard may be locked)?
> I've seen this at Netflix on one variant of our flash offload box with
> a Intel e5-2697v2 running with the Chelsio driver. We're working
> around it by having fewer receive threads than CPUs in the system. The
> only way the boxes would come back was with watchdog. The load was
> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
> unresponsive as well. This is on our FreeBSD-10 stable based fork.
> >From my debugging, we go from totally fine as far as I can tell from
> ps, etc in the moments leading to the hang to being totally wedged. It
> seems a very sudden-onset condition. Sound at all familiar?
> 
> Warner
Not sure.
This is less power box and can be servered only 20Gbit, using Intel
card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
issuse. (Not cleancly remember, may be some month ago this box crashed
by this issuse -- at the that time I am don't have any ideas about crash)

May be stuck caused by some poor (too big) memory request from nginx
(atempt parsing some malformed files). Or frequent nginx core dump
(from this malformed files).

11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
crash). Not sure about correlation.

Warner Losh

2016-Sep-05 17:50 UTC

head link

11.0 stuck on high network load

On Mon, Sep 5, 2016 at 10:46 AM, Slawa Olhovchenkov <slw at zxy.spb.ru>
wrote:> On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:
>
>> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <slw at
zxy.spb.ru> wrote:
>> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>> >
>> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
>> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
>> >> > Under high network load and may be addtional conditional
system go to
>> >> > unresponsible state -- no reaction to network and console
(USB IPMI
>> >> > emulation). INVARIANTS give to high overhad. Is this
exist some way to
>> >> > debug this?
>> >>
>> >> Can you panic it from console to get to db> to get
backtrace and other
>> >> info when it goes unresponsive?
>> >
>> > no
>> > no reaction
>>
>> So the canonical 'ipmitool chassis power diag' doesn't send
an NMI to
>> get you to the debugger?
>
> Don't try (and don't know about this).
> Can you some explain?
The BCM sends the NMI to the CPU.
> Is this FreeBSD by default catch NMI and enter to debugger?
Yes.
> How to interoperable with USB stack (I am beware USB keyboard may be
locked)?
I've just done serial console, so I'm not sure. I think that it works...
>> I've seen this at Netflix on one variant of our flash offload box
with
>> a Intel e5-2697v2 running with the Chelsio driver. We're working
>> around it by having fewer receive threads than CPUs in the system. The
>> only way the boxes would come back was with watchdog. The load was
>> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
>> unresponsive as well. This is on our FreeBSD-10 stable based fork.
>> >From my debugging, we go from totally fine as far as I can tell
from
>> ps, etc in the moments leading to the hang to being totally wedged. It
>> seems a very sudden-onset condition. Sound at all familiar?
>>
>> Warner
>
> Not sure.
> This is less power box and can be servered only 20Gbit, using Intel
> card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
> issuse. (Not cleancly remember, may be some month ago this box crashed
> by this issuse -- at the that time I am don't have any ideas about
crash)
OK.
> May be stuck caused by some poor (too big) memory request from nginx
> (atempt parsing some malformed files). Or frequent nginx core dump
> (from this malformed files).
OK. We're using nginx too, with our modified sendfile.
> 11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
> But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
> crash). Not sure about correlation.
In our case it seems like a timing issue between too many threads. The
same hardware can handle 1x40G no probem...

Warner

freebsd stable - Sep 2016 - 11.0 stuck on high network load

11.0 stuck on high network load

11.0 stuck on high network load