On Mon, Sep 5, 2016 at 10:46 AM, Slawa Olhovchenkov <slw at zxy.spb.ru>
wrote:> On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:
>
>> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <slw at
zxy.spb.ru> wrote:
>> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>> >
>> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
>> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
>> >> > Under high network load and may be addtional conditional
system go to
>> >> > unresponsible state -- no reaction to network and console
(USB IPMI
>> >> > emulation). INVARIANTS give to high overhad. Is this
exist some way to
>> >> > debug this?
>> >>
>> >> Can you panic it from console to get to db> to get
backtrace and other
>> >> info when it goes unresponsive?
>> >
>> > no
>> > no reaction
>>
>> So the canonical 'ipmitool chassis power diag' doesn't send
an NMI to
>> get you to the debugger?
>
> Don't try (and don't know about this).
> Can you some explain?
The BCM sends the NMI to the CPU.
> Is this FreeBSD by default catch NMI and enter to debugger?
Yes.
> How to interoperable with USB stack (I am beware USB keyboard may be
locked)?
I've just done serial console, so I'm not sure. I think that it works...
>> I've seen this at Netflix on one variant of our flash offload box
with
>> a Intel e5-2697v2 running with the Chelsio driver. We're working
>> around it by having fewer receive threads than CPUs in the system. The
>> only way the boxes would come back was with watchdog. The load was
>> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
>> unresponsive as well. This is on our FreeBSD-10 stable based fork.
>> >From my debugging, we go from totally fine as far as I can tell
from
>> ps, etc in the moments leading to the hang to being totally wedged. It
>> seems a very sudden-onset condition. Sound at all familiar?
>>
>> Warner
>
> Not sure.
> This is less power box and can be servered only 20Gbit, using Intel
> card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
> issuse. (Not cleancly remember, may be some month ago this box crashed
> by this issuse -- at the that time I am don't have any ideas about
crash)
OK.
> May be stuck caused by some poor (too big) memory request from nginx
> (atempt parsing some malformed files). Or frequent nginx core dump
> (from this malformed files).
OK. We're using nginx too, with our modified sendfile.
> 11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
> But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
> crash). Not sure about correlation.
In our case it seems like a timing issue between too many threads. The
same hardware can handle 1x40G no probem...
Warner