thr3ads.net - freebsd stable - 8-STABLE freezes on UDP traffic (DNS), 7.x doesn't [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Attila Nagy

2010-Mar-25 14:41 UTC

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

Hi,

I have some recursive nameservers, running unbound and 7.2-STABLE #0: 
Wed Sep  2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce 
interfaces).
These work OK.

During the process of migrating to 8.x, I've upgraded one of these 
machines to 8.0-STABLE #25: Tue Mar  9 18:15:34 CET 2010 (the dates 
indicate an approximate time, when the source was checked out from 
cvsup.hu.freebsd.org, I don't know the exact revision).

The first problem was that the machine occasionally lost network access 
for some minutes. I could log in on the console, and I could see the 
processes, involved in network IO in "keglim" state, but couldn't
do any
network IO. This lasted for some minutes, then everything came back to 
normal.
I could fix this issue by raising kern.ipc.nmbclusters to 51200 
(doubling from its default size), when I can't see these blackouts.

But now the machine freezes. It can run for about a day, and then it 
just freezes. I can't even break in to the debugger with sending NMI to it.
top says:
last pid: 92428;  load averages:  0.49,  0.40,  0.38    up 0+21:13:18  
07:41:43
43 processes:  2 running, 38 sleeping, 1 zombie, 2 lock
CPU:  1.3% user,  0.0% nice,  1.3% system, 26.0% interrupt, 71.3% idle
Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free
Swap:

   PID USERNAME   THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
45011 bind         4  49    0  1734M  1722M RUN     2  37:42 22.17% unbound
   712 bind         3  44    0 70892K 19904K uwait   0  71:07  3.86% 
python2.6

The common in these freezes seems to be the high interrupt count. 
Normally, during load the CPU times look like this:
CPU:  3.5% user,  0.0% nice,  1.8% system,  0.4% interrupt, 94.4% idle

I could observe a "freeze", where top remained running and everything 
was 0%, except interrupt, which was 25% exactly (the machine has four 
cores), and another, where I could save the following console output:
CPU:  0.0% user,  0.0% nice,  0.2% system, 50.0% interrupt, 49.8% idle
.......(partial, broken line)....32M  2423M *udp    1  50:16 10.89% unbound
   714 bind         3  44    0 70892K 26852K uwait   3   8:41  4.69% 
python2.6
61004 root         1  62    0 37428K 10876K *udp    1   0:00  1.56% python
   706 root         1  44    0  2696K   624K piperd  1   0:07  0.00% 
readproctit

Both unbound and python accepts DNS requests, and it seems when 25% 
interrupt happens, only unbound is in *udp state, where it is 50%, both 
programs are in that state.

Michael Loftis

2010-Mar-25 15:39 UTC

head link

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

--On Thursday, March 25, 2010 3:22 PM +0100 Attila Nagy <bra@fsn.hu>
wrote:

<...>> Both unbound and python accepts DNS requests, and it seems when 25%
> interrupt happens, only unbound is in *udp state, where it is 50%, both
> programs are in that state.
Try turning of hardware TSO/checksum offload if it's availble on your 
chipset?  ifconfig <interface> -rxcsum -txcsum -tso -- I'm only using
nfe
chips right now, but w/ the TSO/CSUM on they lock up constantly under high 
load.  We're pretty sure it's mostly the nfe driver, or the chips 
themselves, but have never ruled out some generic 8.x hardware offload 
issues.

Pyun YongHyeon

2010-Mar-25 18:37 UTC

head link

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

On Thu, Mar 25, 2010 at 03:22:04PM +0100, Attila Nagy
wrote:> Hi,
> 
> I have some recursive nameservers, running unbound and 7.2-STABLE #0: 
> Wed Sep  2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce 
> interfaces).
> These work OK.
> 
> During the process of migrating to 8.x, I've upgraded one of these 
> machines to 8.0-STABLE #25: Tue Mar  9 18:15:34 CET 2010 (the dates 
> indicate an approximate time, when the source was checked out from 
> cvsup.hu.freebsd.org, I don't know the exact revision).
> 
> The first problem was that the machine occasionally lost network access 
> for some minutes. I could log in on the console, and I could see the 
> processes, involved in network IO in "keglim" state, but
couldn't do any
> network IO. This lasted for some minutes, then everything came back to 
> normal.
> I could fix this issue by raising kern.ipc.nmbclusters to 51200 
> (doubling from its default size), when I can't see these blackouts.
> 
> But now the machine freezes. It can run for about a day, and then it 
> just freezes. I can't even break in to the debugger with sending NMI to
it.
> top says:
> last pid: 92428;  load averages:  0.49,  0.40,  0.38    up 0+21:13:18  
> 07:41:43
> 43 processes:  2 running, 38 sleeping, 1 zombie, 2 lock
> CPU:  1.3% user,  0.0% nice,  1.3% system, 26.0% interrupt, 71.3% idle
> Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free
> Swap:
> 
>   PID USERNAME   THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
> 45011 bind         4  49    0  1734M  1722M RUN     2  37:42 22.17% unbound
>   712 bind         3  44    0 70892K 19904K uwait   0  71:07  3.86% 
> python2.6
> 
> The common in these freezes seems to be the high interrupt count. 
> Normally, during load the CPU times look like this:
> CPU:  3.5% user,  0.0% nice,  1.8% system,  0.4% interrupt, 94.4% idle
> 
> I could observe a "freeze", where top remained running and
everything
> was 0%, except interrupt, which was 25% exactly (the machine has four 
> cores), and another, where I could save the following console output:
> CPU:  0.0% user,  0.0% nice,  0.2% system, 50.0% interrupt, 49.8% idle
When you see high number of interrupts, could you check this comes
from bce(4)? I guess you can use systat(1) to check how many number
interrupts are generated from bce(4).
> .......(partial, broken line)....32M  2423M *udp    1  50:16 10.89% unbound
>   714 bind         3  44    0 70892K 26852K uwait   3   8:41  4.69% 
> python2.6
> 61004 root         1  62    0 37428K 10876K *udp    1   0:00  1.56% python
>   706 root         1  44    0  2696K   624K piperd  1   0:07  0.00% 
> readproctit
> 
> Both unbound and python accepts DNS requests, and it seems when 25% 
> interrupt happens, only unbound is in *udp state, where it is 50%, both 
> programs are in that state.

freebsd stable - Mar 2010 - 8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't

8-STABLE freezes on UDP traffic (DNS), 7.x doesn't