Hi, I have some recursive nameservers, running unbound and 7.2-STABLE #0: Wed Sep 2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce interfaces). These work OK. During the process of migrating to 8.x, I've upgraded one of these machines to 8.0-STABLE #25: Tue Mar 9 18:15:34 CET 2010 (the dates indicate an approximate time, when the source was checked out from cvsup.hu.freebsd.org, I don't know the exact revision). The first problem was that the machine occasionally lost network access for some minutes. I could log in on the console, and I could see the processes, involved in network IO in "keglim" state, but couldn't do any network IO. This lasted for some minutes, then everything came back to normal. I could fix this issue by raising kern.ipc.nmbclusters to 51200 (doubling from its default size), when I can't see these blackouts. But now the machine freezes. It can run for about a day, and then it just freezes. I can't even break in to the debugger with sending NMI to it. top says: last pid: 92428; load averages: 0.49, 0.40, 0.38 up 0+21:13:18 07:41:43 43 processes: 2 running, 38 sleeping, 1 zombie, 2 lock CPU: 1.3% user, 0.0% nice, 1.3% system, 26.0% interrupt, 71.3% idle Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 45011 bind 4 49 0 1734M 1722M RUN 2 37:42 22.17% unbound 712 bind 3 44 0 70892K 19904K uwait 0 71:07 3.86% python2.6 The common in these freezes seems to be the high interrupt count. Normally, during load the CPU times look like this: CPU: 3.5% user, 0.0% nice, 1.8% system, 0.4% interrupt, 94.4% idle I could observe a "freeze", where top remained running and everything was 0%, except interrupt, which was 25% exactly (the machine has four cores), and another, where I could save the following console output: CPU: 0.0% user, 0.0% nice, 0.2% system, 50.0% interrupt, 49.8% idle .......(partial, broken line)....32M 2423M *udp 1 50:16 10.89% unbound 714 bind 3 44 0 70892K 26852K uwait 3 8:41 4.69% python2.6 61004 root 1 62 0 37428K 10876K *udp 1 0:00 1.56% python 706 root 1 44 0 2696K 624K piperd 1 0:07 0.00% readproctit Both unbound and python accepts DNS requests, and it seems when 25% interrupt happens, only unbound is in *udp state, where it is 50%, both programs are in that state.
--On Thursday, March 25, 2010 3:22 PM +0100 Attila Nagy <bra@fsn.hu> wrote: <...>> Both unbound and python accepts DNS requests, and it seems when 25% > interrupt happens, only unbound is in *udp state, where it is 50%, both > programs are in that state.Try turning of hardware TSO/checksum offload if it's availble on your chipset? ifconfig <interface> -rxcsum -txcsum -tso -- I'm only using nfe chips right now, but w/ the TSO/CSUM on they lock up constantly under high load. We're pretty sure it's mostly the nfe driver, or the chips themselves, but have never ruled out some generic 8.x hardware offload issues.
On Thu, Mar 25, 2010 at 03:22:04PM +0100, Attila Nagy wrote:> Hi, > > I have some recursive nameservers, running unbound and 7.2-STABLE #0: > Wed Sep 2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce > interfaces). > These work OK. > > During the process of migrating to 8.x, I've upgraded one of these > machines to 8.0-STABLE #25: Tue Mar 9 18:15:34 CET 2010 (the dates > indicate an approximate time, when the source was checked out from > cvsup.hu.freebsd.org, I don't know the exact revision). > > The first problem was that the machine occasionally lost network access > for some minutes. I could log in on the console, and I could see the > processes, involved in network IO in "keglim" state, but couldn't do any > network IO. This lasted for some minutes, then everything came back to > normal. > I could fix this issue by raising kern.ipc.nmbclusters to 51200 > (doubling from its default size), when I can't see these blackouts. > > But now the machine freezes. It can run for about a day, and then it > just freezes. I can't even break in to the debugger with sending NMI to it. > top says: > last pid: 92428; load averages: 0.49, 0.40, 0.38 up 0+21:13:18 > 07:41:43 > 43 processes: 2 running, 38 sleeping, 1 zombie, 2 lock > CPU: 1.3% user, 0.0% nice, 1.3% system, 26.0% interrupt, 71.3% idle > Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free > Swap: > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 45011 bind 4 49 0 1734M 1722M RUN 2 37:42 22.17% unbound > 712 bind 3 44 0 70892K 19904K uwait 0 71:07 3.86% > python2.6 > > The common in these freezes seems to be the high interrupt count. > Normally, during load the CPU times look like this: > CPU: 3.5% user, 0.0% nice, 1.8% system, 0.4% interrupt, 94.4% idle > > I could observe a "freeze", where top remained running and everything > was 0%, except interrupt, which was 25% exactly (the machine has four > cores), and another, where I could save the following console output: > CPU: 0.0% user, 0.0% nice, 0.2% system, 50.0% interrupt, 49.8% idleWhen you see high number of interrupts, could you check this comes from bce(4)? I guess you can use systat(1) to check how many number interrupts are generated from bce(4).> .......(partial, broken line)....32M 2423M *udp 1 50:16 10.89% unbound > 714 bind 3 44 0 70892K 26852K uwait 3 8:41 4.69% > python2.6 > 61004 root 1 62 0 37428K 10876K *udp 1 0:00 1.56% python > 706 root 1 44 0 2696K 624K piperd 1 0:07 0.00% > readproctit > > Both unbound and python accepts DNS requests, and it seems when 25% > interrupt happens, only unbound is in *udp state, where it is 50%, both > programs are in that state.