The following is a reproducible problem on a couple of our DNS servers: (one running 6.2-STABLE, one running 7.0-PRERELEASE): pid 52308 (named), uid 53: exited on signal 6 Oct 18 12:10:21 anubis named[52308]: /usr/src/lib/bind/isc/../../../contrib/bind9/lib/isc/task.c:1238: INSIST((((manager->tasks).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Oct 18 12:10:21 anubis named[52308]: exiting (due to assertion failure) The problem only occurs when using "/etc/rc.d/named restart". Doing a manual "/etc/rc.d/named stop" then "/etc/rc.d/named start" does not induce the problem. There was one random Internet user who posted about the same issue: http://forums.devshed.com/dns-36/weird-loggs-470845.html There's nothing bizarre about our BIND configuration on these boxes. I've re-written it (by hand) a couple times hoping it might be some syntax problem or other oddity, but it doesn't appear to be. We're not chrooting, and there's no jails. Only thing "non-standard" in rc.conf that's named-related is named_flags="-4". Both boxes exhibiting this problem are running on identical hardware (C2Ds, same memory amount, etc.), with an SMP kernel. The 7.0 box uses the ULE scheduler, while the 6.2 box uses the 4BSD scheduler. I mention this because the master server (running 6.2-STABLE on different hardware, non-SMP kernel, single-core P4 CPU) uses CPUTYPE?=prescott and does not have this problem. I haven't tried adding "-n 1" to named_flags to see if this is a BIND worker thread problem. I can't provide access to these boxes, but I can provide the configuration files and zones (there are not many) to those I trust (dougb@ that means you :) ). If a core is needed, I can likely get one without too much trouble. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
On Thu, Oct 18, 2007 at 12:33:22PM -0700, Jeremy Chadwick wrote:> The following is a reproducible problem on a couple of our DNS servers: > (one running 6.2-STABLE, one running 7.0-PRERELEASE):Gack, there's an error in my report. I swore I was looking at the right PuTTY window... It's specific to two boxes running 6.2-STABLE and hardware-wise differ completely. rc.conf's are the same, but make.conf differs (box 1 uses CPUTYPE?=pentium3, box2 uses CPUTYPE?=nocona). Both are using the 4BSD scheduler. A third box running 6.2-STABLE (the one using the single-core P4 CPU and CPUTYPE?=prescott, 4BSD scheduler) does not exhibit the problem. I tried to reproduce the problem on our 7.0-PRERELEASE box (using identical hardware to that of box2) and couldn't. So it may indeed be some BIND bug... -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Jeremy, I saw this on Thursday, but I also saw that Mark had answered you and I had to focus on $REAL_LIFE so sorry for the delay. On Thu, 18 Oct 2007, Jeremy Chadwick wrote:> The following is a reproducible problem on a couple of our DNS servers: > (one running 6.2-STABLE, one running 7.0-PRERELEASE): > > pid 52308 (named), uid 53: exited on signal 6 > Oct 18 12:10:21 anubis named[52308]: /usr/src/lib/bind/isc/../../../contrib/bind9/lib/isc/task.c:1238: INSIST((((manager->tasks).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed > Oct 18 12:10:21 anubis named[52308]: exiting (due to assertion failure) > > The problem only occurs when using "/etc/rc.d/named restart". Doing a > manual "/etc/rc.d/named stop" then "/etc/rc.d/named start" does not > induce the problem.I'm currently working on some improvements to the rc.d/named script that should help with that issue (unrelated to the bug Mark mentioned in BIND 9.3.4).> There was one random Internet user who posted about the same issue: > > http://forums.devshed.com/dns-36/weird-loggs-470845.html > > There's nothing bizarre about our BIND configuration on these boxes. > I've re-written it (by hand) a couple times hoping it might be some > syntax problem or other oddity, but it doesn't appear to be. We're not > chrooting,You probably should be. :) You're correct in thinking that it's not a factor for this issue though.> and there's no jails. Only thing "non-standard" in rc.conf that's > named-related is named_flags="-4".Yeah, that's both harmless and common.> Both boxes exhibiting this problem are running on identical hardware > (C2Ds, same memory amount, etc.), with an SMP kernel. The 7.0 box uses > the ULE scheduler, while the 6.2 box uses the 4BSD scheduler. I mention > this because the master server (running 6.2-STABLE on different > hardware, non-SMP kernel, single-core P4 CPU) uses CPUTYPE?=prescott and > does not have this problem.If you're running on 6.x and/or BIND 9.3.x you should definitely not use threads, and your idea of using -n1 is probably a good idea as well (even if the bug were not present). I saw your followup to this post so I'm a little unclear as to what hardware we're talking about, but if you're using a dual core or SMP machine I strongly encourage you to upgrade to 7.0 and BIND 9.4.1-P1. Both new versions have significant improvements in how they handle threads, and Kris has done some great work profiling that combination and shown that it significantly outperforms 6.2 and 9.3.x.> I can't provide access to these boxes, but I can provide the > configuration files and zones (there are not many) to those I trust > (dougb@ that means you :) ).Heh, thanks. hth, Doug -- This .signature sanitized for your protection