Hello all, About three weeks ago, I upgraded my 5.3-RELEASE boxes to 5.4-RELEASE. I also turned on procmail globally on our mail server. Here is our current FreeBSD server setup: URANUS - primary ldap CALIBAN - secondary ldap ORION - primary mail Orion was the first one to crash, about three weeks ago. Orion is constantly talking to uranus, because uranus is our primary ldap server (we have a planet scheme), and caliban is our secondary ldap server. I ran an email flood test on orion to see if I could crash it again. This time, the high requests on Uranus caused Uranus to crash. With two different servers on two different hardware setups crashing, I had to start thinking of what could be causing the problem. Memory tests on both servers came back OK. Orion had some ECC errors which it was able to fix. I wasn't able to catch orion's first crash, but I was able to catch uranus's first crash: http://paste.atopia.net/126 I have the other crashes written down in pencil at my work. They all say mostly the same thing. I assume Caliban would also experience this behavior, but because it does not receive much load at all (only does anything when uranus dies), I am not able to confirm this. The only thing similar between the boxes is that all three have two processors in them, and are running SMP. Orion had hyperthreading turned on but I disabled this in the bios, to no avail. Someone with similar experiences running SMP informed to upgrade to -STABLE as of last week. For almost a week, Orion ran fine. This evening; however, Orion once again crashed, its fourth time in three weeks. Uranus has been stable for a few days but I am expecting it to crash again any day now (they usually take between 4-6 days). So now I am stuck. I have two -STABLE machines which continue to cause kernel traps. Tomorrow, I am going to compile a debugging kernel on orion and try to let it crash again to see what kind of errors it reports, but I was wondering if anyone else is experiencing these problems. Thanks in advance, Matt Juszczak
On Mon, Jun 27, 2005 at 01:01:09AM -0400, Matt Juszczak wrote: M> About three weeks ago, I upgraded my 5.3-RELEASE boxes to 5.4-RELEASE. M> I also turned on procmail globally on our mail server. Here is our M> current FreeBSD server setup: M> M> URANUS - primary ldap M> CALIBAN - secondary ldap M> ORION - primary mail M> M> Orion was the first one to crash, about three weeks ago. Orion is M> constantly talking to uranus, because uranus is our primary ldap server M> (we have a planet scheme), and caliban is our secondary ldap server. I M> ran an email flood test on orion to see if I could crash it again. This M> time, the high requests on Uranus caused Uranus to crash. With two M> different servers on two different hardware setups crashing, I had to M> start thinking of what could be causing the problem. M> M> Memory tests on both servers came back OK. Orion had some ECC errors M> which it was able to fix. I wasn't able to catch orion's first crash, M> but I was able to catch uranus's first crash: M> M> http://paste.atopia.net/126 Can you please build kernel with debugging and obtain a crashdump? -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE
On Wed, 6 Jul 2005, Kris Kennaway wrote:> Please obtain the backtrace with kgdb.Here you go: [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd". #0 doadump () at pcpu.h:159 159 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump () at pcpu.h:159 #1 0xc044b006 in db_fncall (dummy1=0, dummy2=0, dummy3=-1067606609, dummy4=0xe4b6c9d0 "????(\205]?????????\222\a") at /usr/src5/sys/ddb/db_command.c:531 #2 0xc044ae14 in db_command (last_cmdp=0xc0674644, cmd_table=0x0, aux_cmd_tablep=0xc064226c, aux_cmd_tablep_end=0xc0642270) at /usr/src5/sys/ddb/db_command.c:349 #3 0xc044aedc in db_command_loop () at /usr/src5/sys/ddb/db_command.c:455 #4 0xc044ca75 in db_trap (type=12, code=0) at /usr/src5/sys/ddb/db_main.c:221 #5 0xc04e6599 in kdb_trap (type=12, code=0, tf=0xe4b6cb3c) at /usr/src5/sys/kern/subr_kdb.c:468 #6 0xc05f4c79 in trap_fatal (frame=0xe4b6cb3c, eva=36) at /usr/src5/sys/i386/i386/trap.c:812 #7 0xc05f43e9 in trap (frame {tf_fs = -1040580584, tf_es = -1029439472, tf_ds = 16, tf_edi = -1038000128, tf_esi = -1066898900, tf_ebp = -457782384, tf_isp = -457782424, tf_ebx = -1040530304, tf_edx = -1040524364, tf_ecx = -1040524544, tf_eax = 0, tf_trapno = 12, tf_err = 0, tf_eip = -1068574101, tf_cs = 8, tf_eflags = 65683, tf_esp = 180, tf_ss = 0}) at /usr/src5/sys/i386/i386/trap.c:255 #8 0xc05e283a in calltrap () at /usr/src5/sys/i386/i386/exception.s:140 #9 0xc1fa0018 in ?? () #10 0xc2a40010 in ?? () #11 0x00000010 in ?? () #12 0xc2216000 in ?? () #13 0xc0686a2c in tcbinfo () #14 0xe4b6cb90 in ?? () #15 0xe4b6cb68 in ?? () #16 0xc1fac480 in ?? () #17 0xc1fadbb4 in ?? () #18 0xc1fadb00 in ?? () #19 0x00000000 in ?? () #20 0x0000000c in ?? () #21 0x00000000 in ?? () #22 0xc04eda6b in propagate_priority (td=0xc2216000) at /usr/src5/sys/kern/subr_turnstile.c:243 #23 0xc04ee225 in turnstile_wait (ts=0xc1fadb00, lock=0xc0686a2c, owner=0xc2216000) at /usr/src5/sys/kern/subr_turnstile.c:556 #24 0xc04c5ced in _mtx_lock_sleep (m=0xc0686a2c, td=0xc1fac480, opts=0, file=0x0, line=0) at /usr/src5/sys/kern/kern_mutex.c:552 #25 0xc0559ad8 in tcp_usr_rcvd (so=0x0, flags=0) at /usr/src5/sys/netinet/tcp_usrreq.c:602 #26 0xc0506103 in soreceive (so=0xc27bf798, psa=0x0, uio=0xe4b6cc88, mp0=0x0, controlp=0x0, flagsp=0x0) at /usr/src5/sys/kern/uipc_socket.c:1395 #27 0xc04f4bd9 in soo_read (fp=0x0, uio=0xe4b6cc88, active_cred=0xc2884a80, flags=0, td=0xc1fac480) at /usr/src5/sys/kern/sys_socket.c:91 #28 0xc04ee865 in dofileread (td=0xc1fac480, fp=0xc2e17bb0, fd=10, buf=0x0, nbyte=4096, offset=Unhandled dwarf expression opcode 0x93 ) at file.h:233 #29 0xc04ee72f in read (td=0xc1fac480, uap=0xe4b6cd14) at /usr/src5/sys/kern/sys_generic.c:107 #30 0xc05f4fe7 in syscall (frame {tf_fs = 47, tf_es = 47, tf_ds = -1078001617, tf_edi = 10, tf_esi = 300, tf_ebp = -1077942168, tf_isp = -457781900, tf_ebx = 134822152, tf_edx = 0, tf_ecx = 10, tf_eax = 3, tf_trapno = 0, tf_err = 2, tf_eip = 672556795, tf_cs = 31, tf_eflags = 658, tf_esp = -1077942212, tf_ss = 47}) at /usr/src5/sys/i386/i386/trap.c:1009 #31 0xc05e288f in Xint0x80_syscall () at /usr/src5/sys/i386/i386/exception.s:201 #32 0x0000002f in ?? () #33 0x0000002f in ?? () #34 0xbfbf002f in ?? () #35 0x0000000a in ?? () #36 0x0000012c in ?? () #37 0xbfbfe868 in ?? () #38 0xe4b6cd74 in ?? () #39 0x08093908 in ?? () #40 0x00000000 in ?? () #41 0x0000000a in ?? () #42 0x00000003 in ?? () #43 0x00000000 in ?? () #44 0x00000002 in ?? () #45 0x281666fb in ?? () #46 0x0000001f in ?? () #47 0x00000292 in ?? () #48 0xbfbfe83c in ?? () #49 0x0000002f in ?? () #50 0x00000000 in ?? () #51 0x00000000 in ?? () #52 0x00000000 in ?? () #53 0x00000000 in ?? () #54 0x2c75b000 in ?? () #55 0xc22de000 in ?? () #56 0xc1fac480 in ?? () #57 0xe4b6ccac in ?? () #58 0xe4b6cc94 in ?? () #59 0xc1f26000 in ?? () #60 0xc04ded13 in sched_switch (td=0x12c, newtd=0x8093908, flags=Cannot access memory at address 0xbfbfe878 ) at /usr/src5/sys/kern/sched_4bsd.c:881 Previous frame inner to this frame (corrupt stack?) (kgdb) quit
On Tue, 12 Jul 2005, Matt Juszczak wrote:> So far a 13 day up time after switching from IPF to PF. If thats not the > problem, I hope I find it soon considering this is a production server ... > but it seems to be more stable.For me, 5 days up time after switching from IPF to PF. Before the switch a couple of hours of uptime was the maximum. Seems like the crashes are caused by ipfilter.
> For me, 5 days up time after switching from IPF to PF. Before the switch a > couple of hours of uptime was the maximum. Seems like the crashes are caused > by ipfilter.Still same for me :) Uptime almost 20 days now after switching to PF.
On Mon, 18 Jul 2005 14:32:09 -0400 (EDT) Matt Juszczak <matt@atopia.net> wrote:> > For me, 5 days up time after switching from IPF to PF. Before the switch a > > couple of hours of uptime was the maximum. Seems like the crashes are caused > > by ipfilter. > > > Still same for me :) Uptime almost 20 days now after switching to PF.I find this messages kind of weird. Are you saying your servers only run long periods of uptime with pf and *not* with ipf? I run a server and almost never put it down. IPF performs very well, including a lot of natting for my home network. -- dick -- http://nagual.st/ -- PGP/GnuPG key: F86289CE ++ Running FreeBSD 4.11-stable ++ FreeBSD 5.4 + Nai tiruvantel ar vayuvantel i Valar tielyanna nu vilja
> I find this messages kind of weird. Are you saying your servers only run long periods of uptime with pf and *not* with ipf? I run a server and almost never put it down. IPF performs very well, including a lot of natting for my home network.Correct. IPF is unstable with our SMP (most of the time) - based 5.x boxes. VERY unstable. VERY VERY unstable. -Matt