Hi guys, I'm going to take another stab at getting some help. For the last 6 months my FBSD gateway has been locking up every few days, usually about once a week. No panic, no reboots, just a hard lock with no response on the console or over the net. I've replaced literally every piece of hardware with the exception of the case and power supply. No change. I've upgraded from 5.3- to 6.0- to 6.1-STABLE. No change. I've researched as much as I know how and still come up with hardly anything. I have turned on BREAK_TO_DEBUGGER, WITNESS and INVARIANTS and the only indication I've gotten is a lock order reversal that's *similar* to http://sources.zabbadoz.net/freebsd/lor/017.html. The line numbers in pf.c don't match up with LOR 017, but that's about all I can tell. I'm reasonably certain the issue is with pf, since I have 3 other non-gateway servers humming along with no problems. The hardware is nearly identical - their RAID cards are different, but I've tried running my gateway on just a single SCSI drive and had the same lockup issue. Of course, the issue could be somewhere else, but I'm at a loss as to how to find it. I'm running my console over serial so I can log anything that's necessary. I've been able to break to the debugger, but to be honest, I don't know what to look for. I've seen several posts on the lists about posting the output of debug commands, but I figured it to be in poor taste to just dump my output here before someone asked. I'm getting a lot of heat from the boss since our VoIP phones don't work when the gateway locks up. If someone can help identify and/or eliminate this issue, I'm more than happy to do everything I can to provide the necessary information. Thanks.
On Sun, Jun 11, 2006 at 12:09:05PM -0600, Brad Waite wrote:> Hi guys, > > I'm going to take another stab at getting some help. > > For the last 6 months my FBSD gateway has been locking up every few > days, usually about once a week. No panic, no reboots, just a hard lock > with no response on the console or over the net. > > I've replaced literally every piece of hardware with the exception of > the case and power supply. No change.One of my machines had random lockup problems, an then the power supply died. The problems were gone after it was replaced. You might want to swap out the power supply. And test the RAM.> I've upgraded from 5.3- to 6.0- to 6.1-STABLE. No change. > > I've researched as much as I know how and still come up with hardly > anything. I have turned on BREAK_TO_DEBUGGER, WITNESS and INVARIANTS > and the only indication I've gotten is a lock order reversal that's > *similar* to http://sources.zabbadoz.net/freebsd/lor/017.html. The line > numbers in pf.c don't match up with LOR 017, but that's about all I can > tell. > > I'm reasonably certain the issue is with pf, since I have 3 other > non-gateway servers humming along with no problems. The hardware is > nearly identical - their RAID cards are different, but I've tried > running my gateway on just a single SCSI drive and had the same lockup > issue. Of course, the issue could be somewhere else, but I'm at a loss > as to how to find it.Could you try swapping two machines? That would be the ultimate check if it's hardware related.> I'm running my console over serial so I can log anything that's > necessary. I've been able to break to the debugger, but to be honest, I > don't know what to look for. I've seen several posts on the lists about > posting the output of debug commands, but I figured it to be in poor > taste to just dump my output here before someone asked.Posting a stack backtrace (bt) might be a good start.> I'm getting a lot of heat from the boss since our VoIP phones don't work > when the gateway locks up.Sometimes POTS isn't so bad after all. ;-) Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060611/a07b61a5/attachment.pgp
On Sun, Jun 11, 2006 at 12:09:05PM -0600, Brad Waite wrote:> Hi guys, > > I'm going to take another stab at getting some help. > > For the last 6 months my FBSD gateway has been locking up every few > days, usually about once a week. No panic, no reboots, just a hard lock > with no response on the console or over the net. > > I've replaced literally every piece of hardware with the exception of > the case and power supply. No change. > > I've upgraded from 5.3- to 6.0- to 6.1-STABLE. No change. > > I've researched as much as I know how and still come up with hardly > anything. I have turned on BREAK_TO_DEBUGGER, WITNESS and INVARIANTS > and the only indication I've gotten is a lock order reversal that's > *similar* to http://sources.zabbadoz.net/freebsd/lor/017.html. The line > numbers in pf.c don't match up with LOR 017, but that's about all I can > tell.We need to know the LOR before anyone can tell what is going wrong ;-) Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060611/71ae1c78/attachment.pgp
Kris Kennaway wrote:> We need to know the LOR before anyone can tell what is going wrong ;-)Ask and you will receive: lock order reversal: 1st 0xc077a440 pf task mtx (pf task mtx) @ /usr/src/sys/contrib/pf/net/pf.c:6331 2nd 0xc07d3fac tcp (tcp) @ /usr/src/sys/contrib/pf/net/pf.c:2719 KDB: stack backtrace: witness_checkorder(c07d3fac,9,c06fd2f7,a9f) at witness_checkorder+0x55c _mtx_lock_flags(c07d3fac,0,c06fd2f7,a9f,c07d3fac) at _mtx_lock_flags+0x40 pf_socket_lookup(e35ccacc,e35ccad0,1,e35ccb8c,0) at pf_socket_lookup+0x103 pf_test_tcp(e35ccb3c,e35ccb34,1,c4d3e400,c5027c00,14,c5032810,e35ccb8c,e35ccb40,e35ccb44,0,0) at pf_test_tcp+0x10d6 pf_test(1,c4ba3c00,e35ccc2c,0,0) at pf_test+0xb77 pf_check_in(0,e35ccc2c,c4ba3c00,1,0) at pf_check_in+0x37 pfil_run_hooks(c07d3b60,e35ccccc,c4ba3c00,1,0) at pfil_run_hooks+0xee ip_input(c5027c00,18,c07d3138,e35cccec,c05dcd63) at ip_input+0x1b2 netisr_processqueue(c4b24500,c4b28000,0,e35ccd0c,c055877b) at netisr_processqueue+0xf swi_net(0,c4b28038,c4ad6d80,c0558590,c4ad520c) at swi_net+0x8b ithread_loop(c4ab68d0,e35ccd38,c4ab68d0,c0558590,0) at ithread_loop+0x1eb fork_exit(c0558590,c4ab68d0,e35ccd38) at fork_exit+0x7d fork_trampoline() at fork_trampoline+0x8 --- trap 0x1, eip = 0, esp = 0xe35ccd6c, ebp = 0 ---