thr3ads.net - freebsd stable - 5.4-RELEASE lockups on amd64 SMP [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Grooms, Matthew

2005-Jun-06 23:54 UTC

5.4-RELEASE lockups on amd64 SMP

My appologies. With the debug options listed in my previous post ( should have
read 5.4 not 5.3 ), I got a lock order reversal. After a while, it paniced and
spat out this ...

lock order reversal
1st 0xffffffff80752ec0 pf task mtx (pf task mtx) @
contrib/pf/net/if_pfsync.c:1621
2nd 0xffffffff8076e9f0 user map (user man) @ vm/vm_map.c:2998
KDB: stack backtrace:
witness_checkorder() at witness_checkorder+0x654
_sx_xlock() at _sx_xlock+0x51
vm_map_lookup() at vm_map_lookup+0x44
vm_fault() at vm_fault+0xba
trap() at trap+0x1c5
alltraps_with_regs_pushed() at alltraps_with_regs_pushed+0x5
pf_state_tree_lan_ext_RB_REMOVE() at pf_state_tree_lan_ext_RB_REMOVE+0x10c
pf_purge_expired_states() at pf_purge_expired_states+0xab
pfsync_input() at ip_input+0x10f
netisr_processqueue() at netisr_processqueue+0x17
swi_net() at swi_net+0xa8
ithread_loop() at ithread_loop+0xd9
fork_exit() at fork_exit+0xc3
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
KDB: enter: withness_ckeckorder
[thread pid 110 tid 100089]
Stopped at      kdb_enter+0x2f: nop
db> panic blockable sleep lock (sleep mutex) tty @ kern/kern_event.c:1453
cpuid = 0
boot() called on cpu#0
Uptime: 10m40s
Dumping 4864 mB
 16 32 .........

After a reboot, I received another panic.

Tracing pid 603 tid 100140 td 0xffffff012efda500
kdb_enter() at kdb_enter+02f
panic() at panic+0x249
ffs_blkfree() at ffs_blkfree+0x483
indir_trunc() at indir_trunc+0x190
indir_trunc() at indir_trunc+0x1fb
handle_workitem_freeblocks() at handle_workitem_freeblocks+0x228
softdep_setup_freeblocks() at softdep_setup_freeblocks+0x730
ffs_truncate() at ffs_truncate+0x1c9
ffs_snapshot() at ffs_snapshot+0x717
ffs_omount() at ffs_omount+0x16e
vfs_domount() at vfs_domount+0x5a0
mount() at mount+0xd8
syscall() at syscall+0x1fb
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall(21, FreeBSD ELF64, mount), rip = 0800697580, rsp = 0x7fffffffec58,
fbp = 0x515b10 ---

I am guessing this is related to background fsck processes being launched
because it happened consistently until I disabled background fsck and performed
one manually in single user mode. Now I can boot normally into multi user mode.

Not sure where to go from here except to watch the system and wait for more
kernel debug output.

BTW : To answer a reply to my previous post, I have 6 em interfaces.

-Matthew

-----Original Message-----
From: Grooms, Matthew
Sent: Mon 6/6/2005 12:06 PM
To: freebsd-stable@freebsd.org
Subject: Debug help - 5.3 lockups on amd64 SMP
 
All,

      I am experiencing lockups on a production 5.4 amd64 SMP system. 
Its lightly loaded and seems to last about 3-5 days before it stops 
responding to network or even console interaction. The system is acting 
as a firewall and runs a mostly stock kernel with IPV6 removed and SMP, 
PF, PFLOG, CARP and ALTQ added. The only other thing I can think to note 
is that tcpdump is running constantly on the pflog interface to coax 
human readable firewall logs out of pf.

     I have an identical hot spare server with SMP disabled that has 
taken over flawlessly every time the live lock occurs so I am willing to 
leave the primary in the production environment to do testing and gather 
debug info. I have added the following options to primary fw kernel 
config ...

# Debug Options
makeoptions     DEBUG=-g
options         DDB
options         KDB
options         BREAK_TO_DEBUGGER
options         INVARIANT_SUPPORT
options         INVARIANTS
options         WITNESS
options         WITNESS_KDB
options         WITNESS_SKIPSPIN

... and the following to the rc.conf ...

dumpdev="/dev/amrd0s1h"
dumpdir="/var/crash"

Will this do it or should I add anything else?

Thanks in advance,

-Matthew

Kris Kennaway

2005-Jun-06 23:57 UTC

head link

5.4-RELEASE lockups on amd64 SMP

On Mon, Jun 06, 2005 at 06:54:05PM -0500, Grooms, Matthew
wrote:> My appologies. With the debug options listed in my previous post ( should
have read 5.4 not 5.3 ), I got a lock order reversal. After a while, it paniced
and spat out this ...
> 
> lock order reversal
> 1st 0xffffffff80752ec0 pf task mtx (pf task mtx) @
contrib/pf/net/if_pfsync.c:1621
> 2nd 0xffffffff8076e9f0 user map (user man) @ vm/vm_map.c:2998
> KDB: stack backtrace:
> witness_checkorder() at witness_checkorder+0x654
> _sx_xlock() at _sx_xlock+0x51
> vm_map_lookup() at vm_map_lookup+0x44
> vm_fault() at vm_fault+0xba
> trap() at trap+0x1c5
> alltraps_with_regs_pushed() at alltraps_with_regs_pushed+0x5
> pf_state_tree_lan_ext_RB_REMOVE() at pf_state_tree_lan_ext_RB_REMOVE+0x10c
> pf_purge_expired_states() at pf_purge_expired_states+0xab
> pfsync_input() at ip_input+0x10f
> netisr_processqueue() at netisr_processqueue+0x17
> swi_net() at swi_net+0xa8
> ithread_loop() at ithread_loop+0xd9
> fork_exit() at fork_exit+0xc3
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
> KDB: enter: withness_ckeckorder
> [thread pid 110 tid 100089]
> Stopped at      kdb_enter+0x2f: nop
> db> panic blockable sleep lock (sleep mutex) tty @
kern/kern_event.c:1453
> cpuid = 0
> boot() called on cpu#0
> Uptime: 10m40s
> Dumping 4864 mB
>  16 32 .........
> 
> After a reboot, I received another panic.
> 
> Tracing pid 603 tid 100140 td 0xffffff012efda500
> kdb_enter() at kdb_enter+02f
> panic() at panic+0x249
> ffs_blkfree() at ffs_blkfree+0x483
> indir_trunc() at indir_trunc+0x190
> indir_trunc() at indir_trunc+0x1fb
> handle_workitem_freeblocks() at handle_workitem_freeblocks+0x228
> softdep_setup_freeblocks() at softdep_setup_freeblocks+0x730
> ffs_truncate() at ffs_truncate+0x1c9
> ffs_snapshot() at ffs_snapshot+0x717
> ffs_omount() at ffs_omount+0x16e
> vfs_domount() at vfs_domount+0x5a0
> mount() at mount+0xd8
> syscall() at syscall+0x1fb
> Xfast_syscall() at Xfast_syscall+0xa8
> --- syscall(21, FreeBSD ELF64, mount), rip = 0800697580, rsp =
0x7fffffffec58, fbp = 0x515b10 ---
> 
> I am guessing this is related to background fsck processes being launched
because it happened consistently until I disabled background fsck and performed
one manually in single user mode. Now I can boot normally into multi user mode.
You may have had disk corruption that was not repaired by bg fsck.

Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20050606/04e41803/attachment.bin

Grooms, Matthew

2005-Jun-07 16:36 UTC

head link

5.4-RELEASE lockups on amd64 SMP

All,

     Here is an update with more info. In addition to the lock order reversal,
this is the third panic that I have seen that looked like this ...

Tracing id 110 tid 100089 td 0xffffff012f3f0c80
kdb_enter() at kdb_enter+0x2f
panic() at panic+0x249
uma_dbg_free() at uma_dbg_free+0x188
uma_zfree_arg() at uma_zfree_arg+0x1b0
pf_purge_expired_states() at pf_purge_expired_states+0x41
pfsync_input at pfsync_input+xb35
pf_input() at ip_input+0x10f
netisr_processqueue() at netisr_processqueue+0x17
swi_net() at swi_net+0xa8
ithread_loop() at ithread_loop+0xd9
fork_exit() at fork_exit+0xc3
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
db> continue
boot() called on cpu#0
Uptime: 13h42m43s
Dumping 4864 MB
 16 32 ...

     I was hoping to get a crash dump but unfortunately can't seem to get
one to complete. In any case, this particular install is now toast. Im surprised
it lasted as long as it did considering all the pushishment it took. Im not a
kernel hacker, but it would seem to me that somthing is up with pfsync. Should
it matter that I am running an AMD64 kernel in SMP mode?

Matthew Grooms

-----Original Message-----
From: Grooms, Matthew
Sent: Mon 6/6/2005 6:54 PM
To: freebsd-stable@freebsd.org
Subject: 5.4-RELEASE lockups on amd64 SMP
 
My appologies. With the debug options listed in my previous post ( should have
read 5.4 not 5.3 ), I got a lock order reversal. After a while, it paniced and
spat out this ...

lock order reversal
1st 0xffffffff80752ec0 pf task mtx (pf task mtx) @
contrib/pf/net/if_pfsync.c:1621
2nd 0xffffffff8076e9f0 user map (user man) @ vm/vm_map.c:2998
KDB: stack backtrace:
witness_checkorder() at witness_checkorder+0x654
_sx_xlock() at _sx_xlock+0x51
vm_map_lookup() at vm_map_lookup+0x44
vm_fault() at vm_fault+0xba
trap() at trap+0x1c5
alltraps_with_regs_pushed() at alltraps_with_regs_pushed+0x5
pf_state_tree_lan_ext_RB_REMOVE() at pf_state_tree_lan_ext_RB_REMOVE+0x10c
pf_purge_expired_states() at pf_purge_expired_states+0xab
pfsync_input() at ip_input+0x10f
netisr_processqueue() at netisr_processqueue+0x17
swi_net() at swi_net+0xa8
ithread_loop() at ithread_loop+0xd9
fork_exit() at fork_exit+0xc3
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
KDB: enter: withness_ckeckorder
[thread pid 110 tid 100089]
Stopped at      kdb_enter+0x2f: nop
db> panic blockable sleep lock (sleep mutex) tty @ kern/kern_event.c:1453
cpuid = 0
boot() called on cpu#0
Uptime: 10m40s
Dumping 4864 mB
 16 32 .........

After a reboot, I received another panic.

Tracing pid 603 tid 100140 td 0xffffff012efda500
kdb_enter() at kdb_enter+02f
panic() at panic+0x249
ffs_blkfree() at ffs_blkfree+0x483
indir_trunc() at indir_trunc+0x190
indir_trunc() at indir_trunc+0x1fb
handle_workitem_freeblocks() at handle_workitem_freeblocks+0x228
softdep_setup_freeblocks() at softdep_setup_freeblocks+0x730
ffs_truncate() at ffs_truncate+0x1c9
ffs_snapshot() at ffs_snapshot+0x717
ffs_omount() at ffs_omount+0x16e
vfs_domount() at vfs_domount+0x5a0
mount() at mount+0xd8
syscall() at syscall+0x1fb
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall(21, FreeBSD ELF64, mount), rip = 0800697580, rsp = 0x7fffffffec58,
fbp = 0x515b10 ---

I am guessing this is related to background fsck processes being launched
because it happened consistently until I disabled background fsck and performed
one manually in single user mode. Now I can boot normally into multi user mode.

Not sure where to go from here except to watch the system and wait for more
kernel debug output.

BTW : To answer a reply to my previous post, I have 6 em interfaces.

-Matthew

-----Original Message-----
From: Grooms, Matthew
Sent: Mon 6/6/2005 12:06 PM
To: freebsd-stable@freebsd.org
Subject: Debug help - 5.3 lockups on amd64 SMP
 
All,

      I am experiencing lockups on a production 5.4 amd64 SMP system. 
Its lightly loaded and seems to last about 3-5 days before it stops 
responding to network or even console interaction. The system is acting 
as a firewall and runs a mostly stock kernel with IPV6 removed and SMP, 
PF, PFLOG, CARP and ALTQ added. The only other thing I can think to note 
is that tcpdump is running constantly on the pflog interface to coax 
human readable firewall logs out of pf.

     I have an identical hot spare server with SMP disabled that has 
taken over flawlessly every time the live lock occurs so I am willing to 
leave the primary in the production environment to do testing and gather 
debug info. I have added the following options to primary fw kernel 
config ...

# Debug Options
makeoptions     DEBUG=-g
options         DDB
options         KDB
options         BREAK_TO_DEBUGGER
options         INVARIANT_SUPPORT
options         INVARIANTS
options         WITNESS
options         WITNESS_KDB
options         WITNESS_SKIPSPIN

... and the following to the rc.conf ...

dumpdev="/dev/amrd0s1h"
dumpdir="/var/crash"

Will this do it or should I add anything else?

Thanks in advance,

-Matthew

Grooms, Matthew

2005-Jun-08 23:22 UTC

head link

5.4-RELEASE lockups on amd64 SMP

Matthew,

can you try the attached diff.  Available for 5 and CURRENT.  I recall that 
this problem was seen before, strange that I didn't see the problem.  Sounds
familiar to you?  Please try the patch and let me know if that helps.  Thanks 
a lot.

On Wednesday 08 June 2005 01:35, Matthew Grooms wrote:> Once again, here are the backtraces for the panic and lor ...
>
> Tracing id 110 tid 100089 td 0xffffff012f3f0c80
> kdb_enter() at kdb_enter+0x2f
> panic() at panic+0x249
> uma_dbg_free() at uma_dbg_free+0x188
> uma_zfree_arg() at uma_zfree_arg+0x1b0
> pf_purge_expired_states() at pf_purge_expired_states+0x41
> pfsync_input at pfsync_input+xb35
> pf_input() at ip_input+0x10f
> netisr_processqueue() at netisr_processqueue+0x17
> swi_net() at swi_net+0xa8
> ithread_loop() at ithread_loop+0xd9
> fork_exit() at fork_exit+0xc3
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
> db> continue
> boot() called on cpu#0
> Uptime: 13h42m43s
> Dumping 4864 MB
>   16 32 ...
>
> lock order reversal
...> alltraps_with_regs_pushed() at alltraps_with_regs_pushed+0x5
> pf_state_tree_lan_ext_RB_REMOVE() at pf_state_tree_lan_ext_RB_REMOVE+0x10c
This LOR is a consequence of the fault, so it can be disregarded.

-- 
/"\  Best regards,                      | mlaier@freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier@EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News

Grooms, Matthew

2005-Jun-08 23:23 UTC

head link

5.4-RELEASE lockups on amd64 SMP

Max,

     With your patch applied, I get a panic very quickly during the boot cycle
with output that looks like this ...

net.inet.carp.preempt: 0 -> 1
Setting hostname: ---.
em: Link is up 100 Mbps Full Duplex
panic: mtx_lock() of spin mutex (null) @ ../../../net/if.c:1983
cpuid = 1
KDB: enter: panic
[thread pid 282 tid 100157 ]
Stopped at      kdb_enter+0x2f: nop
db> trace
Tracing pid 282 tid 100157 td 0xffffff000af78280
kdb_enter() at kdb_enter+0x2f
panic() at panic+0x249
_mtx_lock_flags() at _mtx_lock_flags+0xd6
if_handoff() at if_handoff+0x49
pfsync_sendout() at pfsync_sendout+0x268
pfsyncioctl() at pfsyncioctl+0x497
in_control() at in_control+0x8cb
ifioctl() at ifioctl+0x178
sooo_ioctl() at soo_ioctl+0x2d6
ioctl() at ioctl+0xfc
syscall() at syscall+0x4ab
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x800793340, rsp = 0x7fffffffeca8,
rbp = 0x7fffffffef8b ---
db> show locks
eclusive sleep mutex pf task mtx r = 0 (0xffffffff80752f60) locked @
contrib/pf/net/if_pfsync.c:973

Rebooting the machine with the same kernel produces an identical panic. Let me
know what else I can do to help. Right now I have just been rebooting back to a
UP kernel which has never shown any sign of problems.

Matthew Grooms

-----Original Message-----
From: Grooms, Matthew
Sent: Wed 6/8/2005 6:22 PM
To: Max Laier
Cc: Palle Girgensohn; Kris Kennaway; freebsd-stable@freebsd.org;
glebius@freebsd.org; pf@freebsd.org
Subject: RE: 5.4-RELEASE lockups on amd64 SMP
 
Matthew,

can you try the attached diff.  Available for 5 and CURRENT.  I recall that 
this problem was seen before, strange that I didn't see the problem.  Sounds
familiar to you?  Please try the patch and let me know if that helps.  Thanks 
a lot.

On Wednesday 08 June 2005 01:35, Matthew Grooms wrote:> Once again, here are the backtraces for the panic and lor ...
>
> Tracing id 110 tid 100089 td 0xffffff012f3f0c80
> kdb_enter() at kdb_enter+0x2f
> panic() at panic+0x249
> uma_dbg_free() at uma_dbg_free+0x188
> uma_zfree_arg() at uma_zfree_arg+0x1b0
> pf_purge_expired_states() at pf_purge_expired_states+0x41
> pfsync_input at pfsync_input+xb35
> pf_input() at ip_input+0x10f
> netisr_processqueue() at netisr_processqueue+0x17
> swi_net() at swi_net+0xa8
> ithread_loop() at ithread_loop+0xd9
> fork_exit() at fork_exit+0xc3
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffffffb44f9d00, rbp = 0 ---
> db> continue
> boot() called on cpu#0
> Uptime: 13h42m43s
> Dumping 4864 MB
>   16 32 ...
>
> lock order reversal
...> alltraps_with_regs_pushed() at alltraps_with_regs_pushed+0x5
> pf_state_tree_lan_ext_RB_REMOVE() at pf_state_tree_lan_ext_RB_REMOVE+0x10c
This LOR is a consequence of the fault, so it can be disregarded.

-- 
/"\  Best regards,                      | mlaier@freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier@EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News

Max Laier

2005-Jun-08 23:45 UTC

head link

5.4-RELEASE lockups on amd64 SMP

Skipped content of type multipart/mixed-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20050609/93c26107/attachment.bin

freebsd stable - Jun 2005 - 5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP

5.4-RELEASE lockups on amd64 SMP