thr3ads.net - freebsd stable - UFS_DIRHASH panics on a dozen server within 30 hours [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Andreas Longwitz

2011-Sep-05 09:35 UTC

UFS_DIRHASH panics on a dozen server within 30 hours

Hi,

a week ago a dozen of my FreeBSD server crashed within a time span of
30 hours. On the server run very different applications, some of them
were only standby. All server has the same kernel with FreeBSD 6 STABLE
and there were no problems for yours until the "black monday".

Yes I know that FreeBSD 6 is out of date now, but I don't like to
change a very good running system. Another reason is that my hardware
needs the amr driver and because of the outstanding solution of the
amr_ioctl problem described in kern/155658 it is not possible for me
to upgrade my production sytems without changing hardware.

Now I have a dozen core dumps and try to understand what happened.
All dumps looks very similar and the panic is always "page fault"
in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free
because the used mtx_object is overwritten with zeros by someone
before _mtx_lock_sleep is called.

A typical stack trace and some kgdb output follows:

(kgdb) where
#0  doadump () at pcpu.h:165
#1  0xc03c5b25 in boot (howto=260)
               at ../../../kern/kern_shutdown.c:410
#2  0xc03c5e7d in panic (fmt=0xc05931cb "%s")
               at ../../../kern/kern_shutdown.c:566
#3  0xc0564606 in trap_fatal (frame=0xec6ed77c, eva=256)
               at ../../../i386/i386/trap.c:838
#4  0xc0563d1e in trap (frame      {tf_fs = 8, tf_es = -328335320, tf_ds =
-328335320, tf_edi       -901761536, tf_esi = 0, tf_ebp = -328280120, tf_isp =
-328280152,
      tf_ebx = -827089920, tf_edx = 0, tf_ecx = 2, tf_eax = 1,
      tf_trapno = 12, tf_err = 0, tf_eip = -1069829895, tf_cs = 32,
      tf_eflags = 65538, tf_esp = -827089920, tf_ss = 2})
               at ../../../i386/i386/trap.c:270
#5  0xc054ddda in calltrap () at ../../../i386/i386/exception.s:139
#6  0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760,
      opts=0, file=0x0, line=0)
               at ../../../kern/kern_mutex.c:550
#7  0xc04eb3c5 in ufsdirhash_recycle (wanted=57230)
               at ../../../ufs/ufs/ufs_dirhash.c:1035
#8  0xc04e981b in ufsdirhash_build (ip=0xca6b6084)
               at ../../../ufs/ufs/ufs_dirhash.c:173
#9  0xc04ebbdd in ufs_lookup (ap=0xec6ed920)
               at ../../../ufs/ufs/ufs_lookup.c:202
#10 0xc057116c in VOP_CACHEDLOOKUP_APV (vop=0x1, a=0x0)
               at vnode_if.c:150
#11 0xc04164fa in vfs_cache_lookup (ap=0x1)
               at vnode_if.h:82
#12 0xc05710fb in VOP_LOOKUP_APV (vop=0xc05f90a0, a=0xec6ed9c0)
               at vnode_if.c:99
#13 0xc041add4 in lookup (ndp=0xec6edbcc)
               at vnode_if.h:56
#14 0xc041a66a in namei (ndp=0xec6edbcc)
               at ../../../kern/vfs_lookup.c:216
#15 0xc042ec31 in vn_open_cred (ndp=0xec6edbcc, flagp=0xec6edccc,
      cmode=384, cred=0xc9bceb80, fdidx=97)
               at ../../../kern/vfs_vnops.c:183
#16 0xc042e982 in vn_open (ndp=0x0, flagp=0xec6edccc, cmode=384,
      fdidx=97)
               at ../../../kern/vfs_vnops.c:91
#17 0xc042749a in kern_open (td=0xca403600, path=0x1 <Address 0x1
       out of bounds>, pathseg=UIO_SYSSPACE, flags=1, mode=438)
               at ../../../kern/vfs_syscalls.c:1016
#18 0xc04271d2 in open (td=0xca403600, uap=0xec6edd04)
               at ../../../kern/vfs_syscalls.c:971
#19 0xc056494b in syscall (frame      {tf_fs = -1082195909, tf_es = -1082195909,
tf_ds = -1082195909,
      tf_edi = -1082141792, tf_esi = -1082155856, tf_ebp = -1082151736,
      tf_isp = -328278684, tf_ebx = -1982551028, tf_edx = 41,
      tf_ecx = 0, tf_eax = 5, tf_trapno = 0, tf_err = 2, tf_eip      
-2008413713, tf_cs = 51, tf_eflags = 642, tf_esp = -1082155972,
      tf_ss = 59})
               at ../../../i386/i386/trap.c:984
#20 0xc054de2f in Xint0x80_syscall ()
               at ../../../i386/i386/exception.s:200

(kgdb) f 8
#8  0xc04e981b in ufsdirhash_build (ip=0xca6b6084)
               at ../../../ufs/ufs/ufs_dirhash.c:173
173                     if (ufsdirhash_recycle(memreqd) != 0)
(kgdb) p *ip
$1 = {i_nextsnap = {tqe_next = 0x0, tqe_prev = 0x0}, i_vnode   0xca6c0bb0, i_ump
= 0xc9bd3300, i_flag = 0, i_dev = 0xc9b4f400,
  i_number = 4686848, i_effnlink = 2, i_fs = 0xc9ba5800, i_dquot
  = {0x0, 0x0}, i_modrev = 14753454826293, i_lockf = 0x0, i_count = 24,
  i_endoff = 112640, i_diroff = 72704, i_offset = 73056, i_ino =3357131,
  i_reclen = 16, i_un = {dirhash = 0x0, snapblklist = 0x0}, i_ea_area
  = 0x0, i_ea_len = 0, i_ea_error = 0, i_mode = 16832, i_nlink = 2,
  i_size = 112640, i_flags = 0, i_gen = -1337636365, i_uid = 60,
  i_gid = 60, dinode_u = {din1 = 0xca6c7d00, din2 = 0xca6c7d00}}

kgdb) f 7
#7  0xc04eb3c5 in ufsdirhash_recycle (wanted=57230)
               at ../../../ufs/ufs/ufs_dirhash.c:1035
1035                    DIRHASH_LOCK(dh);
(kgdb) p dh
$2 = (struct dirhash *) 0xceb39c00
(kgdb) p *dh
$3 = {dh_mtx = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type
  = 0x0, lo_flags = 0, lo_list = { tqe_next = 0x0, tqe_prev = 0x0},
  lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0}, dh_hash = 0x0,
  dh_narrays = 0, dh_hlen = 0, dh_hused = 0, dh_blkfree = 0x0, dh_nblk
  = 0, dh_dirblks = 0, dh_firstfree = { 0 <repeats 46 times>, -16777216,
  -1 <repeats 21 times>}, dh_seqopt = 1, dh_seqoff = 3440, dh_score =64,
  dh_onlist = 1, dh_list = {tqe_next = 0xcf919a00, tqe_prev = 0xc063cfb0

(kgdb) f 6
#6  0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0,
    file=0x0, line=0) at ../../../kern/kern_mutex.c:550
550                     if (m != &Giant && TD_IS_RUNNING(owner)) {
(kgdb) p m
$4 = (struct mtx *) 0xceb39c00
(kgdb) p *m
$5 = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0,
  lo_flags = 0, lo_list = {tqe_next = 0x0, tqe_prev = 0x0}, lo_witness
  = 0x0}, mtx_lock = 2, mtx_recurse = 0}
(kgdb) p &Giant
$6 = (struct mtx *) 0xc062a0e0
(kgdb) p owner
$7 = (volatile struct thread *) 0x0
info local
owner = (volatile struct thread *) 0x0
v = 0
(kgdb) list
545                      */
546                     owner = (struct thread *)(v & MTX_FLAGMASK);
547     #ifdef ADAPTIVE_GIANT
548                     if (TD_IS_RUNNING(owner)) {
549     #else
550                     if (m != &Giant && TD_IS_RUNNING(owner)) {
551     #endif

The crash occurs in line 550 because owner is zero and should be a
thread id that holds the dirhash mutex. When _mtx_lock_sleep is
called the mtx_object already is filled with zeros and especially
mtx_lock should be 4 (UNOWNED) or the thread id of someone.

What may be the reason, that the panics never occured before and then
on a dozen server in a short time ? No further crashs since a week now.

Any hints are welcome.

-- 
Dr. Andreas Longwitz

Data Service GmbH
Beethovenstr. 2A
23617 Stockelsdorf
Amtsgericht L?beck, HRB 318 BS
Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau

John Baldwin

2011-Sep-06 15:04 UTC

head link

UFS_DIRHASH panics on a dozen server within 30 hours

On Monday, September 05, 2011 5:15:42 am Andreas Longwitz
wrote:> Hi,
> 
> a week ago a dozen of my FreeBSD server crashed within a time span of
> 30 hours. On the server run very different applications, some of them
> were only standby. All server has the same kernel with FreeBSD 6 STABLE
> and there were no problems for yours until the "black monday".
> 
> Yes I know that FreeBSD 6 is out of date now, but I don't like to
> change a very good running system. Another reason is that my hardware
> needs the amr driver and because of the outstanding solution of the
> amr_ioctl problem described in kern/155658 it is not possible for me
> to upgrade my production sytems without changing hardware.
Hmm, the patch in that PR should still apply to newer versions.  Also, you 
could just change the malloc() call to always allocate the maximum size 
(instead of using a static buffer) for a smaller diff.  It seems though that a 
specific command is overrunning its buffer.
> Now I have a dozen core dumps and try to understand what happened.
> All dumps looks very similar and the panic is always "page fault"
> in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free
> because the used mtx_object is overwritten with zeros by someone
> before _mtx_lock_sleep is called.
I don't know of anything in particular that would explain this, esp. as to
why you would see them all occur at the same time.  Maybe look to see if the
machines were doing something unusual at that time (a cron job, etc.)?

-- 
John Baldwin

Andreas Longwitz

2011-Sep-19 10:20 UTC

head link

busdma MFC broke ipfw fwd for RELENG_6

Eugene Grosbein wrote:
> Well, given that before busdma commit that hardware worked just fine
> with stock driver, it could be less overhead for me to rollback that
> one busdma small chunk :-)
> Who knows, which drivers got broken then in 2010 in 6.4-STABLE with
> busdma change besides re(4)...
Another example is de(4) as mentioned in kern/151941.

-- 
Dr. Andreas Longwitz

Data Service GmbH
Beethovenstr. 2A
23617 Stockelsdorf
Amtsgericht L?beck, HRB 318 BS
Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau

freebsd stable - Sep 2011 - UFS_DIRHASH panics on a dozen server within 30 hours

UFS_DIRHASH panics on a dozen server within 30 hours

UFS_DIRHASH panics on a dozen server within 30 hours

busdma MFC broke ipfw fwd for RELENG_6