Andreas Longwitz
2011-Sep-05 09:35 UTC
UFS_DIRHASH panics on a dozen server within 30 hours
Hi, a week ago a dozen of my FreeBSD server crashed within a time span of 30 hours. On the server run very different applications, some of them were only standby. All server has the same kernel with FreeBSD 6 STABLE and there were no problems for yours until the "black monday". Yes I know that FreeBSD 6 is out of date now, but I don't like to change a very good running system. Another reason is that my hardware needs the amr driver and because of the outstanding solution of the amr_ioctl problem described in kern/155658 it is not possible for me to upgrade my production sytems without changing hardware. Now I have a dozen core dumps and try to understand what happened. All dumps looks very similar and the panic is always "page fault" in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free because the used mtx_object is overwritten with zeros by someone before _mtx_lock_sleep is called. A typical stack trace and some kgdb output follows: (kgdb) where #0 doadump () at pcpu.h:165 #1 0xc03c5b25 in boot (howto=260) at ../../../kern/kern_shutdown.c:410 #2 0xc03c5e7d in panic (fmt=0xc05931cb "%s") at ../../../kern/kern_shutdown.c:566 #3 0xc0564606 in trap_fatal (frame=0xec6ed77c, eva=256) at ../../../i386/i386/trap.c:838 #4 0xc0563d1e in trap (frame {tf_fs = 8, tf_es = -328335320, tf_ds = -328335320, tf_edi -901761536, tf_esi = 0, tf_ebp = -328280120, tf_isp = -328280152, tf_ebx = -827089920, tf_edx = 0, tf_ecx = 2, tf_eax = 1, tf_trapno = 12, tf_err = 0, tf_eip = -1069829895, tf_cs = 32, tf_eflags = 65538, tf_esp = -827089920, tf_ss = 2}) at ../../../i386/i386/trap.c:270 #5 0xc054ddda in calltrap () at ../../../i386/i386/exception.s:139 #6 0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:550 #7 0xc04eb3c5 in ufsdirhash_recycle (wanted=57230) at ../../../ufs/ufs/ufs_dirhash.c:1035 #8 0xc04e981b in ufsdirhash_build (ip=0xca6b6084) at ../../../ufs/ufs/ufs_dirhash.c:173 #9 0xc04ebbdd in ufs_lookup (ap=0xec6ed920) at ../../../ufs/ufs/ufs_lookup.c:202 #10 0xc057116c in VOP_CACHEDLOOKUP_APV (vop=0x1, a=0x0) at vnode_if.c:150 #11 0xc04164fa in vfs_cache_lookup (ap=0x1) at vnode_if.h:82 #12 0xc05710fb in VOP_LOOKUP_APV (vop=0xc05f90a0, a=0xec6ed9c0) at vnode_if.c:99 #13 0xc041add4 in lookup (ndp=0xec6edbcc) at vnode_if.h:56 #14 0xc041a66a in namei (ndp=0xec6edbcc) at ../../../kern/vfs_lookup.c:216 #15 0xc042ec31 in vn_open_cred (ndp=0xec6edbcc, flagp=0xec6edccc, cmode=384, cred=0xc9bceb80, fdidx=97) at ../../../kern/vfs_vnops.c:183 #16 0xc042e982 in vn_open (ndp=0x0, flagp=0xec6edccc, cmode=384, fdidx=97) at ../../../kern/vfs_vnops.c:91 #17 0xc042749a in kern_open (td=0xca403600, path=0x1 <Address 0x1 out of bounds>, pathseg=UIO_SYSSPACE, flags=1, mode=438) at ../../../kern/vfs_syscalls.c:1016 #18 0xc04271d2 in open (td=0xca403600, uap=0xec6edd04) at ../../../kern/vfs_syscalls.c:971 #19 0xc056494b in syscall (frame {tf_fs = -1082195909, tf_es = -1082195909, tf_ds = -1082195909, tf_edi = -1082141792, tf_esi = -1082155856, tf_ebp = -1082151736, tf_isp = -328278684, tf_ebx = -1982551028, tf_edx = 41, tf_ecx = 0, tf_eax = 5, tf_trapno = 0, tf_err = 2, tf_eip -2008413713, tf_cs = 51, tf_eflags = 642, tf_esp = -1082155972, tf_ss = 59}) at ../../../i386/i386/trap.c:984 #20 0xc054de2f in Xint0x80_syscall () at ../../../i386/i386/exception.s:200 (kgdb) f 8 #8 0xc04e981b in ufsdirhash_build (ip=0xca6b6084) at ../../../ufs/ufs/ufs_dirhash.c:173 173 if (ufsdirhash_recycle(memreqd) != 0) (kgdb) p *ip $1 = {i_nextsnap = {tqe_next = 0x0, tqe_prev = 0x0}, i_vnode 0xca6c0bb0, i_ump = 0xc9bd3300, i_flag = 0, i_dev = 0xc9b4f400, i_number = 4686848, i_effnlink = 2, i_fs = 0xc9ba5800, i_dquot = {0x0, 0x0}, i_modrev = 14753454826293, i_lockf = 0x0, i_count = 24, i_endoff = 112640, i_diroff = 72704, i_offset = 73056, i_ino =3357131, i_reclen = 16, i_un = {dirhash = 0x0, snapblklist = 0x0}, i_ea_area = 0x0, i_ea_len = 0, i_ea_error = 0, i_mode = 16832, i_nlink = 2, i_size = 112640, i_flags = 0, i_gen = -1337636365, i_uid = 60, i_gid = 60, dinode_u = {din1 = 0xca6c7d00, din2 = 0xca6c7d00}} kgdb) f 7 #7 0xc04eb3c5 in ufsdirhash_recycle (wanted=57230) at ../../../ufs/ufs/ufs_dirhash.c:1035 1035 DIRHASH_LOCK(dh); (kgdb) p dh $2 = (struct dirhash *) 0xceb39c00 (kgdb) p *dh $3 = {dh_mtx = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0, lo_flags = 0, lo_list = { tqe_next = 0x0, tqe_prev = 0x0}, lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0}, dh_hash = 0x0, dh_narrays = 0, dh_hlen = 0, dh_hused = 0, dh_blkfree = 0x0, dh_nblk = 0, dh_dirblks = 0, dh_firstfree = { 0 <repeats 46 times>, -16777216, -1 <repeats 21 times>}, dh_seqopt = 1, dh_seqoff = 3440, dh_score =64, dh_onlist = 1, dh_list = {tqe_next = 0xcf919a00, tqe_prev = 0xc063cfb0 (kgdb) f 6 #6 0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:550 550 if (m != &Giant && TD_IS_RUNNING(owner)) { (kgdb) p m $4 = (struct mtx *) 0xceb39c00 (kgdb) p *m $5 = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0, lo_flags = 0, lo_list = {tqe_next = 0x0, tqe_prev = 0x0}, lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0} (kgdb) p &Giant $6 = (struct mtx *) 0xc062a0e0 (kgdb) p owner $7 = (volatile struct thread *) 0x0 info local owner = (volatile struct thread *) 0x0 v = 0 (kgdb) list 545 */ 546 owner = (struct thread *)(v & MTX_FLAGMASK); 547 #ifdef ADAPTIVE_GIANT 548 if (TD_IS_RUNNING(owner)) { 549 #else 550 if (m != &Giant && TD_IS_RUNNING(owner)) { 551 #endif The crash occurs in line 550 because owner is zero and should be a thread id that holds the dirhash mutex. When _mtx_lock_sleep is called the mtx_object already is filled with zeros and especially mtx_lock should be 4 (UNOWNED) or the thread id of someone. What may be the reason, that the panics never occured before and then on a dozen server in a short time ? No further crashs since a week now. Any hints are welcome. -- Dr. Andreas Longwitz Data Service GmbH Beethovenstr. 2A 23617 Stockelsdorf Amtsgericht L?beck, HRB 318 BS Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau
On Monday, September 05, 2011 5:15:42 am Andreas Longwitz wrote:> Hi, > > a week ago a dozen of my FreeBSD server crashed within a time span of > 30 hours. On the server run very different applications, some of them > were only standby. All server has the same kernel with FreeBSD 6 STABLE > and there were no problems for yours until the "black monday". > > Yes I know that FreeBSD 6 is out of date now, but I don't like to > change a very good running system. Another reason is that my hardware > needs the amr driver and because of the outstanding solution of the > amr_ioctl problem described in kern/155658 it is not possible for me > to upgrade my production sytems without changing hardware.Hmm, the patch in that PR should still apply to newer versions. Also, you could just change the malloc() call to always allocate the maximum size (instead of using a static buffer) for a smaller diff. It seems though that a specific command is overrunning its buffer.> Now I have a dozen core dumps and try to understand what happened. > All dumps looks very similar and the panic is always "page fault" > in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free > because the used mtx_object is overwritten with zeros by someone > before _mtx_lock_sleep is called.I don't know of anything in particular that would explain this, esp. as to why you would see them all occur at the same time. Maybe look to see if the machines were doing something unusual at that time (a cron job, etc.)? -- John Baldwin
Eugene Grosbein wrote:> Well, given that before busdma commit that hardware worked just fine > with stock driver, it could be less overhead for me to rollback that > one busdma small chunk :-) > Who knows, which drivers got broken then in 2010 in 6.4-STABLE with > busdma change besides re(4)...Another example is de(4) as mentioned in kern/151941. -- Dr. Andreas Longwitz Data Service GmbH Beethovenstr. 2A 23617 Stockelsdorf Amtsgericht L?beck, HRB 318 BS Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau