Don Lewis
2006-Jan-03 13:29 UTC
Recurring problem: processes block accessing UFS file system
On 3 Jan, Greg Rivers wrote:> On Tue, 22 Nov 2005, I wrote: > >> On Mon, 21 Nov 2005, Kris Kennaway wrote: >> >>> It may not be the same problem. You should also try to obtain a trace when >>> snapshots are not implicated. >>> >> >> Agreed. I'll do so at the first opportunity. >> > > First, my thanks to all of you for looking into this. > > It's taken more than a month, but the problem has recurred without > snapshots ever having been run. I've got a good trace of the machine in > this state (attached). My apologies for the size of the debug output, but > the processes had really stacked up this time before I noticed it. > > I have enough capacity that I can afford to have this machine out of > production for a while, so I've left it suspended in kdb for the time > being in case additional information is needed. Please let me know if > there's anything else I can do to facilitate troubleshooting this. > Thanks!There are large number of sendmail processes waiting on vnode locks which are held by other sendmail processes that are waiting on other vnode locks, etc. until we get to sendmail pid 87150 which is holding a vnode lock and waiting to lock a buf. Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7 acquire(ec26a748,120,60000,15c2e6e0,0) at acquire+0x89 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c ffs_vget(c8870000,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This doesn't appear to be a buf/memory exhausting problem because syncer, bufdaemon, and pagedaemon all appear to be idle. What does "show lockedbufs" say?
Greg Rivers
2006-Jan-03 16:37 UTC
Recurring problem: processes block accessing UFS file system
On Tue, 3 Jan 2006, Don Lewis wrote:> There are large number of sendmail processes waiting on vnode locks > which are held by other sendmail processes that are waiting on other > vnode locks, etc. until we get to sendmail pid 87150 which is holding a > vnode lock and waiting to lock a buf. > > Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480 > sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158 > mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5 > sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f > sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11 > msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7 > acquire(ec26a748,120,60000,15c2e6e0,0) at acquire+0x89 > lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f > getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211 > breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52 > bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c > ffs_vget(c8870000,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383 > ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154 > ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61 > ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36 > VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c > vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe > vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33 > kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca > open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36 > syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324 > Xint0x80_syscall() at Xint0x80_syscall+0x1f > > This doesn't appear to be a buf/memory exhausting problem because > syncer, bufdaemon, and pagedaemon all appear to be idle. > > What does "show lockedbufs" say? >db> show lockedbufs buf at 0xdc625678 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xddcef000, b_blkno = 365233216 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a088, 0x53175000),(0xc8984108, 0x2b8a089, 0x7ae56000),(0xc8984108, 0x2b8a08a, 0xd3f57000),(0xc8984108, 0x2b8a08b, 0xd7d58000) buf at 0xdc6b8ab0 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xdf9ab000, b_blkno = 365257760 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8ac84, 0x60b1000),(0xc8984108, 0x2b8ac85, 0x454d2000),(0xc8984108, 0x2b8ac86, 0x1b273000),(0xc8984108, 0x2b8ac87, 0x47b74000) buf at 0xdc6c3cc8 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xdfbd7000, b_blkno = 365265888 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8b07c, 0xba549000),(0xc8984108, 0x2b8b07d, 0x92eaa000),(0xc8984108, 0x2b8b07e, 0xbdf4b000),(0xc8984108, 0x2b8b07f, 0x2090c000) buf at 0xdc6d9e68 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0027000, b_blkno = 365240832 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a440, 0x61b8d000),(0xc8984108, 0x2b8a441, 0xb0f4e000),(0xc8984108, 0x2b8a442, 0x5f98f000),(0xc8984108, 0x2b8a443, 0x5c210000) buf at 0xdc6e5458 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe025f000, b_blkno = 364617056 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b773ac, 0x4e539000),(0xc8984108, 0x2b773ad, 0x6e13a000),(0xc8984108, 0x2b773ae, 0xc653b000),(0xc8984108, 0x2b773af, 0x14e3c000) buf at 0xdc6fd4b8 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe070f000, b_blkno = 365224960 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b89c80, 0x37c6d000),(0xc8984108, 0x2b89c81, 0x2e40e000),(0xc8984108, 0x2b89c82, 0xa39af000),(0xc8984108, 0x2b89c83, 0x27ff0000) buf at 0xdc713f50 b_flags = 0xa00200a0<remfree,vmio,clusterok,delwri,cache> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000) buf at 0xdc765f50 b_flags = 0xa00200a0<remfree,vmio,clusterok,delwri,cache> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc9c68b60), b_data = 0xe1b7b000, b_blkno = 364555424 lockstatus = 2, excl count = 1, excl owner 0xce9d4c00 b_npages = 4, pages(OBJ, IDX, PA): (0xcff70000, 0xffffffd0, 0xafa36000),(0xcff70000, 0xffffffd1, 0x6bcd7000),(0xcff70000, 0xffffffd2, 0x970d8000),(0xcff70000, 0xffffffd3, 0x89959000) buf at 0xdc7ae590 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe299b000, b_blkno = 364555424 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b75594, 0xd5081000),(0xc8984108, 0x2b75595, 0xa6722000),(0xc8984108, 0x2b75596, 0x402e3000),(0xc8984108, 0x2b75597, 0xb1d04000) buf at 0xdc7d7590 b_flags = 0x20000000<vmio> b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe319b000, b_blkno = 364607808 lockstatus = 2, excl count = 1, excl owner 0xfffffffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b76f28, 0x387d5000),(0xc8984108, 0x2b76f29, 0x28896000),(0xc8984108, 0x2b76f2a, 0x40d17000),(0xc8984108, 0x2b76f2b, 0x27d98000) db>