thr3ads.net - zfs discuss - [zfs-discuss] The Dangling DBuf Strikes Back [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Dale Ghent

2007-Sep-03 08:29 UTC

[zfs-discuss] The Dangling DBuf Strikes Back

I saw a putback this past week from M. Maybee regarding this, but I  
thought I''d post here that I saw what is apparently an incarnation of  
6569719 on a production box running  s10u3 x86 w/ latest (on  
sunsolve) patches. I have 3 other servers configured the same way WRT  
work load, zfs pools and hardware resources, so if this occurs again  
I''ll see about logging a case and getting a relief patch. Anyhow,  
perhaps a backport to s10 may be in order

This server is an x4100 hosting about 10k email accounts using Cyrus,  
and Cyrus''s "squatter" mailbox indexer was running at the
time (lots
of small r/w IO), as well as Networker-based backups which sucks data  
off a clone (yet tons more small ro IO).

Unfortunately due to a recent RAM upgrade of the server in question,  
the dump device was too small to hold a complete vmcore, but at least  
the stack trace was logged. Here it is, at least for the posterity''s  
sake:

Sep  3 03:27:43 xxx ^Mpanic[cpu0]/thread=fffffe80007b7c80:
Sep  3 03:27:43 xxx genunix: [ID 895785 kern.notice] dangling dbufs  
(dn=fffffe8432bad7d8, dbuf=fffffe81f93c5bd8)
Sep  3 03:27:43 xxx unix: [ID 100000 kern.notice]
Sep  3 03:27:43 xxx genunix: [ID 655072 kern.notice] fffffe80007b7960  
zfs:zfsctl_ops_root+2f168a42 ()
Sep  3 03:27:43 xxx genunix: [ID 655072 kern.notice] fffffe80007b79a0  
zfs:zfsctl_ops_root+2f168af8 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7a10  
zfs:dnode_sync+334 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7a60  
zfs:dmu_objset_sync_dnodes+7b ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7af0  
zfs:dmu_objset_sync+5c ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7b10  
zfs:dsl_dataset_sync+23 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7b60  
zfs:dsl_pool_sync+7b ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7bd0  
zfs:spa_sync+116 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7c60  
zfs:txg_sync_thread+115 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fffffe80007b7c70  
unix:thread_start+8 ()

/dale

Mike Gerdts

2007-Sep-04 00:48 UTC

head link

[zfs-discuss] The Dangling DBuf Strikes Back

On 9/3/07, Dale Ghent <daleg at elemental.org>
wrote:>
> I saw a putback this past week from M. Maybee regarding this, but I
> thought I''d post here that I saw what is apparently an incarnation
of
> 6569719 on a production box running  s10u3 x86 w/ latest (on
> sunsolve) patches. I have 3 other servers configured the same way WRT
> work load, zfs pools and hardware resources, so if this occurs again
> I''ll see about logging a case and getting a relief patch. Anyhow,
> perhaps a backport to s10 may be in order
[note: the patches I mention are s10 sparc specific.  Translation to
x86 required.]

As of a few weeks ago s10u3 with latest patches did not have this
problem for me, but s10u4 beta and snv69 did.  My situation was on
sun4v, not i386.  More specifically:

S10 118833-36, 118833-07, 118833-10:

# zpool import
  pool: zfs
    id: 679728171331086542
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        zfs         FAULTED   corrupted data
          c0d1s3    FAULTED   corrupted data

snv_69, s10u4beta:

Boot device: /virtual-devices at 100/channel-devices at 200/network at 0:dhcp
File and args: -s
SunOS Release 5.11 Version snv_69 64-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring /dev
Using DHCP for network configuration information.
Requesting System Maintenance Mode
SINGLE USER MODE
# zpool import

panic[cpu0]/thread=300028943a0: dangling dbufs (dn=3000392dbe0,
dbuf=3000392be08)

000002a10076f270 zfs:dnode_evict_dbufs+188 (3000392dbe0, 0, 1, 1,
2a10076f320, 7b729000)
  %l0-3: 000003000392ddf0 0000000000000000 0000000000000000 000003000392ddf8
  %l4-7: 000002a10076f320 0000000000000001 000003000392bf20 0000000000000003
000002a10076f3e0 zfs:dmu_objset_evict_dbufs+100 (2, 0, 0, 7b722800, 0,
30000516900)
  %l0-3: 000000007b72ac00 000000007b724510 000000007b724400 0000030000516a70
  %l4-7: 000003000392dbe0 0000030000516968 000000007b7228c1 0000000000000001
...

Sun offered me an IDR against 125100-07, but since I could not
reproduce the problem on that kernel, I never tested it.  This does
imply that they think there is a dangling dbufs problem in 125100-07
that they think they have a fixed for support-paying customers.
Perhaps this is the problem and related solution that you would be
interested in.

The interesting thing with my case is that the backing store for this
device is a file on a ZFS file system, served up has a virtual disk in
an LDOM.  From the primary LDOM, there is no corruption.  An
unexpected reset (panic, I believe) of the primary LDOM seems to have
caused the corruption in the guest LDOM.  What was that about having
the redundancy as close to the consumer as possible?  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

zfs discuss - Sep 2007 - The Dangling DBuf Strikes Back

[zfs-discuss] The Dangling DBuf Strikes Back

[zfs-discuss] The Dangling DBuf Strikes Back