On 9/3/07, Dale Ghent <daleg at elemental.org>
wrote:>
> I saw a putback this past week from M. Maybee regarding this, but I
> thought I''d post here that I saw what is apparently an incarnation
of
> 6569719 on a production box running s10u3 x86 w/ latest (on
> sunsolve) patches. I have 3 other servers configured the same way WRT
> work load, zfs pools and hardware resources, so if this occurs again
> I''ll see about logging a case and getting a relief patch. Anyhow,
> perhaps a backport to s10 may be in order
[note: the patches I mention are s10 sparc specific. Translation to
x86 required.]
As of a few weeks ago s10u3 with latest patches did not have this
problem for me, but s10u4 beta and snv69 did. My situation was on
sun4v, not i386. More specifically:
S10 118833-36, 118833-07, 118833-10:
# zpool import
pool: zfs
id: 679728171331086542
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: http://www.sun.com/msg/ZFS-8000-5E
config:
zfs FAULTED corrupted data
c0d1s3 FAULTED corrupted data
snv_69, s10u4beta:
Boot device: /virtual-devices at 100/channel-devices at 200/network at 0:dhcp
File and args: -s
SunOS Release 5.11 Version snv_69 64-bit
Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring /dev
Using DHCP for network configuration information.
Requesting System Maintenance Mode
SINGLE USER MODE
# zpool import
panic[cpu0]/thread=300028943a0: dangling dbufs (dn=3000392dbe0,
dbuf=3000392be08)
000002a10076f270 zfs:dnode_evict_dbufs+188 (3000392dbe0, 0, 1, 1,
2a10076f320, 7b729000)
%l0-3: 000003000392ddf0 0000000000000000 0000000000000000 000003000392ddf8
%l4-7: 000002a10076f320 0000000000000001 000003000392bf20 0000000000000003
000002a10076f3e0 zfs:dmu_objset_evict_dbufs+100 (2, 0, 0, 7b722800, 0,
30000516900)
%l0-3: 000000007b72ac00 000000007b724510 000000007b724400 0000030000516a70
%l4-7: 000003000392dbe0 0000030000516968 000000007b7228c1 0000000000000001
...
Sun offered me an IDR against 125100-07, but since I could not
reproduce the problem on that kernel, I never tested it. This does
imply that they think there is a dangling dbufs problem in 125100-07
that they think they have a fixed for support-paying customers.
Perhaps this is the problem and related solution that you would be
interested in.
The interesting thing with my case is that the backing store for this
device is a file on a ZFS file system, served up has a virtual disk in
an LDOM. From the primary LDOM, there is no corruption. An
unexpected reset (panic, I believe) of the primary LDOM seems to have
caused the corruption in the guest LDOM. What was that about having
the redundancy as close to the consumer as possible? :)
--
Mike Gerdts
http://mgerdts.blogspot.com/