thr3ads.net - zfs discuss - [zfs-discuss] problem with zdb program [Mar 2006]

If this information is useful, please help other people find it:
Share via:

James Dickens

2006-Mar-09 02:41 UTC

[zfs-discuss] problem with zdb program

since people are using zdb i decided to try it...

# zdb -s data
error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380
[L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4
lzjb BE contiguous birth=1893280 fill=445
cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50
Abort (core dumped)

# zpool status
  pool: data
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        data         ONLINE       0     0     0
          raidz      ONLINE       0     0     0
            c0t11d0  ONLINE       0     0     0
            c0t12d0  ONLINE       0     0     0
            c0t13d0  ONLINE       0     0     0
            c0t14d0  ONLINE       0     0     0
          raidz      ONLINE       0     0     0
            c0t3d0   ONLINE       0     0     0
            c0t4d0   ONLINE       0     0     0
            c0t8d0   ONLINE       0     0     0
            c0t9d0   ONLINE       0     0     0
            c0t10d0  ONLINE       0     0     0
#

is the above a known bug?

# uname -av
SunOS enterprise 5.11 snv_27 sun4u sparc SUNW,Ultra-2
#

okay then i did this other command

# zdb -bb p-16-32
zdb: can''t open p-16-32: error 2
#
# zdb -u data
error: ZFS: bad checksum (read on raidz off 478bbe800: zio 1006b5c00
[L0 DMU objset] vdev=1 offset=478bbe800 size=400L/400P/800A fletcher4
uncompressed BE contiguous birth=1893320 fill=445
cksum=2e63ca20e:265d9024c07:ff39f90c092f:4723fcdebe07ea): error 50
Abort (core dumped)
#

no hard errors are reported on the drives...

first raid cluster is  4x 9.0GB 10k rpm sca drives the last 5 are  4.3
GB 7.2 k rpm drives.

James Dickens

Robert Milkowski

2006-Mar-09 03:10 UTC

head link

[zfs-discuss] problem with zdb program

Hello James,

Thursday, March 9, 2006, 3:41:03 AM, you wrote:

JD> # zdb -bb p-16-32
JD> zdb: can''t open p-16-32: error 2

You don''t have a pool named p-16-32, don''t you?


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

James Dickens

2006-Mar-09 03:54 UTC

head link

[zfs-discuss] problem with zdb program

On 3/8/06, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello James,
>
> Thursday, March 9, 2006, 3:41:03 AM, you wrote:
>
> JD> # zdb -bb p-16-32
> JD> zdb: can''t open p-16-32: error 2
>
> You don''t have a pool named p-16-32, don''t you?
>
nope didn''t relize that  p-16-32 was a pool name, okay re-running with
my pool name

# zdb -bb data
error: ZFS: bad checksum (read on raidz off 127330c00: zio 1006df140
[L0 DMU objset] vdev=1 offset=127330c00 size=400L/200P/400A fletcher4
lzjb BE contiguous birth=1894214 fill=445
cksum=743221094:323e84de013:b099ee6cf963:1a469da16421b2): error 50
Abort (core dumped)
#



>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

Matthew Ahrens

2006-Mar-09 09:11 UTC

head link

[zfs-discuss] problem with zdb program

On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens
wrote:> since people are using zdb i decided to try it...
> 
> # zdb -s data
> error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380
> [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4
> lzjb BE contiguous birth=1893280 fill=445
> cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50
> Abort (core dumped)
If you are currently modifying data in your pool, zdb doesn''t always
work.  You can try using the ''-L'' flag, but that
doesn''t always work
either.  You can try running it a few times with the -L flag and you
might get lucky.  I actually just ran into this bug myself yesterday,
and have filed:

	6396042 ''zdb -L'' should work as described (ie. on live pools)

Sorry about that.  I should have at least mentioned -L when I asked you
to run zdb before... From your other emails it looks like you got it
working anyway.

--matt

Bill Sommerfeld

2006-Mar-09 13:56 UTC

head link

[zfs-discuss] problem with zdb program

On Thu, 2006-03-09 at 04:11, Matthew Ahrens wrote:> On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens wrote:
> > since people are using zdb i decided to try it...
> > 
> > # zdb -s data
> > error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380
> > [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4
> > lzjb BE contiguous birth=1893280 fill=445
> > cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50
> > Abort (core dumped)
so, I, too, saw that error message and thought "eek!  pool damage!"
and
fired off a zpool scrub just to be sure.
> I should have at least mentioned -L when I asked you
> to run zdb before... From your other emails it looks like you got it
> working anyway.
I think there''s a second bug lurking here.  I just filed:

6396160 zdb should not needlessly worry sysadmins when run on a live
pool

since it would be clever if zdb noticed that it was aimed at a live pool
without -L and failed more gracefully..

						- Bill

James Dickens

2006-Mar-09 17:44 UTC

head link

[zfs-discuss] problem with zdb program

On 3/9/06, Bill Sommerfeld <sommerfeld at sun.com>
wrote:> On Thu, 2006-03-09 at 04:11, Matthew Ahrens wrote:
> > On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens wrote:
> > > since people are using zdb i decided to try it...
> > >
> > > # zdb -s data
> > > error: ZFS: bad checksum (read on raidz off 17ac77800: zio
100699380
> > > [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A
fletcher4
> > > lzjb BE contiguous birth=1893280 fill=445
> > > cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error
50
> > > Abort (core dumped)
>
> so, I, too, saw that error message and thought "eek!  pool
damage!" and
> fired off a zpool scrub just to be sure.
>
> > I should have at least mentioned -L when I asked you
> > to run zdb before... From your other emails it looks like you got it
> > working anyway.
>I wanted to test a bit furher...

i umounted several zfs filesystem, and then executed 
/etc/init.d/nfs.server stop .. then the machine crashed, so I have a
crash file if anyone is interested.

I booted into single user mode to see if zdb would still  fail,

#zdb --bb data
error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040
[L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4
lzjb BE contiguous birth=1902279 fill=445
cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50

no zfs filesystems were mounted.. and i have the core file for this if
anyone is interested, I forgot the -L but since no zfs filesystems
were mounted i figure its unnecessary.

Let me know if anyone is interested in either of these files...

currently running build 27 solaris express..

James Dickens

> I think there''s a second bug lurking here.  I just filed:
>
> 6396160 zdb should not needlessly worry sysadmins when run on a live
> pool
>
> since it would be clever if zdb noticed that it was aimed at a live pool
> without -L and failed more gracefully..
>
>                                                 - Bill
>
>
>

Matthew Ahrens

2006-Mar-09 17:48 UTC

head link

[zfs-discuss] problem with zdb program

On Thu, Mar 09, 2006 at 11:44:22AM -0600, James Dickens
wrote:> i umounted several zfs filesystem, and then executed 
> /etc/init.d/nfs.server stop .. then the machine crashed, so I have a
> crash file if anyone is interested.
We''d definitely like to see at least the stack trace and panic message.
You can get these by running:

	# mdb <dump>
	> ::status
	> ::stack
> I booted into single user mode to see if zdb would still  fail,
> 
> #zdb --bb data
> error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040
> [L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4
> lzjb BE contiguous birth=1902279 fill=445
> cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50
> 
> no zfs filesystems were mounted.. and i have the core file for this if
> anyone is interested, I forgot the -L but since no zfs filesystems
> were mounted i figure its unnecessary.
That''s curious.  Let''s tackle the kernel panic first, as it
may be
related.

--matt

James Dickens

2006-Mar-09 17:58 UTC

head link

[zfs-discuss] problem with zdb program

On 3/9/06, Matthew Ahrens <ahrens at sun.com>
wrote:> On Thu, Mar 09, 2006 at 11:44:22AM -0600, James Dickens wrote:
> > i umounted several zfs filesystem, and then executed
> > /etc/init.d/nfs.server stop .. then the machine crashed, so I have a
> > crash file if anyone is interested.
>
> We''d definitely like to see at least the stack trace and panic
message.
> You can get these by running:
>
>         # mdb <dump>
>         > ::status
>         > ::stack# mdb unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp
usba fcp fctl emlxs nca md audiosup random zfs nfs sppp crypto ptm
lofs ipc logindmux cpc fcip wrsmd ]>okay here it is.

# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp
usba fcp fctl emlxs nca md audiosup random zfs nfs sppp crypto ptm
lofs ipc logindmux cpc fcip wrsmd ]> ::status\debugging crash dump vmcore.0 (64-bit) from enterprise
operating system: 5.11 snv_27 (sun4u)
panic message:
BAD TRAP: type=31 rp=2a100a16fe0 addr=50 mmu_fsr=0 occurred in module
"ip" due t
o a NULL pointer dereference
dump content: kernel pages only> ::stacktcp_fuse_rcv_drain+0x1c4(0, 3000446bb40, 3000446bf68, 70030c00, 0, 0)
tcp_fuse_disable_pair+0xb8(300025dfb40, 1, a38ef238e99, 3000446bb40, 300025dff80
, 3000446bf80)
tcp_unfuse+0xc(300025dfb40, 30000e406a0, 180e580, 3000446bb40, 800000001ba84676
, 0)
tcp_close_output+0x104(300025df980, 6, 300025dfd18, 300025dfb40, 6, 18)
squeue_enter+0x3ac(60000515f00, 300025dfe40, 1369850, 300025df980, 6, 0)
tcp_close+0x7c(6000553ef70, 300025df980, 0, 300025dfe30, 300025dfb40, 0)
qdetach+0x90(6000553ef70, 700310a0, 83, 600004008f0, 0, 20204032)
strclose+0x3b4(600055d2380, 6000553ef70, 600004008f0, 30004e90470, 200000, 40000
)
device_close+0x94(60007295e00, 83, 30002073200, 600004008f0, 2100, 4)
spec_close+0x1a0(60007295e00, 0, 420, 600054c40a0, 600004008f0, 600054c4018)
fop_close+0x20(60007295e00, 83, 1, 0, 600004008f0, 11efe88)
closef+0x4c(30021bd85b0, 0, 18a5400, 18ab800, 30021bd85b0, 0)
closeall+0x4c(300101bb200, f, 360ee5dc, 7, 7855dc00, 6)
proc_exit+0x388(1, 0, ffff, 1856c00, 600006a9c40, 3000559bde0)
exit+8(1, 0, ffbff9f0, 1, ff3707a8, ff269f31)
syscall_trap32+0xcc(0, 0, ffbff9f0, 1, ff3707a8,
ff269f31)>
James
>
> > I booted into single user mode to see if zdb would still  fail,
> >
> > #zdb --bb data
> > error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040
> > [L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4
> > lzjb BE contiguous birth=1902279 fill=445
> > cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50
> >
> > no zfs filesystems were mounted.. and i have the core file for this if
> > anyone is interested, I forgot the -L but since no zfs filesystems
> > were mounted i figure its unnecessary.
>
> That''s curious.  Let''s tackle the kernel panic first, as
it may be
> related.
>
> --matt
>

zfs discuss - Mar 2006 - problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program

[zfs-discuss] problem with zdb program