since people are using zdb i decided to try it... # zdb -s data error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380 [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4 lzjb BE contiguous birth=1893280 fill=445 cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50 Abort (core dumped) # zpool status pool: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t8d0 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 # is the above a known bug? # uname -av SunOS enterprise 5.11 snv_27 sun4u sparc SUNW,Ultra-2 # okay then i did this other command # zdb -bb p-16-32 zdb: can''t open p-16-32: error 2 # # zdb -u data error: ZFS: bad checksum (read on raidz off 478bbe800: zio 1006b5c00 [L0 DMU objset] vdev=1 offset=478bbe800 size=400L/400P/800A fletcher4 uncompressed BE contiguous birth=1893320 fill=445 cksum=2e63ca20e:265d9024c07:ff39f90c092f:4723fcdebe07ea): error 50 Abort (core dumped) # no hard errors are reported on the drives... first raid cluster is 4x 9.0GB 10k rpm sca drives the last 5 are 4.3 GB 7.2 k rpm drives. James Dickens
Hello James, Thursday, March 9, 2006, 3:41:03 AM, you wrote: JD> # zdb -bb p-16-32 JD> zdb: can''t open p-16-32: error 2 You don''t have a pool named p-16-32, don''t you? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On 3/8/06, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello James, > > Thursday, March 9, 2006, 3:41:03 AM, you wrote: > > JD> # zdb -bb p-16-32 > JD> zdb: can''t open p-16-32: error 2 > > You don''t have a pool named p-16-32, don''t you? >nope didn''t relize that p-16-32 was a pool name, okay re-running with my pool name # zdb -bb data error: ZFS: bad checksum (read on raidz off 127330c00: zio 1006df140 [L0 DMU objset] vdev=1 offset=127330c00 size=400L/200P/400A fletcher4 lzjb BE contiguous birth=1894214 fill=445 cksum=743221094:323e84de013:b099ee6cf963:1a469da16421b2): error 50 Abort (core dumped) #> > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens wrote:> since people are using zdb i decided to try it... > > # zdb -s data > error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380 > [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4 > lzjb BE contiguous birth=1893280 fill=445 > cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50 > Abort (core dumped)If you are currently modifying data in your pool, zdb doesn''t always work. You can try using the ''-L'' flag, but that doesn''t always work either. You can try running it a few times with the -L flag and you might get lucky. I actually just ran into this bug myself yesterday, and have filed: 6396042 ''zdb -L'' should work as described (ie. on live pools) Sorry about that. I should have at least mentioned -L when I asked you to run zdb before... From your other emails it looks like you got it working anyway. --matt
On Thu, 2006-03-09 at 04:11, Matthew Ahrens wrote:> On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens wrote: > > since people are using zdb i decided to try it... > > > > # zdb -s data > > error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380 > > [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4 > > lzjb BE contiguous birth=1893280 fill=445 > > cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50 > > Abort (core dumped)so, I, too, saw that error message and thought "eek! pool damage!" and fired off a zpool scrub just to be sure.> I should have at least mentioned -L when I asked you > to run zdb before... From your other emails it looks like you got it > working anyway.I think there''s a second bug lurking here. I just filed: 6396160 zdb should not needlessly worry sysadmins when run on a live pool since it would be clever if zdb noticed that it was aimed at a live pool without -L and failed more gracefully.. - Bill
On 3/9/06, Bill Sommerfeld <sommerfeld at sun.com> wrote:> On Thu, 2006-03-09 at 04:11, Matthew Ahrens wrote: > > On Wed, Mar 08, 2006 at 08:41:03PM -0600, James Dickens wrote: > > > since people are using zdb i decided to try it... > > > > > > # zdb -s data > > > error: ZFS: bad checksum (read on raidz off 17ac77800: zio 100699380 > > > [L0 DMU objset] vdev=1 offset=17ac77800 size=400L/200P/400A fletcher4 > > > lzjb BE contiguous birth=1893280 fill=445 > > > cksum=c4165ec9d:535d1f8b21f:11fb8c9c3c44e:29fdff742931f8): error 50 > > > Abort (core dumped) > > so, I, too, saw that error message and thought "eek! pool damage!" and > fired off a zpool scrub just to be sure. > > > I should have at least mentioned -L when I asked you > > to run zdb before... From your other emails it looks like you got it > > working anyway. >I wanted to test a bit furher... i umounted several zfs filesystem, and then executed /etc/init.d/nfs.server stop .. then the machine crashed, so I have a crash file if anyone is interested. I booted into single user mode to see if zdb would still fail, #zdb --bb data error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040 [L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4 lzjb BE contiguous birth=1902279 fill=445 cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50 no zfs filesystems were mounted.. and i have the core file for this if anyone is interested, I forgot the -L but since no zfs filesystems were mounted i figure its unnecessary. Let me know if anyone is interested in either of these files... currently running build 27 solaris express.. James Dickens> I think there''s a second bug lurking here. I just filed: > > 6396160 zdb should not needlessly worry sysadmins when run on a live > pool > > since it would be clever if zdb noticed that it was aimed at a live pool > without -L and failed more gracefully.. > > - Bill > > >
On Thu, Mar 09, 2006 at 11:44:22AM -0600, James Dickens wrote:> i umounted several zfs filesystem, and then executed > /etc/init.d/nfs.server stop .. then the machine crashed, so I have a > crash file if anyone is interested.We''d definitely like to see at least the stack trace and panic message. You can get these by running: # mdb <dump> > ::status > ::stack> I booted into single user mode to see if zdb would still fail, > > #zdb --bb data > error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040 > [L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4 > lzjb BE contiguous birth=1902279 fill=445 > cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50 > > no zfs filesystems were mounted.. and i have the core file for this if > anyone is interested, I forgot the -L but since no zfs filesystems > were mounted i figure its unnecessary.That''s curious. Let''s tackle the kernel panic first, as it may be related. --matt
On 3/9/06, Matthew Ahrens <ahrens at sun.com> wrote:> On Thu, Mar 09, 2006 at 11:44:22AM -0600, James Dickens wrote: > > i umounted several zfs filesystem, and then executed > > /etc/init.d/nfs.server stop .. then the machine crashed, so I have a > > crash file if anyone is interested. > > We''d definitely like to see at least the stack trace and panic message. > You can get these by running: > > # mdb <dump> > > ::status > > ::stack# mdb unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba fcp fctl emlxs nca md audiosup random zfs nfs sppp crypto ptm lofs ipc logindmux cpc fcip wrsmd ]>okay here it is. # mdb -k unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba fcp fctl emlxs nca md audiosup random zfs nfs sppp crypto ptm lofs ipc logindmux cpc fcip wrsmd ]> ::status\debugging crash dump vmcore.0 (64-bit) from enterprise operating system: 5.11 snv_27 (sun4u) panic message: BAD TRAP: type=31 rp=2a100a16fe0 addr=50 mmu_fsr=0 occurred in module "ip" due t o a NULL pointer dereference dump content: kernel pages only> ::stacktcp_fuse_rcv_drain+0x1c4(0, 3000446bb40, 3000446bf68, 70030c00, 0, 0) tcp_fuse_disable_pair+0xb8(300025dfb40, 1, a38ef238e99, 3000446bb40, 300025dff80 , 3000446bf80) tcp_unfuse+0xc(300025dfb40, 30000e406a0, 180e580, 3000446bb40, 800000001ba84676 , 0) tcp_close_output+0x104(300025df980, 6, 300025dfd18, 300025dfb40, 6, 18) squeue_enter+0x3ac(60000515f00, 300025dfe40, 1369850, 300025df980, 6, 0) tcp_close+0x7c(6000553ef70, 300025df980, 0, 300025dfe30, 300025dfb40, 0) qdetach+0x90(6000553ef70, 700310a0, 83, 600004008f0, 0, 20204032) strclose+0x3b4(600055d2380, 6000553ef70, 600004008f0, 30004e90470, 200000, 40000 ) device_close+0x94(60007295e00, 83, 30002073200, 600004008f0, 2100, 4) spec_close+0x1a0(60007295e00, 0, 420, 600054c40a0, 600004008f0, 600054c4018) fop_close+0x20(60007295e00, 83, 1, 0, 600004008f0, 11efe88) closef+0x4c(30021bd85b0, 0, 18a5400, 18ab800, 30021bd85b0, 0) closeall+0x4c(300101bb200, f, 360ee5dc, 7, 7855dc00, 6) proc_exit+0x388(1, 0, ffff, 1856c00, 600006a9c40, 3000559bde0) exit+8(1, 0, ffbff9f0, 1, ff3707a8, ff269f31) syscall_trap32+0xcc(0, 0, ffbff9f0, 1, ff3707a8, ff269f31)>James> > > I booted into single user mode to see if zdb would still fail, > > > > #zdb --bb data > > error: ZFS: bad checksum (read on raidz off 348005400: zio 100698040 > > [L0 DMU objset] vdev=1 offset=348005400 size=400L/200P/400A fletcher4 > > lzjb BE contiguous birth=1902279 fill=445 > > cksum=60dbfbf2b:2911140c148:8dee8e87c45e:14d3bcde264186): error 50 > > > > no zfs filesystems were mounted.. and i have the core file for this if > > anyone is interested, I forgot the -L but since no zfs filesystems > > were mounted i figure its unnecessary. > > That''s curious. Let''s tackle the kernel panic first, as it may be > related. > > --matt >