Today my production server crashed 4 times. THIS IS NIGHTMARE! Self-healing file system?! For me ZFS is SELF-KILLING filesystem. I cannot fsck it, there''s no such tool. I cannot scrub it, it crashes 30-40 minutes after scrub starts. I cannot use it, it crashes a number of times every day! And with every crash number of checksum failures is growing: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 0 ...after a few hours... box5 ONLINE 0 0 4 ...after a few hours... box5 ONLINE 0 0 62 ...after another few hours... box5 ONLINE 0 0 120 ...crash! and we start again... box5 ONLINE 0 0 0 ...etc... actually 120 is record, sometimes it crashed as soon as it boots. and always there''s a permanent error: errors: Permanent errors have been detected in the following files: box5:<0x0> and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A Restore the file in question if possible. Otherwise restore the entire pool from backup. Thanks, but if I restore it from backup it won''t be ZFS anymore, that''s for sure. It''s not I/O problem. AFAIK, default ZFS I/O error behavior is "wait" to repair (i''ve 10U4, non-configurable). Then why it panics? Recently there were discussions on failure of OpenSolaris community. Now it''s been more than half a month since I reported such an error. Nobody even posted something like "RTFM". Come on guys, I know you are there and busy with enterprise customers... but at least give me some troubleshooting ideas. i''m totally lost. just to remind, it''s heavily loaded fs with 3-4 million files and folders. Link to original post: http://www.opensolaris.org/jive/thread.jspa?threadID=57425 This message posted from opensolaris.org
On Thu, 1 May 2008, Rustam wrote:> Today my production server crashed 4 times. THIS IS NIGHTMARE! > Self-healing file system?! For me ZFS is SELF-KILLING filesystem. > > I cannot fsck it, there''s no such tool. > I cannot scrub it, it crashes 30-40 minutes after scrub starts. > I cannot use it, it crashes a number of times every day! And with every crash number of checksum failures is growing:Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is it non-redundant? If non-redundant, then there is not much that ZFS can really do if a device begins to fail. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Phillip Wagstrom -- Area SSE MidAmerica
2008-May-01 21:47 UTC
[zfs-discuss] ZFS still crashing after patch
Rustam wrote:> Today my production server crashed 4 times. THIS IS NIGHTMARE! > Self-healing file system?! For me ZFS is SELF-KILLING filesystem. > > I cannot fsck it, there''s no such tool. I cannot scrub it, it crashes > 30-40 minutes after scrub starts. I cannot use it, it crashes a > number of times every day! And with every crash number of checksum > failures is growing: > > NAME STATE READ WRITE CKSUM box5 ONLINE 0 > 0 0 ...after a few hours... box5 ONLINE 0 0 > 4 ...after a few hours... box5 ONLINE 0 0 62 > ...after another few hours... box5 ONLINE 0 0 > 120 ...crash! and we start again... box5 ONLINE 0 0 > 0 ...etc... > > actually 120 is record, sometimes it crashed as soon as it boots. > > and always there''s a permanent error: errors: Permanent errors have > been detected in the following files: box5:<0x0> > > and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A > Restore the file in question if possible. Otherwise restore the > entire pool from backup. > > Thanks, but if I restore it from backup it won''t be ZFS anymore, > that''s for sure.That''s a bit harsh. ZFS is telling you that you have corrupted data based on the checksums. Other types of filesystems would likely simply pass the corrupted data on silently.> It''s not I/O problem. AFAIK, default ZFS I/O error behavior is "wait" > to repair (i''ve 10U4, non-configurable). Then why it panics?Do you have the panic messages? ZFS won''t cause panics based on bad checksums. It will by default cause panic if it can''t write data out to any device or if it completely loses access to non-redundant devices or loses both redundant devices at the same time.> Recently there were discussions on failure of OpenSolaris community. > Now it''s been more than half a month since I reported such an error. > Nobody even posted something like "RTFM". Come on guys, I know you > are there and busy with enterprise customers... but at least give me > some troubleshooting ideas. i''m totally lost. > > just to remind, it''s heavily loaded fs with 3-4 million files and > folders. > > Link to original post: > http://www.opensolaris.org/jive/thread.jspa?threadID=57425Since this seems to show the same number of checksum errors across 2 different channels and 4 different drives. Given that, I''d assume that this is likely a dual-channel HBA of some sort. It would appear that you either have bad hardware or some sort of driver issue. Regards, Phil
> Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is > it non-redundant? If non-redundant, then there is not much that ZFS > can really do if a device begins to fail.It''s RAID 10 (more info here: http://www.opensolaris.org/jive/thread.jspa?threadID=57425): NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 4 mirror ONLINE 0 0 2 c1d0 ONLINE 0 0 4 c2d0 ONLINE 0 0 4 mirror ONLINE 0 0 2 c2d1 ONLINE 0 0 4 c1d1 ONLINE 0 0 4 Actually, there''s no damaged data so far. I don''t get any "unable to read/write" kind of errors. It''s just very strange checksum errors synchronized over all disks.> That''s a bit harsh. ZFS is telling you that you u have corrupted data > based on the checksums. Other types of filesystems would likely simply > pass the corrupted data on silently.Checksums are good, no complaints about that.> Do you have the panic messages? ZFS won''t cause panics based on bad > checksums. It will by default cause panic if it can''t write data out to > any device or if it completely loses access to non-redundant devices or > loses both redundant devices at the same time.A number of panic messages and crash dump stack trace are attached to the original post (http://www.opensolaris.org/jive/thread.jspa?threadID=57425). Here is the short snip:> ::statusdebugging crash dump vmcore.5 (64-bit) from core operating system: 5.10 Generic_127112-07 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe800017f8d0 addr=238 occurred in module "unix" due to a NULL pointer dereference dump content: kernel pages only> > ::stackmutex_enter+0xb() zio_buf_alloc+0x1a() zio_read+0xba() spa_scrub_io_start+0xf1() spa_scrub_cb+0x13d() traverse_callback+0x6a() traverse_segment+0x118() traverse_more+0x7b() spa_scrub_thread+0x147() thread_start+8()> Since this seems to show the same number of checksum errors across 2 > different channels and 4 different drives. Given that, I''d assume that > this is likely a dual-channel HBA of some sort. It would appear that > you either have bad hardware or some sort of driver issue.You right, this is the dual-channel Intel''s ICH6 SATA controller. 10U4 has native support/drivers for this SATA controller (AHCI drivers afaik). The thing is that this hardware and ZFS were in production for almost 2 years (ok, not the best argument). However this problem occurred recently (20 days). It''s even more strange because I didn''t made any OS/diver upgrade or patch during last 2-3 months. However, this is good point. I''ve seen some new SATA/AHCI drivers available in 10U5. Maybe I should try to upgrade and see if it helps. Thanks Phil. -- Rustam This message posted from opensolaris.org
On Thu, 1 May 2008, Rustam wrote:> operating system: 5.10 Generic_127112-07 (i86pc)Seems kind of old. I am using Generic_127112-11 here. Probably many hundreds of nasty bugs have been eliminated since the version you are using. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Seems kind of old. I am using Generic_127112-11 here. > > Probably many hundreds of nasty bugs have been > eliminated since the version you are using.I''ve updated to the latest available kernel 127128-11 (from 28 Apr) which included a number of fixes to AHCI SATA driver and ZFS. Didn''t help. Keeps crashing. The worst thing is that I don''t know where''s the problem. More ideas on how to find problem? This message posted from opensolaris.org
Rustam <rustam <at> code.az> writes:> > Didn''t help. Keeps crashing. > The worst thing is that I don''t know where''s the problem. More ideas on > how to find problem?Lots of CKSUM errors like you see is often indicative of bad hardware. Run memtest for 24-48 hours. -marc
I don''t think that this is hardware issue, however i don''t except this. I''ll try to explain why. 1. I''ve replaced all memory modules which are more likely to cause such a problem. 2. There are many different applications running on that server (Apache, PostgreSQL, etc.). However, if you look at the four different crash dump stack traces you see the same picture: ------ crash dump st1 ------ mutex_enter+0xb() zio_buf_alloc+0x1a() zio_read+0xba() spa_scrub_io_start+0xf1() spa_scrub_cb+0x13d() ------ crash dump st2 ------ mutex_enter+0xb() zio_buf_alloc+0x1a() zio_read+0xba() arc_read+0x3cc() dbuf_prefetch+0x11d() dmu_prefetch+0x107() zfs_readdir+0x408() fop_readdir+0x34() ------ crash dump st3 ------ mutex_enter+0xb() zio_buf_alloc+0x1a() zio_read+0xba() arc_read+0x3cc() dbuf_prefetch+0x11d() dmu_prefetch+0x107() zfs_readdir+0x408() fop_readdir+0x34() ------ crash dump st4 ------ mutex_enter+0xb() zio_buf_alloc+0x1a() zio_read+0xba() arc_read+0x3cc() dbuf_prefetch+0x11d() dmu_prefetch+0x107() zfs_readdir+0x408() fop_readdir+0x34() All four crash dumps show problem at zio_read/zio_buf_alloc. Three of these appeared during metadata prefetch (dmu_prefetch) and one during scrubbing. I don''t think that it''s coincidence. IMHO, checksum errors are the result of this inconsistency. I tend to think that problem is in ZFS it exists even in the latest Solaris version (maybe OpenSolaris as well).> > Lots of CKSUM errors like you see is often indicative > of bad hardware. Run > memtest for 24-48 hours. > > -marcThis message posted from opensolaris.org
Hello Rustam, Saturday, May 3, 2008, 9:16:41 AM, you wrote: R> I don''t think that this is hardware issue, however i don''t except this. I''ll try to explain why. R> 1. I''ve replaced all memory modules which are more likely to cause such a problem. R> 2. There are many different applications running on that server R> (Apache, PostgreSQL, etc.). However, if you look at the four R> different crash dump stack traces you see the same picture: R> ------ crash dump st1 ------ R> mutex_enter+0xb() R> zio_buf_alloc+0x1a() R> zio_read+0xba() R> spa_scrub_io_start+0xf1() R> spa_scrub_cb+0x13d() R> ------ crash dump st2 ------ R> mutex_enter+0xb() R> zio_buf_alloc+0x1a() R> zio_read+0xba() R> arc_read+0x3cc() R> dbuf_prefetch+0x11d() R> dmu_prefetch+0x107() R> zfs_readdir+0x408() R> fop_readdir+0x34() R> ------ crash dump st3 ------ R> mutex_enter+0xb() R> zio_buf_alloc+0x1a() R> zio_read+0xba() R> arc_read+0x3cc() R> dbuf_prefetch+0x11d() R> dmu_prefetch+0x107() R> zfs_readdir+0x408() R> fop_readdir+0x34() R> ------ crash dump st4 ------ R> mutex_enter+0xb() R> zio_buf_alloc+0x1a() R> zio_read+0xba() R> arc_read+0x3cc() R> dbuf_prefetch+0x11d() R> dmu_prefetch+0x107() R> zfs_readdir+0x408() R> fop_readdir+0x34() R> All four crash dumps show problem at zio_read/zio_buf_alloc. Three R> of these appeared during metadata prefetch (dmu_prefetch) and one R> during scrubbing. I don''t think that it''s coincidence. IMHO, R> checksum errors are the result of this inconsistency. Which would happen if you have problem with HW and you''re getting wring checksums on both side of your mirrors. Maybe PS? Try memtest anyway or sunvts -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Hello Robert,> Which would happen if you have problem with HW and you''re getting > wring checksums on both side of your mirrors. Maybe PS? > > Try memtest anyway or sunvtsUnfortunately, SunVTS doesn''t run on non-Sun/OEM hardware. And memtest requires too much downtime which I cannot afford right now. However, I''ve interesting observations and now I can reproduce crash. It seems that I''ve bad checksum(s) and ZFS crashes each time when it tries to read it. Below are two cases: Case1: I''ve got a checksum error not striped over mirrors, this time it was checksum for a file and not <0x0>. I tried to read file twice. First try returned I/O error, second try caused panic. Here''s the log: core# zpool status -xv pool: box5 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 2 mirror ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 mirror ONLINE 0 0 2 c2d1 ONLINE 0 0 4 c1d1 ONLINE 0 0 4 errors: Permanent errors have been detected in the following files: box5:<0x0> /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file core# ll /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file -rw------- 1 user group 489 Apr 20 2006 /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file cat: input error on /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file: I/O error core# zpool status -xv pool: box5 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 4 mirror ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 mirror ONLINE 0 0 4 c2d1 ONLINE 0 0 8 c1d1 ONLINE 0 0 8 errors: Permanent errors have been detected in the following files: box5:<0x0> /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file (Kernel Panic: BAD TRAP: type=e (#pf Page fault) rp=fffffe8001112490 addr=fffffe80882b7000) ... (after system boot up) core# rm /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file core# zpool status -xv pool: box5 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2d1 ONLINE 0 0 0 c1d1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: box5:<0x0> box5:<0x4a049a> core# mdb unix.17 vmcore.17 Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp ufs ip hook neti sctp arp usba uhci fctl nca lofs zfs random nfs ipc sppp crypto ptm ]> ::statusdebugging crash dump vmcore.17 (64-bit) from core operating system: 5.10 Generic_127128-11 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe8001112490 addr=fffffe80882b7000 dump content: kernel pages only> ::stackfletcher_2_native+0x13() zio_checksum_verify+0x27() zio_next_stage+0x65() zio_wait_for_children+0x49() zio_wait_children_done+0x15() zio_next_stage+0x65() zio_vdev_io_assess+0x84() zio_next_stage+0x65() vdev_cache_read+0x14c() vdev_disk_io_start+0x135() vdev_io_start+0x12() zio_vdev_io_start+0x7b() zio_next_stage_async+0xae() zio_nowait+9() vdev_mirror_io_start+0xa9() vdev_io_start+0x12() zio_vdev_io_start+0x7b() zio_next_stage_async+0xae() zio_nowait+9() vdev_mirror_io_start+0xa9() zio_vdev_io_start+0x116() zio_next_stage+0x65() zio_ready+0xec() zio_next_stage+0x65() zio_wait_for_children+0x49() zio_wait_children_ready+0x15() zio_next_stage_async+0xae() zio_nowait+9() arc_read+0x414() dbuf_read_impl+0x1a0() dbuf_read+0x95() dmu_buf_hold_array_by_dnode+0x217() dmu_buf_hold_array+0x81() dmu_read_uio+0x49() zfs_read+0x13c() fop_read+0x31() read+0x188() read32+0xe() _sys_sysenter_post_swapgs+0x14b()> ::msgbufMESSAGE ... panic[cpu0]/thread=ffffffff8b98e240: BAD TRAP: type=e (#pf Page fault) rp=fffffe8001112490 addr=fffffe80882b7000 cat: #pf Page fault Bad kernel fault at addr=0xfffffe80882b7000 pid=17993, pc=0xfffffffff1192923, sp=0xfffffe8001112588, eflags=0x10287 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> cr2: fffffe80882b7000 cr3: df609000 cr8: c rdi: fffffe80882b7000 rsi: fffffe80884ae200 rdx: fffffe80011125c0 rcx: 30fceb4cd1a3ca3f r8: 93e4f7b6a3198113 r9: 73457f74abfdcdf5 rax: dc3bf3d311149f rbx: ffffffff8f68a700 rbp: fffffe8001112630 r10: ffffffff8ecb5c30 r11: 0 r12: fffffe80884ae200 r13: c0 r14: 200200 r15: fffffe80882ae000 fsb: ffffffff80000000 gsb: fffffffffbc24e40 ds: 43 es: 43 fs: 0 gs: 1c3 trp: e err: 0 rip: fffffffff1192923 cs: 28 rfl: 10287 rsp: fffffe8001112588 ss: 30 fffffe80011123a0 unix:real_mode_end+71e1 () fffffe8001112480 unix:trap+5e6 () fffffe8001112490 unix:cmntrap+140 () fffffe8001112630 zfs:zfsctl_ops_root+30d9707b () fffffe8001112650 zfs:zio_checksum_verify+27 () fffffe8001112660 zfs:zio_next_stage+65 () fffffe8001112690 zfs:zio_wait_for_children+49 () fffffe80011126a0 zfs:zio_wait_children_done+15 () fffffe80011126b0 zfs:zio_next_stage+65 () fffffe80011126f0 zfs:zio_vdev_io_assess+84 () fffffe8001112700 zfs:zio_next_stage+65 () fffffe80011127e0 zfs:vdev_cache_read+14c () fffffe8001112820 zfs:vdev_disk_io_start+135 () fffffe8001112830 zfs:vdev_io_start+12 () fffffe8001112860 zfs:zio_vdev_io_start+7b () fffffe8001112870 zfs:zfsctl_ops_root+30db75c6 () fffffe8001112880 zfs:zio_nowait+9 () fffffe80011128e0 zfs:vdev_mirror_io_start+a9 () fffffe80011128f0 zfs:vdev_io_start+12 () fffffe8001112920 zfs:zio_vdev_io_start+7b () fffffe8001112930 zfs:zfsctl_ops_root+30db75c6 () fffffe8001112940 zfs:zio_nowait+9 () fffffe80011129a0 zfs:vdev_mirror_io_start+a9 () fffffe80011129d0 zfs:zio_vdev_io_start+116 () fffffe80011129e0 zfs:zio_next_stage+65 () fffffe8001112a00 zfs:zio_ready+ec () fffffe8001112a10 zfs:zio_next_stage+65 () fffffe8001112a40 zfs:zio_wait_for_children+49 () fffffe8001112a50 zfs:zio_wait_children_ready+15 () fffffe8001112a60 zfs:zfsctl_ops_root+30db75c6 () fffffe8001112a70 zfs:zio_nowait+9 () fffffe8001112af0 zfs:arc_read+414 () fffffe8001112b70 zfs:dbuf_read_impl+1a0 () fffffe8001112bb0 zfs:zfsctl_ops_root+30d812dd () fffffe8001112c20 zfs:dmu_buf_hold_array_by_dnode+217 () fffffe8001112c70 zfs:dmu_buf_hold_array+81 () fffffe8001112cd0 zfs:zfsctl_ops_root+30d84641 () fffffe8001112d30 zfs:zfs_read+13c () fffffe8001112d80 genunix:fop_read+31 () fffffe8001112eb0 genunix:read+188 () fffffe8001112ec0 genunix:read32+e () fffffe8001112f10 unix:brand_sys_sysenter+1f2 () syncing file systems... 14 10 8 ... 8 done (not all i/o completed) dumping to /dev/dsk/c0d0s1, offset 215547904, content: kernel> ^D----------------------------------------- Case2: At this point I''ve started to thinking how to reproduce such error. Obviously, I need to read all blocks to stumble upon bad checksum(s). So I decided to backup whole box5 pool using zfs send/receive to another pool I had in the system - box7. I''ve got very interesting results: # zpool status -v pool: box5 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2d1 ONLINE 0 0 0 c1d1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: box5:<0x0> pool: box7 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM box7 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t6d0 ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT box5 175G 282G 175G /u02 box7 4.26G 681G 4.26G /box7 # zfs snapshot box5 at backup1 # zfs send box5 at backup1 | zfs receive box7/backup1 & [1] 11056 # zfs list NAME USED AVAIL REFER MOUNTPOINT box5 175G 282G 175G /u02 box5 at backup1 2.07M - 175G - box7 4.26G 681G 4.26G /box7 box7/backup1 1.50K 681G 1.50K /box7/backup1 (Kernel Panic. See panic message below) ... (after system boot up) # zfs list NAME USED AVAIL REFER MOUNTPOINT box5 175G 282G 175G /u02 box5 at backup1 24.2M - 175G - box7 6.35G 679G 6.35G /box7 # mdb unix.20 vmcore.20 Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp ufs ip hook neti sctp arp usba uhci fctl nca lofs zfs random nfs ipc sppp crypto ptm ]> ::statusdebugging crash dump vmcore.20 (64-bit) from core operating system: 5.10 Generic_127128-11 (i86pc) panic message: ZFS: bad checksum (read on <unknown> off 0: zio ffffffff8b504c80 [L0 DMU dnode] 4000L/e00P DVA[0]=<0:711cca000:e00> DVA[1]=<1:2535927000:e00> fletcher4 lzjb LE contiguous birth=10088934 fill=19 cksum=dd5d13d3d3:1b19d851501a2:2034a615f84743c:c3ddc5ca046ccb96): error 50 dump content: kernel pages only> ::stackvpanic() 0xfffffffff11b1af4() zio_next_stage+0x65() zio_wait_for_children+0x49() zio_wait_children_done+0x15() zio_next_stage+0x65() zio_vdev_io_assess+0x84() zio_next_stage+0x65() vdev_mirror_io_done+0xc1() zio_vdev_io_done+0x14() taskq_thread+0xbc() thread_start+8()> ::msgbufMESSAGE ... panic[cpu0]/thread=fffffe800081dc80: ZFS: bad checksum (read on <unknown> off 0: zio ffffffff8b504c80 [L0 DMU dnode] 4000L/e00P DVA[0]=<0:711cca000:e00> DVA[1]=<1:2535927000:e00> fletcher4 lzjb LE contiguous birth=10088934 fill=19 cksum=dd5d13d3d3:1b19d851501a2:2034a615f84743c:c3ddc5ca046ccb96): error 50 fffffe800081dac0 zfs:zfsctl_ops_root+30db624c () fffffe800081dad0 zfs:zio_next_stage+65 () fffffe800081db00 zfs:zio_wait_for_children+49 () fffffe800081db10 zfs:zio_wait_children_done+15 () fffffe800081db20 zfs:zio_next_stage+65 () fffffe800081db60 zfs:zio_vdev_io_assess+84 () fffffe800081db70 zfs:zio_next_stage+65 () fffffe800081dbd0 zfs:vdev_mirror_io_done+c1 () fffffe800081dbe0 zfs:zio_vdev_io_done+14 () fffffe800081dc60 genunix:taskq_thread+bc () fffffe800081dc70 unix:thread_start+8 () syncing file systems... 5 3 1 ... 1 done (not all i/o completed) dumping to /dev/dsk/c0d0s1, offset 215547904, content: kernel>You can see checksum type fletcher4 (with lzjb compression?) in panic message. However, all of the ZFS file systems are fletcher2 w/o compression. So this is obviously corrupted checksum. I''ve run backup (send/receive) test described above 3 times. Every time I''ve got the same error and each time system panic approx. 45 minutes after start.>From all these I can conclude that it happens when "zfs send" tries to read the same bad checksum. Bad checksum could appear in the result of software or hardware bug. However, it''s obvious that zfs cannot properly handle and fix this checksum problem.I tried also to run zdb, but it run out of the memory: # zdb box5 version=4 name=''box5'' state=0 txg=11057710 pool_guid=989471958079851180 vdev_tree type=''root'' id=0 guid=989471958079851180 children[0] type=''mirror'' id=0 guid=9879820675701757866 metaslab_array=16 metaslab_shift=31 ashift=9 asize=250045530112 children[0] type=''disk'' id=0 guid=6440359594760637663 path=''/dev/dsk/c1d0s0'' devid=''id1,cmdk at AST3250820AS=____________9QE0LYMD/a'' whole_disk=1 DTL=125 children[1] type=''disk'' id=1 guid=80751879044845160 path=''/dev/dsk/c2d0s0'' devid=''id1,cmdk at AST3250824AS=____________5ND3M6BQ/a'' whole_disk=1 DTL=23 children[1] type=''mirror'' id=1 guid=4781215615782453677 whole_disk=0 metaslab_array=13 metaslab_shift=31 ashift=9 asize=250045530112 children[0] type=''disk'' id=0 guid=4849048357332929360 path=''/dev/dsk/c2d1s0'' devid=''id1,cmdk at AST3250824AS=____________5ND3LQLB/a'' whole_disk=1 DTL=21 children[1] type=''disk'' id=1 guid=15140711491939156235 path=''/dev/dsk/c1d1s0'' devid=''id1,cmdk at AST3250820AS=____________3QE01C2H/a'' whole_disk=1 DTL=19 Uberblock magic = 0000000000bab10c version = 4 txg = 11061627 guid_sum = 5267891425222527909 timestamp = 1209925302 UTC = Sun May 4 23:21:42 2008 Dataset mos [META], ID 0, cr_txg 4, 176M, 201 objects Dataset box5 at backup1 [ZPL], ID 199, cr_txg 11060758, 175G, 4997266 objects Dataset box5 [ZPL], ID 5, cr_txg 4, 175G, 4998785 objects Traversing all blocks to verify checksums and verify nothing leaked ... out of memory -- generating core dump Abort (core dumped) So what I''m supposed to do now with all these 5 million objects? I cannot even backup them. Regards, Rustam. This message posted from opensolaris.org
Rustam wrote:> Hello Robert, > >> Which would happen if you have problem with HW and you''re getting >> wring checksums on both side of your mirrors. Maybe PS? >> >> Try memtest anyway or sunvts >> > Unfortunately, SunVTS doesn''t run on non-Sun/OEM hardware. And memtest requires too much downtime which I cannot afford right now. >Sometimes if you read the docs, you can get confused by people who intend to confuse you. SunVTS does work on a wide variety of hardware, though it may not be "supported." To fully understand the perspective, SunVTS is used by Sun in the manufacturing process. It is the tests run on hardware before shipping to customers. It is not intended to be a generic "test whatever hardware you find laying around product." -- richard
Hello, If you believe that the problem can be related to ZIL code, you can try to disable it to debug (isolate) the problem. If it is not a fileserver (NFS), disabling the zil should not impact consistency. Leal. This message posted from opensolaris.org
Hello Leal, I''ve been already warned (http://www.opensolaris.org/jive/message.jspa?messageID=231349) that ZIL could be a cause and I made tests with zil_disabled. I run scrub and system crashed exactly at after the same period and the same error. ZIL known to cause some problems on writes, while all my problems are with zio_read and checksum_verify. This is NFS file server, but it crashed even when NFS unshared and nfs/server is disabled. So this is not NFS problem. I reduced panic occasions by setting zfs_prefetch_disable. This allows me to avoid unnecessary reads and reduces chances of reading bad checksums. For now I''ve 24 hours without crash which is much better than few times a day. However, I know that bad checksums are there and I need to fix them somehow. -- Rustam This message posted from opensolaris.org
On Mon, 5 May 2008, Marcelo Leal wrote:> Hello, If you believe that the problem can be related to ZIL code, > you can try to disable it to debug (isolate) the problem. If it is > not a fileserver (NFS), disabling the zil should not impact > consistency.In what way is NFS special when it comes to ZFS consistency? If NFS consistency is lost by disabling the zil then local consistency is also lost. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On May 5, 2008, at 1:43 PM, Bob Friesenhahn wrote:> On Mon, 5 May 2008, Marcelo Leal wrote: > >> Hello, If you believe that the problem can be related to ZIL code, >> you can try to disable it to debug (isolate) the problem. If it is >> not a fileserver (NFS), disabling the zil should not impact >> consistency. > > In what way is NFS special when it comes to ZFS consistency? If NFS > consistency is lost by disabling the zil then local consistency is > also lost.That''s not true: http://blogs.sun.com/erickustarz/entry/zil_disable Perhaps people are using "consistency" to mean different things here... eric
On Mon, 5 May 2008, eric kustarz wrote:> > That''s not true: > http://blogs.sun.com/erickustarz/entry/zil_disable > > Perhaps people are using "consistency" to mean different things here...Consistency means that fsync() assures that the data will be written to disk so no data is lost. It is not the same thing as "no corruption". ZFS will happily lose some data in order to avoid some corruption if the system loses power. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 5 May 2008, Marcelo Leal wrote:> I''m calling consistency, "a coherent local view"... > I think that was one option to debug (if not a NFS server), without > generate a corrupted filesystem.In other words your flight reservation will not be lost if the system crashes. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On May 5, 2008, at 4:43 PM, Bob Friesenhahn wrote:> On Mon, 5 May 2008, eric kustarz wrote: >> >> That''s not true: >> http://blogs.sun.com/erickustarz/entry/zil_disable >> >> Perhaps people are using "consistency" to mean different things >> here... > > Consistency means that fsync() assures that the data will be written > to disk so no data is lost. It is not the same thing as "no > corruption". ZFS will happily lose some data in order to avoid some > corruption if the system loses power.Ok, that makes more sense. You''re talking from the application perspective, whereas my blog entry is from the file system''s perspective (disabling the ZIL does not compromise on-disk consistency). eric
Hello Richard, Monday, May 5, 2008, 4:12:23 PM, you wrote: RE> Rustam wrote:>> Hello Robert, >> >>> Which would happen if you have problem with HW and you''re getting >>> wring checksums on both side of your mirrors. Maybe PS? >>> >>> Try memtest anyway or sunvts >>> >> Unfortunately, SunVTS doesn''t run on non-Sun/OEM hardware. And memtest requires too much downtime which I cannot afford right now. >>RE> Sometimes if you read the docs, you can get confused by people who RE> intend to confuse you. SunVTS does work on a wide variety of RE> hardware, though it may not be "supported." To fully understand the RE> perspective, SunVTS is used by Sun in the manufacturing process. RE> It is the tests run on hardware before shipping to customers. It is not RE> intended to be a generic "test whatever hardware you find laying around RE> product." Nevertheless you can actually "persuade" it to run on non Sun HW - it''s even in manual page IIRC. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com