Matt Ingenthron
2008-Mar-23 01:30 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
Hi all, I''m migrating to a new laptop from one which has had hardware issues lately. I kept my home directory on zfs, so in theory it should be straightforward to send/receive, but I''ve had issues. I''ve moved the disk out of the faulty system, though I saw the same issue there. The behavior I have is the zfs send will start and work up to a point (about 792MByte), but then all zfs activity with the source pool/filesystems will hang. When I was doing it across the network, the sender would have all zpool/zfs commands hang after hitting this point, and now that both disks are on the same system, zpool/zfs commands having to do with the sending pool hang once it gets to this point. Is there anything I can do to recover from this condition? Near as I can tell, I can mount the filesystem and access things, so I could try just a bulk copy, but I''d like to know why this won''t work. Is there any data I can gather which may help identify the cause? It seems that the zpool/zfs commands working with that pool shouldn''t hang, even if one zfs send/receive is hitting this condition. I did some searches for bugs, but didn''t find anything helpful. Details: snv_79b imported pool: "oldspace" messages: # tail /var/adm/messages Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] fssnap0 is /pseudo/fssnap at 0 Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: winlock0 Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] winlock0 is /pseudo/winlock at 0 Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: pm0 Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0 Mar 22 17:28:36 hancock ipf: [ID 774698 kern.info] IP Filter: v4.1.9, running. Mar 22 17:28:36 hancock rdc: [ID 517869 kern.info] @(#) rdc: built 20:52:37 Nov 27 2007 Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: rdc0 Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] rdc0 is /pseudo/rdc at 0 Mar 22 17:49:13 hancock zfs: [ID 664491 kern.warning] WARNING: Pool ''oldspace'' has encountered an uncorrectable I/O error. Manual intervention is required. no messages from fmadm # fmadm faulty # command: # zfs send oldspace/home/mi109165 at laptopmigration | zfs receive space/homes/mi109165 output from zdb: oldspace version=3 name=''oldspace'' state=0 txg=3360658 pool_guid=986377251057668768 hostid=630972017 hostname=''hancock'' vdev_tree type=''root'' id=0 guid=986377251057668768 children[0] type=''disk'' id=0 guid=5072910331796803983 path=''/dev/dsk/c3t0d0s3'' devid=''id1,sd at f259bde7147e5a411000cc9950000/d'' phys_path=''/pci at 0,0/pci1179,1 at 1a,7/storage at 1/disk at 0,0:d'' whole_disk=0 metaslab_array=14 metaslab_shift=28 ashift=9 asize=32221691904 is_log=0 DTL=110 space version=10 name=''space'' state=0 txg=4 pool_guid=2349984716539065036 hostid=630972017 hostname=''hancock'' vdev_tree type=''root'' id=0 guid=2349984716539065036 children[0] type=''disk'' id=0 guid=11913572021826904359 path=''/dev/dsk/c1t0d0s7'' devid=''id1,sd at f00000000479cc692000ab26a0001/h'' phys_path=''/pci at 0,0/pci1179,1 at 1f,2/disk at 0,0:h'' whole_disk=0 metaslab_array=14 metaslab_shift=29 ashift=9 asize=58098450432 is_log=0 # iostat -E sd0 Soft Errors: 172 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: FUJITSU MHW2120B Revision: 0013 Serial No: Size: 120.03GB <120034123776 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 172 Predictive Failure Analysis: 0 sd1 Soft Errors: 0 Hard Errors: 409 Transport Errors: 0 Vendor: MATSHITA Product: DVD-RAM UJ-852S Revision: 1.00 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 409 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd3 Soft Errors: 305 Hard Errors: 0 Transport Errors: 0 Vendor: ST912082 Product: 1A Revision: 0014 Serial No: Size: 120.03GB <120034123776 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 305 Predictive Failure Analysis: 0 # zpool status space pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 c1t0d0s7 ONLINE 0 0 0 errors: No known data errors # zpool status oldspace ^C (process not responding...) # zfs list -r space NAME USED AVAIL REFER MOUNTPOINT space 1.39G 51.8G 19K /space space/homes 1.39G 51.8G 18K /space/homes space/homes/mi109165 792M 51.8G 792M /space/homes/mi109165 space/homes/new-mi109165 630M 51.8G 630M /new-export/home/mi109165 # zfs list -r oldspace ^C (process not responding....) Thanks in advance for any help/pointers, - Matt This message posted from opensolaris.org
Matt Ingenthron
2008-Mar-23 02:49 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
One update to this, I tried a scrub. This found a number of errors on old snapshots (long story, I''d once done a zpool replace from an old disk with hardware errors to this disk). I destroyed the snapshots since they weren''t needed. The snapshot I was trying to send did not have any errors. After getting rid of those snapshots, I ran zpool clear on the device, but another zfs status still shows the errors (without metadata): bash-3.2# zpool status -v pool: oldspace state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM oldspace ONLINE 0 0 0 c3t0d0s3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0x33>:<0x2e00> <0x33>:<0x6000> <0x33>:<0x8a00> <0x33>:<0x2e01> <0x33>:<0x6001> <0x33>:<0x2e02> (with much more output....) Despite a zpool clear and an export/import, those stuck around in the zpool status. Then looking at the zpool man page, it looked like setting failmode=continue could help, but it wasn''t clear how that would affect a zfs send. Another attempt at the zfs send/receive again failed with this in messages: Mar 22 19:38:47 hancock zfs: [ID 664491 kern.warning] WARNING: Pool ''oldspace'' has encountered an uncorrectable I/O error. Manual intervention is required. Any pointers on what "manual intervention" to use would be greatly appreciated. - Matt This message posted from opensolaris.org
Matt Ingenthron
2008-Mar-23 04:33 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
One more scrub later, and now the snapshot I was trying to send, @laptopmigration, is now showing errors but the errors on the old snapshots are gone, since I destroyed the snapshots. Is this expected behavior? Should the errors only show on one snapshot at a time? I have a suspicion that if I destroy this snapshot as well, they''ll show on the actual underlying filesystem. - Matt This message posted from opensolaris.org
Tim
2008-Mar-23 04:39 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
On Sat, Mar 22, 2008 at 11:33 PM, Matt Ingenthron <matt.ingenthron at sun.com> wrote:> One more scrub later, and now the snapshot I was trying to send, > @laptopmigration, is now showing errors but the errors on the old snapshots > are gone, since I destroyed the snapshots. > > Is this expected behavior? Should the errors only show on one snapshot at > a time? I have a suspicion that if I destroy this snapshot as well, they''ll > show on the actual underlying filesystem. > > - Matt > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >...You keep discussing errors on snapshots. Snapshots are simply pointers to data blocks... have you been addressing the errors in the actual data blocks? Deleting snapshots isn''t going to fix anything if you have 10 snapshots all sharing a corrupted data block. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080322/640240a9/attachment.html>
Matt Ingenthron
2008-Mar-23 15:27 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
> <div id="jive-html-wrapper-div"> > <br><br><div class="gmail_quote">On Sat, Mar 22, 2008 > at 11:33 PM, Matt Ingenthron <<a > href="mailto:matt.ingenthron at sun.com">matt.ingenthron@ > sun.com</a>> wrote:<br><blockquote > class="gmail_quote" style="border-left: 1px solid > rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; > padding-left: 1ex;"> > One more scrub later, and now the snapshot I was > trying to send, @laptopmigration, is now showing > errors but the errors on the old snapshots are gone, > since I destroyed the snapshots.<br> > <br> > Is this expected behavior? Should the errors > only show on one snapshot at a time? I have a > suspicion that if I destroy this snapshot as well, > they''ll show on the actual underlying > filesystem.<br> > <br> > - Matt<br> > <br> > <br> > This message posted from <a > href="http://opensolaris.org" > target="_blank">opensolaris.org</a><br> > _______________________________________________<br> > zfs-discuss mailing list<br> > <a > href="mailto:zfs-discuss at opensolaris.org">zfs-discuss@ > opensolaris.org</a><br> > <a > href="http://mail.opensolaris.org/mailman/listinfo/zfs > -discuss" > target="_blank">http://mail.opensolaris.org/mailman/li > stinfo/zfs-discuss</a><br> > </blockquote></div><br><br>...You keep discussing > errors on snapshots. Snapshots are simply > pointers to data blocks... have you been addressing > the errors in the actual data blocks? Deleting > snapshots isn''t going to fix anything if you have > 10 snapshots all sharing a corrupted data block.<br> > > </div>I agree with you, but until now all of the scrubs have only shown errors on the oldest snapshot, not on the underlying filesystem. I don''t doubt the same file was in an error condition on the filesystem, but it was odd that zpool status was only showing the errors on the snapshot. Oddly, now zpool status is showing errors on the files themselves too. It was also unfortunate that the zfs send/receive would hang and even the zpool and zfs command would hang when working with those pools/datasets. I guess the reason I was posting is I was looking for the right "manual intervention" (which may be restoring or deleting the files and then removing all snapshots?) and the hanging behavior of zpool and zfs send seemed somewhat disconcerting. They would even block the system from both "init 5" and "reboot -n". The reality with all of these files is that I either have them elsewhere or they can be safely deleted since they''re easily recreated. I''ll try removing all of the snapshots, then recover/delete the underlying files, then run another scrub. - Matt This message posted from opensolaris.org
eric kustarz
2008-Mar-24 22:59 UTC
[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?
> messages: > # tail /var/adm/messages > Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] fssnap0 is / > pseudo/fssnap at 0 > Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- > device: winlock0 > Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] winlock0 is / > pseudo/winlock at 0 > Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- > device: pm0 > Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] pm0 is / > pseudo/pm at 0 > Mar 22 17:28:36 hancock ipf: [ID 774698 kern.info] IP Filter: > v4.1.9, running. > Mar 22 17:28:36 hancock rdc: [ID 517869 kern.info] @(#) rdc: built > 20:52:37 Nov 27 2007 > Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- > device: rdc0 > Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] rdc0 is / > pseudo/rdc at 0 > Mar 22 17:49:13 hancock zfs: [ID 664491 kern.warning] WARNING: Pool > ''oldspace'' has encountered an uncorrectable I/O error. Manual > intervention is required. > > no messages from fmadm > # fmadm faulty > #This bug will cover this: 6623234 better FMA integration for ''failmode'' property eric