thr3ads.net - zfs discuss - [zfs-discuss] uncorrectable error during zfs send; what are the right next steps? [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Matt Ingenthron

2008-Mar-23 01:30 UTC

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

Hi all,

I''m migrating to a new laptop from one which has had hardware issues
lately.  I kept my home directory on zfs, so in theory it should be
straightforward to send/receive, but I''ve had issues.  I''ve
moved the disk out of the faulty system, though I saw the same issue there.

The behavior I have is the zfs send will start and work up to a point (about
792MByte), but then all zfs activity with the source pool/filesystems will hang.
When I was doing it across the network, the sender would have all zpool/zfs
commands hang after hitting this point, and now that both disks are on the same
system, zpool/zfs commands having to do with the sending pool hang once it gets
to this point.

Is there anything I can do to recover from this condition?  Near as I can tell,
I can mount the filesystem and access things, so I could try just a bulk copy,
but I''d like to know why this won''t work.  Is there any data I
can gather which may help identify the cause?

It seems that the zpool/zfs commands working with that pool shouldn''t
hang, even if one zfs send/receive is hitting this condition.

I did some searches for bugs, but didn''t find anything helpful.

Details:
snv_79b

imported pool: "oldspace"

messages:
# tail /var/adm/messages
Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] fssnap0 is /pseudo/fssnap
at 0
Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: winlock0
Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] winlock0 is
/pseudo/winlock at 0
Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: pm0
Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0
Mar 22 17:28:36 hancock ipf: [ID 774698 kern.info] IP Filter: v4.1.9, running.
Mar 22 17:28:36 hancock rdc: [ID 517869 kern.info] @(#) rdc: built 20:52:37 Nov
27 2007
Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo-device: rdc0
Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] rdc0 is /pseudo/rdc at 0
Mar 22 17:49:13 hancock zfs: [ID 664491 kern.warning] WARNING: Pool
''oldspace'' has encountered an uncorrectable I/O error. Manual
intervention is required.

no messages from fmadm
# fmadm faulty
# 


command: 
# zfs send oldspace/home/mi109165 at laptopmigration | zfs receive
space/homes/mi109165

output from zdb:
oldspace
    version=3
    name=''oldspace''
    state=0
    txg=3360658
    pool_guid=986377251057668768
    hostid=630972017
    hostname=''hancock''
    vdev_tree
        type=''root''
        id=0
        guid=986377251057668768
        children[0]
                type=''disk''
                id=0
                guid=5072910331796803983
                path=''/dev/dsk/c3t0d0s3''
                devid=''id1,sd at
f259bde7147e5a411000cc9950000/d''
                phys_path=''/pci at 0,0/pci1179,1 at 1a,7/storage at
1/disk at 0,0:d''
                whole_disk=0
                metaslab_array=14
                metaslab_shift=28
                ashift=9
                asize=32221691904
                is_log=0
                DTL=110
space
    version=10
    name=''space''
    state=0
    txg=4
    pool_guid=2349984716539065036
    hostid=630972017
    hostname=''hancock''
    vdev_tree
        type=''root''
        id=0
        guid=2349984716539065036
        children[0]
                type=''disk''
                id=0
                guid=11913572021826904359
                path=''/dev/dsk/c1t0d0s7''
                devid=''id1,sd at
f00000000479cc692000ab26a0001/h''
                phys_path=''/pci at 0,0/pci1179,1 at 1f,2/disk at
0,0:h''
                whole_disk=0
                metaslab_array=14
                metaslab_shift=29
                ashift=9
                asize=58098450432
                is_log=0

# iostat -E
sd0       Soft Errors: 172 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: FUJITSU MHW2120B Revision: 0013 Serial No:  
Size: 120.03GB <120034123776 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 172 Predictive Failure Analysis: 0 
sd1       Soft Errors: 0 Hard Errors: 409 Transport Errors: 0 
Vendor: MATSHITA Product: DVD-RAM UJ-852S  Revision: 1.00 Serial No:  
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 409 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd3       Soft Errors: 305 Hard Errors: 0 Transport Errors: 0 
Vendor: ST912082 Product: 1A               Revision: 0014 Serial No:  
Size: 120.03GB <120034123776 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 305 Predictive Failure Analysis: 0 



# zpool status space
  pool: space
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        space       ONLINE       0     0     0
          c1t0d0s7  ONLINE       0     0     0

errors: No known data errors
# zpool status oldspace
^C
(process not responding...)

# zfs list -r space
NAME                       USED  AVAIL  REFER  MOUNTPOINT
space                     1.39G  51.8G    19K  /space
space/homes               1.39G  51.8G    18K  /space/homes
space/homes/mi109165       792M  51.8G   792M  /space/homes/mi109165
space/homes/new-mi109165   630M  51.8G   630M  /new-export/home/mi109165
# zfs list -r oldspace


^C
(process not responding....)


Thanks in advance for any help/pointers,

- Matt
 
 
This message posted from opensolaris.org

Matt Ingenthron

2008-Mar-23 02:49 UTC

head link

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

One update to this, I tried a scrub.  This found a number of errors on old
snapshots (long story, I''d once done a zpool replace from an old disk
with hardware errors to this disk).  I destroyed the snapshots since they
weren''t needed.  The snapshot I was trying to send did not have any
errors.  After getting rid of those snapshots, I ran zpool clear on the device,
but another zfs status still shows the errors (without metadata):

bash-3.2# zpool status -v
  pool: oldspace
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        oldspace    ONLINE       0     0     0
          c3t0d0s3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x33>:<0x2e00>
        <0x33>:<0x6000>
        <0x33>:<0x8a00>
        <0x33>:<0x2e01>
        <0x33>:<0x6001>
        <0x33>:<0x2e02>
(with much more output....)

Despite a zpool clear and an export/import, those stuck around in the zpool
status.

Then looking at the zpool man page, it looked like setting failmode=continue
could help, but it wasn''t clear how that would affect a zfs send.

Another attempt at the zfs send/receive again failed with this in messages:
Mar 22 19:38:47 hancock zfs: [ID 664491 kern.warning] WARNING: Pool
''oldspace'' has encountered an uncorrectable I/O error. Manual
intervention is required.

Any pointers on what "manual intervention" to use would be greatly
appreciated.

- Matt
 
 
This message posted from opensolaris.org

Matt Ingenthron

2008-Mar-23 04:33 UTC

head link

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

One more scrub later, and now the snapshot I was trying to send,
@laptopmigration, is now showing errors but the errors on the old snapshots are
gone, since I destroyed the snapshots.

Is this expected behavior?  Should the errors only show on one snapshot at a
time?  I have a suspicion that if I destroy this snapshot as well,
they''ll show on the actual underlying filesystem.

- Matt
 
 
This message posted from opensolaris.org

Tim

2008-Mar-23 04:39 UTC

head link

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

On Sat, Mar 22, 2008 at 11:33 PM, Matt Ingenthron <matt.ingenthron at
sun.com>
wrote:
> One more scrub later, and now the snapshot I was trying to send,
> @laptopmigration, is now showing errors but the errors on the old snapshots
> are gone, since I destroyed the snapshots.
>
> Is this expected behavior?  Should the errors only show on one snapshot at
> a time?  I have a suspicion that if I destroy this snapshot as well,
they''ll
> show on the actual underlying filesystem.
>
> - Matt
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

...You keep discussing errors on snapshots.  Snapshots are simply pointers
to data blocks... have you been addressing the errors in the actual data
blocks?  Deleting snapshots isn''t going to fix anything if you have 10
snapshots all sharing a corrupted data block.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080322/640240a9/attachment.html>

Matt Ingenthron

2008-Mar-23 15:27 UTC

head link

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

> <div id="jive-html-wrapper-div">
> <br><br><div class="gmail_quote">On Sat, Mar
22, 2008
> at 11:33 PM, Matt Ingenthron <<a
> href="mailto:matt.ingenthron at sun.com">matt.ingenthron@
> sun.com</a>> wrote:<br><blockquote
> class="gmail_quote" style="border-left: 1px solid
> rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;
> padding-left: 1ex;">
> One more scrub later, and now the snapshot I was
> trying to send, @laptopmigration, is now showing
> errors but the errors on the old snapshots are gone,
> since I destroyed the snapshots.<br>
> <br>
> Is this expected behavior?  Should the errors
> only show on one snapshot at a time?  I have a
> suspicion that if I destroy this snapshot as well,
> they''ll show on the actual underlying
> filesystem.<br>
> <br>
> - Matt<br>
> <br>
> <br>
> This message posted from <a
> href="http://opensolaris.org"
> target="_blank">opensolaris.org</a><br>
> _______________________________________________<br>
> zfs-discuss mailing list<br>
> <a
> href="mailto:zfs-discuss at opensolaris.org">zfs-discuss@
> opensolaris.org</a><br>
> <a
> href="http://mail.opensolaris.org/mailman/listinfo/zfs
> -discuss"
> target="_blank">http://mail.opensolaris.org/mailman/li
> stinfo/zfs-discuss</a><br>
> </blockquote></div><br><br>...You keep discussing
> errors on snapshots.  Snapshots are simply
> pointers to data blocks... have you been addressing
> the errors in the actual data blocks?  Deleting
> snapshots isn''t going to fix anything if you have
> 10 snapshots all sharing a corrupted data block.<br>
> 
> </div>
I agree with you, but until now all of the scrubs have only shown errors on the
oldest snapshot, not on the underlying filesystem.  I don''t doubt the
same file was in an error condition on the filesystem, but it was odd that zpool
status was only showing the errors on the snapshot.  Oddly, now zpool status is
showing errors on the files themselves too.

It was also unfortunate that the zfs send/receive would hang and even the zpool
and zfs command would hang when working with those pools/datasets.

I guess the reason I was posting is I was looking for the right "manual
intervention" (which may be restoring or deleting the files and then
removing all snapshots?) and the hanging behavior of zpool and zfs send seemed
somewhat disconcerting.  They would even block the system from both "init
5" and "reboot -n".

The reality with all of these files is that I either have them elsewhere or they
can be safely deleted since they''re easily recreated.  I''ll
try removing all of the snapshots, then recover/delete the underlying files,
then run another scrub.

- Matt
 
 
This message posted from opensolaris.org

eric kustarz

2008-Mar-24 22:59 UTC

head link

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

> messages:
> # tail /var/adm/messages
> Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] fssnap0 is / 
> pseudo/fssnap at 0
> Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- 
> device: winlock0
> Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] winlock0 is / 
> pseudo/winlock at 0
> Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- 
> device: pm0
> Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] pm0 is / 
> pseudo/pm at 0
> Mar 22 17:28:36 hancock ipf: [ID 774698 kern.info] IP Filter:  
> v4.1.9, running.
> Mar 22 17:28:36 hancock rdc: [ID 517869 kern.info] @(#) rdc: built  
> 20:52:37 Nov 27 2007
> Mar 22 17:28:36 hancock pseudo: [ID 129642 kern.info] pseudo- 
> device: rdc0
> Mar 22 17:28:36 hancock genunix: [ID 936769 kern.info] rdc0 is / 
> pseudo/rdc at 0
> Mar 22 17:49:13 hancock zfs: [ID 664491 kern.warning] WARNING: Pool  
> ''oldspace'' has encountered an uncorrectable I/O error.
Manual
> intervention is required.
>
> no messages from fmadm
> # fmadm faulty
> #
This bug will cover this:
6623234 better FMA integration for ''failmode'' property

eric

zfs discuss - Mar 2008 - uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?

[zfs-discuss] uncorrectable error during zfs send; what are the right next steps?