thr3ads.net - zfs discuss - [zfs-discuss] Recovering FAULTED zpool [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Peter Jeremy

2009-Nov-18 02:57 UTC

[zfs-discuss] Recovering FAULTED zpool

I have a zpool on a JBOD SE3320 that I was using for data with Solaris
10 (the root/usr/var filesystems were all UFS).  Unfortunately, we had
a bit of a mixup with SCSI cabling and I believe that we created a
SCSI target clash.  The system was unloaded and nothing happened until
I ran "zpool status" at which point things broke.  After correcting
all the cabling, Solaris panic''d before reaching single user.

Sun Support could only suggest restoring from backups - but
unfortunately, we do not have backups of some of the data that we
would like to recover.

Since OpenSolaris has a much newer version of ZFS, I thought I would
give OpenSolaris a try and it looks slightly more promising, though I
still can''t access the pool.  The following is using snv125 on a T2000.

root at als253:~# zpool import -F data
Nov 17 15:26:46 opensolaris zfs: WARNING: can''t open objset for
data/backup
root at als253:~# zpool status -v data
  pool: data
 state: FAULTED
status: An intent log record could not be read.
        Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run ''zpool
online'',
        or ignore the intent log records by running ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        data         FAULTED      0     0     3  bad intent log
          raidz2-0   DEGRADED     0     0    18
            c2t8d0   FAULTED      0     0     0  too many errors
            c2t9d0   ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
            c2t12d0  ONLINE       0     0     0
            c2t13d0  ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0
            c3t12d0  DEGRADED     0     0     0  too many errors
            c3t13d0  ONLINE       0     0     0
root at als253:~# zpool online data c2t8d0
Nov 17 15:28:42 opensolaris zfs: WARNING: can''t open objset for
data/backup
cannot open ''data'': pool is unavailable
root at als253:~# zpool clear data
cannot clear errors for data: one or more devices is currently unavailable
root at als253:~# zpool clear -F data
cannot open ''-F'': name must begin with a letter
root at als253:~# zpool status data
  pool: data
 state: FAULTED
status: One or more devices are faulted in response to persistent errors.  There
are insufficient replicas for the pool to
        continue functioning.
action: Destroy and re-create the pool from a backup source.  Manually marking
the device
        repaired using ''zpool clear'' may allow some data to be
recovered.
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        data         FAULTED      0     0     1  corrupted data
          raidz2-0   FAULTED      0     0     6  corrupted data
            c2t8d0   FAULTED      0     0     0  too many errors
            c2t9d0   ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
            c2t12d0  ONLINE       0     0     0
            c2t13d0  ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0
            c3t12d0  DEGRADED     0     0     0  too many errors
            c3t13d0  ONLINE       0     0     0
root at als253:~#

Annoyingly, data/backup is not a filesystem I''m especially worried
about - I''d just like to get access to the other filesystems on it.
Is is possible to hack the pool to make data/backup just disappear.
For that matter:
1) Why is the whole pool faulted when n-2 vdevs are online?
2) Given that metadata is triplicated, where did the objset go?

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091118/38e22d1a/attachment.bin>

Orvar Korvar

2009-Nov-18 16:40 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

There is a new PSARC in b126(?) that allows to rollback to latest functioning
uber block. Maybe it can help you?
-- 
This message posted from opensolaris.org

Victor Latushkin

2009-Nov-18 23:57 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

Peter Jeremy wrote:> I have a zpool on a JBOD SE3320 that I was using for data with Solaris
> 10 (the root/usr/var filesystems were all UFS).  Unfortunately, we had
> a bit of a mixup with SCSI cabling and I believe that we created a
> SCSI target clash.  The system was unloaded and nothing happened until
> I ran "zpool status" at which point things broke.  After
correcting
> all the cabling, Solaris panic''d before reaching single user.
Do you have crash dump of this panic saved?
> Sun Support could only suggest restoring from backups - but
> unfortunately, we do not have backups of some of the data that we
> would like to recover.
> 
> Since OpenSolaris has a much newer version of ZFS, I thought I would
> give OpenSolaris a try and it looks slightly more promising, though I
> still can''t access the pool.  The following is using snv125 on a
T2000.
> 
> root at als253:~# zpool import -F data
> Nov 17 15:26:46 opensolaris zfs: WARNING: can''t open objset for
data/backup
> root at als253:~# zpool status -v data
>   pool: data
>  state: FAULTED
> status: An intent log record could not be read.
>         Waiting for adminstrator intervention to fix the faulted pool.
> action: Either restore the affected device(s) and run ''zpool
online'',
>         or ignore the intent log records by running ''zpool
clear''.
>    see: http://www.sun.com/msg/ZFS-8000-K4
>  scrub: none requested
> config:
> 
>         NAME         STATE     READ WRITE CKSUM
>         data         FAULTED      0     0     3  bad intent log
>           raidz2-0   DEGRADED     0     0    18
>             c2t8d0   FAULTED      0     0     0  too many errors
>             c2t9d0   ONLINE       0     0     0
>             c2t10d0  ONLINE       0     0     0
>             c2t11d0  ONLINE       0     0     0
>             c2t12d0  ONLINE       0     0     0
>             c2t13d0  ONLINE       0     0     0
>             c3t8d0   ONLINE       0     0     0
>             c3t9d0   ONLINE       0     0     0
>             c3t10d0  ONLINE       0     0     0
>             c3t11d0  ONLINE       0     0     0
>             c3t12d0  DEGRADED     0     0     0  too many errors
>             c3t13d0  ONLINE       0     0     0
> root at als253:~# zpool online data c2t8d0
> Nov 17 15:28:42 opensolaris zfs: WARNING: can''t open objset for
data/backup
> cannot open ''data'': pool is unavailable
> root at als253:~# zpool clear data
> cannot clear errors for data: one or more devices is currently unavailable
> root at als253:~# zpool clear -F data
> cannot open ''-F'': name must begin with a letter
Option -F is new one added with pool recovery support, so it''ll be 
available in build 128 only
> root at als253:~# zpool status data
>   pool: data
>  state: FAULTED
> status: One or more devices are faulted in response to persistent errors. 
There are insufficient replicas for the pool to
>         continue functioning.
> action: Destroy and re-create the pool from a backup source.  Manually
marking the device
>         repaired using ''zpool clear'' may allow some data
to be recovered.
>  scrub: none requested
> config:
> 
>         NAME         STATE     READ WRITE CKSUM
>         data         FAULTED      0     0     1  corrupted data
>           raidz2-0   FAULTED      0     0     6  corrupted data
>             c2t8d0   FAULTED      0     0     0  too many errors
>             c2t9d0   ONLINE       0     0     0
>             c2t10d0  ONLINE       0     0     0
>             c2t11d0  ONLINE       0     0     0
>             c2t12d0  ONLINE       0     0     0
>             c2t13d0  ONLINE       0     0     0
>             c3t8d0   ONLINE       0     0     0
>             c3t9d0   ONLINE       0     0     0
>             c3t10d0  ONLINE       0     0     0
>             c3t11d0  ONLINE       0     0     0
>             c3t12d0  DEGRADED     0     0     0  too many errors
>             c3t13d0  ONLINE       0     0     0
> root at als253:~#
> 
> Annoyingly, data/backup is not a filesystem I''m especially worried
> about - I''d just like to get access to the other filesystems on
it.
I think it should be possible at least in readonly mode. I cannot tell 
if full recovery will be possible, but at least there''s good chance to 
get some data back.

You can try build 128 as soon as it becomes available, or you can try to 
build BFU archives from source and apply to your build 125 BE.
> Is is possible to hack the pool to make data/backup just disappear.
> For that matter:
> 1) Why is the whole pool faulted when n-2 vdevs are online?
RAID-Z2 should survive 2 disk failures. But in this case as you say 
there was some misconfiguration on the storage side that as yo mention 
might cause SCSI target crash.

ZFS verifies checksums and in this case it looks like some critical 
metadata block(s) in the most recent state fails checksum verification, 
so corruption is present on some of the online disks too, but as one 
disk is faulted ad another degraded ZFS is not able to identify what 
other disk has problem by using combinatorial reconstruction.
> 2) Given that metadata is triplicated, where did the objset go?
Metadata replication helps to protect against failures localized in 
space, but as all copies of metadata are written at the same time, it 
cannot protect against failures localized in time.

regards,
victor

Peter Jeremy

2009-Nov-19 00:06 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

On 2009-Nov-19 02:57:31 +0300, Victor Latushkin <Victor.Latushkin at
Sun.COM> wrote:>> all the cabling, Solaris panic''d before reaching single user.
>
>Do you have crash dump of this panic saved?
Yes.  It was provided to Sun Support.
>Option -F is new one added with pool recovery support, so it''ll be 
>available in build 128 only
OK, thanks I knew it was new but I wasn''t certain exactly which build
it had been imported into.
>I think it should be possible at least in readonly mode. I cannot tell 
>if full recovery will be possible, but at least there''s good chance
to
>get some data back.
That''s what I was hoping.
>You can try build 128 as soon as it becomes available, or you can try to 
>build BFU archives from source and apply to your build 125 BE.
I''m currently discussing this off-line with Tim Haley.
>Metadata replication helps to protect against failures localized in 
>space, but as all copies of metadata are written at the same time, it 
>cannot protect against failures localized in time.
Thanks for that.  I suspected it might be something like this.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091119/068045ca/attachment.bin>

Peter Jeremy

2009-Nov-20 00:56 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

On 2009-Nov-18 08:40:41 -0800, Orvar Korvar <knatte_fnatte_tjatte at
yahoo.com> wrote:>There is a new PSARC in b126(?) that allows to rollback to latest
functioning uber block. Maybe it can help you?
It''s in b128 and the feedback I''ve received suggests it will
work.
I''ve been trying to get the relevant ZFS bits for my b127 system but
haven''t managed to get them to work so far.

-- 
Peter Jeremy

Dersaidin

2009-Nov-21 01:47 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

Hello,

This sounds similar to a problem I had a few months ago:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869512

I don''t have a solution, but information from this possibly related
bug may help.

Andrew

Peter Jeremy

2009-Dec-08 03:41 UTC

head link

[zfs-discuss] Recovering FAULTED zpool

On 2009-Nov-18 11:50:44 +1100, I wrote:>I have a zpool on a JBOD SE3320 that I was using for data with Solaris
>10 (the root/usr/var filesystems were all UFS).  Unfortunately, we had
>a bit of a mixup with SCSI cabling and I believe that we created a
>SCSI target clash.  The system was unloaded and nothing happened until
>I ran "zpool status" at which point things broke.  After
correcting
>all the cabling, Solaris panic''d before reaching single user.
I wound up installing OpenSolaris snv_128a on some spare disks and
this enabled me to recover the data.  Thanks to Tim Haley and Victor
Latushkin for their assistance.

As a first attempt, ''zpool import -F data'' said "Destroy
and re-create
the pool from a backup source.".

''zpool import -nFX data'' initially ran the system out of swap
(I
hadn''t attached any swap and it only has 8GB RAM):
WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
INIT: Couldn''t write persistent state file
`/etc/svc/volatile/init.state''.

After rebooting and adding some swap (which didn''t seem to ever get
used), it did work (though it took several hours - unfortunately, I
didn''t record exactly how long):

# zpool import -nFX data
Would be able to return data to its state as of Thu Jan 01 10:00:00 1970.
Would discard approximately 369 minutes of transactions.
# zpool import -FX data
Pool data returned to its state as of Thu Jan 01 10:00:00 1970.
Discarded approximately 369 minutes of transactions.
cannot share ''data/backup'': share(1M) failed
cannot share ''data/JumpStart'': share(1M) failed
cannot share ''data/OS_images'': share(1M) failed
#

I notice that the two times aren''t consistent but the data appears to
be present and a ''zpool scrub'' reported no errors.  I have
reverted
back to Solaris 10 and successfully copied all the data off.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/9c020d5c/attachment.bin>

zfs discuss - Nov 2009 - Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool

[zfs-discuss] Recovering FAULTED zpool