thr3ads.net - zfs discuss - [zfs-discuss] ZFS Recovery: What do I try next? [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Myers Carpenter

2011-Nov-05 18:35 UTC

[zfs-discuss] ZFS Recovery: What do I try next?

I would like to pick the brains of the ZFS experts on this list: What
would you do next to try and recover this zfs pool?

I have a ZFS RAIDZ1 pool named bank0 that I cannot import.  It was
composed of 4 1.5 TiB disks.  One disk is totally dead.  Another had
SMART errors, but using GNU ddrescue I was able to copy all the data
off successfully.

I have copied all 3 remaining disks as images using dd on to another
another filesystem.  Using the loopback filesystem I can treat these
images as if they were real disks.  I''ve made a snapshot of the
filesystem the disk images are on so that I can try things and
rollback the changes if needed.

"gir" is the computer these disks are hosted on.  It used to be a
Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules.

I have tried booting up Solaris Express 11 Live CD and doing "zpool
import -fFX bank0" which ran for ~6 hours and put out: "one or more
devices is currently unavailable"

I have tried "zpool import -fFX bank0" on linux with the same results.

I have tried moving the drives back into the controller config they
where before, and booted my old Nexenta root disk where the
/etc/zfs/zpool.cache still had an entry for bank0.  I was not able to
get the filesystems mounts. I can''t remember what errors I got.  I can
do it again if the errors might be useful.

Here is the output of the different utils:

root at gir:/bank3/hd# zpool import -d devs
  pool: bank0
    id: 3936305481264476979
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        bank0          FAULTED  corrupted data
          raidz1-0     DEGRADED
            loop0      ONLINE
            loop1      ONLINE
            loop2      ONLINE
            c10t2d0p0  UNAVAIL


root at gir:/bank3/hd# zpool import -d devs bank0
cannot import ''bank0'': pool may be in use from other system,
it was
last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011
use ''-f'' to import anyway


root at gir:/bank3/hd# zpool import -f -d devs bank0
cannot import ''bank0'': I/O error
        Destroy and re-create the pool from
        a backup source.

root at gir:/bank3/hd# zdb -e -p devs bank0
Configuration for import:
        vdev_children: 1
        version: 26
        pool_guid: 3936305481264476979
        name: ''bank0''
        state: 0
        hostid: 661351
        hostname: ''gir''
        vdev_tree:
            type: ''root''
            id: 0
            guid: 3936305481264476979
            children[0]:
                type: ''raidz''
                id: 0
                guid: 10967243523656644777
                nparity: 1
                metaslab_array: 23
                metaslab_shift: 35
                ashift: 9
                asize: 6001161928704
                is_log: 0
                create_txg: 4
                children[0]:
                    type: ''disk''
                    id: 0
                    guid: 13554115250875315903
                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk at
3,0:q''
                    whole_disk: 0
                    DTL: 57
                    create_txg: 4
                    path: ''/bank3/hd/devs/loop0''
                children[1]:
                    type: ''disk''
                    id: 1
                    guid: 17894226827518944093
                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk at
0,0:q''
                    whole_disk: 0
                    DTL: 62
                    create_txg: 4
                    path: ''/bank3/hd/devs/loop1''
                children[2]:
                    type: ''disk''
                    id: 2
                    guid: 9087312107742869669
                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk at
1,0:q''
                    whole_disk: 0
                    DTL: 61
                    create_txg: 4
                    faulted: 1
                    aux_state: ''err_exceeded''
                    path: ''/bank3/hd/devs/loop2''
                children[3]:
                    type: ''disk''
                    id: 3
                    guid: 13297176051223822304
                    path: ''/dev/dsk/c10t2d0p0''
                    devid:
''id1,sd at SATA_____ST31500341AS________________9VS32K25/q''
                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk at
2,0:q''
                    whole_disk: 0
                    DTL: 60
                    create_txg: 4

zdb: can''t open ''bank0'': No such file or directory

LaoTsao

2011-Nov-05 18:59 UTC

head link

[zfs-discuss] ZFS Recovery: What do I try next?

Dir you try
Zpool clear -F  bank0 with tbe latest solaris express?

Sent from my iPad

On Nov 5, 2011, at 2:35 PM, Myers Carpenter <myers at maski.org> wrote:
> I would like to pick the brains of the ZFS experts on this list: What
> would you do next to try and recover this zfs pool?
> 
> I have a ZFS RAIDZ1 pool named bank0 that I cannot import.  It was
> composed of 4 1.5 TiB disks.  One disk is totally dead.  Another had
> SMART errors, but using GNU ddrescue I was able to copy all the data
> off successfully.
> 
> I have copied all 3 remaining disks as images using dd on to another
> another filesystem.  Using the loopback filesystem I can treat these
> images as if they were real disks.  I''ve made a snapshot of the
> filesystem the disk images are on so that I can try things and
> rollback the changes if needed.
> 
> "gir" is the computer these disks are hosted on.  It used to be a
> Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules.
> 
> I have tried booting up Solaris Express 11 Live CD and doing "zpool
> import -fFX bank0" which ran for ~6 hours and put out: "one or
more
> devices is currently unavailable"
> 
> I have tried "zpool import -fFX bank0" on linux with the same
results.
> 
> I have tried moving the drives back into the controller config they
> where before, and booted my old Nexenta root disk where the
> /etc/zfs/zpool.cache still had an entry for bank0.  I was not able to
> get the filesystems mounts. I can''t remember what errors I got.  I
can
> do it again if the errors might be useful.
> 
> Here is the output of the different utils:
> 
> root at gir:/bank3/hd# zpool import -d devs
>  pool: bank0
>    id: 3936305481264476979
> state: FAULTED
> status: The pool was last accessed by another system.
> action: The pool cannot be imported due to damaged devices or data.
>        The pool may be active on another system, but can be imported using
>        the ''-f'' flag.
>   see: http://www.sun.com/msg/ZFS-8000-EY
> config:
> 
>        bank0          FAULTED  corrupted data
>          raidz1-0     DEGRADED
>            loop0      ONLINE
>            loop1      ONLINE
>            loop2      ONLINE
>            c10t2d0p0  UNAVAIL
> 
> 
> root at gir:/bank3/hd# zpool import -d devs bank0
> cannot import ''bank0'': pool may be in use from other
system, it was
> last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011
> use ''-f'' to import anyway
> 
> 
> root at gir:/bank3/hd# zpool import -f -d devs bank0
> cannot import ''bank0'': I/O error
>        Destroy and re-create the pool from
>        a backup source.
> 
> root at gir:/bank3/hd# zdb -e -p devs bank0
> Configuration for import:
>        vdev_children: 1
>        version: 26
>        pool_guid: 3936305481264476979
>        name: ''bank0''
>        state: 0
>        hostid: 661351
>        hostname: ''gir''
>        vdev_tree:
>            type: ''root''
>            id: 0
>            guid: 3936305481264476979
>            children[0]:
>                type: ''raidz''
>                id: 0
>                guid: 10967243523656644777
>                nparity: 1
>                metaslab_array: 23
>                metaslab_shift: 35
>                ashift: 9
>                asize: 6001161928704
>                is_log: 0
>                create_txg: 4
>                children[0]:
>                    type: ''disk''
>                    id: 0
>                    guid: 13554115250875315903
>                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk
at 3,0:q''
>                    whole_disk: 0
>                    DTL: 57
>                    create_txg: 4
>                    path: ''/bank3/hd/devs/loop0''
>                children[1]:
>                    type: ''disk''
>                    id: 1
>                    guid: 17894226827518944093
>                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk
at 0,0:q''
>                    whole_disk: 0
>                    DTL: 62
>                    create_txg: 4
>                    path: ''/bank3/hd/devs/loop1''
>                children[2]:
>                    type: ''disk''
>                    id: 2
>                    guid: 9087312107742869669
>                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk
at 1,0:q''
>                    whole_disk: 0
>                    DTL: 61
>                    create_txg: 4
>                    faulted: 1
>                    aux_state: ''err_exceeded''
>                    path: ''/bank3/hd/devs/loop2''
>                children[3]:
>                    type: ''disk''
>                    id: 3
>                    guid: 13297176051223822304
>                    path: ''/dev/dsk/c10t2d0p0''
>                    devid:
> ''id1,sd at
SATA_____ST31500341AS________________9VS32K25/q''
>                    phys_path: ''/pci at 0,0/pci1002,4391 at 11/disk
at 2,0:q''
>                    whole_disk: 0
>                    DTL: 60
>                    create_txg: 4
> 
> zdb: can''t open ''bank0'': No such file or
directory
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Myers Carpenter

2011-Dec-22 16:00 UTC

head link

[zfs-discuss] ZFS Recovery: What do I try next?

On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter <myers at maski.org>
wrote:
> I would like to pick the brains of the ZFS experts on this list: What
> would you do next to try and recover this zfs pool?
>
I hate running across threads that ask a question and the person that asked
them never comes back to say what they eventually did, so...

To summarize: In late October I had two drives fail in a raidz1 pool.  I
was able to recover all the data from one drive, but the other could not be
seen by the controller.  Trying to zpool import was not working.   I had 3
of the 4 drives, why couldn''t I mount this.

I read about every option in zdb and tried ones that might tell me
something more about what was on this recovered drive.  I eventually hit on

zdb -p devs -vvvve -lu /bank4/hd/devs/loop0

where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had
setup the disk image of the recovered drive.

This showed the uberblocks which looked like this:

Uberblock[1]
        magic = 0000000000bab10c
        version = 26
        txg = 23128193
        guid_sum = 13396147021153418877
        timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011
        rootbp = DVA[0]=<0:2981f336c00:400>
DVA[1]=<0:1e8dcc01400:400>
DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous
unique triple size=800L/200P birth=23128193L/23128193P fill=255
cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40

Then it all came clear: This drive had encountered errors one month before
the other drive had failed and zfs had stopped writing to it.

So the lesson here: Don''t be a dumbass like me.  Setup up nagios or
some
other system to alert you when a pool has become degraded.  ZFS works very
well with one drive out of the array, you aren''t probably going to
notice
problems unless you are proactively looking for them.

myers
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111222/f60ad38e/attachment.html>

Tim Cook

2011-Dec-22 16:25 UTC

head link

[zfs-discuss] ZFS Recovery: What do I try next?

On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter <myers at maski.org>
wrote:
>
>
> On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter <myers at maski.org>
wrote:
>
>> I would like to pick the brains of the ZFS experts on this list: What
>> would you do next to try and recover this zfs pool?
>>
>
> I hate running across threads that ask a question and the person that
> asked them never comes back to say what they eventually did, so...
>
> To summarize: In late October I had two drives fail in a raidz1 pool.  I
> was able to recover all the data from one drive, but the other could not be
> seen by the controller.  Trying to zpool import was not working.   I had 3
> of the 4 drives, why couldn''t I mount this.
>
> I read about every option in zdb and tried ones that might tell me
> something more about what was on this recovered drive.  I eventually hit on
>
> zdb -p devs -vvvve -lu /bank4/hd/devs/loop0
>
> where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had
> setup the disk image of the recovered drive.
>
> This showed the uberblocks which looked like this:
>
> Uberblock[1]
>         magic = 0000000000bab10c
>         version = 26
>         txg = 23128193
>         guid_sum = 13396147021153418877
>         timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011
>         rootbp = DVA[0]=<0:2981f336c00:400>
DVA[1]=<0:1e8dcc01400:400>
> DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE
contiguous
> unique triple size=800L/200P birth=23128193L/23128193P fill=255
> cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40
>
> Then it all came clear: This drive had encountered errors one month before
> the other drive had failed and zfs had stopped writing to it.
>
> So the lesson here: Don''t be a dumbass like me.  Setup up nagios
or some
> other system to alert you when a pool has become degraded.  ZFS works very
> well with one drive out of the array, you aren''t probably going to
notice
> problems unless you are proactively looking for them.
>
> myers
>
>
>
>

Or, if you aren''t scrubbing on a regular basis, just change your zpool
failmode property.  Had you set it to wait or panic, it would''ve been
very
clear, very quickly that something was wrong.
http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111222/3c3905e3/attachment.html>

Paul Kraus

2011-Dec-22 18:08 UTC

head link

[zfs-discuss] ZFS Recovery: What do I try next?

On Thu, Dec 22, 2011 at 11:25 AM, Tim Cook <tim at cook.ms>
wrote:> On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter <myers at
maski.org> wrote:
>> So the lesson here: Don''t be a dumbass like me.? Setup up
nagios or some
>> other system to alert you when a pool has become degraded.? ZFS works
very
>> well with one drive out of the array, you aren''t probably
going to notice
>> problems unless you are proactively looking for them.
> Or, if you aren''t scrubbing on a regular basis, just change your
zpool
> failmode property. ?Had you set it to wait or panic, it would''ve
been very
> clear, very quickly that something was wrong.
>
http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
    I''m not sure this will help, as a single failed drive in a raidz1
or 2 in a raidz2 will make the zpool DEGRADED and not FAULTED. I
believe this parameter governs behavior for a FAULTED zpool.

    We have a very simple shell script that runs hourly and does a
`zpool status -x` and generates an email to the admins if any pool is
in any state other than ONLINE. As soon as a zpool goes DEGRADED we
get notified and can initiate the correct response (open a case with
Oracle to replace the failed drive is the usual one). Here is the
snippet from the script of the actual health check (not my code, I
would have done it differently, but this works) ...

not_ok=`${zfs_path}/zpool status -x | egrep -v "all pools are
healthy|no pools available"`

if [ "X${not_ok}" != "X" ]
  then
	fault_details="There is at least one zpool error."
	let fault_count=fault_count+1
	new_faults[${fault_count}]=${fault_details}
fi

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Maybe Matching Threads

Search for more possibly parallel threads

zfs discuss - Nov 2011 - ZFS Recovery: What do I try next?

[zfs-discuss] ZFS Recovery: What do I try next?

[zfs-discuss] ZFS Recovery: What do I try next?

[zfs-discuss] ZFS Recovery: What do I try next?

[zfs-discuss] ZFS Recovery: What do I try next?

[zfs-discuss] ZFS Recovery: What do I try next?

Maybe Matching Threads