thr3ads.net - zfs discuss - [zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...] [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Olaf Seibert

2012-Feb-15 13:49 UTC

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

At the moment I am feverishly seeking advice for how to fix a broken ZFS
raidz2 I have (using FreeBSD 8.2-STABLE).

This is the current status:

$ zpool status
  pool: tank
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-3C
  scan: scrub repaired 0 in 49h3m with 2 errors on Fri Jan 20 15:10:35 2012
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     FAULTED      0     0     2
          raidz2-0               DEGRADED     0     0     8
            da0                  ONLINE       0     0     0
            da1                  ONLINE       0     0     0
            da2                  ONLINE       0     0     0
            da3                  ONLINE       0     0     0
            3758301462980058947  UNAVAIL      0     0     0  was /dev/da4
            da5                  ONLINE       0     0     0

The strange thing is that the pool is FAULTED while its part is merely
DEGRADED.

da4 failed reccently and was replaced with a new disk, but no resilvering is
taking place.

I''ve already tried lots of things with this, including exporting and
then "zpool import -nFX tank". (I only got it back-imported with
"zpool
import -V tank). The -nFX ("extreme rewind") option gives no output,
but
there is a lot of I/O activity going on, as if it is rewinding forever,
or in a loop, or something like that.

One thing that may, or may not, complicate things is the following.
Already quite a while ago there suddenly was a directory that was so
corrupted that zfs reported I/O errors for various files in it. I could
not even remove them; in the end I moved the other files to a new
directory and put the original directory to the side, and made it mode
000. (If rewinding wants to go back to before this happened, I can
understand that this takes a while, but I left it running overnight and
it didn''t make visible progress)

zdb and various other commands complain about the pool not being
available, or I/O errors. For instance:

fourquid.1:~$ sudo zpool clear -nF tank
fourquid.1:~$ sudo zpool clear -F tank
cannot clear errors for tank: I/O error
fourquid.1:~$ sudo zpool clear -nFX tank
(no output, uses some cpu, some I/O)

zdb -v                          ok
zdb -v -c tank                  zdb: can''t open
''tank'': input/output error
zdb -v -l /dev/da[01235]        ok
zdb -v -u tank                  zdb: can''t open
''tank'': Input/output error
zdb -v -l -u /dev/da[01235]     ok
zdb -v -m tank                  zdb: can''t open
''tank'': Input/output error
zdb -v -m -X tank               no output, uses cpu and I/O
zdb -v -i tank                  zdb: can''t open
''tank'': Input/output error
zdb -v -i -F tank               zdb: can''t open
''tank'': Input/output error
zdb -v -i -X tank               no output, uses cpu and I/O

Are there any hints you can give me? I have full FreeBSD source online
so I can modify some tools, if needed.

Thanks in advance,
-Olaf.
-- 
Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));

Tiemen Ruiten

2012-Feb-15 14:24 UTC

head link

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

On 02/15/2012 02:49 PM, Olaf Seibert wrote:> This is the current status:
>
> $ zpool status
>    pool: tank
>   state: FAULTED
> status: One or more devices could not be opened.  There are insufficient
>          replicas for the pool to continue functioning.
> action: Attach the missing device and online it using ''zpool
online''.
>     see:http://www.sun.com/msg/ZFS-8000-3C
>    scan: scrub repaired 0 in 49h3m with 2 errors on Fri Jan 20 15:10:35
2012
> config:
>
>          NAME                     STATE     READ WRITE CKSUM
>          tank                     FAULTED      0     0     2
>            raidz2-0               DEGRADED     0     0     8
>              da0                  ONLINE       0     0     0
>              da1                  ONLINE       0     0     0
>              da2                  ONLINE       0     0     0
>              da3                  ONLINE       0     0     0
>              3758301462980058947  UNAVAIL      0     0     0  was /dev/da4
>              da5                  ONLINE       0     0     0
>
> The strange thing is that the pool is FAULTED while its part is merely
> DEGRADED.
>
> da4 failed reccently and was replaced with a new disk, but no resilvering
is
> taking place.
The correct sequence to replace a failed drive in a ZFS pool is:

zpool offline tank da4
shutdown and replace the drive
zpool replace tank da4

You can see a history of modifications you''ve made to your pool with:

zpool history

Probably you haven''t gone through this sequence correctly and now ZFS
is
still referring to the old/wrong UUID (the number you see instead of 
da4) and therefore thinks the disk is unavailable.

Hope that helps,

Tiemen

Paul Kraus

2012-Feb-15 15:02 UTC

head link

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

On Wed, Feb 15, 2012 at 9:24 AM, Tiemen Ruiten <tiemen at dgr.am> wrote:
> The correct sequence to replace a failed drive in a ZFS pool is:
>
> zpool offline tank da4
> shutdown and replace the drive
> zpool replace tank da4
    Are you saying that you cannot replace a failed drive without
shutting down the system? If that is the case with FreeBSD then I
suggest that FreeBSD is not ready for production use. I know that
under Solaris you _can_ replace failed drives with no downtime to the
end users, we do it on a regular basis.

    I suspect there is a method to replace a failed drive under
FreeBSD with no outage (assuming the drive is in a hot swap capable
enclosure), but as I am not familiar with FreeBSD I do not know what
it is.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Tiemen Ruiten

2012-Feb-15 15:23 UTC

head link

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

On 02/15/2012 04:02 PM, Paul Kraus wrote:>      Are you saying that you cannot replace a failed drive without
> shutting down the system? If that is the case with FreeBSD then I
> suggest that FreeBSD is not ready for production use. I know that
> under Solaris you_can_  replace failed drives with no downtime to the
> end users, we do it on a regular basis.
>
>      I suspect there is a method to replace a failed drive under
> FreeBSD with no outage (assuming the drive is in a hot swap capable
> enclosure), but as I am not familiar with FreeBSD I do not know what
> it is.
Hm no, that''s not what I meant, I guess I shouldn''t have
included that.
Simply offlining the device (to make sure no attempts to access it are 
made) should be sufficient if you indeed assume a hotswap bay.

Tiemen

Olaf Seibert

2012-Feb-16 10:57 UTC

head link

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

On Wed 15 Feb 2012 at 14:49:14 +0100, Olaf Seibert
wrote:>         NAME                     STATE     READ WRITE CKSUM
>         tank                     FAULTED      0     0     2
>           raidz2-0               DEGRADED     0     0     8
>             da0                  ONLINE       0     0     0
>             da1                  ONLINE       0     0     0
>             da2                  ONLINE       0     0     0
>             da3                  ONLINE       0     0     0
>             3758301462980058947  UNAVAIL      0     0     0  was /dev/da4
>             da5                  ONLINE       0     0     0
Current status: I''ve been running "zdb -bcsvL -e -L -p /dev
tank", which
magical command I found from
http://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/.
I apparently had to export the tank first.

It has been running overnight now, and the only output so far was

fourquid.0:/tmp$ sudo zdb -bcsvL -e -L -p /dev tank

Traversing all blocks to verify checksums ...
zdb_blkptr_cb: Got error 122 reading <42, 0, 3, 0>
DVA[0]=<0:508c6a90c00:3000> DVA[1]=<0:1813ba6c800:3000> [L3 DMU
dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1c00P
birth=244334305L/244334305P fill=18480533
cksum=2a43556fd2b:95a3245729a27:15e3e48f3c6a490e:70fa77061df61a76 -- skipping
zdb_blkptr_cb: Got error 122 reading <42, 0, 3, 3>
DVA[0]=<0:508c6aa2000:3000> DVA[1]=<0:1813ba72800:3000> [L3 DMU
dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1e00P
birth=244334321L/244334321P fill=16777409
cksum=2ad6a555e8f:a1dcced71be6c:191abf84e5905b05:e8564e4004372491 -- skipping


with the "error 122" messages appearing after an hour or so.

Would these 2 errors be the "2" in the CKSUM column?

I haven''t tried yet if this automagically has fixed / unlinked these
blocks, but if it didn''t, how would I do that? How can I see whether
these blocks are "important", i.e. are required for access to much
data?
Would running it on OpenIndiana or so instead of on FreeBSD make a
difference?

-Olaf.
-- 
Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));

Jim Klimov

2012-Feb-16 13:07 UTC

head link

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

2012-02-16 14:57, Olaf Seibert wrote:> On Wed 15 Feb 2012 at 14:49:14 +0100, Olaf Seibert wrote:
>>          NAME                     STATE     READ WRITE CKSUM
>>          tank                     FAULTED      0     0     2
>>            raidz2-0               DEGRADED     0     0     8
>>              da0                  ONLINE       0     0     0
>>              da1                  ONLINE       0     0     0
>>              da2                  ONLINE       0     0     0
>>              da3                  ONLINE       0     0     0
>>              3758301462980058947  UNAVAIL      0     0     0  was
/dev/da4
>>              da5                  ONLINE       0     0     0
>
> Current status: I''ve been running "zdb -bcsvL -e -L -p /dev
tank", which
> magical command I found from
> http://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/.
> I apparently had to export the tank first.
>
> It has been running overnight now, and the only output so far was
>
> fourquid.0:/tmp$ sudo zdb -bcsvL -e -L -p /dev tank
>
> Traversing all blocks to verify checksums ...
> zdb_blkptr_cb: Got error 122 reading<42, 0, 3, 0> 
DVA[0]=<0:508c6a90c00:3000>  DVA[1]=<0:1813ba6c800:3000>  [L3 DMU
dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1c00P
birth=244334305L/244334305P fill=18480533
cksum=2a43556fd2b:95a3245729a27:15e3e48f3c6a490e:70fa77061df61a76 -- skipping
> zdb_blkptr_cb: Got error 122 reading<42, 0, 3, 3> 
DVA[0]=<0:508c6aa2000:3000>  DVA[1]=<0:1813ba72800:3000>  [L3 DMU
dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1e00P
birth=244334321L/244334321P fill=16777409
cksum=2ad6a555e8f:a1dcced71be6c:191abf84e5905b05:e8564e4004372491 -- skipping
>
>
> with the "error 122" messages appearing after an hour or so.
>
> Would these 2 errors be the "2" in the CKSUM column?
>
> I haven''t tried yet if this automagically has fixed / unlinked
these
> blocks, but if it didn''t, how would I do that?
ZDB so far is supposed to do only read-only checks by
directly accessing the storage hardware. Whatever errors
it finds are not propagated to the kernel and do not get
fixed, nor do they cause kernel panics if unfixable by
current algorithms. And it doesn''t use the ARC cache, so
zdb is often quite slow (fetching my pool''s DDT into a
text file for grep-analysis took over a day).

Well, the way I''ve been sent off a number of times, "you
can see in the source"; namely - go to src.illumos.org
and search the freebsd-gate for the error message text,
error number or the function name zdb_blkptr_cb().
 From there you can try to figure out the logic that led
to the error. I''ve only got error=50 reported so far,
and apparently those were blocks that were unrecoverably
mismatching their previously known checksums (CKSUM error
counts at the raidz and pool levels).

I''ve also had more errors at the raidz level (2) vs pool
level (1); I guess this means that the logical block had
two ditto copies in ZFS, and both were erroneous.
For the pool these were counted as a single block with
two DVA addresses, and for the raidz these were two
separately stored and broken blocks.

I may be wrong in such interpretation, though.
> How can I see whether
> these blocks are "important", i.e. are required for access to
much data?
These are L3 blocks, so they address at least three
layers of indirect block pointers (in sets of up to
128 entries in each intermediate block). Like this:
                              L3
                         L2[0]...L2[127]
            L1[0,0]...L1[0,127]  ...  L1[127,0]...[L1127,127]
      lots of L0[0,0,0]-L0[127,127,127] blockpointers referencing
            your userdata (including redundant ditto copies)

Overall, if your data sizes require, there can be up
to L7 blocks in the structure for each ZFS object, to
ultimately address its zillions of L0 blocks.

Each LN block has a "fill" field which tells you how
many L0 blocks it addresses in the end. In your case
fill=18480533 and fill=16777409 amounts to quite a
lot of data.
> Would running it on OpenIndiana or so instead of on FreeBSD make a
> difference?
Not sure about that... I THINK most of the code should
be the same, so it probably depends on which platform
you''re more used to work on (and recompile some test
kernels in particular). I haven''t tried FreeBSD in over
a decade, so can''t help here :)

I''m still trying to punch my pool into a sane position,
so I can say that the current OpenIndiana code does not
contain a miraculous recovery wizard ;)
>
> -Olaf.
Good luck, really,
//Jim

zfs discuss - Feb 2012 - [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]

[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]