Olaf Seibert
2012-Feb-15 13:49 UTC
[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]
At the moment I am feverishly seeking advice for how to fix a broken ZFS raidz2 I have (using FreeBSD 8.2-STABLE). This is the current status: $ zpool status pool: tank state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-3C scan: scrub repaired 0 in 49h3m with 2 errors on Fri Jan 20 15:10:35 2012 config: NAME STATE READ WRITE CKSUM tank FAULTED 0 0 2 raidz2-0 DEGRADED 0 0 8 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 3758301462980058947 UNAVAIL 0 0 0 was /dev/da4 da5 ONLINE 0 0 0 The strange thing is that the pool is FAULTED while its part is merely DEGRADED. da4 failed reccently and was replaced with a new disk, but no resilvering is taking place. I''ve already tried lots of things with this, including exporting and then "zpool import -nFX tank". (I only got it back-imported with "zpool import -V tank). The -nFX ("extreme rewind") option gives no output, but there is a lot of I/O activity going on, as if it is rewinding forever, or in a loop, or something like that. One thing that may, or may not, complicate things is the following. Already quite a while ago there suddenly was a directory that was so corrupted that zfs reported I/O errors for various files in it. I could not even remove them; in the end I moved the other files to a new directory and put the original directory to the side, and made it mode 000. (If rewinding wants to go back to before this happened, I can understand that this takes a while, but I left it running overnight and it didn''t make visible progress) zdb and various other commands complain about the pool not being available, or I/O errors. For instance: fourquid.1:~$ sudo zpool clear -nF tank fourquid.1:~$ sudo zpool clear -F tank cannot clear errors for tank: I/O error fourquid.1:~$ sudo zpool clear -nFX tank (no output, uses some cpu, some I/O) zdb -v ok zdb -v -c tank zdb: can''t open ''tank'': input/output error zdb -v -l /dev/da[01235] ok zdb -v -u tank zdb: can''t open ''tank'': Input/output error zdb -v -l -u /dev/da[01235] ok zdb -v -m tank zdb: can''t open ''tank'': Input/output error zdb -v -m -X tank no output, uses cpu and I/O zdb -v -i tank zdb: can''t open ''tank'': Input/output error zdb -v -i -F tank zdb: can''t open ''tank'': Input/output error zdb -v -i -X tank no output, uses cpu and I/O Are there any hints you can give me? I have full FreeBSD source online so I can modify some tools, if needed. Thanks in advance, -Olaf. -- Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));
Tiemen Ruiten
2012-Feb-15 14:24 UTC
[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]
On 02/15/2012 02:49 PM, Olaf Seibert wrote:> This is the current status: > > $ zpool status > pool: tank > state: FAULTED > status: One or more devices could not be opened. There are insufficient > replicas for the pool to continue functioning. > action: Attach the missing device and online it using ''zpool online''. > see:http://www.sun.com/msg/ZFS-8000-3C > scan: scrub repaired 0 in 49h3m with 2 errors on Fri Jan 20 15:10:35 2012 > config: > > NAME STATE READ WRITE CKSUM > tank FAULTED 0 0 2 > raidz2-0 DEGRADED 0 0 8 > da0 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > 3758301462980058947 UNAVAIL 0 0 0 was /dev/da4 > da5 ONLINE 0 0 0 > > The strange thing is that the pool is FAULTED while its part is merely > DEGRADED. > > da4 failed reccently and was replaced with a new disk, but no resilvering is > taking place.The correct sequence to replace a failed drive in a ZFS pool is: zpool offline tank da4 shutdown and replace the drive zpool replace tank da4 You can see a history of modifications you''ve made to your pool with: zpool history Probably you haven''t gone through this sequence correctly and now ZFS is still referring to the old/wrong UUID (the number you see instead of da4) and therefore thinks the disk is unavailable. Hope that helps, Tiemen
On Wed, Feb 15, 2012 at 9:24 AM, Tiemen Ruiten <tiemen at dgr.am> wrote:> The correct sequence to replace a failed drive in a ZFS pool is: > > zpool offline tank da4 > shutdown and replace the drive > zpool replace tank da4Are you saying that you cannot replace a failed drive without shutting down the system? If that is the case with FreeBSD then I suggest that FreeBSD is not ready for production use. I know that under Solaris you _can_ replace failed drives with no downtime to the end users, we do it on a regular basis. I suspect there is a method to replace a failed drive under FreeBSD with no outage (assuming the drive is in a hot swap capable enclosure), but as I am not familiar with FreeBSD I do not know what it is. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players
Tiemen Ruiten
2012-Feb-15 15:23 UTC
[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]
On 02/15/2012 04:02 PM, Paul Kraus wrote:> Are you saying that you cannot replace a failed drive without > shutting down the system? If that is the case with FreeBSD then I > suggest that FreeBSD is not ready for production use. I know that > under Solaris you_can_ replace failed drives with no downtime to the > end users, we do it on a regular basis. > > I suspect there is a method to replace a failed drive under > FreeBSD with no outage (assuming the drive is in a hot swap capable > enclosure), but as I am not familiar with FreeBSD I do not know what > it is.Hm no, that''s not what I meant, I guess I shouldn''t have included that. Simply offlining the device (to make sure no attempts to access it are made) should be sufficient if you indeed assume a hotswap bay. Tiemen
Olaf Seibert
2012-Feb-16 10:57 UTC
[zfs-discuss] [O.Seibert@cs.ru.nl: A broken ZFS pool...]
On Wed 15 Feb 2012 at 14:49:14 +0100, Olaf Seibert wrote:> NAME STATE READ WRITE CKSUM > tank FAULTED 0 0 2 > raidz2-0 DEGRADED 0 0 8 > da0 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > 3758301462980058947 UNAVAIL 0 0 0 was /dev/da4 > da5 ONLINE 0 0 0Current status: I''ve been running "zdb -bcsvL -e -L -p /dev tank", which magical command I found from http://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/. I apparently had to export the tank first. It has been running overnight now, and the only output so far was fourquid.0:/tmp$ sudo zdb -bcsvL -e -L -p /dev tank Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 122 reading <42, 0, 3, 0> DVA[0]=<0:508c6a90c00:3000> DVA[1]=<0:1813ba6c800:3000> [L3 DMU dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1c00P birth=244334305L/244334305P fill=18480533 cksum=2a43556fd2b:95a3245729a27:15e3e48f3c6a490e:70fa77061df61a76 -- skipping zdb_blkptr_cb: Got error 122 reading <42, 0, 3, 3> DVA[0]=<0:508c6aa2000:3000> DVA[1]=<0:1813ba72800:3000> [L3 DMU dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1e00P birth=244334321L/244334321P fill=16777409 cksum=2ad6a555e8f:a1dcced71be6c:191abf84e5905b05:e8564e4004372491 -- skipping with the "error 122" messages appearing after an hour or so. Would these 2 errors be the "2" in the CKSUM column? I haven''t tried yet if this automagically has fixed / unlinked these blocks, but if it didn''t, how would I do that? How can I see whether these blocks are "important", i.e. are required for access to much data? Would running it on OpenIndiana or so instead of on FreeBSD make a difference? -Olaf. -- Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));
2012-02-16 14:57, Olaf Seibert wrote:> On Wed 15 Feb 2012 at 14:49:14 +0100, Olaf Seibert wrote: >> NAME STATE READ WRITE CKSUM >> tank FAULTED 0 0 2 >> raidz2-0 DEGRADED 0 0 8 >> da0 ONLINE 0 0 0 >> da1 ONLINE 0 0 0 >> da2 ONLINE 0 0 0 >> da3 ONLINE 0 0 0 >> 3758301462980058947 UNAVAIL 0 0 0 was /dev/da4 >> da5 ONLINE 0 0 0 > > Current status: I''ve been running "zdb -bcsvL -e -L -p /dev tank", which > magical command I found from > http://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/. > I apparently had to export the tank first. > > It has been running overnight now, and the only output so far was > > fourquid.0:/tmp$ sudo zdb -bcsvL -e -L -p /dev tank > > Traversing all blocks to verify checksums ... > zdb_blkptr_cb: Got error 122 reading<42, 0, 3, 0> DVA[0]=<0:508c6a90c00:3000> DVA[1]=<0:1813ba6c800:3000> [L3 DMU dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1c00P birth=244334305L/244334305P fill=18480533 cksum=2a43556fd2b:95a3245729a27:15e3e48f3c6a490e:70fa77061df61a76 -- skipping > zdb_blkptr_cb: Got error 122 reading<42, 0, 3, 3> DVA[0]=<0:508c6aa2000:3000> DVA[1]=<0:1813ba72800:3000> [L3 DMU dnode] fletcher4 lzjb LE contiguous unique double size=4000L/1e00P birth=244334321L/244334321P fill=16777409 cksum=2ad6a555e8f:a1dcced71be6c:191abf84e5905b05:e8564e4004372491 -- skipping > > > with the "error 122" messages appearing after an hour or so. > > Would these 2 errors be the "2" in the CKSUM column? > > I haven''t tried yet if this automagically has fixed / unlinked these > blocks, but if it didn''t, how would I do that?ZDB so far is supposed to do only read-only checks by directly accessing the storage hardware. Whatever errors it finds are not propagated to the kernel and do not get fixed, nor do they cause kernel panics if unfixable by current algorithms. And it doesn''t use the ARC cache, so zdb is often quite slow (fetching my pool''s DDT into a text file for grep-analysis took over a day). Well, the way I''ve been sent off a number of times, "you can see in the source"; namely - go to src.illumos.org and search the freebsd-gate for the error message text, error number or the function name zdb_blkptr_cb(). From there you can try to figure out the logic that led to the error. I''ve only got error=50 reported so far, and apparently those were blocks that were unrecoverably mismatching their previously known checksums (CKSUM error counts at the raidz and pool levels). I''ve also had more errors at the raidz level (2) vs pool level (1); I guess this means that the logical block had two ditto copies in ZFS, and both were erroneous. For the pool these were counted as a single block with two DVA addresses, and for the raidz these were two separately stored and broken blocks. I may be wrong in such interpretation, though.> How can I see whether > these blocks are "important", i.e. are required for access to much data?These are L3 blocks, so they address at least three layers of indirect block pointers (in sets of up to 128 entries in each intermediate block). Like this: L3 L2[0]...L2[127] L1[0,0]...L1[0,127] ... L1[127,0]...[L1127,127] lots of L0[0,0,0]-L0[127,127,127] blockpointers referencing your userdata (including redundant ditto copies) Overall, if your data sizes require, there can be up to L7 blocks in the structure for each ZFS object, to ultimately address its zillions of L0 blocks. Each LN block has a "fill" field which tells you how many L0 blocks it addresses in the end. In your case fill=18480533 and fill=16777409 amounts to quite a lot of data.> Would running it on OpenIndiana or so instead of on FreeBSD make a > difference?Not sure about that... I THINK most of the code should be the same, so it probably depends on which platform you''re more used to work on (and recompile some test kernels in particular). I haven''t tried FreeBSD in over a decade, so can''t help here :) I''m still trying to punch my pool into a sane position, so I can say that the current OpenIndiana code does not contain a miraculous recovery wizard ;)> > -Olaf.Good luck, really, //Jim