I have an X4540 running b134 where I''m replacing 500GB disks with 2TB disks (Seagate Constellation) and the pool seems sick now. The pool has four raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the third disk to finish replacing (c4t0d0). I have tried the resilver for c4t0d0 four times now and the pool also comes up with checksum errors and a permanent error (<metadata>:<0x0>). The first resilver was from ''zpool replace'', which came up with checksum errors. I cleared the errors which triggered the second resilver (same result). I then did a ''zpool scrub'' which started the third resilver and also identified three permanent errors (the two additional were in files in snapshots which I then destroyed). I then did a ''zpool clear'' and then another scrub which started the fourth resilver attempt. This last attempt identified another file with errors in a snapshot that I have now destroyed. Any ideas how to get this disk finished being replaced without rebuilding the pool and restoring from backup? The pool is working, but is reporting as degraded and with checksum errors. Here is what the pool currently looks like: # zpool status -v pool2 pool: pool2 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 33h9m with 4 errors on Thu Sep 16 00:28:14 config: NAME STATE READ WRITE CKSUM pool2 DEGRADED 0 0 8 raidz2-0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 14 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 replacing-8 DEGRADED 0 0 0 c4t0d0s0/o OFFLINE 0 0 0 c4t0d0 ONLINE 0 0 0 268G resilvered c5t0d0 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c3t6d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 logs mirror-4 ONLINE 0 0 0 c0t1d0s0 ONLINE 0 0 0 c1t3d0s0 ONLINE 0 0 0 cache c0t3d0s7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x0> <0x167a2>:<0x552ed> (This second file was in a snapshot I destroyed after the resilver completed). # zpool list pool2 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT pool2 31.8T 13.8T 17.9T 43% 1.65x DEGRADED - The slog is a mirror of two SLC SSDs and the L2ARC is an MLC SSD. thanks, Ben
On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at mail.eecis.udel.edu>wrote:> I have an X4540 running b134 where I''m replacing 500GB disks with 2TB disks > (Seagate Constellation) and the pool seems sick now. The pool has four > raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few > months ago. I replaced two disks in the second set (c2t0d0, c3t0d0) a > couple of weeks ago, but have been unable to get the third disk to finish > replacing (c4t0d0). > > I have tried the resilver for c4t0d0 four times now and the pool also comes > up with checksum errors and a permanent error (<metadata>:<0x0>). The first > resilver was from ''zpool replace'', which came up with checksum errors. I > cleared the errors which triggered the second resilver (same result). I > then did a ''zpool scrub'' which started the third resilver and also > identified three permanent errors (the two additional were in files in > snapshots which I then destroyed). I then did a ''zpool clear'' and then > another scrub which started the fourth resilver attempt. This last attempt > identified another file with errors in a snapshot that I have now destroyed. > > Any ideas how to get this disk finished being replaced without rebuilding > the pool and restoring from backup? The pool is working, but is reporting > as degraded and with checksum errors. > >[...] Try to run a `zpool clear pool2` and see if clears the errors. If not, you may have to detach `c4t0d0s0/o`. I believe it''s a bug that was fixed in recent builds. -- Giovanni Tirloni gtirloni at sysdroid.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100920/d731e2c9/attachment.html>
On 09/20/10 10:45 AM, Giovanni Tirloni wrote:> On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at mail.eecis.udel.edu > <mailto:bmiller at mail.eecis.udel.edu>> wrote: > > I have an X4540 running b134 where I''m replacing 500GB disks with 2TB > disks (Seagate Constellation) and the pool seems sick now. The pool > has four raidz2 vdevs (8+2) where the first set of 10 disks were > replaced a few months ago. I replaced two disks in the second set > (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the > third disk to finish replacing (c4t0d0). > > I have tried the resilver for c4t0d0 four times now and the pool also > comes up with checksum errors and a permanent error (<metadata>:<0x0>). > The first resilver was from ''zpool replace'', which came up with > checksum errors. I cleared the errors which triggered the second > resilver (same result). I then did a ''zpool scrub'' which started the > third resilver and also identified three permanent errors (the two > additional were in files in snapshots which I then destroyed). I then > did a ''zpool clear'' and then another scrub which started the fourth > resilver attempt. This last attempt identified another file with > errors in a snapshot that I have now destroyed. > > Any ideas how to get this disk finished being replaced without > rebuilding the pool and restoring from backup? The pool is working, > but is reporting as degraded and with checksum errors. > > > [...] > > Try to run a `zpool clear pool2` and see if clears the errors. If not, you > may have to detach `c4t0d0s0/o`. > > I believe it''s a bug that was fixed in recent builds. >I had tried a clear a few times with no luck. I just did a detach and that did remove the old disk and has now triggered another resilver which hopefully works. I had tried a remove rather than a detach before, but that doesn''t work on raidz2... thanks, Ben> -- > Giovanni Tirloni > gtirloni at sysdroid.com <mailto:gtirloni at sysdroid.com> >
On 09/21/10 09:16 AM, Ben Miller wrote:> On 09/20/10 10:45 AM, Giovanni Tirloni wrote: >> On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at mail.eecis.udel.edu >> <mailto:bmiller at mail.eecis.udel.edu>> wrote: >> >> I have an X4540 running b134 where I''m replacing 500GB disks with 2TB >> disks (Seagate Constellation) and the pool seems sick now. The pool >> has four raidz2 vdevs (8+2) where the first set of 10 disks were >> replaced a few months ago. I replaced two disks in the second set >> (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the >> third disk to finish replacing (c4t0d0). >> >> I have tried the resilver for c4t0d0 four times now and the pool also >> comes up with checksum errors and a permanent error (<metadata>:<0x0>). >> The first resilver was from ''zpool replace'', which came up with >> checksum errors. I cleared the errors which triggered the second >> resilver (same result). I then did a ''zpool scrub'' which started the >> third resilver and also identified three permanent errors (the two >> additional were in files in snapshots which I then destroyed). I then >> did a ''zpool clear'' and then another scrub which started the fourth >> resilver attempt. This last attempt identified another file with >> errors in a snapshot that I have now destroyed. >> >> Any ideas how to get this disk finished being replaced without >> rebuilding the pool and restoring from backup? The pool is working, >> but is reporting as degraded and with checksum errors. >> >> >> [...] >> >> Try to run a `zpool clear pool2` and see if clears the errors. If not, you >> may have to detach `c4t0d0s0/o`. >> >> I believe it''s a bug that was fixed in recent builds. >> > I had tried a clear a few times with no luck. I just did a detach and that > did remove the old disk and has now triggered another resilver which > hopefully works. I had tried a remove rather than a detach before, but that > doesn''t work on raidz2... > > thanks, > Ben >I made some progress. That resilver completed with 4 errors. I cleared those and still had the one error "<metadata>:<0x0>" so I started a scrub. The scrub restarted the resilver on c4t0d0 again though! There currently are no errors anyway, but the resilver will be running for the next day+. Is this another bug or will doing a scrub eventually lead to a scrub of the pool instead of the resilver? Ben
Ben Miller
2010-Sep-30 19:00 UTC
[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes
On 09/22/10 04:27 PM, Ben Miller wrote:> On 09/21/10 09:16 AM, Ben Miller wrote:>> I had tried a clear a few times with no luck. I just did a detach and that >> did remove the old disk and has now triggered another resilver which >> hopefully works. I had tried a remove rather than a detach before, but that >> doesn''t work on raidz2... >> >> thanks, >> Ben >> > I made some progress. That resilver completed with 4 errors. I cleared > those and still had the one error "<metadata>:<0x0>" so I started a scrub. > The scrub restarted the resilver on c4t0d0 again though! There currently > are no errors anyway, but the resilver will be running for the next day+. > Is this another bug or will doing a scrub eventually lead to a scrub of the > pool instead of the resilver? > > BenWell not much progress. The one permanent error "<metadata>:<0x0>" came back. And the disk keeps wanting to resilver when trying to do a scrub. Now after the last resilver I have more checksum errors on the pool, but not on any disks: NAME STATE READ WRITE CKSUM pool2 ONLINE 0 0 37 ... raidz2-1 ONLINE 0 0 74 All other checksum totals are 0. So three problems: 1. How to get the disk to stop resilvering? 2. How do you get checksum errors on the pool, but no disk is identified? If I clear them and let the resilver go again more checksum errors appear. So how to get rid of these errors? 3. How to get rid of the metadata:0x0 error? I''m currently destroying old snapshots (though that bug was fixed quite awhile ago and I''m running b134). I can try unmounting filesystems and remounting next (all are currently mounted). I can also schedule a reboot for next week if anyone things that would help. thanks, Ben
Victor Latushkin
2010-Oct-01 14:17 UTC
[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes
On Sep 30, 2010, at 11:00 PM, Ben Miller wrote:> On 09/22/10 04:27 PM, Ben Miller wrote: >> On 09/21/10 09:16 AM, Ben Miller wrote: > >>> I had tried a clear a few times with no luck. I just did a detach and that >>> did remove the old disk and has now triggered another resilver which >>> hopefully works. I had tried a remove rather than a detach before, but that >>> doesn''t work on raidz2... >>> >>> thanks, >>> Ben >>> >> I made some progress. That resilver completed with 4 errors. I cleared >> those and still had the one error "<metadata>:<0x0>" so I started a scrub. >> The scrub restarted the resilver on c4t0d0 again though! There currently >> are no errors anyway, but the resilver will be running for the next day+. >> Is this another bug or will doing a scrub eventually lead to a scrub of the >> pool instead of the resilver? >> >> Ben > > Well not much progress. The one permanent error "<metadata>:<0x0>" came back. And the disk keeps wanting to resilver when trying to do a scrub. Now after the last resilver I have more checksum errors on the pool, but not on any disks: > NAME STATE READ WRITE CKSUM > pool2 ONLINE 0 0 37 > ... > raidz2-1 ONLINE 0 0 74 > > All other checksum totals are 0. So three problems: > 1. How to get the disk to stop resilvering?This is a know bug which is fixed in build 135: 6887372 DTLs not cleared after resilver if permanent errors present> 2. How do you get checksum errors on the pool, but no disk is identified? If I clear them and let the resilver go again more checksum errors appear. So how to get rid of these errors?It may be not possible to determine which disk(s) is(are) responsible for errors, in that case you''ll see 0 counter on disk level and non-zero on raidz level. It may mean that there''s more errors that your raidz allows to recover from, or that data was corrupted in RAM after checksumming but before writing... Check your FMA data for any signs of disk issues.> 3. How to get rid of the metadata:0x0 error? I''m currently destroying old snapshots (though that bug was fixed quite awhile ago and I''m running b134). I can try unmounting filesystems and remounting next (all are currently mounted). I can also schedule a reboot for next week if anyone things that would help.This is error in metadata, and the only way to get rid of it is to recreate your pool. Regards Victor