Scott L. Burson
2009-Jan-23 02:08 UTC
[zfs-discuss] Bug report: disk replacement confusion
This is in snv_86. I have a four-drive raidz pool. One of the drives died. I replaced it, but wasn''t careful to put the new drive on the same controller port; one of the existing drives wound up on the port that had previously been used by the failed drive, and the new drive wound up on the port previously used by that drive. I powered up and booted, and ZFS started a resilver automatically, but the pool status was confused. It looked like this, even after the resilver completed (indentation is being discarded here): NAME STATE READ WRITE CKSUM pool0 DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c5t1d0 FAULTED 0 0 0 too many errors c5t1d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 Doing ''zpool clear'' just changed the "too many errors" to "corrupted data". I then tried ''zpool replace pool0 c5t1d0 c5t2d0'' to see if that would straighten things out (hoping it wouldn''t screw things up any further!). It started another resilver, during which the status looked like this: NAME STATE READ WRITE CKSUM pool0 DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c5t1d0 FAULTED 0 0 0 corrupted data c5t2d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 Maybe this will work, but -- doesn''t ZFS put unique IDs on the drives so it can track them in case they wind up on different ports? If so, seems like it needs to back-map that information to the device names when mounting. Or something :) -- Scott -- This message posted from opensolaris.org
On Thu, 22 Jan 2009, Scott L. Burson wrote:> This is in snv_86. I have a four-drive raidz pool. One of the drives > died. I replaced it, but wasn''t careful to put the new drive on the > same controller port; one of the existing drives wound up on the port > that had previously been used by the failed drive, and the new drive > wound up on the port previously used by that drive. > > I powered up and booted, and ZFS started a resilver automatically, but > the pool status was confused. It looked like this, even after the > resilver completed (indentation is being discarded here): > > NAME STATE READ WRITE CKSUM > pool0 DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > c5t1d0 FAULTED 0 0 0 too many errors > c5t1d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > > Doing ''zpool clear'' just changed the "too many errors" to "corrupted > data". > > I then tried ''zpool replace pool0 c5t1d0 c5t2d0'' to see if that would > straighten things out (hoping it wouldn''t screw things up any further!). > It started another resilver, during which the status looked like this: > > NAME STATE READ WRITE CKSUM > pool0 DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > replacing DEGRADED 0 0 0 > c5t1d0 FAULTED 0 0 0 corrupted data > c5t2d0 ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > > Maybe this will work, but -- doesn''t ZFS put unique IDs on the drives so > it can track them in case they wind up on different ports? If so, seems > like it needs to back-map that information to the device names when > mounting. Or something :)Did your resilver complete successfully? I had a similar problem and the array was showing thousands of write errors to the missing drive and the resilver of the new drive would basically reset after a few minutes. And you can''t cancel a replacement for a nonexistent drive in this case. I ended up having to create a sparse device labeled with the proper guid and finally could remove one of the devices and properly initiate a replacement. Similar to bug id #6782540...
Scott L. Burson
2009-Jan-23 05:10 UTC
[zfs-discuss] Bug report: disk replacement confusion
Well, the second resilver finished, and everything looks okay now. Doing one more scrub to be sure... -- Scott -- This message posted from opensolaris.org
Scott L. Burson
2009-Jan-23 18:56 UTC
[zfs-discuss] Bug report: disk replacement confusion
Yes, everything seems to be fine, but that was still scary, and the fix was not completely obvious. At the very least, I would suggest adding text such as the following to the page at http://www.sun.com/msg/ZFS-8000-FD : When physically replacing the failed device, it is best to use the same controller port, so that the new device will have the same device name as the failed one. -- Scott -- This message posted from opensolaris.org