David Smith
2006-Sep-10 05:28 UTC
[zfs-discuss] Corrupted LUN in RAIDZ group -- How to repair?
Background: We have a ZFS pool setup from LUNS which are from a SAN connected StorageTek/Engenio Flexline 380 storage system. Just this past Friday the storage environment went down causing the system to go down. After looking at the storage environment, we had several volume groups which needed to be carefully put back together to prevent corruption. Well, one of the volume groups and the volumes/LUNs coming from it got corrupted. Since our ZFS pools is setup to only have a LUN from each volume group we basically ended up with a single disk loss in our RAIDZ group. So I believe we should be able to recover from this. My question is how to replace this disk (LUN). Basically the LUN is again okay, but the data on the LUN is not. I have tried to do a zpool replace, but ZFS seems to know that the disk/lun is the same device. Using a -f (force) didn''t work either. How does one replace a LUN with ZFS? I''m currently doing a "scrub", but don''t know if that will help. I first just had read errors on a lun in the raidz group, but just tonight noticed that I now have a checksum error on another lun as well. (see zpool status command output below) Below is a zpool status -x output. Can anyone advise how to recover from this? # zpool status -x pool: mypool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 66.00% done, 10h45m to go config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0 c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0 c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0 c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0 c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0 c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 9 c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0 c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0 c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0 c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0 c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0 c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0 c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 6 c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0 c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0 c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0 This system is at Solaris 10, U2. Thank you, David This message posted from opensolaris.org
James Dickens
2006-Sep-10 08:12 UTC
[zfs-discuss] Corrupted LUN in RAIDZ group -- How to repair?
On 9/10/06, David Smith <smith107 at llnl.gov> wrote:> Background: We have a ZFS pool setup from LUNS which are from a SAN connected StorageTek/Engenio Flexline 380 storage system. Just this past Friday the storage environment went down causing the system to go down. > > After looking at the storage environment, we had several volume groups which needed to be carefully put back together to prevent corruption. Well, one of the volume groups and the volumes/LUNs coming from it got corrupted. Since our ZFS pools is setup to only have a LUN from each volume group we basically ended up with a single disk loss in our RAIDZ group. So I believe we should be able to recover from this. > > My question is how to replace this disk (LUN). Basically the LUN is again okay, but the data on the LUN is not. > > I have tried to do a zpool replace, but ZFS seems to know that the disk/lun is the same device. Using a -f (force) didn''t work either. How does one replace a LUN with ZFS? > > I''m currently doing a "scrub", but don''t know if that will help. > > I first just had read errors on a lun in the raidz group, but just tonight noticed that I now have a checksum error on another lun as well. (see zpool status command output below) > > Below is a zpool status -x output. Can anyone advise how to recover from this? > > # zpool status -x > pool: mypool > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected.okay zfs noticed that there was an error, it is now trying to fix it> action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub in progress, 66.00% done, 10h45m to goits scrubing the drives repariing bad data, in 10 hours and 45 minutes it should be done.> config: > > NAME STATE READ WRITE CKSUM > mypool ONLINE 0 0 0 > raidz ONLINE 0 0 0since everything is still online and ready and no errors, no further action should be required, after the scrub is done, should those messages change, you can use zpool replace or come back here and ask for more help, but at this time there is nothing to worry about. James Dickens uadmin.blogspot.com> c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0 > c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0 > c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0 > c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0 > c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0 > c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0 > c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0 > c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 9 > c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0 > c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0 > c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0 > c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0 > c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0 > c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0 > c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0 > c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 6 > c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0 > c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0 > c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0 > > This system is at Solaris 10, U2. > > Thank you, > > David > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
David Smith
2006-Sep-10 14:58 UTC
[zfs-discuss] Re: Corrupted LUN in RAIDZ group -- How to repair?
James, Thanks for the reply. It looks like now the scrub has completed. Should I now clear these warnings? bash-3.00# zpool status -x pool: mypool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Sun Sep 10 07:44:36 2006 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0 c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0 c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0 c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0 c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0 c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 15 c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0 c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0 c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0 c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0 c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0 c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0 c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 14 c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0 c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0 c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0 David This message posted from opensolaris.org
Jeff Bonwick
2006-Sep-10 21:41 UTC
[zfs-discuss] Re: Corrupted LUN in RAIDZ group -- How to repair?
> It looks like now the scrub has completed. Should I now clear these warnings?Yep. You survived the Unfortunate Event unscathed. You''re golden. Jeff
David Smith
2006-Sep-14 15:09 UTC
[zfs-discuss] Re: Re: Corrupted LUN in RAIDZ group -- How to repair?
I have run zpool scrub again, and I now see checksum errors again. Wouldn''t the checksum errors gotten fixed with the first zpool scrub? Can anyone recommend actions I should do at this point? Thanks, David This message posted from opensolaris.org
Bill Moore
2006-Sep-14 20:55 UTC
[zfs-discuss] Re: Re: Corrupted LUN in RAIDZ group -- How to repair?
On Thu, Sep 14, 2006 at 08:09:07AM -0700, David Smith wrote:> I have run zpool scrub again, and I now see checksum errors again. > Wouldn''t the checksum errors gotten fixed with the first zpool scrub? > > Can anyone recommend actions I should do at this point?After running the first scrub, did you run "zpool clear <pool>" to zero out the error counts? If not, you will still be seeing the error counts from the first scrub. Could you send the output of "zpool status -v"? --Bill
David W. Smith
2006-Sep-14 21:08 UTC
[zfs-discuss] Re: Re: Corrupted LUN in RAIDZ group -- How to repair?
On Thu, 2006-09-14 at 13:55 -0700, Bill Moore wrote:> On Thu, Sep 14, 2006 at 08:09:07AM -0700, David Smith wrote: > > I have run zpool scrub again, and I now see checksum errors again. > > Wouldn''t the checksum errors gotten fixed with the first zpool scrub? > > > > Can anyone recommend actions I should do at this point? > > After running the first scrub, did you run "zpool clear <pool>" to zero > out the error counts? If not, you will still be seeing the error counts > from the first scrub. Could you send the output of "zpool status -v"? > > > --BillBill, Yes, I cleared the errors after the first scrub. Here is the output (pool name changed to protect the innocent): bash-3.00# zpool status -x pool: mypool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Thu Sep 14 11:30:18 2006 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0 c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0 c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0 c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0 c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0 c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0 c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 13 c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0 c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0 c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0 c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0 c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0 c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 16 c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0 c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011730E000066C844C5EBD8d0 ONLINE 0 0 0 c10t600A0B800011730E000066CD44C5EC02d0 ONLINE 0 0 0 c10t600A0B800011730E000066D244C5EC40d0 ONLINE 0 0 0 c10t600A0B800011730E000066D744C5EC7Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C044C5ECC1d0 ONLINE 0 0 0 c10t600A0B800011730E000066E044C5ED0Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C544C5ED67d0 ONLINE 0 0 0 c10t600A0B800011730E000066E944C5EDB4d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CA44C5EE09d0 ONLINE 0 0 0 c10t600A0B800011730E000066F244C5EE5Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CF44C5EEA7d0 ONLINE 0 0 13 c10t600A0B800011730E000066FB44C5EEFAd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5D444C5EF3Fd0 ONLINE 0 0 0 c10t600A0B800011730E0000670444C5EF92d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c10t600A0B800011652E0000E5B944C5EBDDd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BB44C5EC0Dd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BD44C5EC4Bd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5BF44C5EC8Dd0 ONLINE 0 0 0 c10t600A0B800011730E000066DD44C5ECD0d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C444C5ED19d0 ONLINE 0 0 0 c10t600A0B800011730E000066E644C5ED7Ad0 ONLINE 0 0 0 c10t600A0B800011652E0000E5C944C5EDC7d0 ONLINE 0 0 0 c10t600A0B800011730E000066EF44C5EE1Cd0 ONLINE 0 0 0 c10t600A0B800011652E0000E5CE44C5EE6Bd0 ONLINE 0 0 0 c10t600A0B800011730E000066F844C5EEBAd0 ONLINE 0 0 16 c10t600A0B800011652E0000E5D344C5EF07d0 ONLINE 0 0 0 c10t600A0B800011730E0000670144C5EF52d0 ONLINE 0 0 0 c10t600A0B800011652E0000E5D844C5EFA3d0 ONLINE 0 0 0 errors: No known data errors Thanks, David