Mark Hennessy
2008-Apr-17 17:01 UTC
[CentOS] Question about RAID 5 array rebuild with mdadm
I'm using Centos 4.5 right now, and I had a RAID 5 array stop because two drives became unavailable. After adjusting the cables on several occasions and shutting down and restarting, I was able to see the drives again. This is when I snatched defeat from the jaws of victory. Please, someone with vast knowledge of how RAID 5 with mdadm works, tell me if I have any chance at all that this array will pull through with most or all of my data. Background info about the machine /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2 /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2 /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j] /dev/sdi and /dev/sdj were the drives that detached from the array and were marked as faulty. I did the following things that in hindsight were probably VERY BAD Step 1 (Misassign drives to wrong array): I could probably have had things going again in a tenth of a second if I hadn't typed this: mdadm --manage --add /dev/md0 /dev/sdi mdadm --manage --add /dev/md0 /dev/sdi This clobbered the superblock and replaced it with that of /dev/md0, yes? well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow. Ok, so what next? Step 2 (rebuild the array but make sure the params are right!): I wipe out the superblocks on all of the drives in the array and rebuild with --assume-clean for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/ sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj ok, now it says that the array is recovering and will take about 10 hours to rebulid. /dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a spare that's rebuilding. But now I scroll back in my history and see that oops, the chunk size is WRONG. Not only that, but I don't stop the array until the rebuild is at around 8% Ok, I stop the array and rebuild with mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid- devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ sdi /dev/sdj Now it says it's going to take another 10 hours to rebuild. How likely are my data irretrievable/gone and at what step would it have happened if so?
Mark Hennessy
2008-Apr-17 17:06 UTC
[CentOS] Question about RAID 5 array rebuild with mdadm
Sorry about that, my previous e-mail had just '--chunk' toward the bottom. It should have been '--chunk=256' Please see the quoted snippet for detail. On Apr 17, 2008, at 1:01 PM, Mark Hennessy wrote:> Ok, I stop the array and rebuild with > mdadm --create /dev/md2 --assume-clean --level=5 --chunk=256 --raid- > devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ > sdi /dev/sdj
Ross S. W. Walker
2008-Apr-17 17:50 UTC
[CentOS] Question about RAID 5 array rebuild with mdadm
Mark Hennessy wrote:> > I'm using Centos 4.5 right now, and I had a RAID 5 array stop because > two drives became unavailable. After adjusting the cables on several > occasions and shutting down and restarting, I was able to see the > drives again. This is when I snatched defeat from the jaws of > victory. Please, someone with vast knowledge of how RAID 5 with mdadm > works, tell me if I have any chance at all that this array will pull > through with most or all of my data.It may be possible...> Background info about the machine > /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2 > /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2 > /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j] > > /dev/sdi and /dev/sdj were the drives that detached from the array and > were marked as faulty. > > I did the following things that in hindsight were probably VERY BAD > > Step 1 (Misassign drives to wrong array): > I could probably have had things going again in a tenth of a second if > I hadn't typed this: > mdadm --manage --add /dev/md0 /dev/sdi > mdadm --manage --add /dev/md0 /dev/sdi > > This clobbered the superblock and replaced it with that of /dev/md0, yes? > well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.Hmm, not good, but we will mark this drive 'sdi' as bad.> Ok, so what next? > Step 2 (rebuild the array but make sure the params are right!): > I wipe out the superblocks on all of the drives in the array and > rebuild with --assume-clean > for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done > mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/ > sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdjNooo, you need to make sure sdi is marked as 'bad' offline, you are going to need to assemble the array degraded, then add sdi as a replacement and let it rebuild sdi off the parity.> ok, now it says that the array is recovering and will take about 10 > hours to rebulid. > /dev/sd[c-i] say that they are "active sync" and > /dev/sdj says it's a > spare that's rebuilding. > But now I scroll back in my history and see that oops, the chunk size > is WRONG. Not only that, but I don't stop the array until the rebuild > is at around 8%Well, now I think it's all messed up.> Ok, I stop the array and rebuild with > mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid- > devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/ > sdi /dev/sdj > > Now it says it's going to take another 10 hours to rebuild.It's truly hosed now.> How likely are my data irretrievable/gone and at what step would it > have happened if so?I hope you have backups cause your going to need them. If only you posted to the list BEFORE you tried to recover it without knowing what to do. -Ross ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.