Alessandro Baggi wrote:> Il 30/01/19 14:02, mark ha scritto: >> On 01/30/19 03:45, Alessandro Baggi wrote: >>> Il 29/01/19 20:42, mark ha scritto: >>>> Alessandro Baggi wrote: >>>>> Il 29/01/19 18:47, mark ha scritto: >>>>>> Alessandro Baggi wrote: >>>>>>> Il 29/01/19 15:03, mark ha scritto: >>>>>>> >>>>>>>> I've no idea what happened, but the box I was working on >>>>>>>> last week has a *second* bad drive. Actually, I'm starting >>>>>>>> to wonder about that particulare hot-swap bay. >>>>>>>> >>>>>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >>>>>>>> /dev/sdi1... >>>>>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet >>>>>>>> to find a reliable way to make either one active. >>>>>>>> >>>>>>>> Actually, I would have expected the linux RAID to replace a >>>>>>>> failed one with a spare.... >>>> >>>>>>> can you report your raid configuration like raid level and >>>>>>> raid devices and the current status from /proc/mdstat? >>>>>>> >>>>>> Well, nope. I got to the point of rebooting the system (xfs had >>>>>> the RAID >>>>>> volume, and wouldn't let go; I also commented out the RAID >>>>>> volume. >>>>>> >>>>>> It's RAID 5, /dev/sdb *also* appears to have died. If I do >>>>>> mdadm --assemble --force -v /dev/md0? /dev/sd[cefgdh]1 mdadm: >>>>>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified >>>>>> as a member of /dev/md0, slot 0. >>>>>> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. >>>>>> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot >>>>>> 2. >>>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. >>>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. >>>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. >>>>>> mdadm: no uptodate device for slot 1 of /dev/md0 >>>>>> mdadm: added /dev/sde1 to /dev/md0 as 2 >>>>>> mdadm: added /dev/sdf1 to /dev/md0 as 3 >>>>>> mdadm: added /dev/sdg1 to /dev/md0 as 4 >>>>>> mdadm: no uptodate device for slot 5 of /dev/md0 >>>>>> mdadm: added /dev/sdd1 to /dev/md0 as -1 >>>>>> mdadm: added /dev/sdh1 to /dev/md0 as -1 >>>>>> mdadm: added /dev/sdc1 to /dev/md0 as 0 >>>>>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not >>>>>> enough to start the array. >>>>>> >>>>>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both are >>>>>> spares. >>>>> Hi Mark, >>>>> please post the result from >>>>> >>>>> cat /sys/block/md0/md/sync_action >>>> >>>> There is none. There is no /dev/md0. mdadm refusees, saying that >>>> it's lost too many drives. >>>> >>>> ?????? mark >>>> >>>> >>>> _______________________________________________ >>>> CentOS mailing list >>>> CentOS at centos.org >>>> https://lists.centos.org/mailman/listinfo/centos >>> >>> I suppose that your config is 5 drive and 1 spare with 1 drive >>> failed. It's strange that your spare was not used for resync. >>> Then you added a new drive but it does not start because it marks the >>> new disk as spare and you have a raid5 with 4 devices and 2 spares. >>> >>> First I hope that you have a backup for all your data and don't run >>> some exotic command before backupping your data. If you can't backup >>> your data, it's a problem. >> >> This is at work. We have automated nightly backups, and I do offline >> backups of the backups every two weeks. >>> >>> Have you tried to remove the last added device sdi1 and restart the >>> raid and force to start a resync? >> >> The thing is, it had one? two? spares when /dev/sdb1 started dying, and >> it didn't use them. >>> >>> Have you tried to remove this 2 devices and re-add only the device >>> that will be usefull for resync?? Maybe you can set 5 devices for your >>> raid and not 6, if it works (after resync) you can add your spare >>> device growing your raid set. >> >> I tried, and that's when I lost it (again), and it refuses to >> assemble/start the RAID "not enough devices". >>> >>> Reading on google many users use --zero-superblock before re-add the >>> device. >> >> I can take one out, and re-add, but I think I'm going to have to >> recreate the RAID again, and again restore from backup. >>> >>> Other user reassemble the raid using --assume-clean but I don't know >>> what effect it will produces > > Hope that someone give you a better help for this. > > Update here if you got the solution. >Not that I'm into American football, but I seem to have pulled off what I understand is called a hail-mary: *without* zeroing the superrblocks, I did a create with all six good drives, excluding /dev/sdb1, and explicitly told it one spare. And the array is there, complete with data, with *one* spare, five good drives, and it's currently rebuilding the spare. The last resort worked, though we'll see how long. mark
Il 30/01/19 16:33, mark ha scritto:> Alessandro Baggi wrote: >> Il 30/01/19 14:02, mark ha scritto: >>> On 01/30/19 03:45, Alessandro Baggi wrote: >>>> Il 29/01/19 20:42, mark ha scritto: >>>>> Alessandro Baggi wrote: >>>>>> Il 29/01/19 18:47, mark ha scritto: >>>>>>> Alessandro Baggi wrote: >>>>>>>> Il 29/01/19 15:03, mark ha scritto: >>>>>>>> >>>>>>>>> I've no idea what happened, but the box I was working on >>>>>>>>> last week has a *second* bad drive. Actually, I'm starting >>>>>>>>> to wonder about that particulare hot-swap bay. >>>>>>>>> >>>>>>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >>>>>>>>> /dev/sdi1... >>>>>>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet >>>>>>>>> to find a reliable way to make either one active. >>>>>>>>> >>>>>>>>> Actually, I would have expected the linux RAID to replace a >>>>>>>>> failed one with a spare.... >>>>> >>>>>>>> can you report your raid configuration like raid level and >>>>>>>> raid devices and the current status from /proc/mdstat? >>>>>>>> >>>>>>> Well, nope. I got to the point of rebooting the system (xfs had >>>>>>> the RAID >>>>>>> volume, and wouldn't let go; I also commented out the RAID >>>>>>> volume. >>>>>>> >>>>>>> It's RAID 5, /dev/sdb *also* appears to have died. If I do >>>>>>> mdadm --assemble --force -v /dev/md0? /dev/sd[cefgdh]1 mdadm: >>>>>>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified >>>>>>> as a member of /dev/md0, slot 0. >>>>>>> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. >>>>>>> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot >>>>>>> 2. >>>>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. >>>>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. >>>>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. >>>>>>> mdadm: no uptodate device for slot 1 of /dev/md0 >>>>>>> mdadm: added /dev/sde1 to /dev/md0 as 2 >>>>>>> mdadm: added /dev/sdf1 to /dev/md0 as 3 >>>>>>> mdadm: added /dev/sdg1 to /dev/md0 as 4 >>>>>>> mdadm: no uptodate device for slot 5 of /dev/md0 >>>>>>> mdadm: added /dev/sdd1 to /dev/md0 as -1 >>>>>>> mdadm: added /dev/sdh1 to /dev/md0 as -1 >>>>>>> mdadm: added /dev/sdc1 to /dev/md0 as 0 >>>>>>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not >>>>>>> enough to start the array. >>>>>>> >>>>>>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both are >>>>>>> spares. >>>>>> Hi Mark, >>>>>> please post the result from >>>>>> >>>>>> cat /sys/block/md0/md/sync_action >>>>> >>>>> There is none. There is no /dev/md0. mdadm refusees, saying that >>>>> it's lost too many drives. >>>>> >>>>> ?????? mark >>>>> >>>>> >>>>> _______________________________________________ >>>>> CentOS mailing list >>>>> CentOS at centos.org >>>>> https://lists.centos.org/mailman/listinfo/centos >>>> >>>> I suppose that your config is 5 drive and 1 spare with 1 drive >>>> failed. It's strange that your spare was not used for resync. >>>> Then you added a new drive but it does not start because it marks the >>>> new disk as spare and you have a raid5 with 4 devices and 2 spares. >>>> >>>> First I hope that you have a backup for all your data and don't run >>>> some exotic command before backupping your data. If you can't backup >>>> your data, it's a problem. >>> >>> This is at work. We have automated nightly backups, and I do offline >>> backups of the backups every two weeks. >>>> >>>> Have you tried to remove the last added device sdi1 and restart the >>>> raid and force to start a resync? >>> >>> The thing is, it had one? two? spares when /dev/sdb1 started dying, and >>> it didn't use them. >>>> >>>> Have you tried to remove this 2 devices and re-add only the device >>>> that will be usefull for resync?? Maybe you can set 5 devices for your >>>> raid and not 6, if it works (after resync) you can add your spare >>>> device growing your raid set. >>> >>> I tried, and that's when I lost it (again), and it refuses to >>> assemble/start the RAID "not enough devices". >>>> >>>> Reading on google many users use --zero-superblock before re-add the >>>> device. >>> >>> I can take one out, and re-add, but I think I'm going to have to >>> recreate the RAID again, and again restore from backup. >>>> >>>> Other user reassemble the raid using --assume-clean but I don't know >>>> what effect it will produces >> >> Hope that someone give you a better help for this. >> >> Update here if you got the solution. >> > > Not that I'm into American football, but I seem to have pulled off what I > understand is called a hail-mary: *without* zeroing the superrblocks, I > did a create with all six good drives, excluding /dev/sdb1, and explicitly > told it one spare. > > And the array is there, complete with data, with *one* spare, five good > drives, and it's currently rebuilding the spare. > > The last resort worked, though we'll see how long. > > mark > >So you have recreated the array without faulty device?
Alessandro Baggi wrote:> Il 30/01/19 16:33, mark ha scritto: > >> Alessandro Baggi wrote: >> >>> Il 30/01/19 14:02, mark ha scritto: >>> >>>> On 01/30/19 03:45, Alessandro Baggi wrote: >>>> >>>>> Il 29/01/19 20:42, mark ha scritto: >>>>> >>>>>> Alessandro Baggi wrote: >>>>>> >>>>>>> Il 29/01/19 18:47, mark ha scritto: >>>>>>> >>>>>>>> Alessandro Baggi wrote: >>>>>>>> >>>>>>>>> Il 29/01/19 15:03, mark ha scritto: >>>>>>>>> >>>>>>>>> >>>>>>>>>> I've no idea what happened, but the box I was working >>>>>>>>>> on last week has a *second* bad drive. Actually, I'm >>>>>>>>>> starting to wonder about that particulare hot-swap bay. >>>>>>>>>> >>>>>>>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've >>>>>>>>>> added /dev/sdi1... >>>>>>>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have >>>>>>>>>> yet to find a reliable way to make either one active. >>>>>>>>>> >>>>>>>>>> Actually, I would have expected the linux RAID to >>>>>>>>>> replace a failed one with a spare.... >>>>>> >>>>>>>>> can you report your raid configuration like raid level >>>>>>>>> and raid devices and the current status from /proc/mdstat? >>>>>>>>> >>>>>>>>> >>>>>>>> Well, nope. I got to the point of rebooting the system (xfs >>>>>>>> had the RAID volume, and wouldn't let go; I also commented >>>>>>>> out the RAID volume. >>>>>>>> >>>>>>>> It's RAID 5, /dev/sdb *also* appears to have died. If I do >>>>>>>> mdadm --assemble --force -v /dev/md0? /dev/sd[cefgdh]1 >>>>>>>> mdadm: >>>>>>>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is >>>>>>>> identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 >>>>>>>> is identified as a member of /dev/md0, slot -1. mdadm: >>>>>>>> /dev/sde1 is identified as a member of /dev/md0, slot >>>>>>>> 2. >>>>>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot >>>>>>>> 3. >>>>>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot >>>>>>>> 4. >>>>>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot >>>>>>>> -1. >>>>>>>> mdadm: no uptodate device for slot 1 of /dev/md0 >>>>>>>> mdadm: added /dev/sde1 to /dev/md0 as 2 >>>>>>>> mdadm: added /dev/sdf1 to /dev/md0 as 3 >>>>>>>> mdadm: added /dev/sdg1 to /dev/md0 as 4 >>>>>>>> mdadm: no uptodate device for slot 5 of /dev/md0 >>>>>>>> mdadm: added /dev/sdd1 to /dev/md0 as -1 >>>>>>>> mdadm: added /dev/sdh1 to /dev/md0 as -1 >>>>>>>> mdadm: added /dev/sdc1 to /dev/md0 as 0 >>>>>>>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not >>>>>>>> enough to start the array. >>>>>>>> >>>>>>>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both >>>>>>>> are spares. >>>>>>> Hi Mark, >>>>>>> please post the result from >>>>>>> >>>>>>> cat /sys/block/md0/md/sync_action >>>>>> >>>>>> There is none. There is no /dev/md0. mdadm refusees, saying >>>>>> that it's lost too many drives. >>>>>> >>>>>> ?????? mark >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> CentOS mailing list >>>>>> CentOS at centos.org >>>>>> https://lists.centos.org/mailman/listinfo/centos >>>>>> >>>>> >>>>> I suppose that your config is 5 drive and 1 spare with 1 drive >>>>> failed. It's strange that your spare was not used for resync. Then >>>>> you added a new drive but it does not start because it marks the >>>>> new disk as spare and you have a raid5 with 4 devices and 2 >>>>> spares. >>>>> >>>>> First I hope that you have a backup for all your data and don't >>>>> run some exotic command before backupping your data. If you can't >>>>> backup your data, it's a problem. >>>> >>>> This is at work. We have automated nightly backups, and I do >>>> offline backups of the backups every two weeks. >>>>> >>>>> Have you tried to remove the last added device sdi1 and restart >>>>> the raid and force to start a resync? >>>> >>>> The thing is, it had one? two? spares when /dev/sdb1 started dying, >>>> and it didn't use them. >>>>> >>>>> Have you tried to remove this 2 devices and re-add only the >>>>> device that will be usefull for resync?? Maybe you can set 5 >>>>> devices for your raid and not 6, if it works (after resync) you >>>>> can add your spare device growing your raid set. >>>> >>>> I tried, and that's when I lost it (again), and it refuses to >>>> assemble/start the RAID "not enough devices". >>>>> >>>>> Reading on google many users use --zero-superblock before re-add >>>>> the device. >>>> >>>> I can take one out, and re-add, but I think I'm going to have to >>>> recreate the RAID again, and again restore from backup. >>>>> >>>>> Other user reassemble the raid using --assume-clean but I don't >>>>> know what effect it will produces >>> >>> Hope that someone give you a better help for this. >>> >>> >>> Update here if you got the solution. >>> >>> >> >> Not that I'm into American football, but I seem to have pulled off what >> I >> understand is called a hail-mary: *without* zeroing the superrblocks, I >> did a create with all six good drives, excluding /dev/sdb1, and >> explicitly told it one spare. >> >> And the array is there, complete with data, with *one* spare, five good >> drives, and it's currently rebuilding the spare. >> >> The last resort worked, though we'll see how long. >> > So you have recreated the array without faulty device? >Yep. mdadm --create --verbose /dev/md0 --level=5 --raid-devices=6 /dev/sd[cdefgh]1 It's currently at 2.2% recovered for the extra drive. mark