I got the email that a drive in my 4-drive RAID10 setup failed. What are my options? Drives are WD1000FYPS (Western Digital 1 TB 3.5" SATA). mdadm.conf: # mdadm.conf written out by anaconda MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md/root level=raid10 num-devices=4 UUID=942f512e:2db8dc6c:71667abc:daf408c3 /proc/mdstat: Personalities : [raid10] md127 : active raid10 sdf1[2](F) sdg1[3] sde1[1] sdd1[0] 1949480960 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] bitmap: 15/15 pages [60KB], 65536KB chunk smartctl reports this for sdf: 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 6 So it's got 6 bad blocks, 1 pending for remapping. Can I clear the error and rebuild? (It's not clear what commands would do that.) Or should I buy a replacement drive? I'm considering a WDS100T1R0A (2.5" 1TB red drive), which Amazon has for $135, plus the 3.5" adapter. The system serves primarily as a home mail server (it fetchmails from an outside VPS serving as my domain's MX) and archival file server.
> I got the email that a drive in my 4-drive RAID10 setup failed. What are > my > options? > > Drives are WD1000FYPS (Western Digital 1 TB 3.5" SATA). > > mdadm.conf: > > # mdadm.conf written out by anaconda > MAILADDR root > AUTO +imsm +1.x -all > ARRAY /dev/md/root level=raid10 num-devices=4 > UUID=942f512e:2db8dc6c:71667abc:daf408c3 > > /proc/mdstat: > Personalities : [raid10] > md127 : active raid10 sdf1[2](F) sdg1[3] sde1[1] sdd1[0] > 1949480960 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] > bitmap: 15/15 pages [60KB], 65536KB chunk > > smartctl reports this for sdf: > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 1 > 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline > - 6 > > So it's got 6 bad blocks, 1 pending for remapping. > > Can I clear the error and rebuild? (It's not clear what commands would do > that.) Or should I buy a replacement drive? I'm considering a WDS100T1R0AHi, mdadm --remove /dev/md127 /dev/sdf1 and then the same with --add should hotremove and add dev device again. If it rebuilds fine it may again work for a long time. Simon> (2.5" 1TB red drive), which Amazon has for $135, plus the 3.5" adapter. > > The system serves primarily as a home mail server (it fetchmails from an > outside VPS serving as my domain's MX) and archival file server. > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >
--On Friday, September 18, 2020 10:53 PM +0200 Simon Matter <simon.matter at invoca.ch> wrote:> mdadm --remove /dev/md127 /dev/sdf1 > > and then the same with --add should hotremove and add dev device again. > > If it rebuilds fine it may again work for a long time.Thanks. That reminds me: If I need to replace it, is there some easy way to figure out which drive bay is sdf? It's an old Supermicro rack chassis with 6 drive bays. Perhaps a way to blink the drive light?
On Fri, Sep 18, 2020 at 3:20 PM Kenneth Porter <shiva at sewingwitch.com> wrote:> Thanks. That reminds me: If I need to replace it, is there some easy way > to > figure out which drive bay is sdf? It's an old Supermicro rack chassis > with > 6 drive bays. Perhaps a way to blink the drive light? >It's easy enough with dd. Be sure it's the drive you want to find then put dd if=/dev/sdf of=/dev/null into a shell, but don't run it. Look at the drives and hit enter and watch for which one lights up. Then ^c while watching to be sure the light turns off exactly when you hit it.
--On Friday, September 18, 2020 10:53 PM +0200 Simon Matter <simon.matter at invoca.ch> wrote:> mdadm --remove /dev/md127 /dev/sdf1 > > and then the same with --add should hotremove and add dev device again. > > If it rebuilds fine it may again work for a long time.This worked like a charm. When I added it back, it told me it was "re-adding" the drive, so it recognized the drive I'd just removed. I checked /proc/mdstat and it showed rebuilding. It took about 90 minutes to finish and is now running fine.