I got the email that a drive in my 4-drive RAID10 setup failed. What are my
options?
Drives are WD1000FYPS (Western Digital 1 TB 3.5" SATA).
mdadm.conf:
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md/root level=raid10 num-devices=4
UUID=942f512e:2db8dc6c:71667abc:daf408c3
/proc/mdstat:
Personalities : [raid10]
md127 : active raid10 sdf1[2](F) sdg1[3] sde1[1] sdd1[0]
1949480960 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
bitmap: 15/15 pages [60KB], 65536KB chunk
smartctl reports this for sdf:
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline
- 6
So it's got 6 bad blocks, 1 pending for remapping.
Can I clear the error and rebuild? (It's not clear what commands would do
that.) Or should I buy a replacement drive? I'm considering a WDS100T1R0A
(2.5" 1TB red drive), which Amazon has for $135, plus the 3.5"
adapter.
The system serves primarily as a home mail server (it fetchmails from an
outside VPS serving as my domain's MX) and archival file server.
> I got the email that a drive in my 4-drive RAID10 setup failed. What are > my > options? > > Drives are WD1000FYPS (Western Digital 1 TB 3.5" SATA). > > mdadm.conf: > > # mdadm.conf written out by anaconda > MAILADDR root > AUTO +imsm +1.x -all > ARRAY /dev/md/root level=raid10 num-devices=4 > UUID=942f512e:2db8dc6c:71667abc:daf408c3 > > /proc/mdstat: > Personalities : [raid10] > md127 : active raid10 sdf1[2](F) sdg1[3] sde1[1] sdd1[0] > 1949480960 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] > bitmap: 15/15 pages [60KB], 65536KB chunk > > smartctl reports this for sdf: > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 1 > 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline > - 6 > > So it's got 6 bad blocks, 1 pending for remapping. > > Can I clear the error and rebuild? (It's not clear what commands would do > that.) Or should I buy a replacement drive? I'm considering a WDS100T1R0AHi, mdadm --remove /dev/md127 /dev/sdf1 and then the same with --add should hotremove and add dev device again. If it rebuilds fine it may again work for a long time. Simon> (2.5" 1TB red drive), which Amazon has for $135, plus the 3.5" adapter. > > The system serves primarily as a home mail server (it fetchmails from an > outside VPS serving as my domain's MX) and archival file server. > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >
--On Friday, September 18, 2020 10:53 PM +0200 Simon Matter <simon.matter at invoca.ch> wrote:> mdadm --remove /dev/md127 /dev/sdf1 > > and then the same with --add should hotremove and add dev device again. > > If it rebuilds fine it may again work for a long time.Thanks. That reminds me: If I need to replace it, is there some easy way to figure out which drive bay is sdf? It's an old Supermicro rack chassis with 6 drive bays. Perhaps a way to blink the drive light?
On Fri, Sep 18, 2020 at 3:20 PM Kenneth Porter <shiva at sewingwitch.com> wrote:> Thanks. That reminds me: If I need to replace it, is there some easy way > to > figure out which drive bay is sdf? It's an old Supermicro rack chassis > with > 6 drive bays. Perhaps a way to blink the drive light? >It's easy enough with dd. Be sure it's the drive you want to find then put dd if=/dev/sdf of=/dev/null into a shell, but don't run it. Look at the drives and hit enter and watch for which one lights up. Then ^c while watching to be sure the light turns off exactly when you hit it.
--On Friday, September 18, 2020 10:53 PM +0200 Simon Matter <simon.matter at invoca.ch> wrote:> mdadm --remove /dev/md127 /dev/sdf1 > > and then the same with --add should hotremove and add dev device again. > > If it rebuilds fine it may again work for a long time.This worked like a charm. When I added it back, it told me it was "re-adding" the drive, so it recognized the drive I'd just removed. I checked /proc/mdstat and it showed rebuilding. It took about 90 minutes to finish and is now running fine.