Benjamin Smith
2018-Dec-05 19:55 UTC
[CentOS] // RESEND // 7.6: Software RAID1 fails the only meaningful test
(Resend: message didn't show, was my original message too big? Posted one of the output files to a website to see) The point of RAID1 is to allow for continued uptime in a failure scenario. When I assemble servers with RAID1, I set up two HDDs to mirror each other, and test by booting from each drive individually to verify that it works. For the OS partitions, I use simple partitions and ext4 so it's as simple as possible. Using the CentOS 7.6 installer (v 1810) I cannot get this test to pass in any way, with or without LVM. Using an older installer, it works fine (v 1611) and I am able to boot from either drive but as soon as I do a yum update then it fails. I think this may be related or the same issue reported in "LVM failure after CentOS 7.6 upgrade" since that also involves booting from a degraded RAID1 array. This is a terrible bug. See below for some (hopefully) useful output while in recovery mode after a failed boot. ### output of fdisk -l Disk /dev/sda: 500.1 GB, 500107862016 bytes, 976773168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x000c1fd0 Device Boot Start End Blocks Id System /dev/sda1 2048 629409791 314703872 fd Linux raid autodetect /dev/sda2 * 629409792 839256063 104923136 fd Linux raid autodetect /dev/sda3 839256064 944179199 52461568 fd Linux raid autodetect /dev/sda4 944179200 976773119 16296960 5 Extended /dev/sda5 944181248 975654911 15736832 fd Linux raid autodetect ### output of cat /prod/mdstat Personalities : md126 : inactive sda5[0](S) 15727616 blocks super 1.2 md127 : inactive sda2[0](S) 104856576 blocks super 1.2 unused devices: <none> ### content of rdosreport.txt It's big; see http://chico.benjamindsmith.com/rdsosreport.txt
Gordon Messmer
2018-Dec-06 04:07 UTC
[CentOS] // RESEND // 7.6: Software RAID1 fails the only meaningful test
On 12/5/18 11:55 AM, Benjamin Smith wrote:> The point of RAID1 is to allow for continued uptime in a failure scenario. > When I assemble servers with RAID1, I set up two HDDs to mirror each other, > and test by booting from each drive individually to verify that it works. For > the OS partitions, I use simple partitions and ext4 so it's as simple as > possible.I used my test system to test RAID failures.? It has a two-disk RAID1 mirror.? I pulled one drive, waited for the kernel to acknowledge the missing drive, and then rebooted.? The system started up normally with just one disk (which was originally sdb).> ### content of rdosreport.txt > It's big; see > http://chico.benjamindsmith.com/rdsosreport.txt > > [??? 0.000000] localhost kernel: Command line: > BOOT_IMAGE=/boot/vmlinuz-0-rescue-4456807582104f8ab12eb6411a80b31a > root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4 ro crashkernel=auto > rd.md.uuid=ea5fede7:dc339c3b:81817fc4:aba0bd89 > rd.md.uuid=7a90faed:4e5a2b50:9baa8249:21a6c3da rhgb quiet> /dev/sda1: UUID="6f900f10-d951-2ae7-712c-a5710d8d7316" > UUID_SUB="541c8849-58bd-8309-96fd-b45faf0d40bb" LABEL="localhost:home" > TYPE="linux_raid_member" > /dev/sda2: UUID="ea5fede7-dc33-9c3b-8181-7fc4aba0bd89" > UUID_SUB="f127cce4-82f6-fa86-6bc5-2c6b8e3f8e7a" LABEL="localhost:root" > TYPE="linux_raid_member" > /dev/sda3: UUID="39f40c01-b62c-8434-741d-38ee40c227f9" > UUID_SUB="18319e88-67c4-94da-e55f-204c37528ece" LABEL="localhost:var" > TYPE="linux_raid_member" > /dev/sda5: UUID="7a90faed-4e5a-2b50-9baa-824921a6c3da" > UUID_SUB="40034140-1c7f-96c9-d4bd-4f4e25577173" LABEL="localhost:swap" > TYPE="linux_raid_member"The thing that stands out as odd, to me, is that your kernel command line includes "root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4" but that UUID doesn't appear anywhere in the blkid output.? It should, as far as I know. Your root filesystem is in a RAID1 device that includes sda2 as a member.? Its UUID is listed as an rd.md.uuid option on the command line so it should be assembled (incomplete) during boot.? But I think your kernel command line should include "root=UUID=f127cce4-82f6-fa86-6bc5-2c6b8e3f8e7a" and not "root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4"
Benjamin Smith
2018-Dec-08 00:14 UTC
[CentOS] // RESEND // 7.6: Software RAID1 fails the only meaningful test
On Wednesday, December 5, 2018 8:07:02 PM PST Gordon Messmer wrote:> I used my test system to test RAID failures. It has a two-disk RAID1 > mirror. I pulled one drive, waited for the kernel to acknowledge the > missing drive, and then rebooted. The system started up normally with > just one disk (which was originally sdb).my procedure was to shutdown with the system "whole" - both drives working. Then, while dark, removing either disk and then starting up the server. Regardless of which drive I tried to boot on, the failure was consistent.> The thing that stands out as odd, to me, is that your kernel command > line includes "root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4" but that > UUID doesn't appear anywhere in the blkid output. It should, as far as > I know.Except that UUID exists when both drives are present. And this, even though under an earlier CentOS version, it booted fine on either drive singly with the above procedure before doing a yum update. And to clarify my procedure: 1) Set up system with 7.3, RAID1 bare partitions. 2) Wait for mdstat sync to finish. 3) Shutdown system 4) Remove either drive 5) system boots fine 6) Resync drives 7) yum update -y to 7.6 8) shutdown system. 9) remove either drive 10) bad putty tat.> Your root filesystem is in a RAID1 device that includes sda2 as a > member. Its UUID is listed as an rd.md.uuid option on the command line > so it should be assembled (incomplete) during boot. But I think your > kernel command line should include > "root=UUID=f127cce4-82f6-fa86-6bc5-2c6b8e3f8e7a" and not > "root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4"Unfortunately, I have used this same system for other tests and no longer have these UUIDs to test further. However, I can reproduce the problem to test further as soon as I have something to test. I'm going to see if using EXT4 as the file system has any effect.
Possibly Parallel Threads
- // RESEND // 7.6: Software RAID1 fails the only meaningful test
- tool for a comprehensive list of the storage structure
- tool for a comprehensive list of the storage structure
- To loop or not to loop with btrfs
- Accidentally nuked my system - any suggestions ?