thr3ads.net - CentOS - [CentOS] Race condition with mdadm at boot [still mystifying] [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Chuck Munro

2011-Mar-11 03:25 UTC

[CentOS] Race condition with mdadm at boot [still mystifying]

This is a bit long-winded, but I wanted to share some info ....

Regarding my earlier message about a possible race condition with mdadm, 
I have been doing all sorts of poking around with the boot process. 
Thanks to a tip from Steven Yellin at Stanford, I found where to add a 
delay in the rc.sysinit script, which invokes mdadm to assemble the arrays.

Unfortunately it didn't help, so it likely wasn't a race condition after
all.

However, on close examination of dmesg, I found something very 
interesting.  There were missing 'bind<sd??>' statements for one
or the
other hot spare drive (or sometimes both).  These drives are connected 
to the last PHYs in each SATA controller ... in other words they are the 
last devices probed by the driver for a particular controller.  It would 
appear that the drivers are bailing out before managing to enumerate all 
of the partitions on the last drive in a group, and missing partitions 
occur quite randomly.

So it may or may not be a timing issue between the WD Caviar Black 
drives and both the LSI and Marvell SAS/SATA controller chips.

So, I replaced the two drives (SATA-300) with two faster drives 
(SATA-600) on the off chance they might respond fast enough before the 
drivers move on to other duties.  That didn't help either.

Each group of arrays uses completely drivers (mptsas and sata_mv) but 
both exhibit the same problem, so I'm mystified as to where the real 
issue lies.  Anyone care to offer suggestions?

Chuck

Les Mikesell

2011-Mar-11 04:36 UTC

head link

[CentOS] Race condition with mdadm at boot [still mystifying]

On 3/10/11 9:25 PM, Chuck Munro wrote:
> However, on close examination of dmesg, I found something very
> interesting.  There were missing 'bind<sd??>' statements for
one or the
> other hot spare drive (or sometimes both).  These drives are connected
> to the last PHYs in each SATA controller ... in other words they are the
> last devices probed by the driver for a particular controller.  It would
> appear that the drivers are bailing out before managing to enumerate all
> of the partitions on the last drive in a group, and missing partitions
> occur quite randomly.
>
> So it may or may not be a timing issue between the WD Caviar Black
> drives and both the LSI and Marvell SAS/SATA controller chips.
I've seen some weirdness in powering up 6 or more SATA drives but never 
completely pinned down whether it was the controller, drive cage, or particular 
drives causing the problem.  But I think my symptom was completely failing to 
detect some drives when certain combinations of disks were installed although 
each would work individually.  Do you have any options about whether they power 
up immediately or wait until accessed?

-- 
   Les Mikesell
    lesmikesell at gmail.com

Chuck Munro

2011-Mar-12 05:28 UTC

head link

[CentOS] Race condition with mdadm at boot [still mystifying]

On 03/11/2011 09:00 AM, Les Mikesell wrote:>
> On 3/10/11 9:25 PM, Chuck Munro wrote:
>
>> >  However, on close examination of dmesg, I found something very
>> >  interesting.  There were missing 'bind<sd??>'
statements for one or the
>> >  other hot spare drive (or sometimes both).  These drives are
connected
>> >  to the last PHYs in each SATA controller ... in other words they
are the
>> >  last devices probed by the driver for a particular controller. 
It would
>> >  appear that the drivers are bailing out before managing to
enumerate all
>> >  of the partitions on the last drive in a group, and missing
partitions
>> >  occur quite randomly.
>> >
>> >  So it may or may not be a timing issue between the WD Caviar
Black
>> >  drives and both the LSI and Marvell SAS/SATA controller chips.
> I've seen some weirdness in powering up 6 or more SATA drives but never
> completely pinned down whether it was the controller, drive cage, or
particular
> drives causing the problem.  But I think my symptom was completely failing
to
> detect some drives when certain combinations of disks were installed
although
> each would work individually.  Do you have any options about whether they
power
> up immediately or wait until accessed?
>That's a good question, one I have experimented with.  I don't have any 
choice as to when the drives are spun up (only on bootup), but I did try 
a controller card which pre-spun and checked the identification of the 
drives before handing off to the BIOS for bootup.  That didn't help.

On the particular Supermicro motherboard I'm using, there is a very long 
delay (10 or 15 sec) between power-on and initiation of visible BIOS 
activity, so all disk drives have ample time to spin up and stabilize. 
The drives' SMART data shows that the average spin-up time is well 
within the BIOS startup delay.  Each drive activity indicator shows that 
they are always probed by the kernel's scsi scan process.

I have since tried a couple of other tricks I found by Googling around 
... setting the kernel parameters 'rootdelay=xx' and 
'scsi_mod.scan=sync'.  These had no effect on the problem.  For some 
unfathomable reason, the last drives in each group of drives have one or 
more random partitions missing, with no 'bind' statement in dmesg. 
Other partitions on those drives are bound normally.  This has been 
tested with at least two known-good replacement drives, with the same 
random results.  On two occasions today, everything worked perfectly, 
but that was unusual.

A friend of mine suggested an ugly hack - connect two 'dummy' unused old
SATA drives to the last port of each controller (I'm using only 6 of 8 
on each).  I wonder if one of those $15 IDE-to-SATA converters would do 
the job (without a drive attached)?  Foolish thought  :-/

Chuck

Seemingly Similar Threads

Search for more possibly parallel threads

CentOS - Mar 2011 - Race condition with mdadm at boot [still mystifying]

[CentOS] Race condition with mdadm at boot [still mystifying]

[CentOS] Race condition with mdadm at boot [still mystifying]

[CentOS] Race condition with mdadm at boot [still mystifying]

Seemingly Similar Threads