m.roth at 5-cent.us
2015-Jul-10 16:47 UTC
[CentOS] OT, hardware: HP smart array drive issue
Hi. Anyone working with these things? I've got a drive in "predictive failure" on in a RAID5. Now here's the thing: there was an issue yesterday when I got in, and I wound up power cycling the RAID; first boot of attached server had issues, and said the controller had a failure, and a drive had failed, and wouldn't continue booting; when I gave it the three-finger salute, this time on the way up, during POST, it noted the controller issue... but the thing came up, looking like it did a couple of days ago. Trying to prevent this from happening again, I've decided to replace the drive that's in predictive failure. The array has a hot spare. I tried to remove, using hpacucli, it refuses "operation not permitted", and there doesn't *seem* to be a "mark as failed" command. *Do* I just yank the drive? mark
On Jul 10, 2015, at 10:47, m.roth at 5-cent.us wrote:> > Trying to prevent this from happening again, I've decided to replace the > drive that's in predictive failure. The array has a hot spare. I tried to > remove, using hpacucli, it refuses "operation not permitted", and there > doesn't *seem* to be a "mark as failed" command. *Do* I just yank the > drive?Hi Mark, I?ve never had any problem just pulling and replacing drives on HP hardware with the hardware RAID controllers (even the icky cheap one that came out around the DL360/380 Gen 8 timeframe, that isn?t really hardware RAID and needs closed drivers in Linux). That said, I also *test it*, long before putting anything important on them? From past experience with HP stuff, it usually won?t move the data over to the hot spare (especially if it?s a ?Global? hot spare and not specific to that array) until an actual failure occurs. ?Predictive failure? isn?t considered a failure in HP?s world. I don?t think there is any setting to tell the controller to move to the hot spare if there?s a ?predictive failure?. I?ve also had disks that triggered a ?predictive failure? under heavy load that were simply popped out and back in, and the controller rebuilt them, and the drive never did it again for *years*. The ?predictive failure? error rate is pretty low. That last one is more a question of policy than anything. How much do you trust it? At one employer the game was to pop out and back in any drive that showed ?predictive failure? on HP systems (Dell stuff we handled differently at the time, it was less prone to false alarms, so to speak) and if they did it again ?soonish?, we?d call for the replacement disk. That?s how often the HP controllers did it. In a rather large farm of HP stuff, I popped and replaced an HP drive a week, whenever I happened by the data center. As for the question of whether you should be able to do it safely or not? if a hardware RAID controller won?t let me yank a physical drive out and shove another one in and rebuild itself back to whatever level of redundancy was defined by me as ?nominal? for that system, I don?t want it anyway. Look at it this way? if the disk had a catastrophic electronics failure while installed in the array, the array should handle it? yanking it out is technically nicer than some of the failure modes that can affect the busses on the backplane with shorted electronics. (GRIN) Just sharing my thoughts? your call. :-) YMMV. We had a service contract at that place and a new disk was always just a phone call away and no additional $, and even with that level of service, we always did the ?re-seat it once? thing. We?d log it and if anyone else saw that same disk flashing the next time they were at the data center (we just looked at the logged ones before doing the ?re-seat?), they?d make the phone call and the service company would drop a drive off a few hours later. -- Nate Duehr denverpilot at me.com