m.roth at 5-cent.us
2015-Jul-10 17:49 UTC
[CentOS] OT, hardware: HP smart array drive issue
Jason Warr wrote:> On July 10, 2015 11:47:09 AM CDT, m.roth at 5-cent.us wrote: >> Hi. Anyone working with these things? I've got a drive in "predictive >> failure" on in a RAID5. Now here's the thing: there was an issue >> yesterday when I got in, and I wound up power cycling the RAID; >> first boot of attached server had issues, and said the controller >> had a failure, and a drive had failed, and wouldn't continue >> booting; when I gave it the three-finger salute, this time on the >> way up, during POST, it noted the controller issue... but the >> thing came up, looking like it did a couple of days ago. >> >> Trying to prevent this from happening again, I've decided to replace >> the drive that's in predictive failure. The array has a hot spare. >> I tried to remove, using hpacucli, it refuses "operation not >> permitted", and there doesn't *seem* to be a "mark as failed" >> command. *Do* I just yank the drive? >> > Yep, just yank it. It should start auto rebuilding on the spare. > > If you didn't have a spare you would pull the suspect drive and replace it > with one of equal or greater capacity and it would auto rebuild as well. > > I have a bunch of them at home and have been using them at work for years.Thanks for your quick reply, Jason. I'm used to LSI/MegaRAID/PERCs, where you have to fail it, first. Oddity: I had the drive out for more then five minutes while getting it out of the sled, putting the new one in, oh, and dusting out the slot (gotta do that for all of them, next maintenance window), but after I put in the replacement, and used hpacucli to check, to my surprise it was rebuilding with the replacement, *not* with the spare. mark
On 7/10/2015 12:49 PM, m.roth at 5-cent.us wrote:> Jason Warr wrote: >> On July 10, 2015 11:47:09 AM CDT, m.roth at 5-cent.us wrote: >>> Hi. Anyone working with these things? I've got a drive in "predictive >>> failure" on in a RAID5. Now here's the thing: there was an issue >>> yesterday when I got in, and I wound up power cycling the RAID; >>> first boot of attached server had issues, and said the controller >>> had a failure, and a drive had failed, and wouldn't continue >>> booting; when I gave it the three-finger salute, this time on the >>> way up, during POST, it noted the controller issue... but the >>> thing came up, looking like it did a couple of days ago. >>> >>> Trying to prevent this from happening again, I've decided to replace >>> the drive that's in predictive failure. The array has a hot spare. >>> I tried to remove, using hpacucli, it refuses "operation not >>> permitted", and there doesn't *seem* to be a "mark as failed" >>> command. *Do* I just yank the drive? >>> >> Yep, just yank it. It should start auto rebuilding on the spare. >> >> If you didn't have a spare you would pull the suspect drive and replace it >> with one of equal or greater capacity and it would auto rebuild as well. >> >> I have a bunch of them at home and have been using them at work for years. > Thanks for your quick reply, Jason. I'm used to LSI/MegaRAID/PERCs, where > you have to fail it, first. Oddity: I had the drive out for more then five > minutes while getting it out of the sled, putting the new one in, oh, and > dusting out the slot (gotta do that for all of them, next maintenance > window), but after I put in the replacement, and used hpacucli to check, > to my surprise it was rebuilding with the replacement, *not* with the > spare. > > markIt has been a while since I have used a spare but what might have happened is the spare went back to being a spare when the real drive was replaced. It seems to me that is the default behavior as a spare can be attached to more than one raid group. That way it keeps your physical drive placement consistent.
On 07/10/2015 10:49 AM, m.roth at 5-cent.us wrote:> Jason Warr wrote: >> On July 10, 2015 11:47:09 AM CDT, m.roth at 5-cent.us wrote: >>> Hi. Anyone working with these things? I've got a drive in "predictive >>> failure" on in a RAID5. Now here's the thing: there was an issue >>> yesterday when I got in, and I wound up power cycling the RAID; >>> first boot of attached server had issues, and said the controller >>> had a failure, and a drive had failed, and wouldn't continue >>> booting; when I gave it the three-finger salute, this time on t >>> way up, during POST, it noted the controller issue... but the >>> thing came up, looking like it did a couple of days ago. >>> >>> Trying to prevent this from happening again, I've decided to replace >>> the drive that's in predictive failure. The array has a hot spare. >>> I tried to remove, using hpacucli, it refuses "operation not >>> permitted", and there doesn't *seem* to be a "mark as failed" >>> command. *Do* I just yank the drive? >>> >> Yep, just yank it. It should start auto rebuilding on the spare. >> >> If you didn't have a spare you would pull the suspect drive and replace it >> with one of equal or greater capacity and it would auto rebuild as well. >> >> I have a bunch of them at home and have been using them at work for years. > > Thanks for your quick reply, Jason. I'm used to LSI/MegaRAID/PERCs, where > you have to fail it, first. Oddity: I had the drive out for more then five > minutes while getting it out of the sled, putting the new one in, oh, and > dusting out the slot (gotta do that for all of them, next maintenance > window), but after I put in the replacement, and used hpacucli to check, > to my surprise it was rebuilding with the replacement, *not* with the > spare. >HP's raid controllers appears to have some logic that if the rebuild to spare disk have not yet reached 50% when you insert the replacement, it will abandon the rebuild to the spare and rebuild to the replacement instead. I don't have any documentation to prove it, but I have observed it numerous of times. Thomas