thr3ads.net - CentOS - [CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD? [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Valeri Galtsev

2017-Jan-20 23:38 UTC

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

On Fri, January 20, 2017 5:16 pm, Joseph L. Casale
wrote:>> This is why before configuring and installing everything you may want
to
>> attach drives one at a time, and upon boot take a note which physical
>> drive number the controller has for that drive, and definitely label it
>> so
>> y9ou will know which drive to pull when drive failure is reported.
>
> Sorry Valeri, that only works if you're the only guy in the org.
Well, this is true, I'm only one sysadmin working for two departments
here...
>
> In reality, you cannot and should not rely on this given how easily it can
> change and more than likely someone won't update it.
>
> Would you walk up to a production unit in a degraded state and simply pull
> out a drive and risk a production issue? I wouldn't...
I routinely do: I just hot remove failed drive from running production
systems, and replace with good drive (take a note what I said about my job
above though). No one of our users ever notices. When I do it I usually am
only taking chance of making degraded RAID6 (with one drive failed)
degraded yet even more and become not fault tolerant, though still on line
with all data on it. But even that chance is slim given I take all
precautions when I am initially setting up the box.
>
> You need to assert the position of the drive and prepare it in the array
> controller for removal, then swap, scan, add to virtual disk then initiate
> rebuild.
Hm, not certain what process you describe. Most of my controllers are
3ware and LSI, I just pull failed drive (and I know phailed physical drive
number), put good in its place and rebuild stars right away. I have a
couple of Areca ones (I love them too!), I don't remember if I have to
manually initialize rebuild. (I'm lucky in using good drives - very
careful in choosing good ones ;-).
>
> Not to mention if it's a busy system, confirm that the IO load from the
> rebuild is not having an impact on the application. You may need to lower
> the rate.
Indeed, in 3ware configuration there is a choice of several grades of
rebuild vs IO, I usually choose slower rebuild - faster IO. If I have only
one drive failing on me during a year in a given rack, there is almost
zero chance of second drive failing during quite some time (we had heated
discussion about it once and I still stand by my opinion that drive
failures are independent events). So, my degraded RAID-6 can keep running
and even still stay redundant ("single redundant" akin RAID-5) for the
period of rebuild, even if that takes quite long.

Valeri
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Cameron Smith

2017-Jan-21 01:00 UTC

head link

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

Hi Valeri,


Before you pull a drive you should check to make sure that doing so
won't kill the whole array.

MegaCli can help you prevent a storage disaster and can let you have more
insight into your RAID and the status of the virtual disks and the disks
than make up each array.

MegaCli will let you see the health and status of each drive. Does it have
media errors, is it in predictive failure mode, what firmware version does
it have etc. MegaCli will also let you see the status of the enclosure, the
adapter and the virtual disks (logical disks).

Before you pull a drive it's a good idea to properly prepare it for removal
after confirming that it's OK to remove it.

Here are a few commands:

OFFLINE A DISK
MegaCli -PDOffline -PhysDrv[32:0] -a0

MARK A DISK AS MISSING
MegaCli -pdmarkmissing -physdrv[32:0] -a0

MARK A DISK AS PREPARED FOR REMOVAL
MegaCli -pdprprmv -physdrv[32:0] -a0

Here are some easy overview commands that I run when first looking at the
storage on a system:
MegaCli -AdpAllInfo -aAll |grep -A 8 "Device Present";
MegaCli -PDList -aALL |grep "Firmware state";
MegaCli -PDList -aALL |grep "Media Error Count";
MegaCli -PDList -aALL |grep "Predictive Failure Count";
MegaCli -PDList -aALL |grep "Inquiry Data";
MegaCli -PDList -aALL |grep "Device Firmware Level";
MegaCli -PDList -aALL |grep "Drive has flagged";
MegaCli -PDList -aALL |grep Temperature;


I also leverage MegaCli from bash scripts on my older Dell 11Gen that I run
in cron.hourly that check the health status of my arrays and email me if
there is an issue.



Cameron Smith
Technical Operations Manager
Network Redux, LLC
Cell:   503-926-4928

On Fri, Jan 20, 2017 at 3:38 PM, Valeri Galtsev <galtsev at
kicp.uchicago.edu>
wrote:
>
> On Fri, January 20, 2017 5:16 pm, Joseph L. Casale wrote:
> >> This is why before configuring and installing everything you may
want to
> >> attach drives one at a time, and upon boot take a note which
physical
> >> drive number the controller has for that drive, and definitely
label it
> >> so
> >> y9ou will know which drive to pull when drive failure is reported.
> >
> > Sorry Valeri, that only works if you're the only guy in the org.
>
> Well, this is true, I'm only one sysadmin working for two departments
> here...
>
> >
> > In reality, you cannot and should not rely on this given how easily it
> can
> > change and more than likely someone won't update it.
> >
> > Would you walk up to a production unit in a degraded state and simply
> pull
> > out a drive and risk a production issue? I wouldn't...
>
> I routinely do: I just hot remove failed drive from running production
> systems, and replace with good drive (take a note what I said about my job
> above though). No one of our users ever notices. When I do it I usually am
> only taking chance of making degraded RAID6 (with one drive failed)
> degraded yet even more and become not fault tolerant, though still on line
> with all data on it. But even that chance is slim given I take all
> precautions when I am initially setting up the box.
>
> >
> > You need to assert the position of the drive and prepare it in the
array
> > controller for removal, then swap, scan, add to virtual disk then
> initiate
> > rebuild.
>
> Hm, not certain what process you describe. Most of my controllers are
> 3ware and LSI, I just pull failed drive (and I know phailed physical drive
> number), put good in its place and rebuild stars right away. I have a
> couple of Areca ones (I love them too!), I don't remember if I have to
> manually initialize rebuild. (I'm lucky in using good drives - very
> careful in choosing good ones ;-).
>
> >
> > Not to mention if it's a busy system, confirm that the IO load
from the
> > rebuild is not having an impact on the application. You may need to
lower
> > the rate.
>
> Indeed, in 3ware configuration there is a choice of several grades of
> rebuild vs IO, I usually choose slower rebuild - faster IO. If I have only
> one drive failing on me during a year in a given rack, there is almost
> zero chance of second drive failing during quite some time (we had heated
> discussion about it once and I still stand by my opinion that drive
> failures are independent events). So, my degraded RAID-6 can keep running
> and even still stay redundant ("single redundant" akin RAID-5)
for the
> period of rebuild, even if that takes quite long.
>
> Valeri
>
> > _______________________________________________
> > CentOS mailing list
> > CentOS at centos.org
> > https://lists.centos.org/mailman/listinfo/centos
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++
> Valeri Galtsev
> Sr System Administrator
> Department of Astronomy and Astrophysics
> Kavli Institute for Cosmological Physics
> University of Chicago
> Phone: 773-702-4247
> ++++++++++++++++++++++++++++++++++++++++
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

Valeri Galtsev

2017-Jan-21 01:16 UTC

head link

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

On Fri, January 20, 2017 7:00 pm, Cameron Smith wrote:> Hi Valeri,
>
>
> Before you pull a drive you should check to make sure that doing so
> won't kill the whole array.
Wow! What did I say to make you treat me as an ultimate idiot!? ;-) All my
comments, at least in my own reading, we about things you need to do to
make sure when you hot unplug bad drive it is indeed failed drive you have
to replace.

Valeri
>
> MegaCli can help you prevent a storage disaster and can let you have more
> insight into your RAID and the status of the virtual disks and the disks
> than make up each array.
>
> MegaCli will let you see the health and status of each drive. Does it have
> media errors, is it in predictive failure mode, what firmware version does
> it have etc. MegaCli will also let you see the status of the enclosure,
> the
> adapter and the virtual disks (logical disks).
>
> Before you pull a drive it's a good idea to properly prepare it for
> removal
> after confirming that it's OK to remove it.
>
> Here are a few commands:
>
> OFFLINE A DISK
> MegaCli -PDOffline -PhysDrv[32:0] -a0
>
> MARK A DISK AS MISSING
> MegaCli -pdmarkmissing -physdrv[32:0] -a0
>
> MARK A DISK AS PREPARED FOR REMOVAL
> MegaCli -pdprprmv -physdrv[32:0] -a0
>
> Here are some easy overview commands that I run when first looking at the
> storage on a system:
> MegaCli -AdpAllInfo -aAll |grep -A 8 "Device Present";
> MegaCli -PDList -aALL |grep "Firmware state";
> MegaCli -PDList -aALL |grep "Media Error Count";
> MegaCli -PDList -aALL |grep "Predictive Failure Count";
> MegaCli -PDList -aALL |grep "Inquiry Data";
> MegaCli -PDList -aALL |grep "Device Firmware Level";
> MegaCli -PDList -aALL |grep "Drive has flagged";
> MegaCli -PDList -aALL |grep Temperature;
>
>
> I also leverage MegaCli from bash scripts on my older Dell 11Gen that I
> run
> in cron.hourly that check the health status of my arrays and email me if
> there is an issue.
>
>
>
> Cameron Smith
> Technical Operations Manager
> Network Redux, LLC
> Cell:   503-926-4928
>
> On Fri, Jan 20, 2017 at 3:38 PM, Valeri Galtsev
> <galtsev at kicp.uchicago.edu>
> wrote:
>
>>
>> On Fri, January 20, 2017 5:16 pm, Joseph L. Casale wrote:
>> >> This is why before configuring and installing everything you
may want
>> to
>> >> attach drives one at a time, and upon boot take a note which
physical
>> >> drive number the controller has for that drive, and definitely
label
>> it
>> >> so
>> >> y9ou will know which drive to pull when drive failure is
reported.
>> >
>> > Sorry Valeri, that only works if you're the only guy in the
org.
>>
>> Well, this is true, I'm only one sysadmin working for two
departments
>> here...
>>
>> >
>> > In reality, you cannot and should not rely on this given how
easily it
>> can
>> > change and more than likely someone won't update it.
>> >
>> > Would you walk up to a production unit in a degraded state and
simply
>> pull
>> > out a drive and risk a production issue? I wouldn't...
>>
>> I routinely do: I just hot remove failed drive from running production
>> systems, and replace with good drive (take a note what I said about my
>> job
>> above though). No one of our users ever notices. When I do it I usually
>> am
>> only taking chance of making degraded RAID6 (with one drive failed)
>> degraded yet even more and become not fault tolerant, though still on
>> line
>> with all data on it. But even that chance is slim given I take all
>> precautions when I am initially setting up the box.
>>
>> >
>> > You need to assert the position of the drive and prepare it in the
>> array
>> > controller for removal, then swap, scan, add to virtual disk then
>> initiate
>> > rebuild.
>>
>> Hm, not certain what process you describe. Most of my controllers are
>> 3ware and LSI, I just pull failed drive (and I know phailed physical
>> drive
>> number), put good in its place and rebuild stars right away. I have a
>> couple of Areca ones (I love them too!), I don't remember if I have
to
>> manually initialize rebuild. (I'm lucky in using good drives - very
>> careful in choosing good ones ;-).
>>
>> >
>> > Not to mention if it's a busy system, confirm that the IO load
from
>> the
>> > rebuild is not having an impact on the application. You may need
to
>> lower
>> > the rate.
>>
>> Indeed, in 3ware configuration there is a choice of several grades of
>> rebuild vs IO, I usually choose slower rebuild - faster IO. If I have
>> only
>> one drive failing on me during a year in a given rack, there is almost
>> zero chance of second drive failing during quite some time (we had
>> heated
>> discussion about it once and I still stand by my opinion that drive
>> failures are independent events). So, my degraded RAID-6 can keep
>> running
>> and even still stay redundant ("single redundant" akin
RAID-5) for the
>> period of rebuild, even if that takes quite long.
>>
>> Valeri
>>
>> > _______________________________________________
>> > CentOS mailing list
>> > CentOS at centos.org
>> > https://lists.centos.org/mailman/listinfo/centos
>> >
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++
>> Valeri Galtsev
>> Sr System Administrator
>> Department of Astronomy and Astrophysics
>> Kavli Institute for Cosmological Physics
>> University of Chicago
>> Phone: 773-702-4247
>> ++++++++++++++++++++++++++++++++++++++++
>> _______________________________________________
>> CentOS mailing list
>> CentOS at centos.org
>> https://lists.centos.org/mailman/listinfo/centos
>>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Keith Keller

2017-Jan-21 06:16 UTC

head link

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

On 2017-01-20, Valeri Galtsev <galtsev at kicp.uchicago.edu>
wrote:>
> Hm, not certain what process you describe. Most of my controllers are
> 3ware and LSI, I just pull failed drive (and I know phailed physical drive
> number), put good in its place and rebuild stars right away.
I know for sure that LSI's storcli utility supports an identify
operation, which (if the hardware all cooperates) causes the drive's
light to blink.  I'm fairly sure I've used this feature on 3ware
controllers as well.  I use this even when I'm sure of the failed drive
number and am the only sysadmin for these systems, because I don't even
trust my own memory.  :)

This is one reason I prefer RAID6 over RAID5: if you have one failed
drive in your array, and you pull the wrong one, your RAID5 is now gone,
but your RAID6 is still functional.  The odds are with you in a RAID10
but you could get unlucky.  (Not that you want to rebuild two drives at
the same time but it's still better than losing the array.)

--keith

-- 
kkeller at wombat.san-francisco.ca.us

Valeri Galtsev

2017-Jan-21 17:02 UTC

head link

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

On Sat, January 21, 2017 12:16 am, Keith Keller wrote:> On 2017-01-20, Valeri Galtsev <galtsev at kicp.uchicago.edu> wrote:
>>
>> Hm, not certain what process you describe. Most of my controllers are
>> 3ware and LSI, I just pull failed drive (and I know phailed physical
>> drive
>> number), put good in its place and rebuild stars right away.
>
> I know for sure that LSI's storcli utility supports an identify
> operation, which (if the hardware all cooperates) causes the drive's
> light to blink.  I'm fairly sure I've used this feature on 3ware
> controllers as well.  I use this even when I'm sure of the failed drive
> number and am the only sysadmin for these systems, because I don't even
> trust my own memory.  :)
Yes, that's my attitude exactly. If controller is connected to backplane
correctly, failed (in controller's opinion) drive would have different LED
light lit up (in color, or extra LED depending on backplane). So, just
looking at the box you know which drive to pull. But exactly as you, I am
making sure when rolling out the box into production that I know which
drive has which physical drive number in controller's book, then I know
from controller which drive failed, and which drive's LED expect to shine
when I have my hands on the box, and if it is not what I expect, it will
be long investigation why before I do something. Luckily never happened
that way. Still, as you do, I prefer RAID-6, because even improbable can
happen. Even if RAID10 can give you more speed (which with controller
cache is questionable) I prefer reliability (yes, RAID60 is there too, but
too wasteful for simple things we do).

Valeri
>
> This is one reason I prefer RAID6 over RAID5: if you have one failed
> drive in your array, and you pull the wrong one, your RAID5 is now gone,
> but your RAID6 is still functional.  The odds are with you in a RAID10
> but you could get unlucky.  (Not that you want to rebuild two drives at
> the same time but it's still better than losing the array.)
>
> --keith
>
> --
> kkeller at wombat.san-francisco.ca.us
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Possibly Parallel Threads

Search for more apparently analagous threads

CentOS - Jan 2017 - CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

[CentOS] CentOS 7 and Areca ARC-1883I SAS controller: JBOD or not to JBOD?

Possibly Parallel Threads