thr3ads.net - zfs discuss - [zfs-discuss] hot spares

If this information is useful, please help other people find it:
Share via:

Toby Thain

2007-Jan-29 22:21 UTC

[zfs-discuss] hot spares - in standby?

Hi,

This is not exactly ZFS specific, but this still seems like a  
fruitful place to ask.

It occurred to me today that hot spares could sit in standby (spun  
down) until needed (I know ATA can do this, I''m supposing SCSI does  
too, but I haven''t looked at a spec recently). Does anybody do this?  
Or does everybody do this already?

Does the tub curve (chance of early life failure) imply that hot  
spares should be burned in, instead of sitting there doing nothing  
from new? Just like a data disk, seems to me you''d want to know if a  
hot spare fails while waiting to be swapped in. Do they get tested  
periodically?

--Toby

Bill Moore

2007-Jan-29 22:37 UTC

head link

[zfs-discuss] hot spares - in standby?

You could easily do this in Solaris today by just using power.conf(4).
Just have it spin down any drives that have been idle for a day or more.

The periodic testing part would be an interesting project to kick off.


--Bill


On Mon, Jan 29, 2007 at 08:21:16PM -0200, Toby Thain
wrote:> Hi,
> 
> This is not exactly ZFS specific, but this still seems like a  
> fruitful place to ask.
> 
> It occurred to me today that hot spares could sit in standby (spun  
> down) until needed (I know ATA can do this, I''m supposing SCSI
does
> too, but I haven''t looked at a spec recently). Does anybody do
this?
> Or does everybody do this already?
> 
> Does the tub curve (chance of early life failure) imply that hot  
> spares should be burned in, instead of sitting there doing nothing  
> from new? Just like a data disk, seems to me you''d want to know if
a
> hot spare fails while waiting to be swapped in. Do they get tested  
> periodically?
> 
> --Toby
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2007-Jan-29 22:47 UTC

head link

[zfs-discuss] hot spares - in standby?

Toby Thain wrote:> Hi,
> 
> This is not exactly ZFS specific, but this still seems like a fruitful 
> place to ask.
> 
> It occurred to me today that hot spares could sit in standby (spun down) 
> until needed (I know ATA can do this, I''m supposing SCSI does too,
but I
> haven''t looked at a spec recently). Does anybody do this? Or does 
> everybody do this already?
"luxadm stop" will work for many SCSI and FC JBODs.  If your drive
doesn''t
support it, it won''t hurt anything, it will just claim
"Unsupported" --
not very user friendly, IMHO.

I think it is a good idea, with one potential gotcha.  The gotcha is that
it can take 30 seconds or more to spin up. By default, the sd and ssd timeouts
are such that a pending iop will not notice that it took a while to spin up.
However, if you have changed those defaults, as sometimes occurs in high
availability requirements, then you probably shouldn''t do this.
> Does the tub curve (chance of early life failure) imply that hot spares 
> should be burned in, instead of sitting there doing nothing from new?
Good question. If you consider that mechanical wear out is what ultimately
causes many failure modes, then the argument can be made that a spun down
disk should last longer. The problem is that there are failure modes which
are triggered by a spin up.  I''ve never seen field data showing the
difference
between the two.  I spin mine down because they are too loud and consume
more electricity, and electricity is expensive in Southern California.
> Just like a data disk, seems to me you''d want to know if a hot
spare
> fails while waiting to be swapped in. Do they get tested periodically?
Another good question.  AFAIK, they are not accessed until needed.

Note: they will be queried on boot which will cause a spin up.  I use a cron
job to spin mine down in the late evening.
  -- richard

Al Hopper

2007-Jan-29 23:04 UTC

head link

[zfs-discuss] hot spares - in standby?

On Mon, 29 Jan 2007, Toby Thain wrote:
> Hi,
>
> This is not exactly ZFS specific, but this still seems like a
> fruitful place to ask.
>
> It occurred to me today that hot spares could sit in standby (spun
> down) until needed (I know ATA can do this, I''m supposing SCSI
does
> too, but I haven''t looked at a spec recently). Does anybody do
this?
> Or does everybody do this already?
I don''t work with enough disk storage systems to know what is the
industry
norm.  But there are 3 broad categories of disk drive spares:

a) Cold Spare.  A spare where the power is not connected until it is
required.  [1]

b) Warm Spare.  A spare that is active but placed into a low power mode.
Or into a "low mechanical ware & tare" mode.  In the case of a
disk drive,
the controller board is active but the HDA (Head Disk Assembly) is
inactive (platters are stationary, heads unloaded [if the heads are
physically unloaded]); it has power applied and can be made "hot" by a
command over its data/command (bus) connection.  The supervisorary
hardware/software/firmware "knows" how long it *should* take the drive
to
go from warm to hot.

c) Hot Spare.  A spare that is spun up and ready to accept
read/write/position (etc) requests.
> Does the tub curve (chance of early life failure) imply that hot
> spares should be burned in, instead of sitting there doing nothing
> from new? Just like a data disk, seems to me you''d want to know if
a
> hot spare fails while waiting to be swapped in. Do they get tested
> periodically?
The ideal scenario, as you already allude to, would be for the disk
subsystem to initially configure the drive as a hot spare and send it
periodic "test" events for, say, the first 48 hours.  This would get
it
past the first segment of the "bathtub" reliability curve - often
referred
to as the "infant mortality" phase.  After that, (ideally) it would be
placed into "warm standby" mode and it would be periodically tested
(once
a month??).

If saving power was the highest priority, then the ideal situation would
be where the disk subsystem could apply/remove power to the spare and move
it from warm to cold upon command.

One "trick" with disk subsystems, like ZFS that have yet to have the
FMA
type functionality added and which (today) provide for hot spares only, is
to initially configure a pool with one (hot) spare, and then add a 2nd hot
spare, based on installing a brand new device, say, 12 months later.  And
another spare 12 months later.  What you are trying to achieve, with this
strategy, is to avoid the scenario whereby mechanical systems, like disk
drives, tend to "wear out" within the same general, relatively short,
timeframe.

One (obvious) issue with this strategy, is that it may be impossible to
purchase the same disk drive 12 and 24 months later.  However, it''s
always
possible to purchase a larger disk drive and simply commit to the fact
that the extra space provided by the newer drive will be wasted.

[1] The most common example is a disk drive mounted on a carrier but not
seated within the disk drive enclosure.  Simple "push in" when
required.

Off Topic: To go off on a tangent - the same strategy applies to a UPS
(Uninterruptable Power Supply).  As per the following time line:

year 0: purchase the UPS and one battery cabinet
year 1: purchase and attach an additional battery cabinet
year 2: purchase and attach an additional battery cabinet
year 3: purchase and attach an additional battery cabinet
year 4: purchase and attach an additional battery cabinet and remove the
oldest battery cabinet
year 5 ... N: repeat year 4s scenario until its time to replace the UPS.

The advantage of this scenario is that you can budget a *fixed* cost for
the UPS and your management understands that there is a recurring cost so
that, when the power fails, your UPS will have working batteries!!

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Toby Thain

2007-Jan-29 23:14 UTC

head link

[zfs-discuss] hot spares - in standby?

On 29-Jan-07, at 9:04 PM, Al Hopper wrote:
> On Mon, 29 Jan 2007, Toby Thain wrote:
>
>> Hi,
>>
>> This is not exactly ZFS specific, but this still seems like a
>> fruitful place to ask.
>>
>> It occurred to me today that hot spares could sit in standby (spun
>> down) until needed (I know ATA can do this, I''m supposing SCSI
does
>> too, but I haven''t looked at a spec recently). Does anybody do
this?
>> Or does everybody do this already?
>
> I don''t work with enough disk storage systems to know what is the
> industry
> norm.  But there are 3 broad categories of disk drive spares:
>
> a) Cold Spare.  A spare where the power is not connected until it is
> required.  [1]
>
> b) Warm Spare.  A spare that is active but placed into a low power  
> mode. ...
>
> c) Hot Spare.  A spare that is spun up and ready to accept
> read/write/position (etc) requests.
Hi Al,

Thanks for reminding me of the distinction. It seems very few  
installations would actually require (c)?
>
>> Does the tub curve (chance of early life failure) imply that hot
>> spares should be burned in, instead of sitting there doing nothing
>> from new? Just like a data disk, seems to me you''d want to
know if a
>> hot spare fails while waiting to be swapped in. Do they get tested
>> periodically?
>
> The ideal scenario, as you already allude to, would be for the disk
> subsystem to initially configure the drive as a hot spare and send it
> periodic "test" events for, say, the first 48 hours.
For some reason that''s a little shorter than I had in mind, but I  
take your word that that''s enough burn-in for semiconductors, motors,  
servos, etc.
> This would get it
> past the first segment of the "bathtub" reliability curve ...
>
> If saving power was the highest priority, then the ideal situation  
> would
> be where the disk subsystem could apply/remove power to the spare  
> and move
> it from warm to cold upon command.
I am surmising that it would also considerably increase the spare''s  
useful lifespan versus "hot" and spinning.
>
> One "trick" with disk subsystems, like ZFS that have yet to have
> the FMA
> type functionality added and which (today) provide for hot spares  
> only, is
> to initially configure a pool with one (hot) spare, and then add a  
> 2nd hot
> spare, based on installing a brand new device, say, 12 months  
> later.  And
> another spare 12 months later.  What you are trying to achieve,  
> with this
> strategy, is to avoid the scenario whereby mechanical systems, like  
> disk
> drives, tend to "wear out" within the same general, relatively
short,
> timeframe.
>
> One (obvious) issue with this strategy, is that it may be  
> impossible to
> purchase the same disk drive 12 and 24 months later.  However,
it''s
> always
> possible to purchase a larger disk drive
...which is not guaranteed to be compatible with your storage  
subsystem...!

--Toby
> and simply commit to the fact
> that the extra space provided by the newer drive will be wasted.
>
> [1] The most common example is a disk drive mounted on a carrier  
> but not
> seated within the disk drive enclosure.  Simple "push in" when  
> required.
> ...
> Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
>            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
> OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
>              OpenSolaris Governing Board (OGB) Member - Feb 2006

Jason J. W. Williams

2007-Jan-30 01:02 UTC

head link

[zfs-discuss] hot spares - in standby?

Hi Guys,

I seem to remember the Massive Array of Independent Disk guys ran into
a problem I think they called static friction, where idle drives would
fail on spin up after being idle for a long time:
http://www.eweek.com/article2/0,1895,1941205,00.asp

Would that apply here?

Best Regards,
Jason

On 1/29/07, Toby Thain <toby at smartgames.ca>
wrote:>
> On 29-Jan-07, at 9:04 PM, Al Hopper wrote:
>
> > On Mon, 29 Jan 2007, Toby Thain wrote:
> >
> >> Hi,
> >>
> >> This is not exactly ZFS specific, but this still seems like a
> >> fruitful place to ask.
> >>
> >> It occurred to me today that hot spares could sit in standby (spun
> >> down) until needed (I know ATA can do this, I''m supposing
SCSI does
> >> too, but I haven''t looked at a spec recently). Does
anybody do this?
> >> Or does everybody do this already?
> >
> > I don''t work with enough disk storage systems to know what is
the
> > industry
> > norm.  But there are 3 broad categories of disk drive spares:
> >
> > a) Cold Spare.  A spare where the power is not connected until it is
> > required.  [1]
> >
> > b) Warm Spare.  A spare that is active but placed into a low power
> > mode. ...
> >
> > c) Hot Spare.  A spare that is spun up and ready to accept
> > read/write/position (etc) requests.
>
> Hi Al,
>
> Thanks for reminding me of the distinction. It seems very few
> installations would actually require (c)?
>
> >
> >> Does the tub curve (chance of early life failure) imply that hot
> >> spares should be burned in, instead of sitting there doing nothing
> >> from new? Just like a data disk, seems to me you''d want
to know if a
> >> hot spare fails while waiting to be swapped in. Do they get tested
> >> periodically?
> >
> > The ideal scenario, as you already allude to, would be for the disk
> > subsystem to initially configure the drive as a hot spare and send it
> > periodic "test" events for, say, the first 48 hours.
>
> For some reason that''s a little shorter than I had in mind, but I
> take your word that that''s enough burn-in for semiconductors,
motors,
> servos, etc.
>
> > This would get it
> > past the first segment of the "bathtub" reliability curve
...
> >
> > If saving power was the highest priority, then the ideal situation
> > would
> > be where the disk subsystem could apply/remove power to the spare
> > and move
> > it from warm to cold upon command.
>
> I am surmising that it would also considerably increase the
spare''s
> useful lifespan versus "hot" and spinning.
>
> >
> > One "trick" with disk subsystems, like ZFS that have yet to
have
> > the FMA
> > type functionality added and which (today) provide for hot spares
> > only, is
> > to initially configure a pool with one (hot) spare, and then add a
> > 2nd hot
> > spare, based on installing a brand new device, say, 12 months
> > later.  And
> > another spare 12 months later.  What you are trying to achieve,
> > with this
> > strategy, is to avoid the scenario whereby mechanical systems, like
> > disk
> > drives, tend to "wear out" within the same general,
relatively short,
> > timeframe.
> >
> > One (obvious) issue with this strategy, is that it may be
> > impossible to
> > purchase the same disk drive 12 and 24 months later.  However,
it''s
> > always
> > possible to purchase a larger disk drive
>
> ...which is not guaranteed to be compatible with your storage
> subsystem...!
>
> --Toby
>
> > and simply commit to the fact
> > that the extra space provided by the newer drive will be wasted.
> >
> > [1] The most common example is a disk drive mounted on a carrier
> > but not
> > seated within the disk drive enclosure.  Simple "push in"
when
> > required.
> > ...
> > Al Hopper  Logical Approach Inc, Plano, TX.  al at
logical-approach.com
> >            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
> > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
> >              OpenSolaris Governing Board (OGB) Member - Feb 2006
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Toby Thain

2007-Jan-30 01:27 UTC

head link

[zfs-discuss] hot spares - in standby?

On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:
> Hi Guys,
>
> I seem to remember the Massive Array of Independent Disk guys ran into
> a problem I think they called static friction, where idle drives would
> fail on spin up after being idle for a long time:
You''d think that probably wouldn''t happen to a spare drive
that was
spun up from time to time. In fact this problem would be (mitigated  
and/or) caught by the periodic health check I suggested.

--T
> http://www.eweek.com/article2/0,1895,1941205,00.asp
>
> Would that apply here?
>
> Best Regards,
> Jason
>
> On 1/29/07, Toby Thain <toby at smartgames.ca> wrote:
>>
>> On 29-Jan-07, at 9:04 PM, Al Hopper wrote:
>>
>> > On Mon, 29 Jan 2007, Toby Thain wrote:
>> >
>> >> Hi,
>> >>
>> >> This is not exactly ZFS specific, but this still seems like a
>> >> fruitful place to ask.
>> >>
>> >> It occurred to me today that hot spares could sit in standby
(spun
>> >> down) until needed (I know ATA can do this, I''m
supposing SCSI
>> does
>> >> too, but I haven''t looked at a spec recently). Does
anybody do
>> this?
>> >> Or does everybody do this already?
>> >
>> > I don''t work with enough disk storage systems to know
what is the
>> > industry
>> > norm.  But there are 3 broad categories of disk drive spares:
>> >
>> > a) Cold Spare.  A spare where the power is not connected until  
>> it is
>> > required.  [1]
>> >
>> > b) Warm Spare.  A spare that is active but placed into a low power
>> > mode. ...
>> >
>> > c) Hot Spare.  A spare that is spun up and ready to accept
>> > read/write/position (etc) requests.
>>
>> Hi Al,
>>
>> Thanks for reminding me of the distinction. It seems very few
>> installations would actually require (c)?
>>
>> >
>> >> Does the tub curve (chance of early life failure) imply that
hot
>> >> spares should be burned in, instead of sitting there doing
nothing
>> >> from new? Just like a data disk, seems to me you''d
want to know
>> if a
>> >> hot spare fails while waiting to be swapped in. Do they get
tested
>> >> periodically?
>> >
>> > The ideal scenario, as you already allude to, would be for the
disk
>> > subsystem to initially configure the drive as a hot spare and  
>> send it
>> > periodic "test" events for, say, the first 48 hours.
>>
>> For some reason that''s a little shorter than I had in mind,
but I
>> take your word that that''s enough burn-in for semiconductors,
motors,
>> servos, etc.
>>
>> > This would get it
>> > past the first segment of the "bathtub" reliability
curve ...
>> >
>> > If saving power was the highest priority, then the ideal situation
>> > would
>> > be where the disk subsystem could apply/remove power to the spare
>> > and move
>> > it from warm to cold upon command.
>>
>> I am surmising that it would also considerably increase the
spare''s
>> useful lifespan versus "hot" and spinning.
>>
>> >
>> > One "trick" with disk subsystems, like ZFS that have yet
to have
>> > the FMA
>> > type functionality added and which (today) provide for hot spares
>> > only, is
>> > to initially configure a pool with one (hot) spare, and then add a
>> > 2nd hot
>> > spare, based on installing a brand new device, say, 12 months
>> > later.  And
>> > another spare 12 months later.  What you are trying to achieve,
>> > with this
>> > strategy, is to avoid the scenario whereby mechanical systems,
like
>> > disk
>> > drives, tend to "wear out" within the same general,
relatively
>> short,
>> > timeframe.
>> >
>> > One (obvious) issue with this strategy, is that it may be
>> > impossible to
>> > purchase the same disk drive 12 and 24 months later.  However,
it''s
>> > always
>> > possible to purchase a larger disk drive
>>
>> ...which is not guaranteed to be compatible with your storage
>> subsystem...!
>>
>> --Toby
>>
>> > and simply commit to the fact
>> > that the extra space provided by the newer drive will be wasted.
>> >
>> > [1] The most common example is a disk drive mounted on a carrier
>> > but not
>> > seated within the disk drive enclosure.  Simple "push
in" when
>> > required.
>> > ...
>> > Al Hopper  Logical Approach Inc, Plano, TX.  al at logical- 
>> approach.com
>> >            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
>> > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
>> >              OpenSolaris Governing Board (OGB) Member - Feb 2006
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>

David Magda

2007-Jan-30 02:37 UTC

head link

[zfs-discuss] hot spares - in standby?

On Jan 29, 2007, at 20:27, Toby Thain wrote:
> On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:
>
>> I seem to remember the Massive Array of Independent Disk guys ran  
>> into
>> a problem I think they called static friction, where idle drives  
>> would
>> fail on spin up after being idle for a long time:
>
> You''d think that probably wouldn''t happen to a spare
drive that was
> spun up from time to time. In fact this problem would be (mitigated  
> and/or) caught by the periodic health check I suggested.
What about a rotating spare?

When setting up a pool a lot of people would (say) balance things  
around buses and controllers to minimize single  points of failure,  
and a rotating spare could disrupt this organization, but would it be  
useful at all?

Wee Yeh Tan

2007-Jan-30 03:18 UTC

head link

[zfs-discuss] hot spares - in standby?

On 1/30/07, David Magda <dmagda at ee.ryerson.ca>
wrote:> What about a rotating spare?
>
> When setting up a pool a lot of people would (say) balance things
> around buses and controllers to minimize single  points of failure,
> and a rotating spare could disrupt this organization, but would it be
> useful at all?
The costs involved in "rotating" spares in terms of IOPS reduction may
not be worth it.


-- 
Just me,
Wire ...

Nathan Kroenert

2007-Jan-30 03:35 UTC

head link

[zfs-discuss] hot spares - in standby?

Random thoughts:

If we were to use some intelligence in the design, we could perhaps have 
a monitor that profiles the workload on the system (a pool, for example) 
over a [week|month|whatever] and selects a point in time, based on 
history, that it would expect the disks to be quite, and can
''pre-build''
the spare with the contents of the disk it''s about to swap out. At the 
point of switch-over, it could be pretty much instantaneous... It could 
also bail if it happened that the system actually started to get 
genuinely busy...

That might actually be quite cool, though, if all disks are rotated, we 
end up with a whole bunch of disks that are evenly worn out again, which 
is just what we are really trying to avoid! ;)

Nathan.

Wee Yeh Tan wrote:> On 1/30/07, David Magda <dmagda at ee.ryerson.ca> wrote:
>> What about a rotating spare?
>>
>> When setting up a pool a lot of people would (say) balance things
>> around buses and controllers to minimize single  points of failure,
>> and a rotating spare could disrupt this organization, but would it be
>> useful at all?
> 
> The costs involved in "rotating" spares in terms of IOPS
reduction may
> not be worth it.
> 
>

Jason J. W. Williams

2007-Jan-30 06:17 UTC

head link

[zfs-discuss] hot spares - in standby?

Hi Toby,

You''re right. The healthcheck would definitely find any issues. I
misinterpreted your comment to that effect as a question and didn''t
quite latch on. A zpool MAID-mode with that healthcheck might also be
interesting on something like a Thumper for pure-archival, D2D backup
work. Would dramatically cut down on the power. What do y''all think?

Best Regards,
Jason

On 1/29/07, Toby Thain <toby at smartgames.ca>
wrote:>
> On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:
>
> > Hi Guys,
> >
> > I seem to remember the Massive Array of Independent Disk guys ran into
> > a problem I think they called static friction, where idle drives would
> > fail on spin up after being idle for a long time:
>
> You''d think that probably wouldn''t happen to a spare
drive that was
> spun up from time to time. In fact this problem would be (mitigated
> and/or) caught by the periodic health check I suggested.
>
> --T
>
> > http://www.eweek.com/article2/0,1895,1941205,00.asp
> >
> > Would that apply here?
> >
> > Best Regards,
> > Jason
> >
> > On 1/29/07, Toby Thain <toby at smartgames.ca> wrote:
> >>
> >> On 29-Jan-07, at 9:04 PM, Al Hopper wrote:
> >>
> >> > On Mon, 29 Jan 2007, Toby Thain wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> This is not exactly ZFS specific, but this still seems
like a
> >> >> fruitful place to ask.
> >> >>
> >> >> It occurred to me today that hot spares could sit in
standby (spun
> >> >> down) until needed (I know ATA can do this, I''m
supposing SCSI
> >> does
> >> >> too, but I haven''t looked at a spec recently).
Does anybody do
> >> this?
> >> >> Or does everybody do this already?
> >> >
> >> > I don''t work with enough disk storage systems to
know what is the
> >> > industry
> >> > norm.  But there are 3 broad categories of disk drive spares:
> >> >
> >> > a) Cold Spare.  A spare where the power is not connected
until
> >> it is
> >> > required.  [1]
> >> >
> >> > b) Warm Spare.  A spare that is active but placed into a low
power
> >> > mode. ...
> >> >
> >> > c) Hot Spare.  A spare that is spun up and ready to accept
> >> > read/write/position (etc) requests.
> >>
> >> Hi Al,
> >>
> >> Thanks for reminding me of the distinction. It seems very few
> >> installations would actually require (c)?
> >>
> >> >
> >> >> Does the tub curve (chance of early life failure) imply
that hot
> >> >> spares should be burned in, instead of sitting there
doing nothing
> >> >> from new? Just like a data disk, seems to me
you''d want to know
> >> if a
> >> >> hot spare fails while waiting to be swapped in. Do they
get tested
> >> >> periodically?
> >> >
> >> > The ideal scenario, as you already allude to, would be for
the disk
> >> > subsystem to initially configure the drive as a hot spare and
> >> send it
> >> > periodic "test" events for, say, the first 48
hours.
> >>
> >> For some reason that''s a little shorter than I had in
mind, but I
> >> take your word that that''s enough burn-in for
semiconductors, motors,
> >> servos, etc.
> >>
> >> > This would get it
> >> > past the first segment of the "bathtub" reliability
curve ...
> >> >
> >> > If saving power was the highest priority, then the ideal
situation
> >> > would
> >> > be where the disk subsystem could apply/remove power to the
spare
> >> > and move
> >> > it from warm to cold upon command.
> >>
> >> I am surmising that it would also considerably increase the
spare''s
> >> useful lifespan versus "hot" and spinning.
> >>
> >> >
> >> > One "trick" with disk subsystems, like ZFS that
have yet to have
> >> > the FMA
> >> > type functionality added and which (today) provide for hot
spares
> >> > only, is
> >> > to initially configure a pool with one (hot) spare, and then
add a
> >> > 2nd hot
> >> > spare, based on installing a brand new device, say, 12 months
> >> > later.  And
> >> > another spare 12 months later.  What you are trying to
achieve,
> >> > with this
> >> > strategy, is to avoid the scenario whereby mechanical
systems, like
> >> > disk
> >> > drives, tend to "wear out" within the same general,
relatively
> >> short,
> >> > timeframe.
> >> >
> >> > One (obvious) issue with this strategy, is that it may be
> >> > impossible to
> >> > purchase the same disk drive 12 and 24 months later. 
However, it''s
> >> > always
> >> > possible to purchase a larger disk drive
> >>
> >> ...which is not guaranteed to be compatible with your storage
> >> subsystem...!
> >>
> >> --Toby
> >>
> >> > and simply commit to the fact
> >> > that the extra space provided by the newer drive will be
wasted.
> >> >
> >> > [1] The most common example is a disk drive mounted on a
carrier
> >> > but not
> >> > seated within the disk drive enclosure.  Simple "push
in" when
> >> > required.
> >> > ...
> >> > Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-
> >> approach.com
> >> >            Voice: 972.379.2133 Fax: 972.379.2134  Timezone:
US CDT
> >> > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr
2005
> >> >              OpenSolaris Governing Board (OGB) Member - Feb
2006
> >>
> >> _______________________________________________
> >> zfs-discuss mailing list
> >> zfs-discuss at opensolaris.org
> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
>
>

Luke Scharf

2007-Jan-30 14:52 UTC

head link

[zfs-discuss] hot spares - in standby?

David Magda wrote:> What about a rotating spare?
>
> When setting up a pool a lot of people would (say) balance things 
> around buses and controllers to minimize single  points of failure, 
> and a rotating spare could disrupt this organization, but would it be 
> useful at all?
Functionally, that sounds a lot like raidz2!

"Hey, I can take a double-drive failure now!  And I don''t even
need to
rebuild!  Just like having a hot spare with raid5, but without the 
rebuild time!"

Though I can see a "raidz sub N" being useful -- "just tell ZFS
how many
parity drives you want, and we''ll take care of the rest."

-Luke


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3271 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070130/1c628cf2/attachment.bin>

Albert Chin

2007-Jan-30 17:19 UTC

head link

[zfs-discuss] hot spares - in standby?

On Mon, Jan 29, 2007 at 09:37:57PM -0500, David Magda
wrote:> On Jan 29, 2007, at 20:27, Toby Thain wrote:
> 
> >On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:
> >
> >>I seem to remember the Massive Array of Independent Disk guys ran  
> >>into
> >>a problem I think they called static friction, where idle drives  
> >>would
> >>fail on spin up after being idle for a long time:
> >
> >You''d think that probably wouldn''t happen to a spare
drive that was
> >spun up from time to time. In fact this problem would be (mitigated  
> >and/or) caught by the periodic health check I suggested.
> 
> What about a rotating spare?
> 
> When setting up a pool a lot of people would (say) balance things  
> around buses and controllers to minimize single  points of failure,  
> and a rotating spare could disrupt this organization, but would it be  
> useful at all?
Agami Systems has the concept of "Enterprise Sparing", where the hot
spare is distributed amongst data drives in the array. When a failure
occurs, the rebuild occurs in parallel across _all_ drives in the
array:
  http://www.issidata.com/specs/agami/enterprise-classreliability.pdf

-- 
albert chin (china at thewrittenword.com)

David Magda

2007-Jan-31 12:46 UTC

head link

[zfs-discuss] hot spares - in standby?

On Jan 30, 2007, at 09:52, Luke Scharf wrote:
> "Hey, I can take a double-drive failure now!  And I don''t
even need
> to rebuild!  Just like having a hot spare with raid5, but without  
> the rebuild time!"
Theoretically you want to rebuild as soon as possible, because  
running in degraded mode (even with dual-parity) increases your  
chances of data loss (even though the probabilities involved may seem  
remote).

Case in point, recently at work we had a drive fail in a server with 5 
+1 RAID5 configuration.  We replaced it, and about 2-3 weeks later a  
separate drive failed. Even with dual-parity, if we hadn''t replaced /  
rebuilt things we would now be cutting it close.

I understand all the math involved with RAID 5/6 and failure rates,  
but its wise to remember that even if the probabilities are small  
they aren''t zero. :)

Casper.Dik at Sun.COM

2007-Jan-31 12:53 UTC

head link

[zfs-discuss] hot spares - in standby?

>I understand all the math involved with RAID 5/6 and failure rates,  
>but its wise to remember that even if the probabilities are small  
>they aren''t zero. :)
And after 3-5 years of continuous operation, you better decommission the
whole thing or you will have many disk failures.

Casper

Luke Scharf

2007-Jan-31 14:42 UTC

head link

[zfs-discuss] hot spares - in standby?

David Magda wrote:> On Jan 30, 2007, at 09:52, Luke Scharf wrote:
>
>> "Hey, I can take a double-drive failure now!  And I don''t
even need
>> to rebuild!  Just like having a hot spare with raid5, but without the 
>> rebuild time!"
>
> Theoretically you want to rebuild as soon as possible, because running 
> in degraded mode (even with dual-parity) increases your chances of 
> data loss (even though the probabilities involved may seem remote).
> Case in point, recently at work we had a drive fail in a server with 
> 5+1 RAID5 configuration.  We replaced it, and about 2-3 weeks later a 
> separate drive failed. Even with dual-parity, if we hadn''t
replaced /
> rebuilt things we would now be cutting it close. 
I did misspeak -- with raidz2, I still do have to replace the failed 
drive ASAP!

However, with raidz2, you don''t have to wait hours for the rebuild to 
occur before the second drive can fail; with a hot-spare, the first and 
second failures (provided that the failures occur on the array-drives, 
rather than on the spare) must happen several hours apart.  With raidz2 
on the same hardware, the two failures can happen at the same time -- 
and the array can still be rebuilt.

But, I guess the utility of the hot-spare depends a lot on the number of 
drives available, and on the layout.  In my case, most of the hardware 
that I have is Apple XRaid units and, when using the hardware RAID 
inside the unit, the hot-spare  must be in the same half of the box as 
failed drive -- in these small, constrained RAIDs, raidz2 would be much 
better than raidz and a spare because of the rebuild-time.  With 
Thumper+ZFS or something like that, though, the spare could be anywhere, 
and I think I''d like having a few hot/warm spares on the machine that 
could be zinged into service if an array member fails.

-Luke

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3271 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070131/818969a1/attachment.bin>

Al Hopper

2007-Feb-01 17:34 UTC

head link

[zfs-discuss] hot spares - in standby?

On Wed, 31 Jan 2007 Casper.Dik at Sun.COM wrote:
>
> >I understand all the math involved with RAID 5/6 and failure rates,
> >but its wise to remember that even if the probabilities are small
> >they aren''t zero. :)
Agreed.  Another thing I''ve seen, is that if you have an A/C (Air
Conditioning) "event" in the data center or lab, you will usually see
a
cluster of failures over the next 2 to 3 weeks.  Effectively, all your
disk drives have been thermally stressed and are likely to exhibit a spike
in the failure rates in the near term.

Often, in a larger environment, the facilities personnel don''t
understand
the co-relation between an A/C event and disk drive failure rates.  And
major A/C upgrade work is often scheduled over a (long) weekend when most
of the technical talent won''t be present.  After the work is completed
everyone is told that it "went very well" because the organization
does
not "do bad news" and then you loose two drives in a RAID5 array ....
> And after 3-5 years of continuous operation, you better decommission the
> whole thing or you will have many disk failures.
Agreed.  We took an 11 disk FC hardware RAID box offline recently because
all the drives were 5 years old.  It''s tough to hit those power off
switches and scrap working disk drives, but much better than the business
disruption and professional embarassment caused by data loss.  And much
better to be in control of, and experience, *scheduled* downtime.  BTW:
don''t forget that if you plan to continue to use the disk enclosure
hardware you need to replace _all_ the fans first.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Torrey McMahon

2007-Feb-02 20:00 UTC

head link

[zfs-discuss] hot spares - in standby?

Richard Elling wrote:>
> Good question. If you consider that mechanical wear out is what 
> ultimately
> causes many failure modes, then the argument can be made that a spun down
> disk should last longer. The problem is that there are failure modes 
> which
> are triggered by a spin up.  I''ve never seen field data showing
the
> difference
> between the two.
Often, the spare is up and running but for whatever reason you''ll have
a
bad block on it and you''ll die during the reconstruct. Periodically 
checking the spare means reading and writing from over time in order to 
make sure it''s still ok. (You take the spare out of the trunk, you look
at it, you check the tire pressure, etc.) The issue I see coming down 
the road is that we''ll start getting into a "Golden Gate paint
job"
where it takes so long to check the spare that we''ll just keep the 
process going constantly. Not as much wear and tear as real i/o but it 
will still be up and running the entire time and you won''t be able to 
spin the spare down.

Anton B. Rang

2007-Feb-03 02:35 UTC

head link

[zfs-discuss] Re: hot spares - in standby?

> Often, the spare is up and running but for whatever reason you''ll
have a
> bad block on it and you''ll die during the reconstruct.
Shouldn''t SCSI/ATA block sparing handle this?  Reconstruction should be
purely a matter of writing, so "bit rot" shouldn''t be an
issue; or are there cases I''m not thinking of? (Yes, I know there are a
limited number of spare blocks, but I wouldn''t expect a spare which is
turned off to develop severe media problems...am I wrong?)
 
 
This message posted from opensolaris.org

Richard Elling

2007-Feb-06 04:48 UTC

head link

[zfs-discuss] hot spares - in standby?

Torrey McMahon wrote:> Richard Elling wrote:
>> Good question. If you consider that mechanical wear out is what
ultimately
>> causes many failure modes, then the argument can be made that a spun
down
>> disk should last longer. The problem is that there are failure modes
which
>> are triggered by a spin up.  I''ve never seen field data
showing the difference
>> between the two.
> 
> Often, the spare is up and running but for whatever reason you''ll
have a
> bad block on it and you''ll die during the reconstruct.
Periodically
> checking the spare means reading and writing from over time in order to 
> make sure it''s still ok. (You take the spare out of the trunk, you
look
> at it, you check the tire pressure, etc.) The issue I see coming down 
> the road is that we''ll start getting into a "Golden Gate
paint job"
> where it takes so long to check the spare that we''ll just keep the
> process going constantly. Not as much wear and tear as real i/o but it 
> will still be up and running the entire time and you won''t be able
to
> spin the spare down.
In my experience, checking the spare tire leads to getting a flat and needing
the spare about a week later :-)  It has happened to me twice in the past
few years... I suspect a conspiracy... :-)

Back to the topic, I''d believe that some combination of hot, warm, and
cold spares would be optimal.

Anton B. Rang wrote:
 > Shouldn''t SCSI/ATA block sparing handle this?  Reconstruction
should be
 > purely a matter of writing, so "bit rot" shouldn''t be
an issue; or are
 > there cases I''m not thinking of? (Yes, I know there are a limited
number of
 > spare blocks, but I wouldn''t expect a spare which is turned off
to develop
 > severe media problems...am I wrong?)

In the disk, at the disk block level, there is fairly substantial ECC.
Yet, we still see data loss.  There are many mechanisms at work here.  One
that we have studied to some detail is superparamagnetic decay -- the medium
wishes to decay to a lower-enegy state, losing information in the process.
One way to "prevent" this is to rewrite the data -- basically
resetting the
decay clock.  The study we did on this says that rewriting your data once
per year is reasonable.  Note that ZFS is COW, and scrubbing is currently a
read operation which will only write when data needs to be reconstructed.
I look at this as: rewrite-style scrubbing is preventative, read and verify
style scrubbing is prescriptive.  Either is better than neither.

In short, use spares and scrub.
  -- richard

Henk Langeveld

2007-Feb-07 23:27 UTC

head link

[zfs-discuss] rewrite-style scrubbing...

Richard Elling wrote:> In the disk, at the disk block level, there is fairly substantial ECC.
> Yet, we still see data loss.  There are many mechanisms at work here.  One
> that we have studied to some detail is superparamagnetic decay -- the 
> medium wishes to decay to a lower-enegy state, losing information in the
process.
> One way to "prevent" this is to rewrite the data -- basically
resetting the
> decay clock.  The study we did on this says that rewriting your data once
> per year is reasonable.  Note that ZFS is COW, and scrubbing is currently a
> read operation which will only write when data needs to be reconstructed.
> I look at this as: rewrite-style scrubbing is preventative, read and verify
> style scrubbing is prescriptive.  Either is better than neither.
> In short, use spares and scrub.
I see another purpose for rewrite-style scrubbing - it would be an 
enabler for disk eviction.   First you mark the disk you want to
evict as read-only, than start a rewrite scrub. When done, your disk
is free of data and can be taken out.

-- 
Henk Langeveld <henk at hlangeveld.nl>

zfs discuss - Jan 2007 - hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] Re: hot spares - in standby?

[zfs-discuss] hot spares - in standby?

[zfs-discuss] rewrite-style scrubbing...