thr3ads.net - zfs discuss - [zfs-discuss] zfs-raidz - simulate disk failure [Nov 2009]

If this information is useful, please help other people find it:
Share via:

sundeep dhall

2009-Nov-23 17:44 UTC

[zfs-discuss] zfs-raidz - simulate disk failure

All,

I have a test environment with 4 internal disks and RAIDZ option.

Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz
handles things well without data errors

Options considered
1. suddenly pulling a disk out 
2. using zpool offline

I think both these have issues in simulating a sudden failure 

thanks
sundeep
-- 
This message posted from opensolaris.org

David Dyer-Bennet

2009-Nov-23 18:32 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Mon, November 23, 2009 11:44, sundeep dhall wrote:> All,
>
> I have a test environment with 4 internal disks and RAIDZ option.
>
> Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz
> handles things well without data errors
>
> Options considered
> 1. suddenly pulling a disk out
> 2. using zpool offline
3.  Use dd to a raw device to corrupt small random parts of the disks
supporting the zpool.
> I think both these have issues in simulating a sudden failure
Probably now it''s "all three"  :-(.

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Eric D. Mudama

2009-Nov-23 18:42 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Mon, Nov 23 at  9:44, sundeep dhall wrote:>All,
>
>I have a test environment with 4 internal disks and RAIDZ option.
>
>Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz
handles things well without data errors
>
>Options considered
>1. suddenly pulling a disk out
>2. using zpool offline
>
>I think both these have issues in simulating a sudden failure
What is more sudden than yanking a live disk out when the system is
under load?  Maybe cut the non-wall end off a 120V power code, and
touch the exposed wiring to your drive''s circuit board?

(don''t try this at home without a good fire suppression system and eye
protection)

If you''re really looking to shave a few milliseconds off the time to
fail, maybe build a digital toggle switch of some kind that will short
the Tx+/- pair inside your sata data cable?



-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Shawn Ferry

2009-Nov-23 18:44 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

I would try using hdadm or cfgadm to specifically offline devices out from under
ZFS.

I have done that previously with cfgadm for systems I cannot physically access.
You can also use file backed storage to create your raidz and move, delete,
overwrite the files to simulate issues.

Shawn


On Nov 23, 2009, at 1:32 PM, David Dyer-Bennet wrote:
> 
> On Mon, November 23, 2009 11:44, sundeep dhall wrote:
>> All,
>> 
>> I have a test environment with 4 internal disks and RAIDZ option.
>> 
>> Q) How do I simulate a sudden 1-disk failure to validate that zfs /
raidz
>> handles things well without data errors
>> 
>> Options considered
>> 1. suddenly pulling a disk out
>> 2. using zpool offline
> 
> 3.  Use dd to a raw device to corrupt small random parts of the disks
> supporting the zpool.
> 
>> I think both these have issues in simulating a sudden failure
> 
> Probably now it''s "all three"  :-(.
> 
> -- 
> David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
> Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
> Photos: http://dd-b.net/photography/gallery/
> Dragaera: http://dragaera.info
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Shawn Ferry              shawn.ferry at sun.com
571.291.4898

Kjetil Torgrim Homme

2009-Nov-23 19:03 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

sundeep dhall <sundeep.dhall at sun.com> writes:> Q) How do I simulate a sudden 1-disk failure to validate that zfs /
> raidz handles things well without data errors
>
> Options considered
> 1. suddenly pulling a disk out 
> 2. using zpool offline
>
> I think both these have issues in simulating a sudden failure 
why not take a look at what HP''s test department is doing and fire a
round through the disk with a rifle?  oh, I guess that won''t be a
*simulation*.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Richard Elling

2009-Nov-23 19:41 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Nov 23, 2009, at 9:44 AM, sundeep dhall wrote:> All,
>
> I have a test environment with 4 internal disks and RAIDZ option.
>
> Q) How do I simulate a sudden 1-disk failure to validate that zfs /  
> raidz handles things well without data errors
First, list the failure modes you expect to see.
Second, simulate them.
> Options considered
> 1. suddenly pulling a disk out
Is this a failure mode you expect to see?
> 2. using zpool offline
This isn''t a failure mode.
> I think both these have issues in simulating a sudden failure
Some other ideas:
Depending on your hardware, you can turn off power to the disk
from the command line using luxadm.

You can partition (using format) the drive and set the slice size to  
zero.
This will cause all attempts to access the slice to fail.

You can intentionally corrupt the disk by overwriting.

You can use DTrace to inject faults.

IMHO, not touching the hardware is a good policy when testing.

Finally, you can take a look at ztest and see how it simulates
failures.  In fact, ztest may be what you really want to use.
  -- richard

David Dyer-Bennet

2009-Nov-23 20:23 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Mon, November 23, 2009 12:42, Eric D. Mudama wrote:> On Mon, Nov 23 at  9:44, sundeep dhall wrote:
>>All,
>>
>>I have a test environment with 4 internal disks and RAIDZ option.
>>
>>Q) How do I simulate a sudden 1-disk failure to validate that zfs /
raidz
>> handles things well without data errors
>>
>>Options considered
>>1. suddenly pulling a disk out
>>2. using zpool offline
>>
>>I think both these have issues in simulating a sudden failure
>
> What is more sudden than yanking a live disk out when the system is
> under load?  Maybe cut the non-wall end off a 120V power code, and
> touch the exposed wiring to your drive''s circuit board?
>From a testing point of view, the safe assumption is that pulling a drivefrom hot-swap is not necessarily the same thing as that drive failing
while mounted.
> (don''t try this at home without a good fire suppression system and
eye
> protection)
Use insulated pliers, too.
> If you''re really looking to shave a few milliseconds off the time
to
> fail, maybe build a digital toggle switch of some kind that will short
> the Tx+/- pair inside your sata data cable?
Or modify a driver to do fault insertion.

No, I wouldn''t bother with that level of testing when qualifying
products
for my own purchase.
-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Richard Elling

2009-Nov-24 19:32 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Nov 23, 2009, at 11:41 AM, Richard Elling wrote:
> On Nov 23, 2009, at 9:44 AM, sundeep dhall wrote:
>> All,
>>
>> I have a test environment with 4 internal disks and RAIDZ option.
>>
>> Q) How do I simulate a sudden 1-disk failure to validate that zfs /  
>> raidz handles things well without data errors
NB, the comments section of zinject.c is an interesting read.
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zinject/
  -- richard

Daniel Carosone

2009-Nov-24 22:51 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

Those are great, but they''re about testing the zfs software. 
There''s a small amount of overlap, in that these injections include
trying to simulate the hoped-for system response (e.g, EIO) to various physical
scenarios, so it''s worth looking at for scenario suggestions.

However, for most of us, we generally rely on Sun''s (generally
acknowledged as excellent) testing of the software stack.

I suspect the OP is more interested in verifying on his own hardware, that
physical events and problems will be connected to the software fault injection
test scenarios. The rest of us running on random commodity hardware have largely
the same interest, because Sun hasn''t qualified the hardware parts of
the stack as well. We''ve taken on that responsibility ourselves (both
individually, and as a community by sharing findings).

For example, for the various kinds of failures that might happen:
 * Does my particular drive/controller/chipset/bios/etc combination notice the
problem and result in the appropriate error from the driver upwards?
 * How quickly does it notice? Do I have to wait for some long timeout or other
retry cycle, and is that a problem for my usage?
 * Does the rest of the system keep working to allow zfs to recover/react, or is
there some kind of follow-on failure (bus hangs/resets, etc) that will have
wider impact?

Yanking disk controller and/or power cables is an easy and obvious test. 
Testing scenarios that involve things like disk firmware behaviour in response
to bad reads is harder - though apparently yelling at them might be worthwhile
:-)

Finding ways to dial up the load up your psu (or drop voltage/limit current to a
specific device with an inline filter) might be an idea, since overloaded power
supplies seem to be implicated in various people''s reports of trouble. 
Finding ways to generate EMF or "cosmic rays" to induce other kinds of
failure is left as an exercise.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Nov-25 15:57 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:
> Those are great, but they''re about testing the zfs software.   
> There''s a small amount of overlap, in that these injections
include
> trying to simulate the hoped-for system response (e.g, EIO) to  
> various physical scenarios, so it''s worth looking at for scenario
> suggestions.
>
> However, for most of us, we generally rely on Sun''s (generally  
> acknowledged as excellent) testing of the software stack.
>
> I suspect the OP is more interested in verifying on his own  
> hardware, that physical events and problems will be connected to the  
> software fault injection test scenarios. The rest of us running on  
> random commodity hardware have largely the same interest, because  
> Sun hasn''t qualified the hardware parts of the stack as well.
We''ve
> taken on that responsibility ourselves (both individually, and as a  
> community by sharing findings).
Agree 110%.
> For example, for the various kinds of failures that might happen:
> * Does my particular drive/controller/chipset/bios/etc combination  
> notice the problem and result in the appropriate error from the  
> driver upwards?
> * How quickly does it notice? Do I have to wait for some long  
> timeout or other retry cycle, and is that a problem for my usage?
> * Does the rest of the system keep working to allow zfs to recover/ 
> react, or is there some kind of follow-on failure (bus hangs/resets,  
> etc) that will have wider impact?
>
> Yanking disk controller and/or power cables is an easy and obvious  
> test.  Testing scenarios that involve things like disk firmware  
> behaviour in response to bad reads is harder - though apparently  
> yelling at them might be worthwhile :-)
The problem is that yanking a disk tests the failure mode of yanking a  
disk.
If this is the sort of failure you expect to see, then perhaps you  
should look
at a mechanical solution. If you wish to test the failure modes you  
are likely
to see, then you need a more sophisticated test rig that will emulate  
a device
and inject the sorts of faults you expect.
> Finding ways to dial up the load up your psu (or drop voltage/limit  
> current to a specific device with an inline filter) might be an  
> idea, since overloaded power supplies seem to be implicated in  
> various people''s reports of trouble.  Finding ways to generate EMF
> or "cosmic rays" to induce other kinds of failure is left as an  
> exercise.
Many parts of the stack have software fault injection capabilities.   
Whether
you do this with something like zinject or the wansimulator, the  
principle is
the same.  For example, you could easily add wansimulator to an iSCSI
rig to inject packet corruption in the network. You can also roll your  
own with
Dtrace, which allows you to change the return values of any function.

The COMSTAR project has a test suite that could be leveraged, but it  
does
not appear to be explicitly designed to perform system tests.  I''m  
reasonably
confident that the driver teams have test code, too, but I would also  
expect
them to be oriented towards unit testing.  A quick search will turn up  
many
fault injection software programs geared towards unit testing.

Finally, there are companies that provide system-level test services.
  -- richard

Daniel Carosone

2009-Nov-26 00:43 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

>> [verify on real hardware and share results]
> Agree 110%.
Good :)
> > Yanking disk controller and/or power cables is an
> > easy and obvious test.
> The problem is that yanking a disk tests the failure
> mode of yanking a disk.
Yes, but the point is that it''s a cheap and easy test, so you might as
well do it -- just beware of what it does, and most importantly does not, tell
you. It''s a valid scenario to test regardless, you want to be sure that
you can yank a disk to replace it, without a bus hang or other hotplug problem
on your hardware.
> > Testing scenarios that involve things like
> > disk firmware behaviour in response to 
> > bad reads is harder -
> If you wish to test the failure modes you  
> are likely to see, then you need a more 
> sophisticated test rig that will emulate  
> a device and inject the sorts of faults
> you expect.
This is one reason I like to keep faulty disks! :)
-- 
This message posted from opensolaris.org

Richard Elling

2009-Nov-26 00:58 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Nov 25, 2009, at 4:43 PM, Daniel Carosone wrote:
>>> [verify on real hardware and share results]
>> Agree 110%.
>
> Good :)
>
>>> Yanking disk controller and/or power cables is an
>>> easy and obvious test.
>
>> The problem is that yanking a disk tests the failure
>> mode of yanking a disk.
>
> Yes, but the point is that it''s a cheap and easy test, so you
might
> as well do it -- just beware of what it does, and most importantly  
> does not, tell you. It''s a valid scenario to test regardless, you
> want to be sure that you can yank a disk to replace it, without a  
> bus hang or other hotplug problem on your hardware.
The next problem is that although a spec might say that hot-plugging
works, that doesn''t mean the implementers support it.  To wit, there
are
well known SATA controllers that do not support hot plug.  So what
good is the test if the hardware/firmware is known to not support it?
Speaking practically, do you evaluate your chipset and disks for hotplug
support before you buy?
>>> Testing scenarios that involve things like
>>> disk firmware behaviour in response to
>>> bad reads is harder -
>
>> If you wish to test the failure modes you
>> are likely to see, then you need a more
>> sophisticated test rig that will emulate
>> a device and inject the sorts of faults
>> you expect.
>
> This is one reason I like to keep faulty disks! :)
Me too.  I still have a SATA drive that breaks POST for every mobo
I''ve come across.  Wanna try hot plug with it? :-)
  -- richard

Eric D. Mudama

2009-Nov-26 01:02 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Wed, Nov 25 at 16:43, Daniel Carosone wrote:>> The problem is that yanking a disk tests the failure
>> mode of yanking a disk.
>
> Yes, but the point is that it''s a cheap and easy test, so you
might
> as well do it -- just beware of what it does, and most importantly
> does not, tell you. It''s a valid scenario to test regardless, you
> want to be sure that you can yank a disk to replace it, without a
> bus hang or other hotplug problem on your hardware.
Agreed.  It''s also a very effective way of preventing your drive from
responding to commands, to test how the system behaves when a drive
stops responding.  Some significant percentage of device failures will
look similar.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Daniel Carosone

2009-Nov-26 02:36 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

> Speaking practically, do you evaluate your chipset
> and disks for hotplug support before you buy?
Yes, if someone else has shared their test results previously.
-- 
This message posted from opensolaris.org

Miles Nordin

2009-Nov-26 08:33 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> although a spec might say that hot-plugging works, that
    re> doesn''t mean the implementers support it.

hotplug means you can plug in a device after boot and use it.  That''s
not the same thing as being able to unplug a device after boot.

Yes, both features are often broken, which is why it''s worth testing!
The first one''s more optional than the second.  The word
''hotplug''
refers to systems that have both, not systems that have either, so I
think referring to drivers and chips that freeze all the SATA channels
when there''s a problem with one channel, or that take fifteen minutes
to timeout commands issued to a port that''s become empty, as ``missing
hotplug'''' is too generous.

The way you tried to use the word is worse than generous because you
seem to argue ``don''t test hot removal of devices because that
requires hotplug support which is optional and which you don''t really
need.''''  I disagree: even if you agree you will never use the
unplugged port again until reboot, not even for the same disk, these
systems are still broken and make zpool redundancy useless for
preventing freezes and panics.  It''s worth testing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091126/90386ccc/attachment.bin>

Richard Elling

2009-Nov-26 16:27 UTC

head link

[zfs-discuss] zfs-raidz - simulate disk failure

On Nov 26, 2009, at 12:33 AM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>
>    re> although a spec might say that hot-plugging works, that
>    re> doesn''t mean the implementers support it.
>
> hotplug means you can plug in a device after boot and use it. 
That''s
> not the same thing as being able to unplug a device after boot.
>
> Yes, both features are often broken, which is why it''s worth
testing!
> The first one''s more optional than the second.  The word
''hotplug''
> refers to systems that have both, not systems that have either, so I
> think referring to drivers and chips that freeze all the SATA channels
> when there''s a problem with one channel, or that take fifteen
minutes
> to timeout commands issued to a port that''s become empty, as
``missing
> hotplug'''' is too generous.
>
> The way you tried to use the word is worse than generous because you
> seem to argue ``don''t test hot removal of devices because that
> requires hotplug support which is optional and which you don''t
really
> need.''''  I disagree: even if you agree you will never use
the
> unplugged port again until reboot, not even for the same disk, these
> systems are still broken and make zpool redundancy useless for
> preventing freezes and panics.  It''s worth testing.
In general, I agree.  However, if the device is specified by its  
supplier
as "not supporting" hotplug, then why waste time testing?
  -- richard

zfs discuss - Nov 2009 - zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure

[zfs-discuss] zfs-raidz - simulate disk failure