All, I have a test environment with 4 internal disks and RAIDZ option. Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz handles things well without data errors Options considered 1. suddenly pulling a disk out 2. using zpool offline I think both these have issues in simulating a sudden failure thanks sundeep -- This message posted from opensolaris.org
On Mon, November 23, 2009 11:44, sundeep dhall wrote:> All, > > I have a test environment with 4 internal disks and RAIDZ option. > > Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz > handles things well without data errors > > Options considered > 1. suddenly pulling a disk out > 2. using zpool offline3. Use dd to a raw device to corrupt small random parts of the disks supporting the zpool.> I think both these have issues in simulating a sudden failureProbably now it''s "all three" :-(. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Mon, Nov 23 at 9:44, sundeep dhall wrote:>All, > >I have a test environment with 4 internal disks and RAIDZ option. > >Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz handles things well without data errors > >Options considered >1. suddenly pulling a disk out >2. using zpool offline > >I think both these have issues in simulating a sudden failureWhat is more sudden than yanking a live disk out when the system is under load? Maybe cut the non-wall end off a 120V power code, and touch the exposed wiring to your drive''s circuit board? (don''t try this at home without a good fire suppression system and eye protection) If you''re really looking to shave a few milliseconds off the time to fail, maybe build a digital toggle switch of some kind that will short the Tx+/- pair inside your sata data cable? -- Eric D. Mudama edmudama at mail.bounceswoosh.org
I would try using hdadm or cfgadm to specifically offline devices out from under ZFS. I have done that previously with cfgadm for systems I cannot physically access. You can also use file backed storage to create your raidz and move, delete, overwrite the files to simulate issues. Shawn On Nov 23, 2009, at 1:32 PM, David Dyer-Bennet wrote:> > On Mon, November 23, 2009 11:44, sundeep dhall wrote: >> All, >> >> I have a test environment with 4 internal disks and RAIDZ option. >> >> Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz >> handles things well without data errors >> >> Options considered >> 1. suddenly pulling a disk out >> 2. using zpool offline > > 3. Use dd to a raw device to corrupt small random parts of the disks > supporting the zpool. > >> I think both these have issues in simulating a sudden failure > > Probably now it''s "all three" :-(. > > -- > David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ > Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ > Photos: http://dd-b.net/photography/gallery/ > Dragaera: http://dragaera.info > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Shawn Ferry shawn.ferry at sun.com 571.291.4898
Kjetil Torgrim Homme
2009-Nov-23 19:03 UTC
[zfs-discuss] zfs-raidz - simulate disk failure
sundeep dhall <sundeep.dhall at sun.com> writes:> Q) How do I simulate a sudden 1-disk failure to validate that zfs / > raidz handles things well without data errors > > Options considered > 1. suddenly pulling a disk out > 2. using zpool offline > > I think both these have issues in simulating a sudden failurewhy not take a look at what HP''s test department is doing and fire a round through the disk with a rifle? oh, I guess that won''t be a *simulation*. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On Nov 23, 2009, at 9:44 AM, sundeep dhall wrote:> All, > > I have a test environment with 4 internal disks and RAIDZ option. > > Q) How do I simulate a sudden 1-disk failure to validate that zfs / > raidz handles things well without data errorsFirst, list the failure modes you expect to see. Second, simulate them.> Options considered > 1. suddenly pulling a disk outIs this a failure mode you expect to see?> 2. using zpool offlineThis isn''t a failure mode.> I think both these have issues in simulating a sudden failureSome other ideas: Depending on your hardware, you can turn off power to the disk from the command line using luxadm. You can partition (using format) the drive and set the slice size to zero. This will cause all attempts to access the slice to fail. You can intentionally corrupt the disk by overwriting. You can use DTrace to inject faults. IMHO, not touching the hardware is a good policy when testing. Finally, you can take a look at ztest and see how it simulates failures. In fact, ztest may be what you really want to use. -- richard
On Mon, November 23, 2009 12:42, Eric D. Mudama wrote:> On Mon, Nov 23 at 9:44, sundeep dhall wrote: >>All, >> >>I have a test environment with 4 internal disks and RAIDZ option. >> >>Q) How do I simulate a sudden 1-disk failure to validate that zfs / raidz >> handles things well without data errors >> >>Options considered >>1. suddenly pulling a disk out >>2. using zpool offline >> >>I think both these have issues in simulating a sudden failure > > What is more sudden than yanking a live disk out when the system is > under load? Maybe cut the non-wall end off a 120V power code, and > touch the exposed wiring to your drive''s circuit board?>From a testing point of view, the safe assumption is that pulling a drivefrom hot-swap is not necessarily the same thing as that drive failing while mounted.> (don''t try this at home without a good fire suppression system and eye > protection)Use insulated pliers, too.> If you''re really looking to shave a few milliseconds off the time to > fail, maybe build a digital toggle switch of some kind that will short > the Tx+/- pair inside your sata data cable?Or modify a driver to do fault insertion. No, I wouldn''t bother with that level of testing when qualifying products for my own purchase. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Nov 23, 2009, at 11:41 AM, Richard Elling wrote:> On Nov 23, 2009, at 9:44 AM, sundeep dhall wrote: >> All, >> >> I have a test environment with 4 internal disks and RAIDZ option. >> >> Q) How do I simulate a sudden 1-disk failure to validate that zfs / >> raidz handles things well without data errorsNB, the comments section of zinject.c is an interesting read. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zinject/ -- richard
Those are great, but they''re about testing the zfs software. There''s a small amount of overlap, in that these injections include trying to simulate the hoped-for system response (e.g, EIO) to various physical scenarios, so it''s worth looking at for scenario suggestions. However, for most of us, we generally rely on Sun''s (generally acknowledged as excellent) testing of the software stack. I suspect the OP is more interested in verifying on his own hardware, that physical events and problems will be connected to the software fault injection test scenarios. The rest of us running on random commodity hardware have largely the same interest, because Sun hasn''t qualified the hardware parts of the stack as well. We''ve taken on that responsibility ourselves (both individually, and as a community by sharing findings). For example, for the various kinds of failures that might happen: * Does my particular drive/controller/chipset/bios/etc combination notice the problem and result in the appropriate error from the driver upwards? * How quickly does it notice? Do I have to wait for some long timeout or other retry cycle, and is that a problem for my usage? * Does the rest of the system keep working to allow zfs to recover/react, or is there some kind of follow-on failure (bus hangs/resets, etc) that will have wider impact? Yanking disk controller and/or power cables is an easy and obvious test. Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - though apparently yelling at them might be worthwhile :-) Finding ways to dial up the load up your psu (or drop voltage/limit current to a specific device with an inline filter) might be an idea, since overloaded power supplies seem to be implicated in various people''s reports of trouble. Finding ways to generate EMF or "cosmic rays" to induce other kinds of failure is left as an exercise. -- This message posted from opensolaris.org
On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:> Those are great, but they''re about testing the zfs software. > There''s a small amount of overlap, in that these injections include > trying to simulate the hoped-for system response (e.g, EIO) to > various physical scenarios, so it''s worth looking at for scenario > suggestions. > > However, for most of us, we generally rely on Sun''s (generally > acknowledged as excellent) testing of the software stack. > > I suspect the OP is more interested in verifying on his own > hardware, that physical events and problems will be connected to the > software fault injection test scenarios. The rest of us running on > random commodity hardware have largely the same interest, because > Sun hasn''t qualified the hardware parts of the stack as well. We''ve > taken on that responsibility ourselves (both individually, and as a > community by sharing findings).Agree 110%.> For example, for the various kinds of failures that might happen: > * Does my particular drive/controller/chipset/bios/etc combination > notice the problem and result in the appropriate error from the > driver upwards? > * How quickly does it notice? Do I have to wait for some long > timeout or other retry cycle, and is that a problem for my usage? > * Does the rest of the system keep working to allow zfs to recover/ > react, or is there some kind of follow-on failure (bus hangs/resets, > etc) that will have wider impact? > > Yanking disk controller and/or power cables is an easy and obvious > test. Testing scenarios that involve things like disk firmware > behaviour in response to bad reads is harder - though apparently > yelling at them might be worthwhile :-)The problem is that yanking a disk tests the failure mode of yanking a disk. If this is the sort of failure you expect to see, then perhaps you should look at a mechanical solution. If you wish to test the failure modes you are likely to see, then you need a more sophisticated test rig that will emulate a device and inject the sorts of faults you expect.> Finding ways to dial up the load up your psu (or drop voltage/limit > current to a specific device with an inline filter) might be an > idea, since overloaded power supplies seem to be implicated in > various people''s reports of trouble. Finding ways to generate EMF > or "cosmic rays" to induce other kinds of failure is left as an > exercise.Many parts of the stack have software fault injection capabilities. Whether you do this with something like zinject or the wansimulator, the principle is the same. For example, you could easily add wansimulator to an iSCSI rig to inject packet corruption in the network. You can also roll your own with Dtrace, which allows you to change the return values of any function. The COMSTAR project has a test suite that could be leveraged, but it does not appear to be explicitly designed to perform system tests. I''m reasonably confident that the driver teams have test code, too, but I would also expect them to be oriented towards unit testing. A quick search will turn up many fault injection software programs geared towards unit testing. Finally, there are companies that provide system-level test services. -- richard
>> [verify on real hardware and share results] > Agree 110%.Good :)> > Yanking disk controller and/or power cables is an > > easy and obvious test.> The problem is that yanking a disk tests the failure > mode of yanking a disk.Yes, but the point is that it''s a cheap and easy test, so you might as well do it -- just beware of what it does, and most importantly does not, tell you. It''s a valid scenario to test regardless, you want to be sure that you can yank a disk to replace it, without a bus hang or other hotplug problem on your hardware.> > Testing scenarios that involve things like > > disk firmware behaviour in response to > > bad reads is harder -> If you wish to test the failure modes you > are likely to see, then you need a more > sophisticated test rig that will emulate > a device and inject the sorts of faults > you expect.This is one reason I like to keep faulty disks! :) -- This message posted from opensolaris.org
On Nov 25, 2009, at 4:43 PM, Daniel Carosone wrote:>>> [verify on real hardware and share results] >> Agree 110%. > > Good :) > >>> Yanking disk controller and/or power cables is an >>> easy and obvious test. > >> The problem is that yanking a disk tests the failure >> mode of yanking a disk. > > Yes, but the point is that it''s a cheap and easy test, so you might > as well do it -- just beware of what it does, and most importantly > does not, tell you. It''s a valid scenario to test regardless, you > want to be sure that you can yank a disk to replace it, without a > bus hang or other hotplug problem on your hardware.The next problem is that although a spec might say that hot-plugging works, that doesn''t mean the implementers support it. To wit, there are well known SATA controllers that do not support hot plug. So what good is the test if the hardware/firmware is known to not support it? Speaking practically, do you evaluate your chipset and disks for hotplug support before you buy?>>> Testing scenarios that involve things like >>> disk firmware behaviour in response to >>> bad reads is harder - > >> If you wish to test the failure modes you >> are likely to see, then you need a more >> sophisticated test rig that will emulate >> a device and inject the sorts of faults >> you expect. > > This is one reason I like to keep faulty disks! :)Me too. I still have a SATA drive that breaks POST for every mobo I''ve come across. Wanna try hot plug with it? :-) -- richard
On Wed, Nov 25 at 16:43, Daniel Carosone wrote:>> The problem is that yanking a disk tests the failure >> mode of yanking a disk. > > Yes, but the point is that it''s a cheap and easy test, so you might > as well do it -- just beware of what it does, and most importantly > does not, tell you. It''s a valid scenario to test regardless, you > want to be sure that you can yank a disk to replace it, without a > bus hang or other hotplug problem on your hardware.Agreed. It''s also a very effective way of preventing your drive from responding to commands, to test how the system behaves when a drive stops responding. Some significant percentage of device failures will look similar. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
> Speaking practically, do you evaluate your chipset > and disks for hotplug support before you buy?Yes, if someone else has shared their test results previously. -- This message posted from opensolaris.org
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> although a spec might say that hot-plugging works, that re> doesn''t mean the implementers support it. hotplug means you can plug in a device after boot and use it. That''s not the same thing as being able to unplug a device after boot. Yes, both features are often broken, which is why it''s worth testing! The first one''s more optional than the second. The word ''hotplug'' refers to systems that have both, not systems that have either, so I think referring to drivers and chips that freeze all the SATA channels when there''s a problem with one channel, or that take fifteen minutes to timeout commands issued to a port that''s become empty, as ``missing hotplug'''' is too generous. The way you tried to use the word is worse than generous because you seem to argue ``don''t test hot removal of devices because that requires hotplug support which is optional and which you don''t really need.'''' I disagree: even if you agree you will never use the unplugged port again until reboot, not even for the same disk, these systems are still broken and make zpool redundancy useless for preventing freezes and panics. It''s worth testing. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091126/90386ccc/attachment.bin>
On Nov 26, 2009, at 12:33 AM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> although a spec might say that hot-plugging works, that > re> doesn''t mean the implementers support it. > > hotplug means you can plug in a device after boot and use it. That''s > not the same thing as being able to unplug a device after boot. > > Yes, both features are often broken, which is why it''s worth testing! > The first one''s more optional than the second. The word ''hotplug'' > refers to systems that have both, not systems that have either, so I > think referring to drivers and chips that freeze all the SATA channels > when there''s a problem with one channel, or that take fifteen minutes > to timeout commands issued to a port that''s become empty, as ``missing > hotplug'''' is too generous. > > The way you tried to use the word is worse than generous because you > seem to argue ``don''t test hot removal of devices because that > requires hotplug support which is optional and which you don''t really > need.'''' I disagree: even if you agree you will never use the > unplugged port again until reboot, not even for the same disk, these > systems are still broken and make zpool redundancy useless for > preventing freezes and panics. It''s worth testing.In general, I agree. However, if the device is specified by its supplier as "not supporting" hotplug, then why waste time testing? -- richard