thr3ads.net - zfs discuss - [zfs-discuss] "ZFS, Smashing Baby" a fake??? [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Scara Maccai

2008-Nov-23 17:21 UTC

[zfs-discuss] "ZFS, Smashing Baby" a fake???

I watched both the youtube video

http://www.youtube.com/watch?v=CN6iDzesEs0

and the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit".

In the first one is obvious that the app stops working when they smash the
drives; they have to physically detach the drive before the array reconstruction
begins.
I''m not the only one that noticed it, comments on youtube:

"It appears that ZFS didn''t recover after each drive failure until
he unplugged the failed drive? Or was it coincidence that he unplugged the drive
just as ZFS started recovering?"
Reply 
"Yep. its a bug in solaris. BUt if you try and tell a sun person that, they
get really pissy."

In the second video the focus is on the drive when the guy smashes it; I
don''t see any reasons why they would not let you see the app while he
smashed the drive.
The focus comes back to the running app right after he detached the hard drive.
-- 
This message posted from opensolaris.org

Toby Thain

2008-Nov-24 14:46 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On 23-Nov-08, at 12:21 PM, Scara Maccai wrote:
> I watched both the youtube video
>
> http://www.youtube.com/watch?v=CN6iDzesEs0
>
> and the one on http://www.opensolaris.com/, "ZFS ? A Smashing
Hit".
>
> In the first one is obvious that the app stops working when they  
> smash the drives; they have to physically detach the drive before  
> the array reconstruction begins.
> I''m not the only one that noticed it, comments on youtube:
>
> "It appears that ZFS didn''t recover after each drive failure
until
> he unplugged the failed drive? Or was it coincidence that he  
> unplugged the drive just as ZFS started recovering?"
> Reply
> "Yep. its a bug in solaris. BUt if you try and tell a sun person  
> that, they get really pissy."
>

Why would it be assumed to be a bug in Solaris? Seems more likely on  
balance to be a problem in the error reporting path or a controller/ 
firmware weakness.

I''m pretty sure the first 2 versions of this demo I saw were executed  
perfectly - and in a packed auditorium (Moscow? and Russians are the  
toughest crowd). No smoke, no mirrors.

--T
> In the second video the focus is on the drive when the guy smashes  
> it; I don''t see any reasons why they would not let you see the app
> while he smashed the drive.
> The focus comes back to the running app right after he detached the  
> hard drive.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scara Maccai

2008-Nov-24 15:40 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

> Why would it be assumed to be a bug in Solaris? Seems
> more likely on  
> balance to be a problem in the error reporting path
> or a controller/ 
> firmware weakness.
Weird: they would use a controller/firmware that doesn''t work? Bad
call...
> I''m pretty sure the first 2 versions of this demo I
> saw were executed  
> perfectly - and in a packed auditorium (Moscow? and
> Russians are the  
> toughest crowd). No smoke, no mirrors.
Still don''t understand why even the one on http://www.opensolaris.com/,
"ZFS ? A Smashing Hit", doesn''t show the app running in the
moment the HD is smashed... weird...
-- 
This message posted from opensolaris.org

Toby Thain

2008-Nov-24 18:45 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On 24-Nov-08, at 10:40 AM, Scara Maccai wrote:
>> Why would it be assumed to be a bug in Solaris? Seems
>> more likely on
>> balance to be a problem in the error reporting path
>> or a controller/
>> firmware weakness.
>
> Weird: they would use a controller/firmware that doesn''t work? Bad
> call...

Seems to me, a sledgehammer would produce fairly random failure  
modes. How would you pre-test?!

--T
>
>> I''m pretty sure the first 2 versions of this demo I
>> saw were executed
>> perfectly - and in a packed auditorium (Moscow? and
>> Russians are the
>> toughest crowd). No smoke, no mirrors.
>
> Still don''t understand why even the one on http:// 
> www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t
show the app
> running in the moment the HD is smashed... weird...
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Will Murnane

2008-Nov-24 18:57 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it>
wrote:> Still don''t understand why even the one on
http://www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t
show the app running in the moment the HD is smashed... weird...ZFS is primarily about protecting your data: correctness, at the
expense of everything else if necessary.  It happens to be very fast
under most circumstances, but if a disk vanishes like a sledgehammer
hit it, ZFS will wait on the device driver to decide it''s dead.
Device drivers are generally the same way, choosing correctness over
speed.  Thus, ZFS can take a while to notice that a disk is gone and
do something about it---but in the meantime, it won''t make any
promises it can''t keep.

This is to be regarded as a Good Thing.  If a disk fails and ZFS
throws away all of my data as a result I''m not going to be happy; if a
disk fails and ZFS takes 30 seconds to notice I''m still happy with
that.

That said, there have been several threads about wanting configurable
device timeouts handled at the ZFS level rather than the device driver
level.  Perhaps this will be implemented at some point... but in the
meantime I prefer correctness to availability.

Will

"C. Bergström"

2008-Nov-24 19:01 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Will Murnane wrote:> On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it>
wrote:
>   
>> Still don''t understand why even the one on
http://www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t
show the app running in the moment the HD is smashed... weird...
>>     Sorry this is OT, but is it just me or does is only seem proper to have 
Gallagher do this? ;)

./C

Scara Maccai

2008-Nov-24 19:14 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

> if a disk vanishes like
> a sledgehammer
> hit it, ZFS will wait on the device driver to decide
> it''s dead.
OK I see it.
> That said, there have been several threads about
> wanting configurable
> device timeouts handled at the ZFS level rather than
> the device driver
> level.  
Uh, so I can configure timeouts at the device level? I didn''t know
that.
-- 
This message posted from opensolaris.org

Moore, Joe

2008-Nov-24 19:16 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

"C. Bergstr?m" wrote:> Will Murnane wrote:
> > On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it>
wrote:
> >   
> >> Still don''t understand why even the one on 
> http://www.opensolaris.com/, "ZFS - A Smashing Hit",
doesn''t
> show the app running in the moment the HD is smashed... weird...
> >>     
> Sorry this is OT, but is it just me or does is only seem 
> proper to have 
> Gallagher do this? ;)
Absolutely not.  Under no circumstances should you attempt to create a striped
ZFS pool on a watermelon, nor on any other type of epigynous berry.

If you try, you will certainly rind up with a mess, if not a core dump.  And let
me tell you, that''s the pits.

--Joe

Miles Nordin

2008-Nov-24 20:49 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
tt> Why would it be assumed to be a bug in Solaris? Seems more
tt> likely on balance to be a problem in the error reporting path
tt> or a controller/ firmware weakness.

It''s not really an assumption. It''s been discussed in here a
lot, and
we know why it''s happening. It''s just a case of
``it''s a feature not
a bug'''' combined with ``somebody else''s
problem.''''

The error-reporting path you mention is inside Solaris, so I have a
little trouble decoding your statement.

I wish drives had a failure-aware QoS with a split queue for
aggressive-retry cdb''s and deadline cdb''s. This would make
the
B_FAILFAST primitive the Solaris developers seem to believe in
actually mean something.

Solaris is supposed to have a B_FAILFAST option for block I/O that ZFS
could start using to capture vdev-level knowledge like ``don''t try too
hard to read this block from one device, because we can get it faster
by asking another device.'''' In the real world B_FAILFAST is
IMO quite
silly and, not exactly useless but at best deceptive to the
higher-layer developer, because even IF the drive could be told to
fail faster than 30 seconds by some future fancier sd driver, there
would still be some fail-slow cdbs hitting the drive, and the
two can''t be parallelized. Sending a fail-slow cdb to a drive
freezes the drive for up to 30 seconds * <n>, where <n> is the
multiplier of some cargo-cult state machine built into the host
adapter driver involving ``bus resets'''' and other such stuff.
All the
B_FAILFAST cdbs queued behind the fail-slow may as well forget
the flag becasue the drive''s busy with the slow cdb. If you
have a very few of these retryable cdbs peppered into your
transaction stream, which are expected to take 10 - 100ms each but
actually take one or two MINUTES each, the drive will be so slow it''d
be more expressive to mark it dead. What will probably happen in
$REALITY is, the sysadmin will declare his machine ``frozen without a
panic message'''' and reboot it, losing any write-cached data
which, if
not for this idiocy, could have been committed to other drives in a
redundant vdev, as well as rebooting the rest of the system unrelated
to this stuck zpool.

However, it''s inappropriate for a driver to actually report ``drive
dead'''' in this scenario, because the drive is NOT dead. The
drive-failure-statistic papers posted in here say that drives usually
fail with a bunch of contiguous or clumped-together unreadable
sectors. You can still get most of the data off them with dd_rescue
or ''dd if=baddrive of=gooddrive bs=512 conv=noerror,sync'', if
you wait
about a week. About four hours of that week is spent copying data and
the rest spent aggressively ``retrying''''.

An instantanious manual command, ``I insist this drive is failed.
Mark it failed instantly, without leaving me stuck in bogus state
machines for two minutes or two hours,'''' would be a huge
improvement,
but I think graceful automatic behavior is not too much to wish for
because this isn''t some strange circumstance. This is *the way drives
usually fail*.

SCSI drives have all kinds of retry-tuning in the ``mode
pages'''' in a
standardized format. Even 5.25" 10MB SCSI drives had these pages.
One of NetApp''s papers said they don''t even let their SCSI/FC
drives
do their own bad-block reallocation. They do all that in host
software. so there are a lot of secret tuning knobs, and they''re AIUI
largely standardized across manufacturers and through the years. ATA
drives, AIUI, don''t have the pages, but some WD gamer drives have some
goofy DOS RAID-tuner tool.

But even what SCSI drives offer isn''t enough to provide the
architecture ZFS seems to dream ov. What''s really needed to provide
ZFS developer''s expectations of B_FAILFAST is QoS inside the drive
firmware. Drives need to have split queues, with an aggressive-retry
queue and a deadline-service queue. While retrying a stuck
cdb in the aggressive queue, they must keep servicing the
deadline queue. I''ve never heard of anything like this existing in a
real drive. I think it''s beyond the programming skill of an
electrical engineer, and it may be too constraining for them because
drives seem to do spastic head-seeks and sometimes partway spin
themselves down and back up during a retry cycle.

ZFS still seems to have this taxonomic-arcania view of drives that
they are ``failing operations'''' or the drive itself is
``failed''''. It
belongs to the driver''s realm to decide whether it''s the whole
drive
or just the ``operation'''' which is failing, because
that''s how the
square peg fits snugly into it''s square hole.

One of the NetApp papers mentions they have proprietary statistical
heuristics for when to ignore a drive for a little while and use
redundant drives instead, and when to fail a drive and call
autosupport. And they log drive behavior really explicitly and
unambiguously separate from ``controller'''' failure, which is
why
they''re able to write the paper at all. I''m in favour of
heuristics,
but most of the ZFS developers seem to think the issue lies with every
driver in Solaris being not up to its promised standards.

I still think the ZFS approach is wrong and the Netapp approach right.

* I think if SATA is to be supported, then the fantasy that drives
can be configured to return failure early should be cast off
forever.

* I don''t think ZFS will match the availability behavior of Netapp or
even of Areca/PERC/RAID-on-a-card until it includes vdev-level
handling of slow devices. This means vdev-level timers inside ZFS,
above the block driver level, driving error-recovery decisions.

* I think a pool read/write that takes longer than other drives in a
redundant vdev, or longer than other cdb''s took on the same
drive, should be re-dispatched to fetch redundant data. I think
this should happen with really tight tolerance and should be
stateful, such that a mirror could have a remote iSCSI component
and a local component, and only the local component would be used
for reads.

* If a drive is taking 30 seconds do perform every cdb, but is still
present and the driver refuses to mark it bad, ZFS needs to be able
to mark it bad on its own, so that it no longer blocks synchronous
writes, and so hot-spare replacement can start to get the pool back
up to policy''s redundancy expectation. If we''re designing
systems
with multiple controllers to avoid a ``single point of
failure''''
then it''s not okay to punt and say, well this isn''t our
problem
because we''re waiting patiently on the controller to do something
sane. The short-term decisions require vdev-level knowledge which
doesn''t exist inside the driver, but arguably marking drives failed
does not require vdev-level knowledge and could be done in the
driver rather than ZFS. I still think this is wrong. Based on our
experience so far with controller drivers, they aren''t very good,
and controller chips are rather short-lived so they''re never going
to be very good, and the drivers are often proprietary so the work
has to be redone inside ZFS just to have a bit of software freedom
again. A practical modern storage design is robust against bugs in
the controller driver, bugs exercised by combinations of drive
firmware and controllers or by doing strange things with cables.

If this won''t go inside ZFS, then people will reasonably want some
pseudodevice like an SVM soft partition or a multipath layer to
protect them from failing controller drivers. They might want a
way to manually, and instantly, without waiting on stupid state
machines, mark the device failed, crack ZFS and the controller
driver apart so they''re not locked in some deadly embrace of
failure that requires rebooting. If we agree there is a need for
multipath to a single device, why can we not agree that we expect
protection from failures of a controller or its driver even when we
don''t have multipath but have laid out our vdev''s with
enough
redundancy to tolerate controller failure?

In practice, I think drives that become really slow instead of
failing outright is the real problem, but bringing in multipath and
controller redundancy shows what is to my view the taxonomic
hypocricy of wanting to keep this out of ZFS.

* Management commands like export, status, detach, offline, replace,
must either (a) never block waiting for I/O. use kernel state
only, and do disk writes asynchronously reporting failure through
inspection commands that the user polls like ''zpool
status''. This
world is possible---we don''t expect the mirror to be in sync before
''zpool attach'' returns though we could. Or (b) sleep
interruptably, and include a more drastic version that doesn''t
block, so normally you type ''zpool offline'' and when the
prompt
returns without error, you know that all your labels are updated.
But if you don''t get a prompt back, you can ^C and
''zpool offline -f''.

Not being able to get rid of a drive without access to the drive
you want to get rid of is as ridiculous as ``keyboard not found.
Press F1 to continue.'''' Even square-peg square-hole
taxonomists
ought to agree on this one. And I don''t like getting ``no valid
replicas'''' errors to situations that ZFS will tolerate if
you force
it by rebooting or by hot-unplugging the device---there should be
clear deliniation of which pool states are acceptable and which are
not, and I should be able to explore all the acceptable states by
moving the pool through them manually. If I can''t ''zpool
offline''
a device, and _instantly_ if I insist on it, then the pool should
not mount at boot without that device. I shouldn''t have to involve
rebooting in my testing, or else it feels like fisher price
wallpapered crap. I sometimes run my dishwasher with the door open
for a half second when I become suspicious of it. The sky doesn''t
fall. but these days it seems like people believe any interlock
anywhere, even a preposterous invented one, is as sacred as the one
on a microwave or a UV oven. oh, and when possible ZFS should not
forget its knowledge of inconsistencies across a reboot, and should
for example continue interrupted resilvers like SVM did.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081124/5b9c29d8/attachment.bin>

Toby Thain

2008-Nov-24 22:30 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
>>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>
>     tt> Why would it be assumed to be a bug in Solaris? Seems more
>     tt> likely on balance to be a problem in the error reporting path
>     tt> or a controller/ firmware weakness.
>
> It''s not really an assumption.  It''s been discussed in
here a lot, and
> we know why it''s happening.  It''s just a case of
``it''s a feature not
> a bug'''' combined with ``somebody else''s
problem.''''
>
> The error-reporting path you mention is inside Solaris, so I have a
> little trouble decoding your statement.
>

Not all of it is!

I don''t see how anyone could confidently correlate "behaviour
after
sledgehammer impact" with a specific fault in Solaris, without doing  
a lot more investigation than "watching a YouTube video". Perhaps  
this has already been narrowed down to a specific root cause within  
Solaris - I just didn''t see enough data in the OP''s post to
indicate
that.

But I bow to your far more extensive experience...

--Toby
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2008-Nov-24 23:34 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Toby Thain wrote:> On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
>
>   
>>>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>>>>>>>               
>>     tt> Why would it be assumed to be a bug in Solaris? Seems more
>>     tt> likely on balance to be a problem in the error reporting
path
>>     tt> or a controller/ firmware weakness.
>>
>> It''s not really an assumption.  It''s been discussed
in here a lot, and
>> we know why it''s happening.  It''s just a case of
``it''s a feature not
>> a bug'''' combined with ``somebody else''s
problem.''''
>>
>> The error-reporting path you mention is inside Solaris, so I have a
>> little trouble decoding your statement.
>>
>>     
>
>
> Not all of it is!
>
> I don''t see how anyone could confidently correlate "behaviour
after
> sledgehammer impact" with a specific fault in Solaris, without doing  
> a lot more investigation than "watching a YouTube video". Perhaps
> this has already been narrowed down to a specific root cause within  
> Solaris - I just didn''t see enough data in the OP''s post
to indicate
> that.
>   
We could add strain sensors to disk drives which, when the strain
was suddenly too great, would register an ASC/ASCQ 75/00 "DEVICE
WAS HIT BY A HAMMER" and then we could add the e-report to sd
and then register with a "io-hammer-event" FMA diagnosis engine
which would be registered to ZFS to offline the device :-)

But seriously, it really does depend on the failure mode of the device
and I''m not sure people have studied the hammer case very closely.
In the worst case, the device would be selectable, but not responding
to data requests which would lead through the device retry logic and can
take minutes.  If the (USB) device simply disappeared, it would be
indistinquishable from a hot-plug event and that logic would take over
which results in a faster diagnosis.  I suppose it will depend on the
device and your aim.
 -- richard

Scara Maccai

2008-Nov-25 02:34 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

> In the worst case, the device would be selectable,
> but not responding
> to data requests which would lead through the device
> retry logic and can
> take minutes.
that''s what I didn''t know: that a driver could take minutes
(hours???) to decide that a device is not working anymore.
Now it comes another question: how can one assume that a drive failure
won''t take one hour to be acknowledged by the driver? That is: what
good is a failover strategy if it takes one hour to start? I''m grateful
that the system doesn''t write until it knows what is going on, but that
can''t take that long.
-- 
This message posted from opensolaris.org

Richard Elling

2008-Nov-25 04:48 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Scara Maccai wrote:>> In the worst case, the device would be selectable,
>> but not responding
>> to data requests which would lead through the device
>> retry logic and can
>> take minutes.
>>     
>
> that''s what I didn''t know: that a driver could take
minutes (hours???) to decide that a device is not working anymore.
>   
For Solaris, sd driver, there are, by default, 60 second timeouts with 5
retries.  For ssd driver, 3 retries.  But sometimes, additional tests are
made to try to verify that the disk is really not working properly which
will cause more of these.  Again, it depends on the failure mode.
> Now it comes another question: how can one assume that a drive failure
won''t take one hour to be acknowledged by the driver? That is: what
good is a failover strategy if it takes one hour to start? I''m grateful
that the system doesn''t write until it knows what is going on, but that
can''t take that long.
>   
AFAIK, there are no cases where the timeouts would result in an hour
delay before making a decision.  Usually, the policy is made in advance,
as in the zpool failmode property.
 -- richard

Ross

2008-Nov-25 06:45 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

But that''s exactly the problem Richard:  AFAIK.

Can you state that absolutely, categorically, there is no failure mode out there
(caused by hardware faults, or bad drivers) that won''t lock a drive up
for hours?  You can''t, obviously, which is why we keep saying that ZFS
should have this kind of timeout feature.

For once I agree with Miles, I think he''s written a really good writeup
of the problem here.  My simple view on it would be this:

Drives are only aware of themselves as an individual entity.  Their job is to
save & restore data to themselves, and drivers are written to minimise any
chance of data loss.  So when a drive starts to fail, it makes complete sense
for the driver and hardware to be very, very thorough about trying to read or
write that data, and to only fail as a last resort.

I''m not at all surprised that drives take 30 seconds to timeout, nor
that they could slow a pool for hours.  That''s their job.  They know
nothing else about the storage, they just have to do their level best to do as
they''re told, and will only fail if they absolutely can''t
store the data.

The raid controller on the other hand (Netapp / ZFS, etc) knows all about the
pool.  It knows if you have half a dozen good drives online, it knows if there
are hot spares available, and it *should* also know how quickly the drives under
its care usually respond to requests.

ZFS is perfectly placed to spot when a drive is starting to fail, and to take
the appropriate action to safeguard your data.  It has far more information
available than a single drive ever will, and should be designed accordingly.

Expecting the firmware and drivers of individual drives to control the failure
modes of your redundant pool is just crazy imo.  You''re throwing away
some of the biggest benefits of using multiple drives in the first place.
-- 
This message posted from opensolaris.org

Jeff Bonwick

2008-Nov-25 08:37 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

I think we (the ZFS team) all generally agree with you.  The current
nevada code is much better at handling device failures than it was
just a few months ago.  And there are additional changes that were
made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
product line that will make things even better once the FishWorks team
has a chance to catch its breath and integrate those changes into nevada.
And then we''ve got further improvements in the pipeline.

The reason this is all so much harder than it sounds is that we''re
trying to provide increasingly optimal behavior given a collection of
devices whose failure modes are largely ill-defined.  (Is the disk
dead or just slow?  Gone or just temporarily disconnected?  Does this
burst of bad sectors indicate catastrophic failure, or just localized
media errors?)  The disks'' SMART data is notoriously unreliable, BTW.
So there''s a lot of work underway to model the physical topology of
the hardware, gather telemetry from the devices, the enclosures,
the environmental sensors etc, so that we can generate an accurate
FMA fault diagnosis and then tell ZFS to take appropriate action.

We have some of this today; it''s just a lot of work to complete it.

Oh, and regarding the original post -- as several readers correctly
surmised, we weren''t faking anything, we just didn''t want to
wait
for all the device timeouts.  Because the disks were on USB, which
is a hotplug-capable bus, unplugging the dead disk generated an
interrupt that bypassed the timeout.  We could have waited it out,
but 60 seconds is an eternity on stage.

Jeff

On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:> But that''s exactly the problem Richard:  AFAIK.
> 
> Can you state that absolutely, categorically, there is no failure mode out
there (caused by hardware faults, or bad drivers) that won''t lock a
drive up for hours?  You can''t, obviously, which is why we keep saying
that ZFS should have this kind of timeout feature.
> 
> For once I agree with Miles, I think he''s written a really good
writeup of the problem here.  My simple view on it would be this:
> 
> Drives are only aware of themselves as an individual entity.  Their job is
to save & restore data to themselves, and drivers are written to minimise
any chance of data loss.  So when a drive starts to fail, it makes complete
sense for the driver and hardware to be very, very thorough about trying to read
or write that data, and to only fail as a last resort.
> 
> I''m not at all surprised that drives take 30 seconds to timeout,
nor that they could slow a pool for hours.  That''s their job.  They
know nothing else about the storage, they just have to do their level best to do
as they''re told, and will only fail if they absolutely can''t
store the data.
> 
> The raid controller on the other hand (Netapp / ZFS, etc) knows all about
the pool.  It knows if you have half a dozen good drives online, it knows if
there are hot spares available, and it *should* also know how quickly the drives
under its care usually respond to requests.
> 
> ZFS is perfectly placed to spot when a drive is starting to fail, and to
take the appropriate action to safeguard your data.  It has far more information
available than a single drive ever will, and should be designed accordingly.
> 
> Expecting the firmware and drivers of individual drives to control the
failure modes of your redundant pool is just crazy imo.  You''re
throwing away some of the biggest benefits of using multiple drives in the first
place.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ross Smith

2008-Nov-25 10:10 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Hey Jeff,

Good to hear there''s work going on to address this.

What did you guys think to my idea of ZFS supporting a "waiting for a
response" status for disks as an interim solution that allows the pool
to continue operation while it''s waiting for FMA or the driver to
fault the drive?

I do appreciate that it''s hard to come up with a definative
"it''s dead
Jim" answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn''t return data ok

And for the state where it''s not returning data, you can again split
that in two:
- returns wrong data
- doesn''t return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it''s just the second
that needs immediate attention, and for the life of me I can''t think
of any situation that a simple timeout wouldn''t catch.

Personally I''d love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
"waiting" status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it''s
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there''s also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I''d maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I''d be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you''re not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it''s
only going to be looking at a single item and trying to determine
whether it''s faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it''s time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it''s waiting for an
''official'' result
from any one faulty component.

Ross

On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <Jeff.Bonwick at sun.com>
wrote:> I think we (the ZFS team) all generally agree with you.  The current
> nevada code is much better at handling device failures than it was
> just a few months ago.  And there are additional changes that were
> made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
> product line that will make things even better once the FishWorks team
> has a chance to catch its breath and integrate those changes into nevada.
> And then we''ve got further improvements in the pipeline.
>
> The reason this is all so much harder than it sounds is that we''re
> trying to provide increasingly optimal behavior given a collection of
> devices whose failure modes are largely ill-defined.  (Is the disk
> dead or just slow?  Gone or just temporarily disconnected?  Does this
> burst of bad sectors indicate catastrophic failure, or just localized
> media errors?)  The disks'' SMART data is notoriously unreliable,
BTW.
> So there''s a lot of work underway to model the physical topology
of
> the hardware, gather telemetry from the devices, the enclosures,
> the environmental sensors etc, so that we can generate an accurate
> FMA fault diagnosis and then tell ZFS to take appropriate action.
>
> We have some of this today; it''s just a lot of work to complete
it.
>
> Oh, and regarding the original post -- as several readers correctly
> surmised, we weren''t faking anything, we just didn''t want
to wait
> for all the device timeouts.  Because the disks were on USB, which
> is a hotplug-capable bus, unplugging the dead disk generated an
> interrupt that bypassed the timeout.  We could have waited it out,
> but 60 seconds is an eternity on stage.
>
> Jeff
>
> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>> But that''s exactly the problem Richard:  AFAIK.
>>
>> Can you state that absolutely, categorically, there is no failure mode
out there (caused by hardware faults, or bad drivers) that won''t lock a
drive up for hours?  You can''t, obviously, which is why we keep saying
that ZFS should have this kind of timeout feature.
>>
>> For once I agree with Miles, I think he''s written a really
good writeup of the problem here.  My simple view on it would be this:
>>
>> Drives are only aware of themselves as an individual entity.  Their job
is to save & restore data to themselves, and drivers are written to minimise
any chance of data loss.  So when a drive starts to fail, it makes complete
sense for the driver and hardware to be very, very thorough about trying to read
or write that data, and to only fail as a last resort.
>>
>> I''m not at all surprised that drives take 30 seconds to
timeout, nor that they could slow a pool for hours.  That''s their job. 
They know nothing else about the storage, they just have to do their level best
to do as they''re told, and will only fail if they absolutely
can''t store the data.
>>
>> The raid controller on the other hand (Netapp / ZFS, etc) knows all
about the pool.  It knows if you have half a dozen good drives online, it knows
if there are hot spares available, and it *should* also know how quickly the
drives under its care usually respond to requests.
>>
>> ZFS is perfectly placed to spot when a drive is starting to fail, and
to take the appropriate action to safeguard your data.  It has far more
information available than a single drive ever will, and should be designed
accordingly.
>>
>> Expecting the firmware and drivers of individual drives to control the
failure modes of your redundant pool is just crazy imo.  You''re
throwing away some of the biggest benefits of using multiple drives in the first
place.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Ross Smith

2008-Nov-25 10:25 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

PS.  I think this also gives you a chance at making the whole problem
much simpler.  Instead of the hard question of "is this faulty",
you''re just trying to say "is it working right now?".

In fact, I''m now wondering if the "waiting for a response"
flag
wouldn''t be better as "possibly faulty".  That way you could
use it
with checksum errors too, possibly with settings as simple as "errors
per minute" or "error percentage".  As with the timeouts, you
could
have it off by default (or provide sensible defaults), and let
administrators tweak it for their particular needs.

Imagine a pool with the following settings:
- zfs-auto-device-timeout = 5s
- zfs-auto-device-checksum-fail-limit-epm = 20
- zfs-auto-device-checksum-fail-limit-percent = 10
- zfs-auto-device-fail-delay = 120s

That would allow the pool to flag a device as possibly faulty
regardless of the type of fault, and take immediate proactive action
to safeguard data (generally long before the device is actually
faulted).

A device triggering any of these flags would be enough for ZFS to
start reading from (or writing to) other devices first, and should you
get multiple failures, or problems on a non redundant pool, you always
just revert back to ZFS'' current behaviour.

Ross





On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <Jeff.Bonwick at sun.com>
wrote:> I think we (the ZFS team) all generally agree with you.  The current
> nevada code is much better at handling device failures than it was
> just a few months ago.  And there are additional changes that were
> made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
> product line that will make things even better once the FishWorks team
> has a chance to catch its breath and integrate those changes into nevada.
> And then we''ve got further improvements in the pipeline.
>
> The reason this is all so much harder than it sounds is that we''re
> trying to provide increasingly optimal behavior given a collection of
> devices whose failure modes are largely ill-defined.  (Is the disk
> dead or just slow?  Gone or just temporarily disconnected?  Does this
> burst of bad sectors indicate catastrophic failure, or just localized
> media errors?)  The disks'' SMART data is notoriously unreliable,
BTW.
> So there''s a lot of work underway to model the physical topology
of
> the hardware, gather telemetry from the devices, the enclosures,
> the environmental sensors etc, so that we can generate an accurate
> FMA fault diagnosis and then tell ZFS to take appropriate action.
>
> We have some of this today; it''s just a lot of work to complete
it.
>
> Oh, and regarding the original post -- as several readers correctly
> surmised, we weren''t faking anything, we just didn''t want
to wait
> for all the device timeouts.  Because the disks were on USB, which
> is a hotplug-capable bus, unplugging the dead disk generated an
> interrupt that bypassed the timeout.  We could have waited it out,
> but 60 seconds is an eternity on stage.
>
> Jeff
>
> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>> But that''s exactly the problem Richard:  AFAIK.
>>
>> Can you state that absolutely, categorically, there is no failure mode
out there (caused by hardware faults, or bad drivers) that won''t lock a
drive up for hours?  You can''t, obviously, which is why we keep saying
that ZFS should have this kind of timeout feature.
>>
>> For once I agree with Miles, I think he''s written a really
good writeup of the problem here.  My simple view on it would be this:
>>
>> Drives are only aware of themselves as an individual entity.  Their job
is to save & restore data to themselves, and drivers are written to minimise
any chance of data loss.  So when a drive starts to fail, it makes complete
sense for the driver and hardware to be very, very thorough about trying to read
or write that data, and to only fail as a last resort.
>>
>> I''m not at all surprised that drives take 30 seconds to
timeout, nor that they could slow a pool for hours.  That''s their job. 
They know nothing else about the storage, they just have to do their level best
to do as they''re told, and will only fail if they absolutely
can''t store the data.
>>
>> The raid controller on the other hand (Netapp / ZFS, etc) knows all
about the pool.  It knows if you have half a dozen good drives online, it knows
if there are hot spares available, and it *should* also know how quickly the
drives under its care usually respond to requests.
>>
>> ZFS is perfectly placed to spot when a drive is starting to fail, and
to take the appropriate action to safeguard your data.  It has far more
information available than a single drive ever will, and should be designed
accordingly.
>>
>> Expecting the firmware and drivers of individual drives to control the
failure modes of your redundant pool is just crazy imo.  You''re
throwing away some of the biggest benefits of using multiple drives in the first
place.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Casper.Dik at Sun.COM

2008-Nov-25 10:44 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

>My justification for this is that it seems to me that you can split
>disk behavior into two states:
>- returns data ok
>- doesn''t return data ok

I think you''re missing "won''t write".

There''s clearly a difference between "get data from a different
copy"
which you can fix but retrying data to a different part of the redundant 
data and writing data: the data which can''t be written must be kept
until the drive is faulted.


Casper

Ross Smith

2008-Nov-25 10:52 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

No, I count that as "doesn''t return data ok", but my post
wasn''t very
clear at all on that.

Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it''s
waiting for that response.

My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that''s a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

For write operations, the data can be safely committed to the rest of
the pool, with just the outstanding writes for the drive left waiting.
 Then as soon as the device is faulted, the hot spare can kick in, and
the outstanding writes quickly written to the spare.

For single parity, or non redundant volumes there''s some benefit in
this.  For dual parity pools there''s a massive benefit as your pool
stays available, and your data is still well protected.

Ross

On Tue, Nov 25, 2008 at 10:44 AM,  <Casper.Dik at sun.com>
wrote:>
>
>>My justification for this is that it seems to me that you can split
>>disk behavior into two states:
>>- returns data ok
>>- doesn''t return data ok
>
>
> I think you''re missing "won''t write".
>
> There''s clearly a difference between "get data from a
different copy"
> which you can fix but retrying data to a different part of the redundant
> data and writing data: the data which can''t be written must be
kept
> until the drive is faulted.
>
>
> Casper
>
>

Casper.Dik at Sun.COM

2008-Nov-25 10:55 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

>My idea is simply to allow the pool to continue operation while
>waiting for the drive to fault, even if that''s a faulty write.  It
>just means that the rest of the operations (reads and writes) can keep
>working for the minute (or three) it takes for FMA and the rest of the
>chain to flag a device as faulty.
Except when you''re writing a lot; 3 minutes can cause a 20GB backlog
for a single disk.

Casper

Ross Smith

2008-Nov-25 11:39 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Hmm, true.  The idea doesn''t work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.

Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool?  Or temporarily stored somewhere else in
the pool?  Would it be an option to allow up to a certain amount of
writes to be cached in this way while waiting for FMA, and only
suspend writes once that cache is full?

With a large SSD slog device would it be possible to just stream all
writes to the log?  As a further enhancement, might it be possible to
commit writes to the working drives, and just leave the writes for the
bad drive(s) in the slog (potentially saving a lot of space)?

For pools without log devices, I suspect that you would probably need
the administrator to specify the behavior as I can see several options
depending on the raid level and that pools priorities for data
availability / integrity:

Drive fault write cache settings:
default - pool waits for device, no writes occur until device or spare
comes online
slog - writes are cached to slog device until full, then pool reverts
to default behavior (could this be the default with slog devices
present?)
pool - writes are cached to the pool itself, up to a set maximum, and
are written to the device or spare as soon as possible.  This assumes
a single parity pool with the other devices available.  If the upper
limit is reached, or another devices goes faulty, pool reverts to
default behaviour.

Storing directly to the rest of the pool would probably want to be off
by default on single parity pools, but I would imagine that it could
be on by default on dual parity pools.

Would that be enough to allow writes to continue in most circumstances
while the pool waits for FMA?

Ross

On Tue, Nov 25, 2008 at 10:55 AM,  <Casper.Dik at sun.com>
wrote:>
>
>>My idea is simply to allow the pool to continue operation while
>>waiting for the drive to fault, even if that''s a faulty write. 
It
>>just means that the rest of the operations (reads and writes) can keep
>>working for the minute (or three) it takes for FMA and the rest of the
>>chain to flag a device as faulty.
>
> Except when you''re writing a lot; 3 minutes can cause a 20GB
backlog
> for a single disk.
>
> Casper
>
>

Toby Thain

2008-Nov-25 13:00 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On 25-Nov-08, at 5:10 AM, Ross Smith wrote:
> Hey Jeff,
>
> Good to hear there''s work going on to address this.
>
> What did you guys think to my idea of ZFS supporting a "waiting for a
> response" status for disks as an interim solution that allows the pool
> to continue operation while it''s waiting for FMA or the driver to
> fault the drive?
> ...
>
> The first of these is already covered by ZFS with its checksums (with
> FMA doing the extra work to fault drives), so it''s just the second
> that needs immediate attention, and for the life of me I can''t
think
> of any situation that a simple timeout wouldn''t catch.
>
> Personally I''d love to see two parameters, allowing this behavior
to
> be turned on if desired, and allowing timeouts to be configured:
>
> zfs-auto-device-timeout
> zfs-auto-device-timeout-fail-delay
>
> The first sets whether to use this feature, and configures the maximum
> time ZFS will wait for a response from a device before putting it in a
> "waiting" status.

The shortcomings of timeouts have been discussed on this list before.  
How do you tell the difference between a drive that is dead and a  
path that is just highly loaded?

I seem to recall the argument strongly made in the past that making  
decisions based on a timeout alone can provoke various undesirable  
cascade effects.
>   The second would be optional and is the maximum
> time ZFS will wait before faulting a device (at which point it''s
> replaced by a hot spare).
>
> The reason I think this will work well with the FMA work is that you
> can implement this now and have a real improvement in ZFS
> availability.  Then, as the other work starts bringing better modeling
> for drive timeouts, the parameters can be either removed, or set
> automatically by ZFS.
> ... it should be possible for ZFS to read or
> write from other devices while it''s waiting for an
''official'' result
> from any one faulty component.
Sounds good - devil, meet details, etc.

--Toby
>
> Ross
>
>
> On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick  
> <Jeff.Bonwick at sun.com> wrote:
>> I think we (the ZFS team) all generally agree with you. ...
>>
>> The reason this is all so much harder than it sounds is that
we''re
>> trying to provide increasingly optimal behavior given a collection of
>> devices whose failure modes are largely ill-defined.  (Is the disk
>> dead or just slow?  Gone or just temporarily disconnected? ...
>>
>> Jeff
>>
>> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>>> But that''s exactly the problem Richard:  AFAIK.
>>>
>>> Can you state that absolutely, categorically, there is no failure  
>>> mode out there (caused by hardware faults, or bad drivers) that  
>>> won''t lock a drive up for hours?  You can''t,
obviously, which is
>>> why we keep saying that ZFS should have this kind of timeout  
>>> feature.
>>> ...

Ross Smith

2008-Nov-25 13:22 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

> The shortcomings of timeouts have been discussed on this list before. How
do
> you tell the difference between a drive that is dead and a path that is
just
> highly loaded?
A path that is dead is either returning bad data, or isn''t returning
anything.  A highly loaded path is by definition reading & writing
lots of data.  I think you''re assuming that these are file level
timeouts, when this would actually need to be much lower level.

> Sounds good - devil, meet details, etc.
Yup, I imagine there are going to be a few details to iron out, many
of which will need looking at by somebody a lot more technical than
myself.

Despite that I still think this is a discussion worth having.  So far
I don''t think I''ve seen any situation where this would make
things
worse than they are now, and I can think of plenty of cases where it
would be a huge improvement.

Of course, it also probably means a huge amount of work to implement.
I''m just hoping that it''s not prohibitively difficult, and
that the
ZFS team see the benefits as being worth it.

Scara Maccai

2008-Nov-25 14:29 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

> Oh, and regarding the original post -- as several
> readers correctly
> surmised, we weren''t faking anything, we just didn''t
> want to wait
> for all the device timeouts.  Because the disks were
> on USB, which
> is a hotplug-capable bus, unplugging the dead disk
> generated an
> interrupt that bypassed the timeout.  We could have
> waited it out,
> but 60 seconds is an eternity on stage.
I''m sorry, I didn''t mean to sound offensive. Anyway I think
that people should know that their drives can stuck the system for minutes,
"despite" ZFS. I mean: there are a lot of writings about how ZFS is
great for recovery in case a drive fails, but there''s nothing regarding
this problem. I know now it''s not ZFS fault; but I wonder how many
people set up their drives with ZFS assuming that "as soon as something
goes bad, ZFS will fix it".
Is there any way to test these cases other than smashing the drive with a
hammer? Having a failover policy where the failover can''t be tested
sounds scary...
-- 
This message posted from opensolaris.org

Moore, Joe

2008-Nov-25 14:49 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Ross Smith wrote:> My justification for this is that it seems to me that you can split
> disk behavior into two states:
> - returns data ok
> - doesn''t return data ok
> 
> And for the state where it''s not returning data, you can again
split
> that in two:
> - returns wrong data
> - doesn''t return data
The state in discussion in this thread is "the I/O requested by ZFS
hasn''t finished after 60, 120, 180, 3600, etc. seconds"

The pool is waiting (for device timeouts) to distinguish between the first two
states.

More accurate state descriptions are:
- The I/O has returned data
- The I/O hasn''t yet returned data and the user (admin) is justifiably
impatient.

For the first state, the data is either correct (verified by the ZFS checksums,
or ESUCCESS on write) or incorrect and retried.
> 
> The first of these is already covered by ZFS with its checksums (with
> FMA doing the extra work to fault drives), so it''s just the second
> that needs immediate attention, and for the life of me I can''t
think
> of any situation that a simple timeout wouldn''t catch.
> 
> Personally I''d love to see two parameters, allowing this behavior
to
> be turned on if desired, and allowing timeouts to be configured:
> 
> zfs-auto-device-timeout
> zfs-auto-device-timeout-fail-delay
I''d prefer these be set at the (default) pool level:
zpool-device-timeout
zpool-device-timeout-fail-delay

with specific per-VDEV overrides possible:
vdev-device-timeout and vdev-device-fail-delay

This would allow but not require slower VDEVs to be tuned specifically for that
case without hindering the default pool behavior on the local fast disks. 
Specifically, consider where I''m using mirrored VDEVs with one half
over iSCSI, and want to have the iSCSI retry logic to still apply.  Writes that
failed while the iSCSI link is down would have to be resilvered, but at least
reads would switch to the local devices faster.

Set them to the default magic "0" value to have the system use the
current behavior, of relying on the device drivers to report failures.
Set to a number (in ms probably) and the pool would consider an I/O that takes
longer than that as "returns invalid data"

When the FMA work discussed below, these could be augmented by the pools
"best heuristic guess" as to what the proper timeouts should be, which
could be saved in (kstat?) vdev-device-autotimeout.

If you set the timeout to the magic "-1" value, the pool would use
vdev-device-autotimeout.

All that would be required is for the I/O that caused the disk to take a long
time to be given a deadline (now + (vdev-device-timeout ?:
(zpool-device-timeout?: forever)))* and consider the I/O complete with whatever
data has returned after that deadline: if that''s a bunch of
0''s in a read, which would have a bad checksum; or a
partially-completed write that would have to be committed somewhere else.

Unfortunately, I''m not enough of a programmer to implement this.

--Joe
* with the -1 magic, it would be a little more complicated calculation.

Nicolas Williams

2008-Nov-25 15:53 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Tue, Nov 25, 2008 at 11:55:17AM +0100, Casper.Dik at Sun.COM
wrote:> >My idea is simply to allow the pool to continue operation while
> >waiting for the drive to fault, even if that''s a faulty write.
It
> >just means that the rest of the operations (reads and writes) can keep
> >working for the minute (or three) it takes for FMA and the rest of the
> >chain to flag a device as faulty.
> 
> Except when you''re writing a lot; 3 minutes can cause a 20GB
backlog
> for a single disk.
If we''re talking isolated, or even clumped-but-relatively-few bad
sectors, then having a short timeout for writes and remapping
should be possible to do without running out of memory to cache
those writes.  But...

...writes to bad sectors will happen when txgs flush, and depending on
how bad sector remapping is done (say, by picking a new block address
and changing the blkptrs that referred to the old one) that might mean
redoing large chunks of the txg in the next one, which might mean that
fsync() could be delayed an additional 5 seconds or so.  And even if
that''s not the case, writes to mirrors are supposed to be synchronous,
so one would think that bad block remapping should be synchronous also,
thus there must be a delay on writes to bad blocks no matter what --
though that delay could be tuned to be no more than a few seconds.

That points to a possibly decent heuristic on writes: vdev-level
timeouts that result in bad block remapping, but if the queue of
outstanding bad block remappings grows too large -> treat the disk
as faulted and degrade the pool.

Sounds simple, but it needs to be combined at a higher layer with
information from other vdevs.  Unplugging a whole jbod shouldn''t
necessarily fault all the vdevs on it -- perhaps it should cause
pool operation to pause until the jbod is plugged back in... which
should then cause those outstanding bad block remappings to be
rolled back since they weren''t bad blocks after all.

That''s a lot of fault detection and handling logic across many layers.

Incidentally, cables to fall out, or, rather, get pulled out
accidentally.  What should be the failure mode of a jbod disappearing
due to a pulled cable (or power supply failure)?  A pause in operation
(hangs)?  Or faulting of all affected vdevs, and if you''re mirrored
across different jbods, incurring the need to re-silver later, with
degraded operation for hours on end?  I bet answers will vary.  The best
answer is to provide enough redundancy (multiple power supplies,
multi-pathing, ...) to make such situations less likely, but that''s not
a complete answer.

Nico
--

Bob Friesenhahn

2008-Nov-25 15:57 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Tue, 25 Nov 2008, Ross Smith wrote:>
> Good to hear there''s work going on to address this.
>
> What did you guys think to my idea of ZFS supporting a "waiting for a
> response" status for disks as an interim solution that allows the pool
> to continue operation while it''s waiting for FMA or the driver to
> fault the drive?
A stable and sane system never comes with "two brains".  It is wrong 
to put this sort of logic into ZFS when ZFS is already depending on 
FMA to make the decisions and Solaris already has an infrastructure to 
handle faults.  The more appropriate solution is that this feature 
should be in FMA.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Nov-25 15:57 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

Scara Maccai wrote:>> Oh, and regarding the original post -- as several
>> readers correctly
>> surmised, we weren''t faking anything, we just didn''t
>> want to wait
>> for all the device timeouts.  Because the disks were
>> on USB, which
>> is a hotplug-capable bus, unplugging the dead disk
>> generated an
>> interrupt that bypassed the timeout.  We could have
>> waited it out,
>> but 60 seconds is an eternity on stage.
>>     
>
> I''m sorry, I didn''t mean to sound offensive. Anyway I
think that people should know that their drives can stuck the system for
minutes, "despite" ZFS. I mean: there are a lot of writings about how
ZFS is great for recovery in case a drive fails, but there''s nothing
regarding this problem. I know now it''s not ZFS fault; but I wonder how
many people set up their drives with ZFS assuming that "as soon as
something goes bad, ZFS will fix it".
> Is there any way to test these cases other than smashing the drive with a
hammer? Having a failover policy where the failover can''t be tested
sounds scary...
>   
It is with this idea in mind that I wrote part of Chapter 1 of the book
Designing Enterprise Solutions with Sun Cluster 3.0.  For convenience,
I also published chapter 1 as a Sun BluePrint Online article.
http://www.sun.com/blueprints/1101/clstrcomplex.pdf
False positives are very expensive in highly available systems, so we
really do want to avoid them.

One thing that we can do, and I''ve already (again[1]) started down the
path
to document, is to show where and how the various (common) timeouts
are in the system.  Once you know how sd, cmdk, dbus, and friends work
you can make better decisions on where to look when the behaviour is not
as you expect.  But this is a very tedious path because there are so many
different failure modes and real-world devices can react ambiguously
when they fail.

[1] we developed a method to benchmark cluster dependability. The
description of the benchmark was published in several papers, but is
now available in the new IEEE book on Dependability Benchmarking.
This is really the first book of its kind and the first steps toward making
dependability benchmarks more mainstream. Anyway, the work done
for that effort included methods to improve failure detection and handling,
so we have a detailed understanding of those things for SPARC, in lab
form.  Expanding that work to cover the random-device-bought-at-Frys
will be a substantial undertaking.  Co-conspirators welcome.
 -- richard

Ross Smith

2008-Nov-25 16:10 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

I disagree Bob, I think this is a very different function to that
which FMA provides.

As far as I know, FMA doesn''t have access to the big picture of pool
configuration that ZFS has, so why shouldn''t ZFS use that information
to increase the reliability of the pool while still using FMA to
handle device failures?

The flip side of the argument is that ZFS already checks the data
returned by the hardware.  You might as well say that FMA should deal
with that too since it''s responsible for all hardware failures.

The role of ZFS is to manage the pool, availability should be part and
parcel of that.

On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Tue, 25 Nov 2008, Ross Smith wrote:
>>
>> Good to hear there''s work going on to address this.
>>
>> What did you guys think to my idea of ZFS supporting a "waiting
for a
>> response" status for disks as an interim solution that allows the
pool
>> to continue operation while it''s waiting for FMA or the driver
to
>> fault the drive?
>
> A stable and sane system never comes with "two brains".  It is
wrong to put
> this sort of logic into ZFS when ZFS is already depending on FMA to make
the
> decisions and Solaris already has an infrastructure to handle faults.  The
> more appropriate solution is that this feature should be in FMA.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
>

Bob Friesenhahn

2008-Nov-25 16:37 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Tue, 25 Nov 2008, Ross Smith wrote:
> I disagree Bob, I think this is a very different function to that
> which FMA provides.
>
> As far as I know, FMA doesn''t have access to the big picture of
pool
> configuration that ZFS has, so why shouldn''t ZFS use that
information
> to increase the reliability of the pool while still using FMA to
> handle device failures?
If FMA does not currently have knowledge of the redundancy model but 
needs it to make well-informed decisions, then it should be updated to 
incorporate this information.

FMA sees all the hardware in the system, including devices used for 
UFS and other types of filesystems, and even tape devices.  It is able 
to see hardware at a much more detailed level than ZFS does.  ZFS only 
sees an abstracted level of the hardware.  If a HBA or part of the 
backplane fails, FMA should be able to determine the failing area (at 
least as far out as it can see based on available paths) whereas all 
ZFS knows is that it is having difficulty getting there from here.
> The flip side of the argument is that ZFS already checks the data
> returned by the hardware.  You might as well say that FMA should deal
> with that too since it''s responsible for all hardware failures.
If bad data is returned, then I assume that there is a peg to FMA''s 
error statistics counters.
> The role of ZFS is to manage the pool, availability should be part and
> parcel of that.
Too much complexity tends to clog up the works and keep other areas of 
ZFS from being enhanced expediently.  ZFS would soon become a chunk of 
source code that no mortal could understand and as such it would be 
put under "maintenance" with no more hope of moving forward and 
inability to address new requirements.

A rational system really does not want to have mutiple brains. 
Otherwise some parts of the system will think that the device is fine 
while other parts believe that it has failed. None of us want to deal 
with an insane system like that.  There is also the matter of fault 
isolation.  If a drive can not be reached, is it because the drive 
failed, or because a HBA supporting multiple drives failed, or a cable 
got pulled?  This sort of information is extremely important for large 
reliable systems.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric Schrock

2008-Nov-25 16:41 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

It''s hard to tell exactly what you are asking for, but this sounds
similar to how ZFS already works.  If ZFS decides that a device is
pathologically broken (as evidenced by vdev_probe() failure), it knows
that FMA will come back and diagnose the drive is faulty (becuase we
generate a probe_failure ereport).  So ZFS pre-emptively short circuits
all I/O and treats the drive as faulted, even though the diagnosis
hasn''t come back yet.  We can only do this for errors that have a 1:1
correspondence with faults.

- Eric

On Tue, Nov 25, 2008 at 04:10:13PM +0000, Ross Smith
wrote:> I disagree Bob, I think this is a very different function to that
> which FMA provides.
> 
> As far as I know, FMA doesn''t have access to the big picture of
pool
> configuration that ZFS has, so why shouldn''t ZFS use that
information
> to increase the reliability of the pool while still using FMA to
> handle device failures?
> 
> The flip side of the argument is that ZFS already checks the data
> returned by the hardware.  You might as well say that FMA should deal
> with that too since it''s responsible for all hardware failures.
> 
> The role of ZFS is to manage the pool, availability should be part and
> parcel of that.
> 
> 
> On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
> <bfriesen at simple.dallas.tx.us> wrote:
> > On Tue, 25 Nov 2008, Ross Smith wrote:
> >>
> >> Good to hear there''s work going on to address this.
> >>
> >> What did you guys think to my idea of ZFS supporting a
"waiting for a
> >> response" status for disks as an interim solution that allows
the pool
> >> to continue operation while it''s waiting for FMA or the
driver to
> >> fault the drive?
> >
> > A stable and sane system never comes with "two brains".  It
is wrong to put
> > this sort of logic into ZFS when ZFS is already depending on FMA to
make the
> > decisions and Solaris already has an infrastructure to handle faults. 
The
> > more appropriate solution is that this feature should be in FMA.
> >
> > Bob
> > =====================================> > Bob Friesenhahn
> > bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> > GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> >
> >
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Miles Nordin

2008-Nov-27 00:02 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

>>>>> "rs" == Ross Smith <myxiplx at
googlemail.com> writes:
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
rs> I disagree Bob, I think this is a very different function to
rs> that which FMA provides.

I see two problems.

(1) FMA doesn''t seem to work very well, and was used as an excuse to
keep proper exception handling out of ZFS for a couple years, so
im sort of...skeptical whenever it''s brought up like a panacea.

(2) The FMA model of collecting telemmetry, taking it into
user-space, chin-strokingly contemplating it for a while, then
decreeing a diagnosis, is actually a rather limited one. I can
think of two kinds of limit:

(a) you''re diagnosing the pool FMA is running on. FMA is on the
root pool, but the root pool won''t unfreeze until FMA
diagnoses it.

In practice it''s much worse, because problems in one
pool''s
devices can freeze all of ZFS, even other pools. Or if the
system is NFS-rooted and also exporting ZFS filesystems over
NFS, maybe all of NFS freezes? problems like that, knocking
out FMA. Diagnosis in kernel is harder to knock out.

(b) calls are sleeping uninterruptably in the path that returns
events to FMA. ``Call down into the controller driver, wait
for return success or failure, then count the event and
callback to FMA as appropriate. If something''s borked, FMA
will eventually return diagnosis.'''' This plan is
useless if
the controller just freezes. FMA never sees anything. You
are analyzing faults, yes, but you can only do it with
hindsight. When do you do the FMA callback? To implement
this timeout, you''d have to do a callback before and after
each I/O, which is obviously too expensive.

Likewise, when FMA returns the diagnosis, are you prepared to
act on it? Or are you busy right now, and you''re going to
act on it just as soon as that controller returns success or
failure?

You can''t abstract the notion of time out of your diagnosis.
Trying to compartmentalize it interferes with working it into
low-level event loops in a way that''s sometimes needed.

It''s not a matter of where things taxonomically belong, where it feels
clean to put some functionality in your compartmentalized layered
tower. Certain things just aren''t achievable from certain places.

nw> If we''re talking isolated, or even
clumped-but-relatively-few
nw> bad sectors, then having a short timeout for writes and
nw> remapping should be possible

I''m not sure I understand the state machine for the remapping plan
but...I think your idea is, try to write to some spot on the disk. If
it takes too long, cancel the write, and try writing somewhere else
instead. Then do bad-block-remapping: fix up all the pointers for the
new location, mark the spot that took too long as poisonous, all that.

I don''t think it''ll work. First, you can''t cancel
the write. Once
you dispatch a write that hangs, you''ve locked up, at a minimum, the
drive trying to write. You don''t get the option of remapping and
writing elsewhere, because the drive''s stopped listening to you.
Likely, you''ve also locked up the bus (if the drive''s on PATA
or
SCSI), or maybe the whole controller. (This is IMHO the best reason
for laying out a RAID to survive a controller failure---interaction
with a bad drive could freeze a whole controller.)

Even if you could cancel the write, when do you cancel it? If you can
learn your drive and controller so well you convince them to ignore
you for 10 seconds instead of two minutes when they hit a block they
can''t write, you''ve got approximately the same problem,
because you
don''t know where the poison sectors are. You''ll probably hit
another
one. Even a ten-second write means the drive''s performance is shot by
almost three orders of magnitude---it''s not workable.

Finally, this approach interferes with diagnosis. The drives have
their own retry state machine. If you start muddling all this ad-hoc
stuff on top of it you can''t tell the difference between drive
failures, cabling problems, controller failures. You end up with
normal thermal recalibration events being treated as some kind of
``spurious late read'''' and inventing all these strange
unexplained
failure terms which make it impossible to write a paper like the
Netapp or Google papers on UNC''s we used to cite in here all the time,
because your failure statistics no longer correspond to a single layer
of the storage stack and can''t be compared to others''
statistics.
Also, remember that we suspect and wish to tolerate drives that
operate many standard deviations outside their specification, even
when they''re not broken or suspect or about to break. There are two
reasons. First, we think they might do it. Second, otherwise you
can''t collect performance statistics you can compare with
others''.

That''s why the added failure handling I suggested is only to ignore
drives---either for a little while, or permanently. Merely ignoring a
drive, without telling the drive you''re ignoring it, doesn''t
interfere
with collecting statistics from it.

The two queues inside the drive (retryable and deadline) would let you
do this bad-block-remapping, but no drive implements it, and it''s
probably impossible to implement because of the sorts of things drives
do while ``retrying''''. I described the drive-QoS idea to
explain why
this B_FAILFAST-ish plan of supervising the drive''s recovery behavior,
or any plan involving ``cancelling'''' CDB''s, is never
going to work.

Here is one variant of this remapping plan I think could work, which
somewhat preserves the existing storage stack layering:

* add a timeout to B_FAILFAST cdb''s above the controller driver, a
short one like a couple seconds.

* when a drive is busy on a non-B_FAILFAST transaction for longer
than the B_FAILFAST timeout, walk through the CDB queue and
instantly fail all the B_FAILFAST transactions, without even
sending them to the drive.

* when a drive blows a B_FAILFAST timeout, admit no more B_FAILFAST
transactions until it successfully completes a non-B_FAILFAST
transaction. If the drive is marked timeout-blown, and no
transactions are queued for it, wait 60 seconds and then make up a
fake transaction for it, like ``read one sector in the middle of
the disk.''''

I like the vdev-layer ideas better than the block-layer ideas though.

nw> What should be the failure mode of a jbod disappearing due to
nw> a pulled cable (or power supply failure)? A pause in
nw> operation (hangs)? Or faulting of all affected vdevs, and if
nw> you''re mirrored across different jbods, incurring the need
to
nw> re-silver later, with degraded operation for hours on end?

The resilvering shoudl only include things written during the outage,
so the degraded operation will last some time proportional to the
outage. Resilvering is already supposed to work this way.

The argument, I think, will be over the idea of auto-onlining things.
My opinion: if you are dealing with failure by deciding to return
success to fsync() with fewer copies of the data written, then this
should require either a spare rebuild or manually issuing ''zpool
clear'' to get back to normal. Certain kinds of rocking
behavior---like changes to the mirror roundrobin, or delaying writes
of non-fsync() data, are okay, but rocking back and forth between
redundancy states automatically during normal operation is probably
unacceptable.

The counter opinion I suppose might be that we get more MTDL by
writing as quickly as possible to as many places as possible, so
automatic-onlining is good. but i dont think so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081126/348b00e6/attachment.bin>

Bob Friesenhahn

2008-Nov-27 01:49 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Wed, 26 Nov 2008, Miles Nordin wrote:>
> (2) The FMA model of collecting telemmetry, taking it into
>     user-space, chin-strokingly contemplating it for a while, then
>     decreeing a diagnosis, is actually a rather limited one.  I can
>     think of two kinds of limit:
>
>     (a) you''re diagnosing the pool FMA is running on.  FMA is on
the
>         root pool, but the root pool won''t unfreeze until FMA
>         diagnoses it.
I did not have time to read most of your lengthy thesis but I agree 
that FMA is useless if the motherboard catches fire.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric Schrock

2008-Nov-27 03:36 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin
wrote:>  (2) The FMA model of collecting telemmetry, taking it into
>      user-space, chin-strokingly contemplating it for a while, then
>      decreeing a diagnosis, is actually a rather limited one.  I can
>      think of two kinds of limit:
As mentioned previously, this is not an accurate description of what''s
going on.  FMA allows diagnosis to happen at the detector when the
telemetry is conclusive and cross-domain or predictive analysis isn''t
required.  This is exactly what ZFS does on recent nevada builds.  If a
drive is pathologically broken (i.e. a reopen fails, or reads and writes
to the label fail), it will *immediately* fail the drive and not wait
for any further diagnosis from FMA.

For drives that randomly fail I/Os or take along time, but otherwise
respond to basic requests, ZFS is often in no better position to perform
a diagnosis in the kernel.  And as of build 101, ZFS behaves much better
in these circumstances by not aggressively retrying commands before
exhausting all other options.

Are you running your experiments on build 101 or later?  And what
experiments are you running?  Drawing conclusions from previous
experience or reports is basically pointless given the amount of change
that has occurred recently (Jeff''s putback wasn''t nicknamed
"SPA 3.0"
for nothing).  While there are no doubt more rough edges, we have
incorporated much of the previous feedback into new behavior that should
provide a much improved experience.

- Eric

P.S. I''m also not sure that B_FAILFAST behaves in the way you think it
     does.  My reading of sd.c seems to imply that much of what you
     suggest is actually how it currently behaves, but you should
     probably bring up the issue on storage-discuss where you will find
     more experts in this area.

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Miles Nordin

2008-Nov-28 22:28 UTC

head link

[zfs-discuss] "ZFS, Smashing Baby" a fake???

>>>>> "es" == Eric Schrock <eric.schrock at
sun.com> writes:
es> Are you running your experiments on build 101 or later?

no.

aside from that quick one for copies=2 im pretty bad about running
well-designed experiments. and I do have old builds. I need to buy
more hardware.

It''s hard to know how to get the most stable system. I bet
it''ll be a
year before this b101 stuff makes it into stable Solaris, yet the
bleeding-edge improvements are all stability-related, so for
mostly-ZFS jobs maybe it''s better to run SXCE than sol10 in
production. I suppose I should be happy about that since it means
more people will have some source. :)

es> P.S. I''m also not sure that B_FAILFAST behaves in the way
you
es> think it does. My reading of sd.c seems to imply that much of
es> what you suggest is actually how it currently behaves,

Yeah, I got a private email referring me to the spec for
PSARC/2002/126 which already included both pieces I hoped for
(killing queued CDB''s, and statefully tracking each device as
failed/good), so I take back what I said about B_FAILFAST being
useless---it should be able to help the ZFS availability problems
we''ve seen.

The PSARC says B_FAILFAST is implemented in the ``disk
driver'''' which
AIUI is above the controller, just as I hoped, but there is more than
one ``disk driver'''' so the B_FAILFAST stuff is not factored
out to one
spot the way a vdev-level system would be but rather punted downwards
and paste-and-raped into sd, ssd, dad, ...., so whatever experience
you get with it isn''t necessarily portable to disks with a different
kind of attachment.

I still think the vdev-layer logic could make better decisions by
using more than the 1 bit of information per device, but maybe 1-bit
B_FAILFAST is enough to make me accept the shortfall as an
arguable-feature rather than a unanimous-bug. Also if it can fix my
(1) and (2) with FMA then maybe the gap between B_FAILFAST and real
NetApp-like drive diagnosis can be done partly in userspace the way
developers seem to want.

The problems this doesn''t cover are write-related:

* what should we do about implicit and explicit fsync()s where all
the data is already on stable storage, but not with full
redundancy---one device won''t finish writing?

I think there should not be transparent recovery from this, though
maybe others disagree. but pool-level failmode doesn''t settle the
issue:

(a) _when_ will you take the failure action (if failmode != wait)?
The property says *what* to do, not *when* to do it.

(b) There isn''t any vdev-level failure, only device-level, so
it''s
not appropriate to consult the failmode property in the first
place---the situation is different. The question is, do we
keep trying, or do we transition the device to FAULTED and the
vdev to DEGRADED so that fsync()''s can proceed without that
device and hotspare resilver kicks in?

(c) Inside the time interval between when the device starts writing
slowly and when you take the (b) action, how well can you
isolate the failure? For example, can you insure that
read-only access remains instantaneous, even though atime
updates involve writing, even though these 5-second txg-flushes
are blocked, and even though the admin might (gasp!) type
''zpool status''---or even a label-writing command like
''zpool
attach''? or will one of those three things cause pool-wide or
ZFS-wide hang that blocks read access which could theoretically
work?

* commands like zpool attach, detach, replace, offline, export

(a) should not be uninterruptably hangable.

(b) Problems in one pool should not spill over into another.

(c) And finally they should be forcable even when they can''t write
everything they''d like to, so that rebooting isn''t a
necessary
move in certain kinds failure-recovery of pool gymnastics.

I expect there''s some quiet work on this in b101 also---at least
someone said ''zpool status'' isn''t supposed to hang
anymore? so I''ll
have to try it out, but B_FAILFAST isn''t enough to settle the whole
issue, even modulo marginal performance improvement that more
ambitiously wacky schemes might promise us.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081128/5b12dd50/attachment.bin>

zfs discuss - Nov 2008 - "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???

[zfs-discuss] "ZFS, Smashing Baby" a fake???