thr3ads.net - zfs discuss - [zfs-discuss] ZFS improvements [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Gino

2007-Apr-10 07:48 UTC

[zfs-discuss] ZFS improvements

Hi All

I''d like to expose two points about ZFS that I think are a must before
even trying to use it in production:


1) ZFS must stop to force kernel panics! 
As you know ZFS takes to a kernel panic when a corrupted zpool is found or if
it''s unable to reach
a device and so on...
We need to have it just fail with an error message but please stop crashing the
kernel.


2) We need a way to recover a corrupted ZFS, trashing the last incompleted
transactions.
Please give us "zfsck" :)


Waiting for comments,
gino
 
 
This message posted from opensolaris.org

William D. Hathaway

2007-Apr-10 11:25 UTC

head link

[zfs-discuss] Re: ZFS improvements

There was some discussion on the "always panic for fatal pool
failures" issue in April 2006, but I haven''t seen if an actual RFE
was generated.
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/017276.html
 
 
This message posted from opensolaris.org

Eric Schrock

2007-Apr-10 17:02 UTC

head link

[zfs-discuss] ZFS improvements

On Tue, Apr 10, 2007 at 12:48:49AM -0700, Gino wrote:> Hi All
> 
> I''d like to expose two points about ZFS that I think are a must
before even trying to use it in production:
> 
> 
> 1) ZFS must stop to force kernel panics! 
> As you know ZFS takes to a kernel panic when a corrupted zpool is found or
if it''s unable to reach
> a device and so on...
> We need to have it just fail with an error message but please stop crashing
the kernel.
This is:

6322646 ZFS should gracefully handle all devices failing (when writing)

Which is being worked on.  Using a redundant configuration prevents this
from happening.
> 2) We need a way to recover a corrupted ZFS, trashing the last incompleted
transactions.
> Please give us "zfsck" :)
Please the ZFS FAQ at:

http://www.opensolaris.org/os/community/zfs/faq/#whynofsck

Writing such a tool is effectively impossible.  For the one known
corruption bug we''ve encountered (and since fixed, we provided the
''zfs_recover'' /etc/system switch, but it only works for this
particular
bug.  Without understanding the underlying pathology it''s impossible to
"fix" a ZFS pool.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Anton B. Rang

2007-Apr-11 04:43 UTC

head link

[zfs-discuss] Re: ZFS improvements

>> please stop crashing the kernel.
> 
> This is:
> 
> 6322646 ZFS should gracefully handle all devices failing (when writing)
That''s only one cause of panics.

At least two of gino''s panics appear due to corrupted space maps, for
instance. I think there may also still be a case where a failure to read
metadata during a transaction commit leads to a panic, too. Maybe that
one''s been fixed, or maybe it will be handled by the above bug.

Maybe someone needs to file a bug/RFE to remove all panics from ZFS, at least in
non-debug builds? The QFS approach is to panic when inconsistency is found on
debug builds, but return an appropriate error code on release builds, which
seems reasonable.

I/O errors, of course, should never lead to a panic. I think we [you] fixed all
of those cases in UFS, and QFS, long ago.

Anton
 
 
This message posted from opensolaris.org

Anton B. Rang

2007-Apr-11 04:52 UTC

head link

[zfs-discuss] Re: ZFS improvements

> Without understanding the underlying pathology it''s impossible to
"fix" a ZFS pool.
Sorry, but I have to disagree with this.

The goal of fsck is not to bring a file system into the state it
"should" be in had no errors occurred. The goal, rather, is to bring a
file system to a self-consistent state. Ideally, data should be recoverable when
it''s believed to be good (ZFS has a big advantage here, since the
checksums can be used to validate block pointers).

The ZFS on-disk data structure is basically a tree. zfsck could fairly easily
walk the tree and ensure that, for instance, pools are at the top level; space
maps match allocated blocks; block pointers from multiple files don''t
overlap; file lengths match their allocation; ACLs are not corrupted; compressed
data is not damaged; directories are in the proper format; etc.

This might be impractical for a large file system, of course. It might be easier
to have a ''zscavenge'' that would recover data, where possible,
from a corrupted file system. But there should be at least one of these. Losing
a whole pool due to the corruption of a couple of blocks of metadata is a Bad
Thing.
 
 
This message posted from opensolaris.org

Eric Schrock

2007-Apr-11 05:06 UTC

head link

[zfs-discuss] Re: ZFS improvements

On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B. Rang
wrote:> 
> That''s only one cause of panics.
> 
> At least two of gino''s panics appear due to corrupted space maps,
for
> instance. I think there may also still be a case where a failure to
> read metadata during a transaction commit leads to a panic, too. Maybe
> that one''s been fixed, or maybe it will be handled by the above
bug.
The space map bugs should have been fixed as part of:

6458218 assertion failed: ss == NULL

Which went into Nevada build 60.  There are several different
pathologies that can result from this bug, and I don''t know if the
panics are from before or after this fix.  I hope folks from the ZFS
team are investigating, but I can''t speak for everyone.
> Maybe someone needs to file a bug/RFE to remove all panics from ZFS,
> at least in non-debug builds? The QFS approach is to panic when
> inconsistency is found on debug builds, but return an appropriate
> error code on release builds, which seems reasonable.
In order to do this we need to fix 6322646 first, which addresses the
issue of ''backing out'' of a transaction once we''re
down in the ZIO layer
discovering these problems.  It doesn''t matter if it''s due to
an I/O
error or space map inconsistency or I/O error if we can''t propagate the
error.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Gino

2007-Apr-11 09:13 UTC

head link

[zfs-discuss] Re: ZFS improvements

> 6322646 ZFS should gracefully handle all devices
> failing (when writing)
> 
> Which is being worked on.  Using a redundant
> configuration prevents this
> from happening.
What do you mean with "redundant"?  All our servers has 2 or 4 HBAs, 2
or 4 fc switches and storage arrays with redundant controllers.
We used only RAID10 zpools but we still had them corrupted.
 > http://www.opensolaris.org/os/community/zfs/faq/#whyno
> fsck
> 
> Writing such a tool is effectively impossible.  For
> the one known
> corruption bug we''ve encountered (and since fixed, we
> provided the
> ''zfs_recover'' /etc/system switch, but it only works
> for this particular
> bug.  Without understanding the underlying pathology
> it''s impossible to
> "fix" a ZFS pool.
I think this is a very importand drawback of ZFS.
 
 
This message posted from opensolaris.org

Gino

2007-Apr-11 09:15 UTC

head link

[zfs-discuss] Re: ZFS improvements

> > 1) ZFS must stop to force kernel panics! 
> > As you know ZFS takes to a kernel panic when a
> corrupted zpool is found or if it''s unable to reach
> > a device and so on...
> > We need to have it just fail with an error message
> but please stop crashing the kernel.
> 
> This is:
> 
> 6322646 ZFS should gracefully handle all devices
> failing (when writing)
With S10U3 we are still getting kernel panics when trying to import a corrupted
zpool (RAID10)

gino
 
 
This message posted from opensolaris.org

Gino

2007-Apr-11 09:43 UTC

head link

[zfs-discuss] Re: Re: ZFS improvements

> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B.
> Rang wrote:
> > 
> > That''s only one cause of panics.
> > 
> > At least two of gino''s panics appear due to
> corrupted space maps, for
> > instance. I think there may also still be a case
> where a failure to
> > read metadata during a transaction commit leads to
> a panic, too. Maybe
> > that one''s been fixed, or maybe it will be handled
> by the above bug.
> 
> The space map bugs should have been fixed as part of:
> 
> 6458218 assertion failed: ss == NULL
> 
> Which went into Nevada build 60.  There are several
> different
> pathologies that can result from this bug, and I
> don''t know if the
> panics are from before or after this fix. 
If that can help you, we are able to corrupt a zpool on snv_60 doing the
following a few times:

-create a raid10 zpool   (dual path luns)
-making a high writing load on that zpool
-disabling fc ports on both the fc switches

Each time we get a kernel panic, probably because of 6322646, and sometimes we
get a corrupted zpool.

gino
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Apr-11 09:47 UTC

head link

[zfs-discuss] Re: ZFS improvements

Gino writes:
 > > 6322646 ZFS should gracefully handle all devices
 > > failing (when writing)
 > > 
 > > Which is being worked on.  Using a redundant
 > > configuration prevents this
 > > from happening.
 > 
 > What do you mean with "redundant"?  All our servers has 2 or 4
HBAs, 2 or 4 fc switches and storage arrays with redundant controllers.
 > We used only RAID10 zpools but we still had them corrupted.
 >  

"Redundant" from the viewpoint of ZFS. So either zfs mirror of 
raid-z. The point of the bug is to better handle failures on 
devices in non-redundant pools. For redundant pools, ZFS is able to
self-heal problems as they arise. If you maintain redundancy 
at the storage level, then it''s harder for ZFS to deal with
problems. We should still behave better than we do now thus 6322646.

Can you post your zpool status output ?

-r

 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Luke Scharf

2007-Apr-11 14:21 UTC

head link

[zfs-discuss] Re: ZFS improvements

Anton B. Rang wrote:> This might be impractical for a large file system, of course. It might be
easier to have a ''zscavenge'' that would recover data, where
possible, from a corrupted file system. But there should be at least one of
these. Losing a whole pool due to the corruption of a couple of blocks of
metadata is a Bad Thing.
>   

That could be handy for any number of data-transport, 
borked-system-recovery and even some forensic-like tasks:
    zscavenge badpool | zfs recv

So, suppose a user has a few hundred gb of data that they''d like copied
directly onto our fileserver.  They bring me a ZFS-formatted external 
USB drive.  Instead of mounting it and messing with their data, I 
zscavange /dev/usbdevice, and then write that into a pool that I''m 
comfortable messing with.

It works just as well for someone trying to recover a truly borked 
system.  One could recover the data without making any changes 
whatsoever to the drive, so that I can put everything back the way I 
found it -- in case I can''t fix it, the next person who tries to fix it
has a clean slate.

-Luke

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3271 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070411/28db2138/attachment.bin>

Gino

2007-Apr-23 13:47 UTC

head link

[zfs-discuss] Re: Re: ZFS improvements

> > At least two of gino''s panics appear due to
> corrupted space maps, for
> > instance. I think there may also still be a case
> where a failure to
> > read metadata during a transaction commit leads to
> a panic, too. Maybe
> > that one''s been fixed, or maybe it will be handled
> by the above bug.
> 
> The space map bugs should have been fixed as part of:
> 
> 6458218 assertion failed: ss == NULL
> 
> Which went into Nevada build 60.  There are several
> different
> pathologies that can result from this bug, and I
> don''t know if the
> panics are from before or after this fix.  I hope
> folks from the ZFS
> team are investigating, but I can''t speak for
> everyone.

This week we''ll start one of our "burn test" on snv60.
I''ll let you know.  For the moment we are able to panic zfs
very easily. We''ll see if we are able to corrupt again a zpool.

 > > Maybe someone needs to file a bug/RFE to remove all
> panics from ZFS,
> > at least in non-debug builds? The QFS approach is
> to panic when
> > inconsistency is found on debug builds, but return
> an appropriate
> > error code on release builds, which seems
> reasonable.
> 
> In order to do this we need to fix 6322646 first,
> which addresses the
> issue of ''backing out'' of a transaction once
we''re
> down in the ZIO layer
> discovering these problems.  It doesn''t matter if
> it''s due to an I/O
> error or space map inconsistency or I/O error if we
> can''t propagate the
> error.
Any EDT for 6322646 and 6417779 ?
6417779 ZFS: I/O failure (write on ...) 
6322646 ZFS should gracefully handle all devices failing (when writing)


Isn''t there a way to increase the "timeout" to have Solaris
just hang
when a lun is not available and have it retry the i/o more times?

gino
 
 
This message posted from opensolaris.org

Ivan Wang

2007-Apr-23 15:49 UTC

head link

[zfs-discuss] Re: Re: ZFS improvements

> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B.
> Rang wrote:
> > 
> > That''s only one cause of panics.
> > 
> > At least two of gino''s panics appear due to
> corrupted space maps, for
> > instance. I think there may also still be a case
> where a failure to
> > read metadata during a transaction commit leads to
> a panic, too. Maybe
> > that one''s been fixed, or maybe it will be handled
> by the above bug.
> 
> The space map bugs should have been fixed as part of:
> 
> 6458218 assertion failed: ss == NULL
> 
> Which went into Nevada build 60.  There are several
> different
> pathologies that can result from this bug, and I
> don''t know if the
> panics are from before or after this fix.  I hope
> folks from the ZFS
> team are investigating, but I can''t speak for
> everyone.
> 
> > Maybe someone needs to file a bug/RFE to remove all
> panics from ZFS,
> > at least in non-debug builds? The QFS approach is
> to panic when
> > inconsistency is found on debug builds, but return
> an appropriate
> > error code on release builds, which seems
> reasonable.
> 
> In order to do this we need to fix 6322646 first,
> which addresses the
> issue of ''backing out'' of a transaction once
we''re
> down in the ZIO layer
> discovering these problems.  It doesn''t matter if
> it''s due to an I/O
> error or space map inconsistency or I/O error if we
> can''t propagate the
> error.
Now this is scary, looking from the descriptions, it is possible that we might
lose data in zfs, and/or resulted in a corrupted zpool that panics the kernel,
if during the write operation, zfs loses connection to underlying hardwares?
(for example a power failure?)

But I''ve rarely seen that since we got UFS w/ logging in Solaris 7 or
something. Even with UFS, there is always fsck allowing us to bring system back
to a consistent state for recovering from previous backup.

Is ZFS really supposed to be more reliable than UFS w/ logging, for example, in
single disk, root file system scenario?
> 
> - Eric
> 
> --
> Eric Schrock, Solaris Kernel Development
>       http://blogs.sun.com/eschrock
> _________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> 
 
This message posted from opensolaris.org

Eric Schrock

2007-Apr-23 16:22 UTC

head link

[zfs-discuss] Re: Re: ZFS improvements

On Mon, Apr 23, 2007 at 08:49:35AM -0700, Ivan Wang
wrote:> 
> Now this is scary, looking from the descriptions, it is possible that
> we might lose data in zfs, and/or resulted in a corrupted zpool that
> panics the kernel, if during the write operation, zfs loses connection
> to underlying hardwares? (for example a power failure?)
No, you will not lose data.  There is a chance we will panic if you fail
a write in an unreplicated pool, but you will not lose data as a result.
> But I''ve rarely seen that since we got UFS w/ logging in Solaris 7
or
> something. Even with UFS, there is always fsck allowing us to bring
> system back to a consistent state for recovering from previous backup. 
> 
> Is ZFS really supposed to be more reliable than UFS w/ logging, for
> example, in single disk, root file system scenario?
Yes.  The failure to cope with a failed write in an unreplicated pool
affects the availability of the system (because we panic), but not the
underlying reliability of the filesystem.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Gino

2007-Apr-23 16:38 UTC

head link

[zfs-discuss] Re: Re: Re: ZFS improvements

> > Is ZFS really supposed to be more reliable than UFS
> w/ logging, for
> > example, in single disk, root file system scenario?
> 
> Yes.  The failure to cope with a failed write in an
> unreplicated pool
> affects the availability of the system (because we
> panic), but not the
> underlying reliability of the filesystem.
Eric,
we had 5 corrupted zpool (on different servers and different SANs) !
With Solaris up to S10U3 and Nevada up to snv59 we are able to corrupt easily a
zpool only disconnecting a few times one or more luns of a zpool under high i/o
load.

We are testing now snv60.

gino
 
 
This message posted from opensolaris.org

Eric Schrock

2007-Apr-23 16:42 UTC

head link

[zfs-discuss] Re: Re: Re: ZFS improvements

On Mon, Apr 23, 2007 at 09:38:47AM -0700, Gino wrote:>
> we had 5 corrupted zpool (on different servers and different SANs) !
> With Solaris up to S10U3 and Nevada up to snv59 we are able to corrupt
> easily a zpool only disconnecting a few times one or more luns of a
> zpool under high i/o load.
> 
> We are testing now snv60.
As I''ve mentioned before, I believe you were tripping over the space
map
bug (6458218) which was fixed in build 60 and will appear in S10u4.  Let
us know if you are able to reproduce the problem on build 60 or later.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Gino

2007-May-04 09:31 UTC

head link

[zfs-discuss] Re: Re: Re: Re: ZFS improvements

> On Mon, Apr 23, 2007 at 09:38:47AM -0700, Gino wrote:
> >
> > we had 5 corrupted zpool (on different servers and
> different SANs) !
> > With Solaris up to S10U3 and Nevada up to snv59 we
> are able to corrupt
> > easily a zpool only disconnecting a few times one
> or more luns of a
> > zpool under high i/o load.
> > 
> > We are testing now snv60.
> 
> As I''ve mentioned before, I believe you were tripping
> over the space map
> bug (6458218) which was fixed in build 60 and will
> appear in S10u4.  Let
> us know if you are able to reproduce the problem on
> build 60 or later.
Eric,
we done our first test with snv60.
we moved over 40TB of data between 4 zpools and in the mean time we''ve
done about 100000 snapshots and forced 50 panic disabling ports on the fc
switches.
None of the pools have been corrupted!
Also we found snv60 MUCH more stable than S10U3.

gino
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Aug-09 17:53 UTC

head link

[zfs-discuss] ZFS improvements

Hello Gino,

Wednesday, April 11, 2007, 10:43:17 AM, you wrote:
>> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B.
>> Rang wrote:
>> > 
>> > That''s only one cause of panics.
>> > 
>> > At least two of gino''s panics appear due to
>> corrupted space maps, for
>> > instance. I think there may also still be a case
>> where a failure to
>> > read metadata during a transaction commit leads to
>> a panic, too. Maybe
>> > that one''s been fixed, or maybe it will be handled
>> by the above bug.
>> 
>> The space map bugs should have been fixed as part of:
>> 
>> 6458218 assertion failed: ss == NULL
>> 
>> Which went into Nevada build 60.  There are several
>> different
>> pathologies that can result from this bug, and I
>> don''t know if the
>> panics are from before or after this fix. 
G> If that can help you, we are able to corrupt a zpool on snv_60 doing the
following a few times:

G> -create a raid10 zpool   (dual path luns)
G> -making a high writing load on that zpool
G> -disabling fc ports on both the fc switches

G> Each time we get a kernel panic, probably because of 6322646, and
G> sometimes we get a corrupted zpool.

Is it still the case? Was the problem of corruption the pool addressed
and hopefully solved?



-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Gino

2007-Aug-13 07:51 UTC

head link

[zfs-discuss] ZFS improvements

Hello Robert,

now we are using snv60 and snv67 and moving many TB of data every day and no
corruption problem any more.

Unfortunately the following problems force us to stay with UFS for our
production servers:
6417779 ZFS: I/O failure (write on ...)
6322646 ZFS should gracefully handle all devices failing (when writing)

Also we found that our backup servers, using ZFS, after 40-60gg days of uptime
starts to show system cpu time > 50%, often using one or two cpus at 100%.
After a reboot, cpu system time go back to 7-11%.

gino
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Aug-14 07:53 UTC

head link

[zfs-discuss] ZFS improvements

Hello Gino,

Monday, August 13, 2007, 8:51:18 AM, you wrote:

G> Hello Robert,

G> now we are using snv60 and snv67 and moving many TB of data every
G> day and no corruption problem any more.

Good, thanks for info.

G> Unfortunately the following problems force us to stay with UFS for our
production servers:
G> 6417779 ZFS: I/O failure (write on ...)
G> 6322646 ZFS should gracefully handle all devices failing (when writing)

So are you mounting UFS with onerr=lock?


-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

zfs discuss - Apr 2007 - ZFS improvements

[zfs-discuss] ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: ZFS improvements

[zfs-discuss] Re: Re: ZFS improvements

[zfs-discuss] Re: Re: ZFS improvements

[zfs-discuss] Re: Re: ZFS improvements

[zfs-discuss] Re: Re: Re: ZFS improvements

[zfs-discuss] Re: Re: Re: ZFS improvements

[zfs-discuss] Re: Re: Re: Re: ZFS improvements

[zfs-discuss] ZFS improvements

[zfs-discuss] ZFS improvements

[zfs-discuss] ZFS improvements