thr3ads.net - zfs discuss - [zfs-discuss] RFE for two-level ZFS [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Gary Mills

2009-Feb-19 14:18 UTC

[zfs-discuss] RFE for two-level ZFS

Should I file an RFE for this addition to ZFS?  The concept would be
to run ZFS on a file server, exporting storage to an application
server where ZFS also runs on top of that storage.  All storage
management would take place on the file server, where the physical
disks reside.  The application server would still perform end-to-end
error checking but would notify the file server when it detected an
error.

There are several advantages to this configuration.  One current
recommendation is to export raw disks from the file server.  Some
storage devices, including I assume Sun''s 7000 series, are unable to
do this.  Another is to build two RAID devices on the file server and
to mirror them with ZFS on the application server.  This is also
sub-optimal as it doubles the space requirement and still does not
take full advantage of ZFS error checking.  Splitting the
responsibilities works around these problems.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Richard Elling

2009-Feb-19 17:59 UTC

head link

[zfs-discuss] RFE for two-level ZFS

Gary Mills wrote:> Should I file an RFE for this addition to ZFS?  The concept would be
> to run ZFS on a file server, exporting storage to an application
> server where ZFS also runs on top of that storage.  All storage
> management would take place on the file server, where the physical
> disks reside.  The application server would still perform end-to-end
> error checking but would notify the file server when it detected an
> error.
>   
Currently, this is done as a retry. But retries can suffer from cached
badness.
> There are several advantages to this configuration.  One current
> recommendation is to export raw disks from the file server.  Some
> storage devices, including I assume Sun''s 7000 series, are unable
to
> do this.  Another is to build two RAID devices on the file server and
> to mirror them with ZFS on the application server.  This is also
> sub-optimal as it doubles the space requirement and still does not
> take full advantage of ZFS error checking.  Splitting the
> responsibilities works around these problems

I''m not convinced, but here is how you can change my mind.

1. Determine which faults you are trying to recover from.

2. Prioritize these faults based on their observability, impact,
and rate.

3. For each fault, can it be solved using currently implemented
means?  Is there a way to improve recovery (likely)?

The list that falls out of the bottom of this evaluation process
should provide bounded, well-defined problems to solve.

If the solution requires additions to protocols or even a new protocol,
then that work would need to be started ASAP, because it can take
years to implement.  Currently, most protocols use retries as a basis.
Few have anything more sophisticated.
 -- richard

Miles Nordin

2009-Feb-19 18:19 UTC

head link

[zfs-discuss] RFE for two-level ZFS

>>>>> "gm" == Gary Mills <mills at
cc.umanitoba.ca> writes:
    gm> ZFS on a file server, exporting storage to an application
    gm> server where ZFS also runs on top of that storage.  All
    gm> storage management would take place on the file server, where
    gm> the physical disks reside.  The application server would still
    gm> perform end-to-end error checking but would notify the file
    gm> server when it detected an error.

I think Lustre group wants or was directed to arrange for ZFS becoming
a supported backing store.  Since Lustre might have less
interoperability baggage than NFS, SMB, iSCSI, maybe you could
convince them to extend the ZFS-checksum protection domain out to the
Lustre client.  I don''t really know what they are doing.  It might end
up without quite the level of elegance of a ZFS checksum tree since
there will be multiple ZFS''s beneath Lustre, but adding the idea of a
``protection domain'''' to their deliberations might make
Lustre-ZFS
meaningfully better.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090219/cd21883a/attachment.bin>

jkait

2009-Feb-19 18:35 UTC

head link

[zfs-discuss] RFE for two-level ZFS

>>>>> "gm" == Gary Mills <mills at
cc.umanitoba.ca> writes:
    gm> ZFS on a file server, exporting storage to an application
    gm> server where ZFS also runs on top of that storage.  All
    gm> storage management would take place on the file server, where
    gm> the physical disks reside.  The application server would still
    gm> perform end-to-end error checking but would notify the file
    gm> server when it detected an error.
> I think Lustre group wants or was directed to arrange for ZFS
> becoming a supported backing store.  Since Lustre might have
> less interoperability baggage than NFS, SMB, iSCSI, maybe you
> could convince them to extend the ZFS-checksum protection
> domain out to the Lustre client.  I don''t really know what they
are
> doing.  It might end up without quite the level of elegance of a
> ZFS checksum tree since there will be multiple ZFS''s beneath
> Lustre, but adding the idea of a ``protection domain'''' to
their
> deliberations might make Lustre-ZFS meaningfully better.

2 points...

[a]  There is a standard for such end to end data integrity, i.g. T10 DIF.
many vendors seem to be moving that way. For a high level overview
see ->

http://www.enterprisestorageforum.com/continuity/news/article.php/3672651

[b] The Lustre team, I believe,  is looking at porting the DMU **not**
the entire
zfs stack. There are still license issues, i.g. CDDL vs. GPL.  How that will
be handled hasn''t been discussed openly as far as I know.

Miles Nordin

2009-Feb-19 20:34 UTC

head link

[zfs-discuss] RFE for two-level ZFS

>>>>> "j" == jkait <jkaitsch at gmail.com>
writes:
j> [b] The Lustre team, I believe, is looking at porting the DMU
j> **not** the entire zfs stack.

wow. That''s even more awesome. In that case, since they are more or
less making their own filesystem, maybe it will be natural to validate
checksums on the clients.

j>
http://www.enterprisestorageforum.com/continuity/news/article.php/3672651

meh, wake me when it''s over.

Another thing which interests me in light of recent discussion, is
checksums which can be broken if write barriers are violated. It''s
forever impossible to tell if your data is ``up-to-date'''' with
just a
checksum because it will be valid tomorrow if it''s valid today, but
you can tell if a bag of checksums match with each other, perhaps be
warned if the filesystem has recovered to some new and seemingly-valid
state through which, were it respecting fsync() barriers, it could
never have passed before the data loss. With this feature, instead of
just insuring the insides of files as invalid, ZFS could put seals on
whole datasets, and we would see these checksum seals broken if we
disabled the ZIL. It could become meaningful to put a seal on a
heirarchy of datasets, which would be broken if you mounted a tree of
snapshots of those datasets which were not taken atomically. This
also becomes more meaningful with filesystems like HAMMER that have
infinite snapshots, where you may want metadata checksums to seal the
filesystems'' history, a history which could be broken if drives write
checksum-sized blocks, but write them in the wrong order.

I don''t see how raw storage can do anything but put checksums on
block-sized chunks, which is useful for data in flight but not that
useful to store. The stored checksum can prove ``this exact block was
once handed to me, and I was once told to write it to this LBA on this
LUN.'''' So what? Yes, I agree that happened, but it might
have been
two years ago. that doesn''t mean the block is what belongs there
_right now_. I could have overwritten that block 100 times since
then. You need a metadata heirarchy to know that.

What the SCSI extensions could do is extend the checksums that all the
big storage vendors are already doing over the FC/iSCSI SAN, and thus
stop ZFS advocates from pointing at weak TCP checksums, ancient
routers, SAN bitflip gremlins when pools with single-lun vdevs become
corrupt. The storage vendor pitch about helping to _find_ the
corruption problems---I buy that one. ZFS is notoriously poor at that
job. But I don''t think the SCSI extension is helpful for extending
the halo of the on-disk protection domain through the filesystem and
above it, past a network filesystem. They can''t do that by adding
SCSI commands. It''s simply irrelevant to the task, unless SCSI is
going to become its own non-POSIX filesystem with snapshots and a
virtual clock, which it had better not.

Lustre could do it, though, especially if they are building their own
filesystem from zpool pieces right above the transactional layer, not
just using ZFS as a POSIX backing store.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090219/05ed1484/attachment.bin>

Brandon High

2009-Feb-19 20:36 UTC

head link

[zfs-discuss] RFE for two-level ZFS

On Thu, Feb 19, 2009 at 6:18 AM, Gary Mills <mills at cc.umanitoba.ca>
wrote:> Should I file an RFE for this addition to ZFS?  The concept would be
> to run ZFS on a file server, exporting storage to an application
> server where ZFS also runs on top of that storage.  All storage
> management would take place on the file server, where the physical
> disks reside.  The application server would still perform end-to-end
> error checking but would notify the file server when it detected an
> error.
You could accomplish most of this by creating a iSCSI volume on the
storage server, then using ZFS with no redundancy on the application
server.

You''ll have two layers for checksums, one on the storage
server''s
zpool and a second on the application server''s filesystem. The
application server won''t be able to notify the storage server that
it''s detected a bad checksum, other than through retries, but can
write a user-space monitor that watches for ZFS checksum errors and
sends notification to the storage server.

To poke a hole in your idea: What if the app server does find an
error? What''s the storage server to do at that point? Provided that
the storage server''s zpool already has redundancy, the data written to
disk should already be exactly what was received from the client. If
you want to have the ability to recover from erorrs on the app server,
you should use a redundant zpool - Either a mirror or a raidz.

If you''re concerned about data corruption in transit, then it sounds
like something akin to T10 DIF (which others mentioned) would fit the
bill. You could also tunnel the traffic over a transit layer such as
TLS or SSH that provides a measure of validation. Latency should be
fun to deal with however.

-B

-- 
Brandon High : bhigh at freaks.com

Gary Mills

2009-Feb-20 14:33 UTC

head link

[zfs-discuss] RFE for two-level ZFS

On Thu, Feb 19, 2009 at 09:59:01AM -0800, Richard Elling
wrote:> Gary Mills wrote:
> >Should I file an RFE for this addition to ZFS?  The concept would be
> >to run ZFS on a file server, exporting storage to an application
> >server where ZFS also runs on top of that storage.  All storage
> >management would take place on the file server, where the physical
> >disks reside.  The application server would still perform end-to-end
> >error checking but would notify the file server when it detected an
> >error.
> 
> Currently, this is done as a retry. But retries can suffer from cached
> badness.
So, ZFS on the application server would retry the read from the
storage server.  This would be the same as it does from a physical
disk, I presume.  However, if the checksum failure persisted, it
would declare an error.  That''s where the RFE comes in, because it
would then notify the file server to utilize its redundant data
source.  Perhaps this could be done as part of the retry, using
existing protocols.
> >There are several advantages to this configuration.  One current
> >recommendation is to export raw disks from the file server.  Some
> >storage devices, including I assume Sun''s 7000 series, are
unable to
> >do this.  Another is to build two RAID devices on the file server and
> >to mirror them with ZFS on the application server.  This is also
> >sub-optimal as it doubles the space requirement and still does not
> >take full advantage of ZFS error checking.  Splitting the
> >responsibilities works around these problems
> 
> I''m not convinced, but here is how you can change my mind.
> 
> 1. Determine which faults you are trying to recover from.
I don''t think this has been clearly identified, except that they are
``those faults that are only detected by end-to-end
checksums''''.
> 2. Prioritize these faults based on their observability, impact,
> and rate.
Perhaps the project should be to extend end-to-end checksums in
situations that don''t have end-to-end redundancy.  Redundancy at the
storage layer would be required, of course.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Kyle McDonald

2009-Feb-20 14:53 UTC

head link

[zfs-discuss] RFE for two-level ZFS

On 2/20/2009 9:33 AM, Gary Mills wrote:> On Thu, Feb 19, 2009 at 09:59:01AM -0800, Richard Elling wrote:
>    
>> Gary Mills wrote:
>>      
>>> Should I file an RFE for this addition to ZFS?  The concept would
be
>>> to run ZFS on a file server, exporting storage to an application
>>> server where ZFS also runs on top of that storage.  All storage
>>> management would take place on the file server, where the physical
>>> disks reside.  The application server would still perform
end-to-end
>>> error checking but would notify the file server when it detected an
>>> error.
>>>        
>> Currently, this is done as a retry. But retries can suffer from cached
>> badness.
>>      
>
> So, ZFS on the application server would retry the read from the
> storage server.  This would be the same as it does from a physical
> disk, I presume.  However, if the checksum failure persisted, it
> would declare an error.  That''s where the RFE comes in, because it
> would then notify the file server to utilize its redundant data
> source.  Perhaps this could be done as part of the retry, using
> existing protocols.
>    I''m no expert, but I think not only "would this have been taken
care of
by the retry" but if the error is being introduced by any HW or SW on 
the storage server''s end, then the storage server will already be 
checking it''s checksums.

The main place the new errors could be introduced will be after the data 
left ZFS''s control, heading out the network interface across the wires,
and into the application server... While not impossible for the same 
error to creep in on every retry, I think it''d be rarer than different 
errors each time, and the retries would have a very good chance 
eventually getting good copies of every block.

Even if the application server could notify the storage server of the 
problem. There isn''t any thing more the storage server can do. If there
was a problem that it''s redundancy could fix, it''s checksums
would have
identified that, and it would have fixed it even before the data was 
sent to the application server.>    
>>> There are several advantages to this configuration.  One current
>>> recommendation is to export raw disks from the file server.  Some
>>> storage devices, including I assume Sun''s 7000 series, are
unable to
>>> do this.  Another is to build two RAID devices on the file server
and
>>> to mirror them with ZFS on the application server.  This is also
>>> sub-optimal as it doubles the space requirement and still does not
>>> take full advantage of ZFS error checking.  Splitting the
>>> responsibilities works around these problems
>>>        
>> I''m not convinced, but here is how you can change my mind.
>>
>> 1. Determine which faults you are trying to recover from.
>>      
>
> I don''t think this has been clearly identified, except that they
are
> ``those faults that are only detected by end-to-end
checksums''''.
>
>    Adding ZFS on the appserver will add a new set of checksums for the 
data''s journey over the wire and back again. Nothing will be checking 
those checksums on the storage server to see if corruption happened to 
writes on the way there (which might be a place for improvement - but 
I''m not sure how that can even be done,) but those same checksums will 
be sent back to the appserver on a read, so the appserver will be able 
to determine the problem then - Of course if the corruption happenned 
while sending the write, then no amount of retries will help. Only ZFS 
redundancy on the app server can (currently) help with that.

   -Kyle
>> 2. Prioritize these faults based on their observability, impact,
>> and rate.
>>      
>
> Perhaps the project should be to extend end-to-end checksums in
> situations that don''t have end-to-end redundancy.  Redundancy at
the
> storage layer would be required, of course.
>
>

Gary Mills

2009-Feb-21 20:09 UTC

head link

[zfs-discuss] RFE for two-level ZFS

On Thu, Feb 19, 2009 at 12:36:22PM -0800, Brandon High
wrote:> On Thu, Feb 19, 2009 at 6:18 AM, Gary Mills <mills at
cc.umanitoba.ca> wrote:
> > Should I file an RFE for this addition to ZFS?  The concept would be
> > to run ZFS on a file server, exporting storage to an application
> > server where ZFS also runs on top of that storage.  All storage
> > management would take place on the file server, where the physical
> > disks reside.  The application server would still perform end-to-end
> > error checking but would notify the file server when it detected an
> > error.
> 
> You could accomplish most of this by creating a iSCSI volume on the
> storage server, then using ZFS with no redundancy on the application
> server.
That''s what I''d like to do, and what we do now.  The RFE is to
take
advantage of the end-to-end checksums in ZFS in spite of having no
redundancy on the application server.  Having all of the disk
management in one place is a great benefit.
> You''ll have two layers for checksums, one on the storage
server''s
> zpool and a second on the application server''s filesystem. The
> application server won''t be able to notify the storage server that
> it''s detected a bad checksum, other than through retries, but can
> write a user-space monitor that watches for ZFS checksum errors and
> sends notification to the storage server.
The RFE is to enable the two instances of ZFS to exchange information
about checksum failures.
> To poke a hole in your idea: What if the app server does find an
> error? What''s the storage server to do at that point? Provided
that
> the storage server''s zpool already has redundancy, the data
written to
> disk should already be exactly what was received from the client. If
> you want to have the ability to recover from erorrs on the app server,
> you should use a redundant zpool - Either a mirror or a raidz.
Yes, if the two instances of ZFS disagree, we have a problem that
needs to be resolved: they need to cooperate in this endevour.
> If you''re concerned about data corruption in transit, then it
sounds
> like something akin to T10 DIF (which others mentioned) would fit the
> bill. You could also tunnel the traffic over a transit layer such as
> TLS or SSH that provides a measure of validation. Latency should be
> fun to deal with however.
I''m mainly concerned that ZFS on the application server will detect a
checksum error and then be unable to preserve the data.  Iscsi already
has TCP checksums.  I assume that FC-AL does as well.  Using more
reliable checksums has no benefit if ZFS will still detect end-to-end
checksum errors.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Apparently Analagous Threads

Search for more reasonably related threads

zfs discuss - Feb 2009 - RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

[zfs-discuss] RFE for two-level ZFS

Apparently Analagous Threads