Should I file an RFE for this addition to ZFS? The concept would be to run ZFS on a file server, exporting storage to an application server where ZFS also runs on top of that storage. All storage management would take place on the file server, where the physical disks reside. The application server would still perform end-to-end error checking but would notify the file server when it detected an error. There are several advantages to this configuration. One current recommendation is to export raw disks from the file server. Some storage devices, including I assume Sun''s 7000 series, are unable to do this. Another is to build two RAID devices on the file server and to mirror them with ZFS on the application server. This is also sub-optimal as it doubles the space requirement and still does not take full advantage of ZFS error checking. Splitting the responsibilities works around these problems. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Gary Mills wrote:> Should I file an RFE for this addition to ZFS? The concept would be > to run ZFS on a file server, exporting storage to an application > server where ZFS also runs on top of that storage. All storage > management would take place on the file server, where the physical > disks reside. The application server would still perform end-to-end > error checking but would notify the file server when it detected an > error. >Currently, this is done as a retry. But retries can suffer from cached badness.> There are several advantages to this configuration. One current > recommendation is to export raw disks from the file server. Some > storage devices, including I assume Sun''s 7000 series, are unable to > do this. Another is to build two RAID devices on the file server and > to mirror them with ZFS on the application server. This is also > sub-optimal as it doubles the space requirement and still does not > take full advantage of ZFS error checking. Splitting the > responsibilities works around these problemsI''m not convinced, but here is how you can change my mind. 1. Determine which faults you are trying to recover from. 2. Prioritize these faults based on their observability, impact, and rate. 3. For each fault, can it be solved using currently implemented means? Is there a way to improve recovery (likely)? The list that falls out of the bottom of this evaluation process should provide bounded, well-defined problems to solve. If the solution requires additions to protocols or even a new protocol, then that work would need to be started ASAP, because it can take years to implement. Currently, most protocols use retries as a basis. Few have anything more sophisticated. -- richard
>>>>> "gm" == Gary Mills <mills at cc.umanitoba.ca> writes:gm> ZFS on a file server, exporting storage to an application gm> server where ZFS also runs on top of that storage. All gm> storage management would take place on the file server, where gm> the physical disks reside. The application server would still gm> perform end-to-end error checking but would notify the file gm> server when it detected an error. I think Lustre group wants or was directed to arrange for ZFS becoming a supported backing store. Since Lustre might have less interoperability baggage than NFS, SMB, iSCSI, maybe you could convince them to extend the ZFS-checksum protection domain out to the Lustre client. I don''t really know what they are doing. It might end up without quite the level of elegance of a ZFS checksum tree since there will be multiple ZFS''s beneath Lustre, but adding the idea of a ``protection domain'''' to their deliberations might make Lustre-ZFS meaningfully better. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090219/cd21883a/attachment.bin>
>>>>> "gm" == Gary Mills <mills at cc.umanitoba.ca> writes:gm> ZFS on a file server, exporting storage to an application gm> server where ZFS also runs on top of that storage. All gm> storage management would take place on the file server, where gm> the physical disks reside. The application server would still gm> perform end-to-end error checking but would notify the file gm> server when it detected an error.> I think Lustre group wants or was directed to arrange for ZFS > becoming a supported backing store. Since Lustre might have > less interoperability baggage than NFS, SMB, iSCSI, maybe you > could convince them to extend the ZFS-checksum protection > domain out to the Lustre client. I don''t really know what they are > doing. It might end up without quite the level of elegance of a > ZFS checksum tree since there will be multiple ZFS''s beneath > Lustre, but adding the idea of a ``protection domain'''' to their > deliberations might make Lustre-ZFS meaningfully better.2 points... [a] There is a standard for such end to end data integrity, i.g. T10 DIF. many vendors seem to be moving that way. For a high level overview see -> http://www.enterprisestorageforum.com/continuity/news/article.php/3672651 [b] The Lustre team, I believe, is looking at porting the DMU **not** the entire zfs stack. There are still license issues, i.g. CDDL vs. GPL. How that will be handled hasn''t been discussed openly as far as I know.
>>>>> "j" == jkait <jkaitsch at gmail.com> writes:j> [b] The Lustre team, I believe, is looking at porting the DMU j> **not** the entire zfs stack. wow. That''s even more awesome. In that case, since they are more or less making their own filesystem, maybe it will be natural to validate checksums on the clients. j> http://www.enterprisestorageforum.com/continuity/news/article.php/3672651 meh, wake me when it''s over. Another thing which interests me in light of recent discussion, is checksums which can be broken if write barriers are violated. It''s forever impossible to tell if your data is ``up-to-date'''' with just a checksum because it will be valid tomorrow if it''s valid today, but you can tell if a bag of checksums match with each other, perhaps be warned if the filesystem has recovered to some new and seemingly-valid state through which, were it respecting fsync() barriers, it could never have passed before the data loss. With this feature, instead of just insuring the insides of files as invalid, ZFS could put seals on whole datasets, and we would see these checksum seals broken if we disabled the ZIL. It could become meaningful to put a seal on a heirarchy of datasets, which would be broken if you mounted a tree of snapshots of those datasets which were not taken atomically. This also becomes more meaningful with filesystems like HAMMER that have infinite snapshots, where you may want metadata checksums to seal the filesystems'' history, a history which could be broken if drives write checksum-sized blocks, but write them in the wrong order. I don''t see how raw storage can do anything but put checksums on block-sized chunks, which is useful for data in flight but not that useful to store. The stored checksum can prove ``this exact block was once handed to me, and I was once told to write it to this LBA on this LUN.'''' So what? Yes, I agree that happened, but it might have been two years ago. that doesn''t mean the block is what belongs there _right now_. I could have overwritten that block 100 times since then. You need a metadata heirarchy to know that. What the SCSI extensions could do is extend the checksums that all the big storage vendors are already doing over the FC/iSCSI SAN, and thus stop ZFS advocates from pointing at weak TCP checksums, ancient routers, SAN bitflip gremlins when pools with single-lun vdevs become corrupt. The storage vendor pitch about helping to _find_ the corruption problems---I buy that one. ZFS is notoriously poor at that job. But I don''t think the SCSI extension is helpful for extending the halo of the on-disk protection domain through the filesystem and above it, past a network filesystem. They can''t do that by adding SCSI commands. It''s simply irrelevant to the task, unless SCSI is going to become its own non-POSIX filesystem with snapshots and a virtual clock, which it had better not. Lustre could do it, though, especially if they are building their own filesystem from zpool pieces right above the transactional layer, not just using ZFS as a POSIX backing store. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090219/05ed1484/attachment.bin>
On Thu, Feb 19, 2009 at 6:18 AM, Gary Mills <mills at cc.umanitoba.ca> wrote:> Should I file an RFE for this addition to ZFS? The concept would be > to run ZFS on a file server, exporting storage to an application > server where ZFS also runs on top of that storage. All storage > management would take place on the file server, where the physical > disks reside. The application server would still perform end-to-end > error checking but would notify the file server when it detected an > error.You could accomplish most of this by creating a iSCSI volume on the storage server, then using ZFS with no redundancy on the application server. You''ll have two layers for checksums, one on the storage server''s zpool and a second on the application server''s filesystem. The application server won''t be able to notify the storage server that it''s detected a bad checksum, other than through retries, but can write a user-space monitor that watches for ZFS checksum errors and sends notification to the storage server. To poke a hole in your idea: What if the app server does find an error? What''s the storage server to do at that point? Provided that the storage server''s zpool already has redundancy, the data written to disk should already be exactly what was received from the client. If you want to have the ability to recover from erorrs on the app server, you should use a redundant zpool - Either a mirror or a raidz. If you''re concerned about data corruption in transit, then it sounds like something akin to T10 DIF (which others mentioned) would fit the bill. You could also tunnel the traffic over a transit layer such as TLS or SSH that provides a measure of validation. Latency should be fun to deal with however. -B -- Brandon High : bhigh at freaks.com
On Thu, Feb 19, 2009 at 09:59:01AM -0800, Richard Elling wrote:> Gary Mills wrote: > >Should I file an RFE for this addition to ZFS? The concept would be > >to run ZFS on a file server, exporting storage to an application > >server where ZFS also runs on top of that storage. All storage > >management would take place on the file server, where the physical > >disks reside. The application server would still perform end-to-end > >error checking but would notify the file server when it detected an > >error. > > Currently, this is done as a retry. But retries can suffer from cached > badness.So, ZFS on the application server would retry the read from the storage server. This would be the same as it does from a physical disk, I presume. However, if the checksum failure persisted, it would declare an error. That''s where the RFE comes in, because it would then notify the file server to utilize its redundant data source. Perhaps this could be done as part of the retry, using existing protocols.> >There are several advantages to this configuration. One current > >recommendation is to export raw disks from the file server. Some > >storage devices, including I assume Sun''s 7000 series, are unable to > >do this. Another is to build two RAID devices on the file server and > >to mirror them with ZFS on the application server. This is also > >sub-optimal as it doubles the space requirement and still does not > >take full advantage of ZFS error checking. Splitting the > >responsibilities works around these problems > > I''m not convinced, but here is how you can change my mind. > > 1. Determine which faults you are trying to recover from.I don''t think this has been clearly identified, except that they are ``those faults that are only detected by end-to-end checksums''''.> 2. Prioritize these faults based on their observability, impact, > and rate.Perhaps the project should be to extend end-to-end checksums in situations that don''t have end-to-end redundancy. Redundancy at the storage layer would be required, of course. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
On 2/20/2009 9:33 AM, Gary Mills wrote:> On Thu, Feb 19, 2009 at 09:59:01AM -0800, Richard Elling wrote: > >> Gary Mills wrote: >> >>> Should I file an RFE for this addition to ZFS? The concept would be >>> to run ZFS on a file server, exporting storage to an application >>> server where ZFS also runs on top of that storage. All storage >>> management would take place on the file server, where the physical >>> disks reside. The application server would still perform end-to-end >>> error checking but would notify the file server when it detected an >>> error. >>> >> Currently, this is done as a retry. But retries can suffer from cached >> badness. >> > > So, ZFS on the application server would retry the read from the > storage server. This would be the same as it does from a physical > disk, I presume. However, if the checksum failure persisted, it > would declare an error. That''s where the RFE comes in, because it > would then notify the file server to utilize its redundant data > source. Perhaps this could be done as part of the retry, using > existing protocols. >I''m no expert, but I think not only "would this have been taken care of by the retry" but if the error is being introduced by any HW or SW on the storage server''s end, then the storage server will already be checking it''s checksums. The main place the new errors could be introduced will be after the data left ZFS''s control, heading out the network interface across the wires, and into the application server... While not impossible for the same error to creep in on every retry, I think it''d be rarer than different errors each time, and the retries would have a very good chance eventually getting good copies of every block. Even if the application server could notify the storage server of the problem. There isn''t any thing more the storage server can do. If there was a problem that it''s redundancy could fix, it''s checksums would have identified that, and it would have fixed it even before the data was sent to the application server.> >>> There are several advantages to this configuration. One current >>> recommendation is to export raw disks from the file server. Some >>> storage devices, including I assume Sun''s 7000 series, are unable to >>> do this. Another is to build two RAID devices on the file server and >>> to mirror them with ZFS on the application server. This is also >>> sub-optimal as it doubles the space requirement and still does not >>> take full advantage of ZFS error checking. Splitting the >>> responsibilities works around these problems >>> >> I''m not convinced, but here is how you can change my mind. >> >> 1. Determine which faults you are trying to recover from. >> > > I don''t think this has been clearly identified, except that they are > ``those faults that are only detected by end-to-end checksums''''. > >Adding ZFS on the appserver will add a new set of checksums for the data''s journey over the wire and back again. Nothing will be checking those checksums on the storage server to see if corruption happened to writes on the way there (which might be a place for improvement - but I''m not sure how that can even be done,) but those same checksums will be sent back to the appserver on a read, so the appserver will be able to determine the problem then - Of course if the corruption happenned while sending the write, then no amount of retries will help. Only ZFS redundancy on the app server can (currently) help with that. -Kyle>> 2. Prioritize these faults based on their observability, impact, >> and rate. >> > > Perhaps the project should be to extend end-to-end checksums in > situations that don''t have end-to-end redundancy. Redundancy at the > storage layer would be required, of course. > >
On Thu, Feb 19, 2009 at 12:36:22PM -0800, Brandon High wrote:> On Thu, Feb 19, 2009 at 6:18 AM, Gary Mills <mills at cc.umanitoba.ca> wrote: > > Should I file an RFE for this addition to ZFS? The concept would be > > to run ZFS on a file server, exporting storage to an application > > server where ZFS also runs on top of that storage. All storage > > management would take place on the file server, where the physical > > disks reside. The application server would still perform end-to-end > > error checking but would notify the file server when it detected an > > error. > > You could accomplish most of this by creating a iSCSI volume on the > storage server, then using ZFS with no redundancy on the application > server.That''s what I''d like to do, and what we do now. The RFE is to take advantage of the end-to-end checksums in ZFS in spite of having no redundancy on the application server. Having all of the disk management in one place is a great benefit.> You''ll have two layers for checksums, one on the storage server''s > zpool and a second on the application server''s filesystem. The > application server won''t be able to notify the storage server that > it''s detected a bad checksum, other than through retries, but can > write a user-space monitor that watches for ZFS checksum errors and > sends notification to the storage server.The RFE is to enable the two instances of ZFS to exchange information about checksum failures.> To poke a hole in your idea: What if the app server does find an > error? What''s the storage server to do at that point? Provided that > the storage server''s zpool already has redundancy, the data written to > disk should already be exactly what was received from the client. If > you want to have the ability to recover from erorrs on the app server, > you should use a redundant zpool - Either a mirror or a raidz.Yes, if the two instances of ZFS disagree, we have a problem that needs to be resolved: they need to cooperate in this endevour.> If you''re concerned about data corruption in transit, then it sounds > like something akin to T10 DIF (which others mentioned) would fit the > bill. You could also tunnel the traffic over a transit layer such as > TLS or SSH that provides a measure of validation. Latency should be > fun to deal with however.I''m mainly concerned that ZFS on the application server will detect a checksum error and then be unable to preserve the data. Iscsi already has TCP checksums. I assume that FC-AL does as well. Using more reliable checksums has no benefit if ZFS will still detect end-to-end checksum errors. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-