Large sites that have centralized their data with a SAN typically have a storage device export block-oriented storage to a server, with a fibre-channel or Iscsi connection between the two. The server sees this as a single virtual disk. On the storage device, the blocks of data may be spread across many physical disks. The storage device looks after redundancy and management of the physical disks. It may even phone home when a disk fails and needs to be replaced. The storage device provides reliability and integrity for the blocks of data that it serves, and does this well. On the server, a variety of filesystems can be created on this virtual disk. UFS is most common, but ZFS has a number of advantages over UFS. Two of these are dynamic space management and snapshots. There are also a number of objections to employing ZFS in this manner. ``ZFS cannot correct errors'''', and ``you will lose all of your data'''' are two of the alarming ones. Isn''t ZFS supposed to ensure that data written to the disk are always correct? What''s the real problem here? This is a split responsibility configuration where the storage device is responsible for integrity of the storage and ZFS is responsible for integrity of the filesystem. How can it be made to behave in a reliable manner? Can ZFS be better than UFS in this configuration? Is a different form of communication between the two components necessary in this case? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Bob Friesenhahn
2008-Dec-10 19:08 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, 10 Dec 2008, Gary Mills wrote:> > This is a split responsibility configuration where the storage device > is responsible for integrity of the storage and ZFS is responsible for > integrity of the filesystem. How can it be made to behave in a > reliable manner? Can ZFS be better than UFS in this configuration? > Is a different form of communication between the two components > necessary in this case?The issue is really that the SAN device error detection and correction is not as robust as what is used by ZFS. The vast majority of SAN devices do not do 100% data error detection. ZFS is in a position to detect errors that the SAN devices can not detect. I doubt that ZFS is any more likely to lose your data than UFS is, but ZFS is vastly more likely to detect if there is a problem with the data that your SAN device is returning. For my own situation, I configured my SAN array to be a fiber channel "JBOD" and ZFS handles all the data integrity issues associated with the disks. After 10 months I have yet to encounter any issue and performance is excellent. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Dec 10, 2008 at 18:46, Gary Mills <mills at cc.umanitoba.ca> wrote:> The > storage device provides reliability and integrity for the blocks of > data that it serves, and does this well.But not well enough. Even if the storage does a perfect job keeping its bits correct on disk, there are a lot of steps between the array and the CPU. If one of those steps is faulty, data could be silently corrupted. ZFS checks data all the way from the CPU to the disk and back to the CPU, and the storage array fundamentally cannot do that.> a number of objections to employing ZFS in this manner. > ``ZFS cannot correct errors'''', and ``you will lose all of your data'''' > are two of the alarming ones. Isn''t ZFS supposed to ensure that data > written to the disk are always correct? What''s the real problem here?These problems are caused by not letting ZFS handle a level of redundancy. If you export raid-0 (or raid-5, or single-disk) LUNs from the array and mirror them on the host side, this will solve the problem. This does mean that your array is doing extra work that''s not getting used. I don''t see any way around it. It also means that you need twice the bandwidth to the storage array. I don''t see any way around that either. ZFS really loves high bandwidth, which gives the advantage to direct-connected storage: SAS arrays, and so forth.> the storage device > is responsible for integrity of the storage and ZFS is responsible for > integrity of the filesystem.Turning off checksumming on the ZFS side may ``solve'''' the problem. Will
Nicolas Williams
2008-Dec-10 19:30 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote:> On the server, a variety of filesystems can be created on this virtual > disk. UFS is most common, but ZFS has a number of advantages over > UFS. Two of these are dynamic space management and snapshots. There > are also a number of objections to employing ZFS in this manner. > ``ZFS cannot correct errors'''', and ``you will lose all of your data'''' > are two of the alarming ones. Isn''t ZFS supposed to ensure that data > written to the disk are always correct? What''s the real problem here?ZFS has very strong error detection built-in, and for mirrored and RAID-Zed pools can recover from errors automatically as long as there''s a mirror left or enough disks in RAID-Z left to complete the recovery. ZFS can also store multiple copies of data and metadata even in non-mirrored/non-RAID-Z pools. ZFS always leaves the filesystem in a consistent state, provided the drives aren''t lying. Whoever is making those objections is misinformed.> This is a split responsibility configuration where the storage device > is responsible for integrity of the storage and ZFS is responsible for > integrity of the filesystem. How can it be made to behave in a > reliable manner? Can ZFS be better than UFS in this configuration?It does. It is.> Is a different form of communication between the two components > necessary in this case?No. Note that you''ll generally be better off using RAID-Z than HW RAID-5. Nico --
I agree completely with your assessment of the problems Gary, when ZFS can''t correct your data you do seem to be at high risk of loosing data, although some people are able to recover it with the help of a couple of helpful souls on this forum. I can think of one scenario where you might be able to turn this configuration to your advantage though. If you have two SAN''s for redundancy, it would be possible to link each to your ZFS server and create a ZFS mirror. That then gives you the best of both worlds (while potentially avoiding SAN remote mirroring licences which tend to be expensive). You could also potentially mirror to a local disk, although I suspect that will have a noticeable impact on performance in most situations. Failing that, as others have suggested, export multiple LUN''s from your SAN and create a ZFS raid array or mirror. -- This message posted from opensolaris.org
Nicolas Williams
2008-Dec-10 20:26 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, Dec 10, 2008 at 01:30:30PM -0600, Nicolas Williams wrote:> On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote: > > On the server, a variety of filesystems can be created on this virtual > > disk. UFS is most common, but ZFS has a number of advantages over > > UFS. Two of these are dynamic space management and snapshots. There > > are also a number of objections to employing ZFS in this manner. > > ``ZFS cannot correct errors'''', and ``you will lose all of your data'''' > > are two of the alarming ones. Isn''t ZFS supposed to ensure that data > > written to the disk are always correct? What''s the real problem here? > > ZFS has very strong error detection built-in, and for mirrored and > RAID-Zed pools can recover from errors automatically as long as there''s > a mirror left or enough disks in RAID-Z left to complete the recovery.Oh, but I get it: all the redundancy here would be in the SAN, and the ZFS pools would have no mirrors, no RAID-Z. As I said:> Note that you''ll generally be better off using RAID-Z than HW RAID-5.Precisely because ZFS can reconstruct the correct data if it''s responsible for redundancy. But note that the setup you describe puts ZFS in no worse a situation than any other filesystem.
Nicolas Williams wrote:> On Wed, Dec 10, 2008 at 01:30:30PM -0600, Nicolas Williams wrote: > >> On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote: >> >>> On the server, a variety of filesystems can be created on this virtual >>> disk. UFS is most common, but ZFS has a number of advantages over >>> UFS. Two of these are dynamic space management and snapshots. There >>> are also a number of objections to employing ZFS in this manner. >>> ``ZFS cannot correct errors'''', and ``you will lose all of your data'''' >>> are two of the alarming ones. Isn''t ZFS supposed to ensure that data >>> written to the disk are always correct? What''s the real problem here? >>> >> ZFS has very strong error detection built-in, and for mirrored and >> RAID-Zed pools can recover from errors automatically as long as there''s >> a mirror left or enough disks in RAID-Z left to complete the recovery. >> > > Oh, but I get it: all the redundancy here would be in the SAN, and the > ZFS pools would have no mirrors, no RAID-Z. > > As I said: > > >> Note that you''ll generally be better off using RAID-Z than HW RAID-5. >> > > Precisely because ZFS can reconstruct the correct data if it''s > responsible for redundancy. > > But note that the setup you describe puts ZFS in no worse a situation > than any other filesystem. >Well, actually, it does. ZFS is susceptible to a class of failure modes I classify as "kill the canary" types. ZFS will detect errors and complain about them, which results in people blaming ZFS (the canary). If you follow this forum, you''ll see a "kill the canary" post about every month or so. By default, ZFS implements the policy that uncorrectable, but important failures may cause it to do an armadillo impression (staying with the animal theme ;-) but for which some other file systems, like UFS, will blissfully ignore -- putting data at risk. Occasionally, arguments will arise over whether this is the best default policy, though most folks seem to agree that data corruption is a bad thing. Later versions of ZFS, particularly that available in Solaris 10 10/08 and all OpenSolaris releases, allow system admins to have better control over these policies. -- richard
Nicolas Williams
2008-Dec-10 21:11 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, Dec 10, 2008 at 12:58:48PM -0800, Richard Elling wrote:> Nicolas Williams wrote: > >But note that the setup you describe puts ZFS in no worse a situation > >than any other filesystem. > > Well, actually, it does. ZFS is susceptible to a class of failure modes > I classify as "kill the canary" types. ZFS will detect errors and complain > about them, which results in people blaming ZFS (the canary). If you > follow this forum, you''ll see a "kill the canary" post about every month > or so. > > By default, ZFS implements the policy that uncorrectable, but important > failures may cause it to do an armadillo impression (staying with the > animal theme ;-) but for which some other file systems, like UFS, will > blissfully ignore -- putting data at risk. Occasionally, arguments will > arise over whether this is the best default policy, though most folks > seem to agree that data corruption is a bad thing. Later versions of > ZFS, particularly that available in Solaris 10 10/08 and all OpenSolaris > releases, allow system admins to have better control over these policies.I''ve seen many of those threads. ZFS won''t put your data at risk, but the user is used to UFS (and others) doing so, and they tend to prefer that to ZFS panics. It''s not that ZFS puts your data at risk in this scenario, but your operations, which for many is actually much worse than risking their data. Here''s hoping for the end of HW RAID. Nico --
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes: >>>>> "wm" == Will Murnane <will.murnane at gmail.com> writes:nw> ZFS has very strong error detection built-in, nw> ZFS can also store multiple copies of data and metadata even nw> in non-mirrored/non-RAID-Z pools. nw> Whoever is making those objections is misinformed. The objection, to review, is that people are losing entire ZFS pools on SAN''s more often than UFS pools on the same SAN. This is experience. One might start trying to infer the reason, from the manual recovery workarounds that have worked: using an older ueberblock. wm> Turning off checksumming on the ZFS side may ``solve'''' the wm> problem. That wasn''t the successful answer for people who lost pools and then recovered them. Based on my limited understanding I don''t think it would help a pool that was recovered by using an older ueberblock. Also to pick a nit, AIUI certain checksums on the metadata can''t be disabled because they''re used in place of write-barriered commit sectors. I might be wrong though. nw> ZFS always leaves the filesystem in a consistent state, nw> provided the drives aren''t lying. ZFS needs to give similar reliability performance to competing filesystems while running on the drives and SANs that exist now. Alternatively, if you want to draw a line in the sand on the ``blame the device'''' position, the problems causing lost pools have to be actually tracked down and definitively blamed on misimplemented devices, and we need to develop a procedure to identify and disqualify the misimplemented devices. When we follow the qualification procedure before loading data into the pool, you''re no longer allowed to blame devices with hindsight after the pool''s lost by pointing at self-exhonerating error messages or telling stories about theoretical capabilities of the on-disk format. We also need to develop a list of broken devices so we can avoid buying them, and the list needs not to be a secret list rumored to contain ``drives from major vendors'''' for fear of vendors retaliating by repealing discounts or whatever. I kind of prefer this approach, but the sloppier approach of working around the problem (``working around'''' meaing automatically, safely, somewhat-quickly, hopefully not silently, recovering from often-seen kinds of corruption without rigorously identifying their root causes, just like fsck does on other filesystems) is probably easier to implement. Other filesystems like ext3 and XFS on Linux have gone through the same process of figuring out why corruption was happening and working around it through changing the way they write, sending drives STOP_UNIT commands before ACPI powerdown, the rumored ``about to lose power'''' interrupt on SGI that makes Irix cancel DMA, and mostly adding special cases to fsck, and so on. I think the obstructionist recitations of on-disk-format feature lists explaining why this ``shouldn''t be happening'''' reduce confidence in ZFS. They don''t improve it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081210/9783e4a3/attachment.bin>
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> ZFS will detect errors and complain about them, which results re> in people blaming ZFS (the canary). this is some really sketchy spin. Sometimes you will say ZFS stores multiple copies of metadata, so even on an unredundant pool a few unreadable sectors usually affects only a few files, _and_ ZFS will tell you which files they are. This is the appropriate response to corruption: notice it, while saving as much uncorrupt data as possible. Both pieces are advertised as ZFS features but you stress the second by saying how redundant metdata is. Later you say ZFS is functioning as a canary when it notices corruption and informs you by throwing away your whole pool. Yes, I can pedantically see how sometimes it''d be better to lose a whole pool than have files within it silently corrupted, but if you''re comfortable living with that, it should be presented just-so to potential users, not hidden inside this canary spin. Let''s go with the canary analogy. Living with the current behavior is, at best (assuming you buy the device-blaming explanations which I don''t), more like loading up the whole mine with strategically-placed explosives and connecting them to poison-gas detectors. If there''s any posion gas, they destroy the entire mine. Sure, all the workers die, but (a) they MIGHT have died anyway from the poison gas, (b) it''s for the best because no one will mistakenly wander into the remaining pile of rubble and be harmed by the gas that was in there, (c) you need to have mine-collapse insurance anyway. I don''t think the SAN corruption problems are adequately explained, and even if the party line that they''re caused by mysterious bit-flipping gremlins in DRAM or over FC circuits, throwing out the whole pool isn''t an acceptable kind of warning. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081210/2862f33f/attachment.bin>
Bob Friesenhahn
2008-Dec-10 22:32 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, 10 Dec 2008, Miles Nordin wrote:> The objection, to review, is that people are losing entire ZFS pools > on SAN''s more often than UFS pools on the same SAN. This isIt sounds like you have access to a source of information that the rest of us don''t have access to. Perhaps it is a secret university study which is not yet published. Can you please share this source of information so that we may all analyze it and draw our own conclusions? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>It sounds like you have access to a source of information that the >rest of us don''t have access to.I think if you read the archives of this mailing list, and compare it to the discussions on the other Solaris mailing lists re UFS, it''s a reasonable conclusion. -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Dec-11 16:53 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Wed, 10 Dec 2008, Anton B. Rang wrote:>> It sounds like you have access to a source of information that the >> rest of us don''t have access to. > > I think if you read the archives of this mailing list, and compare > it to the discussions on the other Solaris mailing lists re UFS, > it''s a reasonable conclusion.I don''t think drawing conclusions based on observing the zfs nerve center is a scientific approach for these reasons: * UFS is expected to fail. * ZFS is expected to never fail. * UFS has a small maximum volume size. * ZFS allows building massive storage pools into the hundreds of terrabytes and beyond. * UFS has only rudimentary error checking. * ZFS has exotic error checking. So basically UFS volumes are small (most are 100GB or less) and when they fail (and someone actually notices) it is not worth mentioning since they were expected to eventually fail and they can easily be restored from backup. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Robert Milkowski
2008-Dec-11 17:28 UTC
[zfs-discuss] Split responsibility for data with ZFS
Hello Anton, Thursday, December 11, 2008, 4:17:15 AM, you wrote:>>It sounds like you have access to a source of information that the >>rest of us don''t have access to.ABR> I think if you read the archives of this mailing list, and ABR> compare it to the discussions on the other Solaris mailing lists ABR> re UFS, it''s a reasonable conclusion. Well, by following that logic one might deduct that there are much more ZFS installations than UFS - because people are talking more about ZFS than UFS these days. But hey, I doubt it is actually true. ZFS is very active when it comes to development and its development is happening pretty much in public - that''s why you can see much more problems with ZFS - but I would argue that it is mostly perception if anything else. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
On 11-Dec-08, at 12:28 PM, Robert Milkowski wrote:> Hello Anton, > > Thursday, December 11, 2008, 4:17:15 AM, you wrote: > >>> It sounds like you have access to a source of information that the >>> rest of us don''t have access to. > > ABR> I think if you read the archives of this mailing list, and > ABR> compare it to the discussions on the other Solaris mailing lists > ABR> re UFS, it''s a reasonable conclusion. > > Well, by following that logic one might deduct that there are much > more ZFS installations than UFS - because people are talking more > about ZFS than UFS these days. But hey, I doubt it is actually true. > > ZFS is very active when it comes to development and its development is > happening pretty much in publicAnd that perceived (or real) immaturity attracts blame (warranted or not). I think we have to assume Anton was joking - otherwise his measure is uselessly unscientific. --Toby> - that''s why you can see much more > problems with ZFS - but I would argue that it is mostly perception if > anything else. > > -- > Best regards, > Robert Milkowski mailto:milek at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Dec 10, 2008 at 12:58:48PM -0800, Richard Elling wrote:> Nicolas Williams wrote: > >On Wed, Dec 10, 2008 at 01:30:30PM -0600, Nicolas Williams wrote: > > > >>On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote: > >> > >>>On the server, a variety of filesystems can be created on this virtual > >>>disk. UFS is most common, but ZFS has a number of advantages over > >>>UFS. Two of these are dynamic space management and snapshots. There > >>>are also a number of objections to employing ZFS in this manner. > >>> > >>ZFS has very strong error detection built-in, and for mirrored and > >>RAID-Zed pools can recover from errors automatically as long as there''s > >>a mirror left or enough disks in RAID-Z left to complete the recovery. > > > >Oh, but I get it: all the redundancy here would be in the SAN, and the > >ZFS pools would have no mirrors, no RAID-Z. > > > >>Note that you''ll generally be better off using RAID-Z than HW RAID-5. > > > >Precisely because ZFS can reconstruct the correct data if it''s > >responsible for redundancy. > > > >But note that the setup you describe puts ZFS in no worse a situation > >than any other filesystem. > > Well, actually, it does. ZFS is susceptible to a class of failure modes > I classify as "kill the canary" types. ZFS will detect errors and complain > about them, which results in people blaming ZFS (the canary). If you > follow this forum, you''ll see a "kill the canary" post about every month > or so. > > By default, ZFS implements the policy that uncorrectable, but important > failures may cause it to do an armadillo impression (staying with the > animal theme ;-) but for which some other file systems, like UFS, will > blissfully ignore -- putting data at risk. Occasionally, arguments will > arise over whether this is the best default policy, though most folks > seem to agree that data corruption is a bad thing. Later versions of > ZFS, particularly that available in Solaris 10 10/08 and all OpenSolaris > releases, allow system admins to have better control over these policies.Yes, that''s what I was getting at. Without redundancy at the ZFS level, ZFS can report errors but not correct them. Of course, with a reliable SAN and storage device, those errors will never happen. Certainly, vendors of these products will claim that they have extremely high standards of data integrity. Data corruption is the worst nightmare of storage designers, after all. It rarely happens, although I have seen it on one occasion in a high-quality storage device. The split responsibility model is quite appealing. I''d like to see ZFS address this model. Is there not a way that ZFS could delegate responsibility for both error detection and correction to the storage device, at least one more sophisticated than a physical disk? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Gary Mills wrote:> The split responsibility model is quite appealing. I''d like to see > ZFS address this model. Is there not a way that ZFS could delegate > responsibility for both error detection and correction to the storage > device, at least one more sophisticated than a physical disk? > >Surely that removes one of ZFS''s greatest features: end to end checksums. All you''d end up with is yet another volume manager. No matter how good your SAN is, it won''t spot a flaky cable or bad RAM. -- Ian.
Bob Friesenhahn
2008-Dec-12 04:41 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Thu, 11 Dec 2008, Gary Mills wrote:> The split responsibility model is quite appealing. I''d like to see > ZFS address this model. Is there not a way that ZFS could delegate > responsibility for both error detection and correction to the storage > device, at least one more sophisticated than a physical disk?Why is split responsibility appealing? In almost any complex system whether it be government or computing, split responsibility results in indecision and confusion. Heirarchical decision making based on common rules is another matter entirely. Unfortunately SAN equipment is still based on technology developed in the early ''80s and simply tries to behave like a more reliable disk drive rather than a participating intelligent component in a system which may detect, tolerate, and spontaneously correct any faults. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gary Mills wrote:> On Wed, Dec 10, 2008 at 12:58:48PM -0800, Richard Elling wrote: > >> Nicolas Williams wrote: >> >>> On Wed, Dec 10, 2008 at 01:30:30PM -0600, Nicolas Williams wrote: >>> >>> >>>> On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote: >>>> >>>> >>>>> On the server, a variety of filesystems can be created on this virtual >>>>> disk. UFS is most common, but ZFS has a number of advantages over >>>>> UFS. Two of these are dynamic space management and snapshots. There >>>>> are also a number of objections to employing ZFS in this manner. >>>>> >>>>> >>>> ZFS has very strong error detection built-in, and for mirrored and >>>> RAID-Zed pools can recover from errors automatically as long as there''s >>>> a mirror left or enough disks in RAID-Z left to complete the recovery. >>>> >>> Oh, but I get it: all the redundancy here would be in the SAN, and the >>> ZFS pools would have no mirrors, no RAID-Z. >>> >>> >>>> Note that you''ll generally be better off using RAID-Z than HW RAID-5. >>>> >>> Precisely because ZFS can reconstruct the correct data if it''s >>> responsible for redundancy. >>> >>> But note that the setup you describe puts ZFS in no worse a situation >>> than any other filesystem. >>> >> Well, actually, it does. ZFS is susceptible to a class of failure modes >> I classify as "kill the canary" types. ZFS will detect errors and complain >> about them, which results in people blaming ZFS (the canary). If you >> follow this forum, you''ll see a "kill the canary" post about every month >> or so. >> >> By default, ZFS implements the policy that uncorrectable, but important >> failures may cause it to do an armadillo impression (staying with the >> animal theme ;-) but for which some other file systems, like UFS, will >> blissfully ignore -- putting data at risk. Occasionally, arguments will >> arise over whether this is the best default policy, though most folks >> seem to agree that data corruption is a bad thing. Later versions of >> ZFS, particularly that available in Solaris 10 10/08 and all OpenSolaris >> releases, allow system admins to have better control over these policies. >> > > Yes, that''s what I was getting at. Without redundancy at the ZFS > level, ZFS can report errors but not correct them. Of course, with a > reliable SAN and storage device, those errors will never happen. >"those errors will never happen" are famous last words. If you search the archives here, you will find stories of bad cables, SAN switches with downrev firmware, HBA, and RAM problems which were detected by ZFS.> Certainly, vendors of these products will claim that they have > extremely high standards of data integrity. Data corruption is the > worst nightmare of storage designers, after all. It rarely happens, > although I have seen it on one occasion in a high-quality storage > device. > > The split responsibility model is quite appealing. I''d like to see > ZFS address this model. Is there not a way that ZFS could delegate > responsibility for both error detection and correction to the storage > device, at least one more sophisticated than a physical disk? > >I''m not really sure what you mean by "split responsibility model." I think you will find that previous designs have more (blind?) trust in the underlying infrastructure. ZFS is designed to trust, but verify. -- richard
Nicolas Williams
2008-Dec-12 05:56 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Thu, Dec 11, 2008 at 09:54:36PM -0800, Richard Elling wrote:> I''m not really sure what you mean by "split responsibility model." I > think you will find that previous designs have more (blind?) trust in > the underlying infrastructure. ZFS is designed to trust, but verify.I think he means ZFS w/ HW RAID (even if it''s SW RAID< if it''s behind the SAN then it''s as if it were HW RAID from ZFS'' p.o.v.). Nico --
On Thu, Dec 11, 2008 at 10:41:26PM -0600, Bob Friesenhahn wrote:> On Thu, 11 Dec 2008, Gary Mills wrote: > >The split responsibility model is quite appealing. I''d like to see > >ZFS address this model. Is there not a way that ZFS could delegate > >responsibility for both error detection and correction to the storage > >device, at least one more sophisticated than a physical disk? > > Why is split responsibility appealing? In almost any complex system > whether it be government or computing, split responsibility results in > indecision and confusion. Heirarchical decision making based on > common rules is another matter entirely.Now this becomes semantics. There still has to be a hierarchy, but it''s split into areas of responsibility. In the case of ZFS over SAN storage, the area boundary now is the SAN cable.> Unfortunately SAN equipment > is still based on technology developed in the early ''80s and simply > tries to behave like a more reliable disk drive rather than a > participating intelligent component in a system which may detect, > tolerate, and spontaneously correct any faults.That''s exactly what I''m asking. How can ZFS and SAN equipment be improved so that they cooperate to make the whole system more reliable? Converting the SAN storage into a JBOD is not a valid solution. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
On Fri, Dec 12, 2008 at 04:30:51PM +1300, Ian Collins wrote:> Gary Mills wrote: > > The split responsibility model is quite appealing. I''d like to see > > ZFS address this model. Is there not a way that ZFS could delegate > > responsibility for both error detection and correction to the storage > > device, at least one more sophisticated than a physical disk? > > > > > Surely that removes one of ZFS''s greatest features: end to end > checksums. All you''d end up with is yet another volume manager. > > No matter how good your SAN is, it won''t spot a flaky cable or bad RAM.Of course it will. There''s an error-checking protocol that runs over the SAN cable. Memory will detect errors as well. There''s error checking, or checking and correction, every step of the way. Better integration of all of this error checking could be an improvement, though. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
It really comes down to how much you trust the SAN and transport technology. If you''re happy that you''ve got a good SAN, and you have a transport that guarantees the integrity of the data then there''s no reason ZFS shouldn''t be reliable. Personally I''d be happier once some of the recovery tools that have been discussed here have been made available, but then I don''t often play with high end kit. With good quality kit I suspect your risk of data loss is pretty low. -- This message posted from opensolaris.org
Nicolas Williams
2008-Dec-12 20:09 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Fri, Dec 12, 2008 at 01:52:54PM -0600, Gary Mills wrote:> On Fri, Dec 12, 2008 at 04:30:51PM +1300, Ian Collins wrote: > > No matter how good your SAN is, it won''t spot a flaky cable or bad RAM. > > Of course it will. There''s an error-checking protocol that runs over > the SAN cable. Memory will detect errors as well. There''s error > checking, or checking and correction, every step of the way. Better > integration of all of this error checking could be an improvement, > though.If you can fully trust the SAN then there''s no reason not to run ZFS on top of it with no ZFS mirrors and no RAID-Z. Yet at the same time we see posters worried about ZFS failure modes in the face of corrupted data. Which is it: do you trust the SAN, yes or no? If you do then you''re saying that you trust your filesystems not to have any failure modes upon SAN data corruption because you trust SAN data corruption to be impossible. Nico --
Gary Mills wrote:> On Thu, Dec 11, 2008 at 10:41:26PM -0600, Bob Friesenhahn wrote: > >> On Thu, 11 Dec 2008, Gary Mills wrote: >> >>> The split responsibility model is quite appealing. I''d like to see >>> ZFS address this model. Is there not a way that ZFS could delegate >>> responsibility for both error detection and correction to the storage >>> device, at least one more sophisticated than a physical disk? >>> >> Why is split responsibility appealing? In almost any complex system >> whether it be government or computing, split responsibility results in >> indecision and confusion. Heirarchical decision making based on >> common rules is another matter entirely. >> > > Now this becomes semantics. There still has to be a hierarchy, but > it''s split into areas of responsibility. In the case of ZFS over SAN > storage, the area boundary now is the SAN cableI think I see where you are coming from. Suppose we make an operational definition that says a SAN is a transport for block-level data. Then...>> Unfortunately SAN equipment >> is still based on technology developed in the early ''80s and simply >> tries to behave like a more reliable disk drive rather than a >> participating intelligent component in a system which may detect, >> tolerate, and spontaneously correct any faults. >> > > That''s exactly what I''m asking. How can ZFS and SAN equipment be > improved so that they cooperate to make the whole system more > reliable? Converting the SAN storage into a JBOD is not a valid > solution. > >ZFS only knows about block devices. It really doesn''t care if that block device is an IDE disk, USB disk, or something on the SAN. If you want ZFS to be able to repair damage that it detects, then ZFS needs to manage the data redundancy. If you don''t care that ZFS may not be able to repair damage, then don''t configure ZFS with redundancy. It really is that simple. The stack looks something like: application ---- read(), write(), mmap(), etc. ---- ZFS ---- read(), write(), ioctl(), etc. ---- block device Ideally, applications would manage their data integrity, but developers tend to let file systems or block-level systems manage data integrity.>> No matter how good your SAN is, it won''t spot a flaky cable or bad RAM. >> > > Of course it will. There''s an error-checking protocol that runs over > the SAN cable. Memory will detect errors as well. There''s error > checking, or checking and correction, every step of the way. Better > integration of all of this error checking could be an improvement, > though. >However, there are a number of failure modes which cannot be detected by such things. By implementing more end-to-end checking, you can see when your SAN switch firmware stuffs nulls into your data stream or your disk reads the data from the wrong sector (for example). No matter how much reliability is built into each step of the way, you must trust the subsystem at each step, and anecdotally, there are many subsystems which cannot be trusted: disks, arrays, switches, HBAs, memory, etc. You will find similar end-to-end design elsewhere, particularly in the security field. -- richard