Hi All, Its been a while since I touched zfs. Is the below still the case with zfs and hardware raid array? Do we still need to provide two luns from the hardware raid then zfs mirror those two luns? http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid Thanks, Shawn -- This message posted from opensolaris.org
Shawn Joy wrote:> Hi All, > > Its been a while since I touched zfs. Is the below still the case with zfs and hardware raid array? Do we still need to provide two luns from the hardware raid then zfs mirror those two luns? > > http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid > >Need, no. Should, yes. The last two points on that page are key: "Overall, ZFS functions as designed with SAN-attached devices, but if you expose simpler devices to ZFS, you can better leverage all available features. In summary, if you use ZFS with SAN-attached devices, you can take advantage of the self-healing features of ZFS by configuring redundancy in your ZFS storage pools even though redundancy is available at a lower hardware level." If you don''t give ZFS any redundancy, you risk loosing you pool if there is data corruption. -- Ian.
>If you don''t give ZFS any redundancy, you risk loosing you pool ifthere is data corruption. Is this the same risk for data corruption as UFS on hardware based luns? If we present one LUN to ZFS and choose not to ZFS mirror or do a raidz pool of that LUN is ZFS able to handle disk or raid controllers failures on the hardware array? Does ZFS handle intermittent controller outages on the raid controllers the same as what UFS would? Thanks, Shawn Ian Collins wrote:> Shawn Joy wrote: >> Hi All, >> Its been a while since I touched zfs. Is the below still the case >> with zfs and hardware raid array? Do we still need to provide two >> luns from the hardware raid then zfs mirror those two luns? >> >> http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid >> >> > Need, no. Should, yes. > > The last two points on that page are key: > > "Overall, ZFS functions as designed with SAN-attached devices, but if > you expose simpler devices to ZFS, you can better leverage all > available features. > > In summary, if you use ZFS with SAN-attached devices, you can take > advantage of the self-healing features of ZFS by configuring > redundancy in your ZFS storage pools even though redundancy is > available at a lower hardware level." > > If you don''t give ZFS any redundancy, you risk loosing you pool if > there is data corruption. >
>If you don''t give ZFS any redundancy, you risk loosing you pool if there is data corruption.Is this the same risk for data corruption as UFS on hardware based luns? If we present one LUN to ZFS and choose not to ZFS mirror or do a raidz pool of that LUN is ZFS able to handle disk or raid controllers failures on the hardware array? Does ZFS handle intermittent controller outages on the raid controllers the same as what UFS would? Thanks, Shawn -- This message posted from opensolaris.org
Shawn Joy wrote:> > > Ian Collins wrote: >> Shawn Joy wrote: >>> Hi All, >>> Its been a while since I touched zfs. Is the below still the case >>> with zfs and hardware raid array? Do we still need to provide two >>> luns from the hardware raid then zfs mirror those two luns? >>> >>> http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid >>> >>> >> Need, no. Should, yes. >> >> The last two points on that page are key: >> >> "Overall, ZFS functions as designed with SAN-attached devices, but if >> you expose simpler devices to ZFS, you can better leverage all >> available features. >> >> In summary, if you use ZFS with SAN-attached devices, you can take >> advantage of the self-healing features of ZFS by configuring >> redundancy in your ZFS storage pools even though redundancy is >> available at a lower hardware level." >> >> If you don''t give ZFS any redundancy, you risk loosing you pool if >> there is data corruption. > > Is this the same risk for data corruption as UFS on hardware based luns? >Not really, UFS wouldn''t notice, ZFS would and the single device pool would be enter a faulted state.> If we present one LUN to ZFS and choose not to ZFS mirror or do a > raidz pool of that LUN is ZFS able to handle disk or raid controllers > failures on the hardware array? >I guess the only answer id "it depends". A LUN is in effect just another drive, so if the failure is managed by the SAN, ZFS wouldn''t know.> Does ZFS handle intermittent controller outages on the raid > controllers the same as what UFS would? >I haven''t used ZFS with a SAN device, but pulling a drive causes ZFS to mark it unavailable and the pool degraded. -- Ian.
ZFS no longer has the issue where loss of a single device (even intermittently) causes pool corruption. That''s been fixed. That is, there used to be an issue in this scenario: (1) zpool constructed from a single LUN on a SAN device (2) SAN experiences temporary outage, while ZFS host remains running. (3) zpool is permanently corrupted, even if no I/O occured during outage This is fixed. (around b101, IIRC) However, ZFS remains much more sensitive to loss of the underlying LUN than UFS, and has a tendency to mark such a LUN as defective during any such SAN outage. It''s much more recoverable nowdays, though. Just to be clear, this occasionally occurs when something such as a SAN switch dies, or there is a temporary hiccup in the SAN infrastructure, causing some small (i.e. < minute) loss of connectivity to the underlying LUN. RAIDZ and mirrored zpools are still the preferred method of arranging things in ZFS, even with hardware raid backing the underlying LUN (whether the LUN is from a SAN or local HBA doesn''t matter). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Shawn Joy wrote:> >If you don''t give ZFS any redundancy, you risk loosing you pool if > there is data corruption. > > Is this the same risk for data corruption as UFS on hardware based luns?It''s a tradeoff. ZFS has more issues with loss of connectivity to the underlying LUN than UFS, while UFS has issues with inability to detect silent data corruption due to faulty HW writing/reading. I can''t quantify the actual occurrence of either to any specific number, so I can''t say the risk is more or less.> If we present one LUN to ZFS and choose not to ZFS mirror or do a > raidz pool of that LUN is ZFS able to handle disk or raid controllers > failures on the hardware array? >No. But neither is /any/ other filesystem. At best, if you lose a (non-redundant) hardware raid controller, you may be able to recover the volume upon replacement of the raid controller. Maybe. It depends on the failure mode of the raid controller. If the LUN itself is lost (i.e. a single-disk LUN where the disk goes bad, or a HW-RAID volume where failures exceed redundancy), no filesystem in the universe will help you. I don''t see any real difference between UFS and ZFS in these cases.> Does ZFS handle intermittent controller outages on the raid > controllers the same as what UFS would? >No. This is an issue with ZFS, as I noted in a previous post. Intermittent outages on the SAN have a tendency to cause ZFS to mark the LUN as "failed" - remember that ZFS acts both as a file system, and as a software volume manager. Right now, I''m not aware of ways to make ZFS more resilient to a "flakey" SAN infrastructure. Since UFS has no awareness of a LUN''s reliability (that would be in SVM), it will happily try to use a device whose underlying LUN has gone away, eventually reporting an inability to complete the relevant transaction to the calling software. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Richard Elling
2009-Oct-10 16:27 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
On Oct 10, 2009, at 8:19 AM, Erik Trimble wrote:> Shawn Joy wrote: >> >If you don''t give ZFS any redundancy, you risk loosing you pool if >> there is data corruption. >> >> Is this the same risk for data corruption as UFS on hardware based >> luns? > It''s a tradeoff. ZFS has more issues with loss of connectivity to > the underlying LUN than UFS, while UFS has issues with inability to > detect silent data corruption due to faulty HW writing/reading. I > can''t quantify the actual occurrence of either to any specific > number, so I can''t say the risk is more or less. > >> If we present one LUN to ZFS and choose not to ZFS mirror or do a >> raidz pool of that LUN is ZFS able to handle disk or raid >> controllers failures on the hardware array? >> > No. But neither is /any/ other filesystem. At best, if you lose a > (non-redundant) hardware raid controller, you may be able to recover > the volume upon replacement of the raid controller. Maybe. It > depends on the failure mode of the raid controller. If the LUN > itself is lost (i.e. a single-disk LUN where the disk goes bad, or a > HW-RAID volume where failures exceed redundancy), no filesystem in > the universe will help you. I don''t see any real difference between > UFS and ZFS in these cases. > > >> Does ZFS handle intermittent controller outages on the raid >> controllers the same as what UFS would? >> > No. This is an issue with ZFS, as I noted in a previous post. > Intermittent outages on the SAN have a tendency to cause ZFS to mark > the LUN as "failed" - remember that ZFS acts both as a file system, > and as a software volume manager. Right now, I''m not aware of ways > to make ZFS more resilient to a "flakey" SAN infrastructure. Since > UFS has no awareness of a LUN''s reliability (that would be in SVM), > it will happily try to use a device whose underlying LUN has gone > away, eventually reporting an inability to complete the relevant > transaction to the calling software.By default, ZFS won''t see a SAN outage less than 3 minutes long. It will patiently wait for the [s]sd driver to timeout... which takes 3 minutes. UFS is the same. OTOH, if the "outage" is not a complete outage, then error messages will be handled according to the error received. If a sudden batch of I/O failures is received, IIRC the default is 10 in 10 minutes, then the vdev can be marked as degraded. You can see this in the FMA logs. NB definitions of the pool states, including "degraded" are in the zpool(1m) man page. -- richard
On Oct 10, 2009, at 01:26, Erik Trimble wrote:> That is, there used to be an issue in this scenario: > > (1) zpool constructed from a single LUN on a SAN device > (2) SAN experiences temporary outage, while ZFS host remains running. > (3) zpool is permanently corrupted, even if no I/O occured during > outage > > This is fixed. (around b101, IIRC)Was this fix ever back-ported to Solaris 10?
Victor Latushkin
2009-Oct-10 19:00 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
Erik Trimble wrote:> ZFS no longer has the issue where loss of a single device (even > intermittently) causes pool corruption. That''s been fixed.Erik, it does not help at all when you are talking about some issue being fixed and does not provide corresponding CR number. It does not allow interested observer to go have a look what exactly that issue was, how it''s been fixed, does not allow to track it presence or absence in other releases. So could you please provide CR number for an issue you are talking about?> That is, there used to be an issue in this scenario: > > (1) zpool constructed from a single LUN on a SAN device > (2) SAN experiences temporary outage, while ZFS host remains running. > (3) zpool is permanently corrupted, even if no I/O occured during outage > > This is fixed. (around b101, IIRC)You see - you cannot tell exactly when it was fixed yourself. Besides, in the scenario you describe above a whole lot can be hidden in the "SAN experiences temporary outage". It can be as simple as wrong fiber cable being unplugged, and as complex as some storage array failing, rebooting and loosing its entire cache content as a result. In the former case I do not see how it could badly affect ZFS pool. It may cause panic, if ''failmode'' is set to panic (or software release is too old and does not support this property), it may require administrator intervention to do ''zpool clear''. In the latter case consequences can really be bad - pool may be corrupted and unopenable. And there are several examples of this in the archives, as well as success stories of successful recovery. And there''s recovery project to provide support for pool recovery resulting from these corruptions.> However, ZFS remains much more sensitive to loss of the underlying > LUN than UFS, and has a tendency to mark such a LUN as defective> during any such SAN outage. It''s much more recoverable nowdays, > though. Just to be clear, this occasionally occurs when something such > as a SAN switch dies, or there is a temporary hiccup in the SAN > infrastructure, causing some small (i.e. < minute) loss of > connectivity to the underlying LUN. Again, SANs are very complex structures, and perceived small loss of connectivity may in reality be very complex event with difficult to predict consequences. With non-COW filesystems (like UFS) it is indeed less likely to experience consequences of small outage immediately (though they can still manifest itself much much later). ZFS tends to uncover presence of the consequences much earlier (immediately?). But that does not immediately mean there''s an issue with ZFS. There may be issue somewhere within SAN infrastructure which was only unavailable for less than a minute.> RAIDZ and mirrored zpools are still the preferred method of arranging > things in ZFS, even with hardware raid backing the underlying LUN > (whether the LUN is from a SAN or local HBA doesn''t matter).Fully support this - without redundancy at the ZFS level there''s no such benefit as self-healing... regards, victor
Victor Latushkin wrote:> Erik Trimble wrote: >> ZFS no longer has the issue where loss of a single device (even >> intermittently) causes pool corruption. That''s been fixed. > > Erik, it does not help at all when you are talking about some issue > being fixed and does not provide corresponding CR number. It does not > allow interested observer to go have a look what exactly that issue > was, how it''s been fixed, does not allow to track it presence or > absence in other releases. > > So could you please provide CR number for an issue you are talking about? >I went back and dug through some of my email, and the issue showed up as CR 6565042. That was fixed in b77 and s10 update 6. I''m looking for the related issues of timeout failures and kernel panic before import for missing zpools.>> That is, there used to be an issue in this scenario: >> >> (1) zpool constructed from a single LUN on a SAN device >> (2) SAN experiences temporary outage, while ZFS host remains running. >> (3) zpool is permanently corrupted, even if no I/O occured during outage >> >> This is fixed. (around b101, IIRC) > > You see - you cannot tell exactly when it was fixed yourself. Besides, > in the scenario you describe above a whole lot can be hidden in the > "SAN experiences temporary outage". It can be as simple as wrong fiber > cable being unplugged, and as complex as some storage array failing, > rebooting and loosing its entire cache content as a result. > > In the former case I do not see how it could badly affect ZFS pool. It > may cause panic, if ''failmode'' is set to panic (or software release is > too old and does not support this property), it may require > administrator intervention to do ''zpool clear''. > > In the latter case consequences can really be bad - pool may be > corrupted and unopenable. And there are several examples of this in > the archives, as well as success stories of successful recovery. > > And there''s recovery project to provide support for pool recovery > resulting from these corruptions. > >> However, ZFS remains much more sensitive to loss of the underlying >> LUN than UFS, and has a tendency to mark such a LUN as defective > > during any such SAN outage. It''s much more recoverable nowdays, > > though. Just to be clear, this occasionally occurs when something such > > as a SAN switch dies, or there is a temporary hiccup in the SAN > > infrastructure, causing some small (i.e. < minute) loss of > > connectivity to the underlying LUN. > > Again, SANs are very complex structures, and perceived small loss of > connectivity may in reality be very complex event with difficult to > predict consequences. > > With non-COW filesystems (like UFS) it is indeed less likely to > experience consequences of small outage immediately (though they can > still manifest itself much much later). > > ZFS tends to uncover presence of the consequences much earlier > (immediately?). But that does not immediately mean there''s an issue > with ZFS. There may be issue somewhere within SAN infrastructure which > was only unavailable for less than a minute. >I''m not saying that it''s ZFS''s fault. I''m saying that ZFS is more sensitive to SAN issues than UFS. As Richard pointed out earlier, it''s unlikely that very small hiccups will have an impact - generally, timeout stops have to be hit on the various underlying drivers.>> RAIDZ and mirrored zpools are still the preferred method of arranging >> things in ZFS, even with hardware raid backing the underlying LUN >> (whether the LUN is from a SAN or local HBA doesn''t matter). > > Fully support this - without redundancy at the ZFS level there''s no > such benefit as self-healing... > > regards, > victor-- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
>I went back and dug through some of my email, and the issue showed up as >CR 6565042. > >That was fixed in b77 and s10 update 6.I looked at this CR, forgive me but I am not a ZFS engineer. Can you explain in, simple terms, how ZFS now reacts to this? If it does not panic how does it insure data is save? Also, just want to ensure everyone is on the same page here. There seems to be some mixed messages in this thread about how sensitive ZFS is to SAN issues. Do we all agree that creating a zpool out of one device in a SAN environment is not recommended. One should always constructs a zfs mirror or raidz device out of SAN attached devices, as posted in the ZFS FAQ? -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Oct-12 00:42 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
On Sun, 11 Oct 2009, Shawn Joy wrote:> > Do we all agree that creating a zpool out of one device in a SAN > environment is not recommended. One should always constructs a zfs > mirror or raidz device out of SAN attached devices, as posted in the > ZFS FAQ?No, not everyone agrees. Not even the zfs inventors. As with most things, it is not a black/white issue and there are plenty of valid reasons to put zfs on a big-LUN SAN device. It does not necessarily end badly. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>I went back and dug through some of my email, and the issue showed up as >CR 6565042. > >That was fixed in b77 and s10 update 6. > >I looked at this CR, forgive me but I am not a ZFS engineer. Can you explain in, >simple terms, how ZFS now reacts to this? If it does not panic how does it insure >data is save?Found some conflicting information Infodoc: 211349 Solaris[TM] ZFS & Write Failure. "ZFS will handle the drive failures gracefully as part of the BUG 6322646 fix in the case of non-redundant configurations by degrading the pool instead of initiating a system panic with the help of Solaris[TM] FMA framework.">From Richards post above."NB definitions of the pool states, including "degraded" are in the zpool(1m) man page. -- richard">From zpool man page located below.http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?l=en&a=view&q=zpool "Device Failure and Recovery ZFS supports a rich set of mechanisms for handling device failure and data corruption. All metadata and data is checksummed, and ZFS automatically repairs bad data from a good copy when corruption is detected. In order to take advantage of these features, a pool must make use of some form of redundancy, using either mirrored or raidz groups. While ZFS supports running in a non-redundant configuration, where each root vdev is simply a disk or file, this is strongly discouraged. A single case of bit corruption can render some or all of your data unavailable. A pool''s health status is described by one of three states: online, degraded, or faulted. An online pool has all devices operating normally. A degraded pool is one in which one or more devices have failed, but the data is still available due to a redundant configuration. A faulted pool has corrupted metadata, or one or more faulted devices, and insufficient replicas to continue functioning. The health of the top-level vdev, such as mirror or raidz device, is potentially impacted by the state of its associated vdevs, or component devices. A top-level vdev or component device is in one of the following states:" So from the zpool man page it seems that it is not possible to put a single device zpool in a degraded state. Is this correct or does the fix in Bugs 6565042 and 6322646 change this behavior.> >Also, just want to ensure everyone is on the same page here. There seems to be >some mixed messages in this thread about how sensitive ZFS is to SAN issues. > >Do we all agree that creating a zpool out of one device in a SAN environment is >not recommended. One should always constructs a zfs mirror or raidz device out >of SAN attached devices, as posted in the ZFS FAQ?The zpool man page seem to agree with this. Is this correct? -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Oct-12 01:31 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
On Sun, 11 Oct 2009, Shawn Joy wrote:> > So from the zpool man page it seems that it is not possible to put a > single device zpool in a degraded state. Is this correct or does the > fix in Bugs 6565042 and 6322646 change this behavior.It is true that it is not possible to use the pool if the device that it is based on is inaccessible or missing. If the device totally fails or scrambles its data, then the pool is permanently lost.>> Also, just want to ensure everyone is on the same page here. There >> seems to be some mixed messages in this thread about how sensitive >> ZFS is to SAN issues. >> >> Do we all agree that creating a zpool out of one device in a SAN >> environment is not recommended. One should always constructs a zfs >> mirror or raidz device out of SAN attached devices, as posted in >> the ZFS FAQ? > > The zpool man page seem to agree with this. Is this correct?In life there are many things that we "should do" (but often don''t). There are always trade-offs. If you need your pool to be able to operate with a device missing, then the pool needs to have sufficient redundancy to keep working. If you want your pool to survive if a disk gets crushed by a wayward fork lift, then you need to have redundant storage so that the data continues to be available. If the devices are on a SAN and you want to be able to continue operating while there is a SAN failure, then you need to have redundant SAN switches, redundant paths, and redundant storage devices, preferably in a different chassis. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "sj" == Shawn Joy <shawn.joy at sun.com> writes:sj> Can you explain in, simple terms, how ZFS now reacts sj> to this? I can''t. :) I think Victor''s long message made a lot of sense. The failure modes with a SAN are not simple. At least there is the difference of whether the target''s write buffer was lost after a transient failure or not, and the current storage stack assumes it''s never lost. IMHO, SAN''s are in general broken by design because their software stacks don''t deal predictably with common network failure modes (like the target rebooting, but the initiator staying up). The standard that would qualify to me as ``deal predictably'''' would be what NFS provides: * writes are double-cached on client and server, so the client can replay them if the server crashes. To my limited knowledge, no SAN stack does this. Expensive SAN''s can limit the amount of data at risk with NVRAM, but it seems like there would always be a little bit of data in-flight. A cost-conscious Solaris iSCSI target will put a quite large amount of data at risk between sync-cache commands. This is okay, just as it''s ok for NFS servers, but only if all the initiators reboot whenver the target reboots. Doing the client side part of the double-caching is a little tricky because I think you really want to do it pretty high in the storage stack, maybe in ZFS rather than in the initiator, or else you will be triple-caching a TXG (twice on the client, once on the server) which can be pretty big. This means introducing the idea that a sync-cache command can fail, and that when it does, none/some/all of the writes between the last sync-cache that succeeded and the current one that failed may have been silently lost even if those write commands were ack''d successful when they were issued. * the bcp for NFS mount type is ''hard,intr'' meaning, retry forever if there is a failure. If you want to stop retrying, whatever app was doing the writing gets killed. This rule means any database file that got ``intr''d'''' will be crash-consistent. The SAN equivalent of ''intr'' would be force-unmounting the filesystem (and force-unmounting implies either killing processes with open files or giving persistent errors to any open filehandles). I''m pretty sure no SAN stack does this intentionally whenever it''s needed---rather it just sort of happens sometimes depending on how errors percolate upwards through various nested cargo-cult timeouts. I guess it would be easy to add to a first order---just make SAN targets stay down forever after they bounce until ZFS marks them offline. The tricky part is the complaints you get after: ``how do I add this target back without rebooting?'''', ``do I really have to resilver? It''s happening daily so I''m basically always resilvering.'''', ``we are going down twice a day because of harmless SAN glitches that we never noticed before---is this really necessary?'''' I think I remember some post that made it sound like people were afraid to touch any of the storage exception handling because no one knows what cases are really captured by the many stupid levels of timeouts and retries. In short, to me it sounds like the retry state machines of SAN initiators are broken by design, across the board. They make the same assumption they did for local storage: the only time data in a target write buffer will get lost is during a crash-reboot. This is wrong not only for SAN''s but also for hot-pluggable drives which can have power sags that get wrongly treated the same way as CRC errors on the data cable. It''s possible to get it right, like NFS is right, but instead the popular fix with most people is to leave the storage stack broken but make ZFS more resilient to this type of corruption, like other filesystems are, because resilience is good, and people are always twitchy and frightened and not expecting strictly consistent behavior around their SAN''s anyway so the problem is rare. So far SAN targets have been proprietary, so vendors are free to conceal this problem with protocol tweaks, expensive NVRAM''s, and giving undefended or fuzzed advice through their support channels to their paranoid, accepting sysadmins. Whatever free and open targets behaved differently were assumed to be ``immature.'''' Hopefully now that SAN''s are opening up this SAN write hole will finally get plugged somehow, ...maybe with one of the two * points above, and if we were to pick the second * then we''d probably need some notion of a ``target boot cookie'''' so we only take the ''intr''-like force-unmount path in the cases where it''s really needed. sj> Do we all agree that creating a zpool out of one device in a sj> SAN environment is not recommended. This is still a good question. The stock response is ``ZFS needs to manage at least one layer of <blah blah>'''', but this problem (SAN target reboots while initiator does not) isn''t unexplained storagechaos or cosmic bitflip gremlins. Does anyone know if however much zpool redundancy helps with this type of event has changed before/after b77? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091012/f96209f9/attachment.bin>
>In life there are many things that we "should do" (but often don''t). >There are always trade-offs. If you need your pool to be able to >operate with a device missing, then the pool needs to have sufficient >redundancy to keep working. If you want your pool to survive if a >disk gets crushed by a wayward fork lift, then you need to have >redundant storage so that the data continues to be available. > >If the devices are on a SAN and you want to be able to continue >operating while there is a SAN failure, then you need to have >redundant SAN switches, redundant paths, and redundant storage >devices, preferably in a different chassis.Yes, of course. This is part of normal SAN design. The ZFS file systems is what is different here. If a either a HBA, fibre cable, redundant controller fail or firmware issues on a array redundant controller occur then SSTM (MPXIO) will see the issue and try and fail things over to the other controller. Of course this reaction at the SSTM level takes time. UFS simply allows this to happen. It is my understanding ZFS can have issues with this hence the reason why a zfs mirror or raidz device is required. Still not clear how the above mentioned BUGS change the behavior of zfs and if they change the recommendations of the zpool man page.> >Bob >-- >Bob Friesenhahn >bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ >GraphicsMagick Maintainer, http://www.GraphicsMagick.org/_______________________________________________ -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Oct-13 14:21 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
On Tue, 13 Oct 2009, Shawn Joy wrote:> The ZFS file systems is what is different here. If a either a HBA, > fibre cable, redundant controller fail or firmware issues on a array > redundant controller occur then SSTM (MPXIO) will see the issue and > try and fail things over to the other controller. Of course this > reaction at the SSTM level takes time. UFS simply allows this to > happen. It is my understanding ZFS can have issues with this hence > the reason why a zfs mirror or raidz device is required.ZFS does not seem so different than UFS when it comes to a SAN. ZFS depends on the underlying device drivers to detect and report problems. UFS does the same. MPXIO''s response will also depend on the underlying device drivers. My own reliability concerns regarding a "SAN" are due to the big-LUN that SAN hardware usually emulates and not due to communications in the "SAN". A big-LUN is comprised of multiple disk drives. If the SAN storage array has an error, then it is possible that the data on one of these disk drives will be incorrect, and it will be hidden somewhere in that big LUN. The data could be old data rather than just being "corrupted". Without redundancy ZFS will detect this corruption but will be unable to repair it. The difference from UFS is that UFS might not even notice the corruption, or fsck will just paper it over. UFS filesystems are usually much smaller than ZFS pools. There are performance concerns when using a big-LUN because ZFS won''t be able to intelligently schedule I/O for multiple drives, so performance is reduced. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Andrew Gabriel
2009-Oct-13 14:49 UTC
[zfs-discuss] Does ZFS work with SAN-attached devices?
Bob Friesenhahn wrote:> My own reliability concerns regarding a "SAN" are due to the big-LUN > that SAN hardware usually emulates and not due to communications in the > "SAN". A big-LUN is comprised of multiple disk drives. If the SAN > storage array has an error, then it is possible that the data on one of > these disk drives will be incorrect, and it will be hidden somewhere in > that big LUN. The data could be old data rather than just being > "corrupted". Without redundancy ZFS will detect this corruption but > will be unable to repair it. The difference from UFS is that UFS might > not even notice the corruption, or fsck will just paper it over. UFS > filesystems are usually much smaller than ZFS pools. > > There are performance concerns when using a big-LUN because ZFS won''t be > able to intelligently schedule I/O for multiple drives, so performance > is reduced.Also, ZFS does things like putting the ZIL data (when not on a dedicated device) at the outer edge of disks, that being faster. When you have a LUN which doesn''t map on to standard performance profile of a disk, this optimsation is lost. I give talks on ZFS to Enterprise customers, and this area is something I cover. Where possible, give ZFS visibility of redundancy, and as many LUNs as you can. However, we have to recognise that this isn''t always possible. In many enterprises, storage is managed by separate teams from servers (this is a legal requirement in some industry sectors in some countries, typically finance), often with very little cooperation between teams, indeed even rivalry. If we said ZFS _had_ to handle lots of LUNs and the data redundancy, it would never get through many data centre doors, so we do have to work in this environment. Even where customers can''t make use of some of the features such as self healing data corruptions, I/O scheduling, etc, because of their company storage infrastructure limitations, there''s still a ton of other goodness in there too with ease of creating filesystems, snapshots, etc. and we will at least let them know when their multi-million dollar storage system silently drops a bit, which they tend to far more often than most customers realise. -- Andrew
> Also, ZFS does things like putting the ZIL data (when not on a dedicated > device) at the outer edge of disks, that being faster.No, ZFS does not do that. It will chain the intent log from blocks allocated from the same metaslabs that the pool is allocating from. This actually works out well because there isn''t a large seek back to the beginning of the device. When the pool gets near full then there will be a noticeable slowness - but then all file systems performance suffer when searching for space. When the log is on a separate device it uses the same allocation scheme but those blocks will tend to be allocated at the outer edge of the disk. They only exist for a short time before getting freed, so the same blocks gets re-used. Neil.
Just to put closure to this discussion about how CR 6565042 and 6322646 change how ZFS functions with in the below scenario.>ZFS no longer has the issue where loss of a single device (even >intermittently) causes pool corruption. That''s been fixed. > >That is, there used to be an issue in this scenario: > >(1) zpool constructed from a single LUN on a SAN device >(2) SAN experiences temporary outage, while ZFS host remains running. >(3) zpool is permanently corrupted, even if no I/O occured during outage > >This is fixed. (around b101, IIRC) > >I went back and dug through some of my email, and the issue showed up as >CR 6565042. > >That was fixed in b77 and s10 update 6."After doing further research, and speaking with the CR engineers, the CR changes seem to be included in an overall fix for ZFS panic situations. The Zpool can still go into a degraded or faulted state, which will require manual intervention by the user. This fix was discussed above in information from infodoc 211349 Solaris[TM] ZFS & Write Failure "ZFS will handle the drive failures gracefully as part of the BUG 6322646 fix in the case of non-redundant configurations by degrading the pool instead of initiating a system panic with the help of Solaris[TM] FMA framework." -- This message posted from opensolaris.org
>>>>> "sj" == Shawn Joy <shawn.joy at sun.com> writes:sj> "ZFS will handle the drive failures gracefully as part of the sj> BUG 6322646 fix in the case of non-redundant configurations by sj> degrading the pool instead of initiating a system panic with sj> the help of Solaris[TM] FMA The problem was not system panics. It was lost pools. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091014/b01e04ff/attachment.bin>
Prior to this fix ZFS would panic the systems in order to avoid data corruption and loss of the zpool. Now the pool goes into a degraded or faulted state and one can "try" the zpool clear command to correct the issue. If this does not succeed a reboot is required. -- This message posted from opensolaris.org