The paragraph below is from ZFS admin guide Traditional Volume Management As described in ?ZFS Pooled Storage? on page 18, ZFS eliminates the need for a separate volume manager. ZFS operates on raw devices, so it is possible to create a storage pool comprised of logical volumes, either software or hardware. This configuration is not recommended, as ZFS works best when it uses raw physical devices. Using logical volumes might sacrifice performance, reliability, or both, and should be avoided Does this mean EMC/Hitachi and other SAN provisioned storage(RAID LUN) is not suitable storage for ZFS? Please clarify thanks This message posted from opensolaris.org
Hello tester, Tuesday, April 17, 2007, 11:09:34 AM, you wrote: t> The paragraph below is from ZFS admin guide t> Traditional Volume Management t> As described in ?ZFS Pooled Storage? on page 18, ZFS eliminates the need for a separate volume t> manager. ZFS operates on raw devices, so it is possible to create t> a storage pool comprised of logical t> volumes, either software or hardware. This configuration is not recommended, as ZFS works best t> when it uses raw physical devices. Using logical volumes might t> sacrifice performance, reliability, or t> both, and should be avoided t> Does this mean EMC/Hitachi and other SAN provisioned storage(RAID t> LUN) is not suitable storage for ZFS? Please clarify It''s about *software* volume managers. It means that it''s not generally recommended to create a raid volume using VxVM or SVM and then put zfs on top of it. It will work it''s just not recommended. In some situations it would actually make sense (like creating raid-5 in SVM on VxVM and put zf on top of it if you need much better random io read throughput). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Well, no; his quote did say "software or hardware". The theory is apparently that ZFS can do better at detecting (and with redundancy, correcting) errors if it''s dealing with raw hardware, or as nearly so as possible. Most SANs _can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are just not used to doing it. Another issue that may come up with SANs and/or hardware RAID: supposedly, storage systems with large non-volatile caches will tend to have poor performance with ZFS, because ZFS issues cache flush commands as part of committing every transaction group; this is worse if the filesystem is also being used for NFS service. Most such hardware can be configured to ignore cache flushing commands, which is safe as long as the cache is non-volatile. The above is simply my understanding of what I''ve read; I could be way off base, of course. This message posted from opensolaris.org
Richard L. Hamilton writes: > Well, no; his quote did say "software or hardware". The theory is apparently > that ZFS can do better at detecting (and with redundancy, correcting) errors > if it''s dealing with raw hardware, or as nearly so as possible. Most SANs > _can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are > just not used to doing it. > > Another issue that may come up with SANs and/or hardware RAID: > supposedly, storage systems with large non-volatile caches will tend to have > poor performance with ZFS, because ZFS issues cache flush commands as > part of committing every transaction group; this is worse if the filesystem > is also being used for NFS service. Most such hardware can be > configured to ignore cache flushing commands, which is safe as long as > the cache is non-volatile. > > The above is simply my understanding of what I''ve read; I could be way off > base, of course. > Sounds good to me. The first point is easy to understand. If you rely on ZFS for data reconstruction; carve virtual luns out of your storage and mirror those luns in ZFS, then it''s possible that both copies of a mirrored block end up on a single physical device. Performance wise, the ZFS I/O scheduler might interact in interesting way with the one in the storage, but I don''t know if this has been studied in depth. -r > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello Richard, Wednesday, April 18, 2007, 7:35:24 AM, you wrote: RLH> Well, no; his quote did say "software or hardware". Right, I missed that. RLH> The theory is apparently RLH> that ZFS can do better at detecting (and with redundancy, correcting) errors RLH> if it''s dealing with raw hardware, or as nearly so as possible. Most SANs RLH> _can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are RLH> just not used to doing it. Detecting errors by zfs is the same regardless when redundancy is done - by default checksums are checked for each block in every pool configuration. Now correction is another story and basically you need to create redundant pool (exception to that are meta data, and with introduction of user ditto block also user data to some extent). Now when it comes to HW RAID I wouldn''t actually recommend doing RAID in ZFS, at least not always and not now. First reason is lacking hot spare support in zfs right now. In many scenarios when one disk goes wild zfs won''t really notice and your hot spare won''t kick-in, and you end up doing silly things while your pool is not serving data. As I understand the problem with hot spare is being worked on. Then if you want RAID5 (yes, people want RAID5) zfs could give you much less performance for some workloads and for such workloads HW RAID5 (or even SVM, VxVM RAID-5 with zfs on top) with zfs on top as a file system (or additionally with dynamic striping between hw r5 luns) actually makes sense. If you need RAID-10 and you need lot of bandwidth for sequential writes (or even not sequential in case of zfs) doing it in software will halve your actual performance in most cases as you have to move out twice much data when doing software raid. Also exposing disks as LUNs is not that well tested on arrays simply just it''s been used much less. For example I had a problem with EMC CX3-40 when each disk was exposed as a LUN and raid-10 zfs pool was created - when I pulled out a disk the array didn''t catch it on a lun level neither a host - IOs were queuing up, I waited for about an hour (this was a test system) and nothing happened. Of course hot spare didn''t kick in. SVM, VxVM or HW hot spares just work. ZFS hot spare support is far behind right now. It has better potential but it just doesn''t work properly in many failure scenarios. RLH> Another issue that may come up with SANs and/or hardware RAID: RLH> supposedly, storage systems with large non-volatile caches will tend to have RLH> poor performance with ZFS, because ZFS issues cache flush commands as RLH> part of committing every transaction group; this is worse if the filesystem RLH> is also being used for NFS service. Most such hardware can be RLH> configured to ignore cache flushing commands, which is safe as long as RLH> the cache is non-volatile. In most arrays (if not in all) exposing physical disk as a lun won''t solve above, so it''s not a choice between doing raid in hw or in zfs. As always, if you really care about performance and availability you have to know what you are doing. And while zfs does some miracles in some environments it actually makes sense to do raid in HW or other volume manager and use zfs solely as a file system. In case when doing raid in HW and using zfs as a file system I would recommend always exposing 3 luns or more (or at least 2) and then do a dynamic striping on zfs side. Of course those luns should be on different physical disks. That way you should have a better protection for your meta data. -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
is cache flushing part of the SCSI protocol? If not, how does ZFS become aware of the array specific command? Thanks This message posted from opensolaris.org
Yes, it is: SYNCHRONIZE_CACHE(10) opcode 0x35 SYNCHRONIZE_CACHE(16) opcode 0x91 [i]-- leon[/i] This message posted from opensolaris.org
On Apr 18, 2007, at 2:35 AM, Richard L. Hamilton wrote:> Well, no; his quote did say "software or hardware". The theory is > apparently > that ZFS can do better at detecting (and with redundancy, > correcting) errors > if it''s dealing with raw hardware, or as nearly so as possible. > Most SANs > _can_ hand out raw LUNs as well as RAID LUNs, the folks that run > them are > just not used to doing it. > > Another issue that may come up with SANs and/or hardware RAID: > supposedly, storage systems with large non-volatile caches will > tend to have > poor performance with ZFS, because ZFS issues cache flush commands as > part of committing every transaction group; this is worse if the > filesystem > is also being used for NFS service. Most such hardware can be > configured to ignore cache flushing commands, which is safe as long as > the cache is non-volatile.The non-volatile cache issues are being covered by: 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices PSARC 2007/053 The PSARC case has been approved, and Grant is finishing up the code changes. eric
Hello eric, Wednesday, April 18, 2007, 10:53:59 PM, you wrote: ek> On Apr 18, 2007, at 2:35 AM, Richard L. Hamilton wrote:>> Well, no; his quote did say "software or hardware". The theory is >> apparently >> that ZFS can do better at detecting (and with redundancy, >> correcting) errors >> if it''s dealing with raw hardware, or as nearly so as possible. >> Most SANs >> _can_ hand out raw LUNs as well as RAID LUNs, the folks that run >> them are >> just not used to doing it. >> >> Another issue that may come up with SANs and/or hardware RAID: >> supposedly, storage systems with large non-volatile caches will >> tend to have >> poor performance with ZFS, because ZFS issues cache flush commands as >> part of committing every transaction group; this is worse if the >> filesystem >> is also being used for NFS service. Most such hardware can be >> configured to ignore cache flushing commands, which is safe as long as >> the cache is non-volatile.ek> The non-volatile cache issues are being covered by: ek> 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE ek> CACHE to SBC-2 devices ek> PSARC 2007/053 ek> The PSARC case has been approved, and Grant is finishing up the code ek> changes. Has an analysis of most common storage system been done on how they treat SYNC_NV bit and if any additional tweaking is needed? Would such analysis be publicly available? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> > Has an analysis of most common storage system been done on how they > treat SYNC_NV bit and if any additional tweaking is needed? Would such > analysis be publicly available? >I am not aware of any analysis and would love to see it done (i''m sure any vendors who are lurking on this list that support the SYNC_NV would surely want to speak up now). Due to not every vendor not supporting SYNC_NV, our solution is to first see if SYNC_NV is supported and if not, then provide a config file (as a short term necessity) that you can hardcore certain products to act as if they support SYNC_NV (which we would then not send a flushing of the cache). If the SYNC_NV bit is not supported and the config file is not updated for the device, then we do what we do today. But if anyone knows for certain if a particular device supports SYNC_NV, please post... eric
Hello eric, Friday, April 20, 2007, 3:36:20 PM, you wrote:>> >> Has an analysis of most common storage system been done on how they >> treat SYNC_NV bit and if any additional tweaking is needed? Would such >> analysis be publicly available? >>ek> I am not aware of any analysis and would love to see it done (i''m ek> sure any vendors who are lurking on this list that support the ek> SYNC_NV would surely want to speak up now). ek> Due to not every vendor not supporting SYNC_NV, our solution is to ek> first see if SYNC_NV is supported and if not, then provide a config ek> file (as a short term necessity) that you can hardcore certain ek> products to act as if they support SYNC_NV (which we would then not ek> send a flushing of the cache). If the SYNC_NV bit is not supported ek> and the config file is not updated for the device, then we do what we ek> do today. ek> But if anyone knows for certain if a particular device supports ek> SYNC_NV, please post... Why config file and not a property for a pool? Ahhhh.... pool can have disks from different arrays :) Useful thing would be to ba able to keep that config file in a pool so if one exports/imports to different server... you get the idea. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com