Hello zfs-discuss, It looks like when zfs issues write cache flush commands se3510 actually honors it. I do not have right now spare se3510 to be 100% sure but comparing nfs/zfs server with se3510 to another nfs/ufs server with se3510 with "Periodic Cache Flush Time" set to disable or so longer time I can see that cache utilization on nfs/ufs stays about 48% while on nfs/zfs it''s hardly reaches 20% and every few seconds goes down to 0 (I guess every txg_time). nfs/zfs also has worse performance than nfs/ufs. Does anybody know how to tell se3510 not to honor write cache flush commands? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Robert, Tuesday, February 6, 2007, 12:55:19 PM, you wrote: RM> Hello zfs-discuss, RM> It looks like when zfs issues write cache flush commands se3510 RM> actually honors it. I do not have right now spare se3510 to be 100% RM> sure but comparing nfs/zfs server with se3510 to another nfs/ufs RM> server with se3510 with "Periodic Cache Flush Time" set to disable RM> or so longer time I can see that cache utilization on nfs/ufs stays RM> about 48% while on nfs/zfs it''s hardly reaches 20% and every few RM> seconds goes down to 0 (I guess every txg_time). RM> nfs/zfs also has worse performance than nfs/ufs. RM> Does anybody know how to tell se3510 not to honor write cache flush RM> commands? Is there mdb hack to disable issuing write cache flushes by zfs? It could be better than disabling it on the array itself as less scsi commands would be send. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Feb 6, 2007, at 06:55, Robert Milkowski wrote:> Hello zfs-discuss, > > It looks like when zfs issues write cache flush commands se3510 > actually honors it. I do not have right now spare se3510 to be 100% > sure but comparing nfs/zfs server with se3510 to another nfs/ufs > server with se3510 with "Periodic Cache Flush Time" set to disable > or so longer time I can see that cache utilization on nfs/ufs stays > about 48% while on nfs/zfs it''s hardly reaches 20% and every few > seconds goes down to 0 (I guess every txg_time). > > nfs/zfs also has worse performance than nfs/ufs. > > Does anybody know how to tell se3510 not to honor write cache flush > commands?I don''t think you can .. DKIOCFLUSHWRITECACHE *should* tell the array to flush the cache. Gauging from the amount of calls that zfs makes to this vs ufs (fsck, lockfs, mount?) - i think you''ll see the performance diff, particularly when you hit an NFS COMMIT. (If you don''t use vdevs you may see another difference in zfs as the only place you''ll hit is on the zil) btw - you may already know, but you''ll also fall to write-through on the cache if your battery charge drops and we also recommend setting to write- through when you only have a single controller since a power event could result in data loss. Of course there''s a big performance difference between write-back and write-through cache --- .je
Hello Jonathan, Tuesday, February 6, 2007, 5:00:07 PM, you wrote: JE> On Feb 6, 2007, at 06:55, Robert Milkowski wrote:>> Hello zfs-discuss, >> >> It looks like when zfs issues write cache flush commands se3510 >> actually honors it. I do not have right now spare se3510 to be 100% >> sure but comparing nfs/zfs server with se3510 to another nfs/ufs >> server with se3510 with "Periodic Cache Flush Time" set to disable >> or so longer time I can see that cache utilization on nfs/ufs stays >> about 48% while on nfs/zfs it''s hardly reaches 20% and every few >> seconds goes down to 0 (I guess every txg_time). >> >> nfs/zfs also has worse performance than nfs/ufs. >> >> Does anybody know how to tell se3510 not to honor write cache flush >> commands?JE> I don''t think you can .. DKIOCFLUSHWRITECACHE *should* tell the array JE> to flush the cache. Gauging from the amount of calls that zfs makes to JE> this vs ufs (fsck, lockfs, mount?) - i think you''ll see the JE> performance diff, JE> particularly when you hit an NFS COMMIT. (If you don''t use vdevs you JE> may see another difference in zfs as the only place you''ll hit is on JE> the zil) IMHO it definitely shouldn''t actually. The array has two controllers and write cache is mirrored. Also this is not the only host using that array. You can actually win much of a performance, especially with nfs/zfs setup (lot of synchronous ops) I guess. IIRC Bill posted here some tie ago saying the problem with write cache on the arrays is being worked on. JE> btw - you may already know, but you''ll also fall to write-through on JE> the cache JE> if your battery charge drops and we also recommend setting to write- JE> through JE> when you only have a single controller since a power event could JE> result in JE> data loss. Of course there''s a big performance difference between JE> write-back and write-through cache I can understand reduced performance with one controller (however I can force write-back on se3510 even with one controller - depending on situation we can argue if it would be wise or not) however until write cache is protected I definitely do not want ZFS to issue DKIOCFLUSHWRITECACHE. Ideally ZFS could figure out itself with recognized arrays. But I would like to be able to set it my self on pool basis anyway (there''re always specific situations). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> > IIRC Bill posted here some tie ago saying the problem with write cache > on the arrays is being worked on.Yep, the bug is: 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices We have a case going through PSARC that will make things works correctly with regards to flushing the write cache and non-volatile caches. The tricky part is getting vendors to actually support SYNC_NV bit. If you your favorite vendor/array doesn''t support it, feel free to give them a call... eric
On Feb 6, 2007, at 11:46, Robert Milkowski wrote:>>> Does anybody know how to tell se3510 not to honor write cache >>> flush >>> commands? > > JE> I don''t think you can .. DKIOCFLUSHWRITECACHE *should* tell the > array > JE> to flush the cache. Gauging from the amount of calls that zfs > makes to > JE> this vs ufs (fsck, lockfs, mount?)correction .. UFS uses _FIOFFS which is a file ioctl not a device ioctl which makes sense given the difference in models .. hence UFS doesn''t care if the device write cache is turned on or off as it only makes dkio calls for geometry, info and such. you can poke through the code to see what other dkio ioctls are being made by z .. i believe it''s due to the design of a closer tie between the underlying devices and the file system that there''s a big difference. The DKIOCFLUSH PSARC is here: http://www.opensolaris.org/os/community/arc/caselog/2004/652/spec/ however I''m not sure if the 3510 maintains a difference between the entire array cache and the cache for a single LUN/device .. we''d have to dig up one of the firmware engineers for a more definitive answer. Point well taken on shared storage if we''re flushing an array cache here :) --- .je
Robert Milkowski
2007-Feb-06 18:43 UTC
[storage-discuss] Re[2]: [zfs-discuss] se3510 and ZFS
Hello eric, Tuesday, February 6, 2007, 5:55:23 PM, you wrote:>> >> IIRC Bill posted here some tie ago saying the problem with write cache >> on the arrays is being worked on.ek> Yep, the bug is: ek> 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE ek> CACHE to ek> SBC-2 devices Thanks. I see a workaround there (I saw it earlier but it doesn''t apply to 3510) and I have a question - setting zil_disable to 1 won''t actually completely disable cache flushing, right? (still every txg group completes cache would be flushed)?? ek> We have a case going through PSARC that will make things works ek> correctly with regards to flushing the write cache and non-volatile ek> caches. There''s actually a tunable to disable cache flushes: zfs_nocacheflush and in older code (like S10U3) it''s zil_noflush. Hmmmm... ek> The tricky part is getting vendors to actually support SYNC_NV bit. ek> If you your favorite vendor/array doesn''t support it, feel free to ek> give them a call... Is there any work being done to ensure/check that all arrays Sun sells do support it? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote On 02/06/07 11:43,:> Hello eric, > > Tuesday, February 6, 2007, 5:55:23 PM, you wrote: > > >>>IIRC Bill posted here some tie ago saying the problem with write cache >>>on the arrays is being worked on. > > > ek> Yep, the bug is: > ek> 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE > ek> CACHE to > ek> SBC-2 devices > > Thanks. I see a workaround there (I saw it earlier but it doesn''t > apply to 3510) and I have a question - setting zil_disable to 1 > won''t actually completely disable cache flushing, right? (still every > txg group completes cache would be flushed)?? > > > ek> We have a case going through PSARC that will make things works > ek> correctly with regards to flushing the write cache and non-volatile > ek> caches. > > There''s actually a tunable to disable cache flushes: > zfs_nocacheflush and in older code (like S10U3) it''s zil_noflush.Yes, but we didn''t want to publicise this internal switch. (I would not call it a tunable). We (or at least I) are regretting publicising zil_disable, but using zfs_nocacheflush is worse. If the device is volatile then we can get pool corruption. An uberblock could get written before all of its tree. Note, zfs_nocacheflush and zil_noflush are not the same. Setting zil_noflush stopped zil flushes of the write cache, whereas zfs_nocacheflush will additionally stop flushing for txgs.> > Hmmmm... > > > ek> The tricky part is getting vendors to actually support SYNC_NV bit. > ek> If you your favorite vendor/array doesn''t support it, feel free to > ek> give them a call... > > Is there any work being done to ensure/check that all arrays Sun sells > do support it? > >
On Feb 6, 2007, at 10:43 AM, Robert Milkowski wrote:> Hello eric, > > Tuesday, February 6, 2007, 5:55:23 PM, you wrote: > >>> >>> IIRC Bill posted here some tie ago saying the problem with write >>> cache >>> on the arrays is being worked on. > > ek> Yep, the bug is: > ek> 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE > ek> CACHE to > ek> SBC-2 devices > > Thanks. I see a workaround there (I saw it earlier but it doesn''t > apply to 3510) and I have a question - setting zil_disable to 1 > won''t actually completely disable cache flushing, right? (still every > txg group completes cache would be flushed)??Right, zil_disable is orthogonal to flushing the write cache. With or without the ZIL, ZFS is consistent on disk. Disabling the flushing of the write cache (as Neil just mentioned) *can cause corruption*.> > ek> The tricky part is getting vendors to actually support SYNC_NV > bit. > ek> If you your favorite vendor/array doesn''t support it, feel free to > ek> give them a call... > > Is there any work being done to ensure/check that all arrays Sun sells > do support it? > >We''re working it from our end, but i''m sure we won''t cover all 3rd party vendors... eric
Robert Milkowski writes: > Hello Jonathan, > > Tuesday, February 6, 2007, 5:00:07 PM, you wrote: > > JE> On Feb 6, 2007, at 06:55, Robert Milkowski wrote: > > >> Hello zfs-discuss, > >> > >> It looks like when zfs issues write cache flush commands se3510 > >> actually honors it. I do not have right now spare se3510 to be 100% > >> sure but comparing nfs/zfs server with se3510 to another nfs/ufs > >> server with se3510 with "Periodic Cache Flush Time" set to disable > >> or so longer time I can see that cache utilization on nfs/ufs stays > >> about 48% while on nfs/zfs it''s hardly reaches 20% and every few > >> seconds goes down to 0 (I guess every txg_time). > >> > >> nfs/zfs also has worse performance than nfs/ufs. > >> > >> Does anybody know how to tell se3510 not to honor write cache flush > >> commands? > > JE> I don''t think you can .. DKIOCFLUSHWRITECACHE *should* tell the array > JE> to flush the cache. Gauging from the amount of calls that zfs makes to > JE> this vs ufs (fsck, lockfs, mount?) - i think you''ll see the > JE> performance diff, > JE> particularly when you hit an NFS COMMIT. (If you don''t use vdevs you > JE> may see another difference in zfs as the only place you''ll hit is on > JE> the zil) > > IMHO it definitely shouldn''t actually. The array has two controllers > and write cache is mirrored. Also this is not the only host using that > array. You can actually win much of a performance, especially with > nfs/zfs setup (lot of synchronous ops) I guess. > Again it''s a question of semantic. The intent of ZFS is to say put the bits on "stable storage". The controller then decides if the NVRAM qualifies as stable storage (is dual ported, batteries are up) and can ignore the request. If the battery charge gets low then the array needs to honor the flush request. No way for ZFS to adjust to array battery charge. So I''d argue the DKIOCFLUSHWRITECACHE is misnamed. The work going on is to allow the DKIOCFLUSHWRITECACHE to be qualified to mean "flush to rust" which I guess won''t be used by ZFS or "flush to stable storage"; if NVRAM is considered stable enough by the array, then it will turn the request into a noop. -r