hi folks... I''ve just been exposed to zfs directly, since I''m trying it out on "a certain 48-drive box with 4 cpus" :-) I read in the archives, the recent " hard drive write cache " thread. in which someone at sun made the claim that zfs takes advantage of the disk write cache, selectively enabling it and disabling it. However, that does not seem to be at all true, on the system I am testing on. (or if it doesnt, it isnt doing it in any kind of effective way) SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pc On the following RAIDZ pool: # zpool status rzpool pool: rzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 Write performance for large files appears to top out at around 15-20MB/sec, according to zpool iostat However, when I manually enable write cache on all the drives involved... performance for the pathalogical case of dd if=/dev/zero of=/rzpool/testfile bs=128k jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very disappointed to see that was not sustained ;-) ] This kind of performance differential also shows up with "real" load; doing a tar| tar copy of large video files over NFS to the filesystem. As a comparison, a single disk''s dd write performance is around 6MB/sec no cache, and 30MB/sec with write cache enabled. So the 40-50MB/sec result is kind of disappointing, with a **10** disk pool. Comments?
On Fri, Jun 02, 2006 at 12:42:53PM -0700, Philip Brown wrote:> hi folks... > I''ve just been exposed to zfs directly, since I''m trying it out on > "a certain 48-drive box with 4 cpus" :-) > > I read in the archives, the recent " hard drive write cache " > thread. in which someone at sun made the claim that zfs takes advantage of > the disk write cache, selectively enabling it and disabling it. > > However, that does not seem to be at all true, on the system I am testing > on. (or if it doesnt, it isnt doing it in any kind of effective way) > > > SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pcThat''s because you are using really old bits. Upgrade to at least build 38 and everything should work as advertised. --Bill
Philip Brown writes: > hi folks... > I''ve just been exposed to zfs directly, since I''m trying it out on > "a certain 48-drive box with 4 cpus" :-) > > I read in the archives, the recent " hard drive write cache " > thread. in which someone at sun made the claim that zfs takes advantage of > the disk write cache, selectively enabling it and disabling it. > > However, that does not seem to be at all true, on the system I am testing > on. (or if it doesnt, it isnt doing it in any kind of effective way) > > > SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pc > > > On the following RAIDZ pool: > > # zpool status rzpool > pool: rzpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > rzpool ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c9t4d0 ONLINE 0 0 0 > c9t5d0 ONLINE 0 0 0 > c10t4d0 ONLINE 0 0 0 > c10t5d0 ONLINE 0 0 0 > > > Write performance for large files appears to top out at around 15-20MB/sec, > according to zpool iostat > > > However, when I manually enable write cache on all the drives involved... > performance for the pathalogical case of > > dd if=/dev/zero of=/rzpool/testfile bs=128k > > > jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very > disappointed to see that was not sustained ;-) ] Yes it is; See "Sequential writing is jumping"; should not be too hard to fix though. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6415647 > > This kind of performance differential also shows up with "real" load; > doing a tar| tar copy of large video files over NFS to the filesystem. > > > As a comparison, a single disk''s dd write performance is around 6MB/sec no > cache, and 30MB/sec with write cache enabled. > > So the 40-50MB/sec result is kind of disappointing, with a **10** disk pool. > I Don''t think RAID-Z is your problem in the above, but if the performance of random read is important do check this: http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to -r > > > Comments? > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I previously wrote about my scepticism on the claims that zfs selectively enables and disables write cache, to improve throughput over the usual solaris defaults prior to this point. I posted my observations that this did not seem to be happening in any meaningful way, for my zfs, on build nv33. I was told, "oh you just need the more modern drivers". Well, I''m now running S10u2, with SUNWzfsr 11.10.0,REV=2006.05.18.01.46 I dont see much of a difference. By default, iostat shows the disks grinding along at 10MB/sec during the transfer. However, if I manually enable write_cache on the drives (SATA drives, FWIW), the drive throughput zips up to 30MB/sec during the transfer. Test case: # zpool status philpool pool: philpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM philpool ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 # dd if=/dev/zero of=/philpool/testfile bs=256k count=10000 # [run iostat] The wall clock time for the i/o to quiesce, is as espected. Without write cache manually enabled, it takes 3 times as long to finish, as with it enabled. (1:30, vs 30sec) [Approximately a 2 gig file is generated. A side note of interest to me is that in both cases, the dd returns to the user relatively quickly, but the write goes on for quite a long time in the background.. without apparently reserving 2 gigabytes of extra kernel memory according to swap -s ]
> I previously wrote about my scepticism on the claims that zfs selectively > enables and disables write cache, to improve throughput over the usual > solaris defaults prior to this point.I have snv_38 here. With a zpool thus : bash-3.1# zpool status pool: zfs0 state: ONLINE scrub: scrub completed with 0 errors on Sun Jun 11 16:17:24 2006 config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 errors: No known data errors Regardless of what abuse I throw at this I never seem to see anything happen that indicates that cache is being "toggled" on or off. Furthermore these are all Sun 36G disks.> I posted my observations that this did not seem to be happening in any > meaningful way, for my zfs, on build nv33. > > I was told, "oh you just need the more modern drivers". > > Well, I''m now running S10u2, with > SUNWzfsr 11.10.0,REV=2006.05.18.01.46Its possible that the feature you seek is in snv somewhere and not in that S10 wos. But I am guessing. We would need to look to the changelogs to see where that feature was incorporated in the ZFS bits. Better yet .. use the source Luke ! Dennis
Roch Bourbonnais - Performance Engineering
2006-Jun-15 10:23 UTC
[zfs-discuss] Re: disk write cache, redux
I''m puzzled by 2 things. Naively I''d think a write_cache should not help throughput test since the cache should fill up after which you should still be throttled by the physical drain rate. You clearly show that it helps; Anyone knows why/how a cache helps throughput ? And the second thing...quick search, this seems relevant Bug ID: 6397876 Synopsis: sata drives need default write cache controlled via property Integrated in Build: snv_38 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876 May have missed U2 though. Sorry about that... -r Philip Brown writes: > I previously wrote about my scepticism on the claims that zfs selectively > enables and disables write cache, to improve throughput over the usual > solaris defaults prior to this point. > > I posted my observations that this did not seem to be happening in any > meaningful way, for my zfs, on build nv33. > > I was told, "oh you just need the more modern drivers". > > Well, I''m now running S10u2, with > SUNWzfsr 11.10.0,REV=2006.05.18.01.46 > > I dont see much of a difference. > By default, iostat shows the disks grinding along at 10MB/sec during the > transfer. > However, if I manually enable write_cache on the drives (SATA drives, FWIW), > the drive throughput zips up to 30MB/sec during the transfer. > > > Test case: > > # zpool status philpool > pool: philpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > philpool ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > > # dd if=/dev/zero of=/philpool/testfile bs=256k count=10000 > > # [run iostat] > > The wall clock time for the i/o to quiesce, is as espected. Without write > cache manually enabled, it takes 3 times as long to finish, as with it > enabled. (1:30, vs 30sec) > > [Approximately a 2 gig file is generated. A side note of interest to me is > that in both cases, the dd returns to the user relatively quickly, but the > write goes on for quite a long time in the background.. without apparently > reserving 2 gigabytes of extra kernel memory according to swap -s ] > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering wrote:> Naively I''d think a write_cache should not help throughput > test since the cache should fill up after which you should still be > throttled by the physical drain rate. You clearly show that > it helps; Anyone knows why/how a cache helps throughput ?7200 RPM disks are typically IOP bound - so the write cache (which can be up to 16MB on some drives) should be able to buffer enough IO to deliver more efficiently on each IOP and also reduce head seek. Not sure which vendors implement write through when the cache fills, or how detailed the drive cache algos on SATA can go .. Take a look at PSARC 2004/652: http://www.opensolaris.org/os/community/arc/caselog/2004/652/ .je
Just was on the phone with Andy Bowers. He cleared up that our SATA device drivers need some work. We basically do not have the necessary I/O concurrency at this stage. So the write_cache is actually a good substitute for tag queuing. So that explain why we get more throughput _on SATA_ drives from the write_cache; and I guess the other bug explain why ZFS is still not able to benefit from it. -r Jonathan Edwards writes: > > On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering > wrote: > > > Naively I''d think a write_cache should not help throughput > > test since the cache should fill up after which you should still be > > throttled by the physical drain rate. You clearly show that > > it helps; Anyone knows why/how a cache helps throughput ? > > 7200 RPM disks are typically IOP bound - so the write cache (which > can be up to 16MB on some drives) should be able to buffer enough > IO to deliver more efficiently on each IOP and also reduce head seek. > Not sure which vendors implement write through when the cache fills, > or how detailed the drive cache algos on SATA can go .. > > Take a look at PSARC 2004/652: > http://www.opensolaris.org/os/community/arc/caselog/2004/652/ > > .je > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
The write cache decouples the actual write to disk from the data transfer from the host. For a streaming operation, this means that the disk can typically stream data onto tracks with almost no latency (because the cache can aggregate multiple I/O operations into full tracks which can be written without waiting for the right sector to come around). Disks could do this with "write cache disabled" if they actually used their write cache anyway and simply didn''t acknowledge the write immediately, but it appears they don''t, perhaps because this would add extra latency compared to actually waiting for the disk to get to the right place and transferring the data (well, one track''s worth at least), or perhaps because SATA disks are normally used with the write cache enabled and hence that''s the optimized code path. This message posted from opensolaris.org
Roch Bourbonnais - Performance Engineering wrote:> I''m puzzled by 2 things. > > Naively I''d think a write_cache should not help throughput > test since the cache should fill up after which you should still be > throttled by the physical drain rate. You clearly show that > it helps; Anyone knows why/how a cache helps throughput ? > > And the second thing...quick search, this seems relevant > > Bug ID: 6397876 > Synopsis: sata drives need default write cache controlled via property > Integrated in Build: snv_38 > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876 >well, that just says that the cache settings dont stick across reboot. that doesnt seem to have any bearing on whether zfs toggles write cache or not. From the sound of things, it sounds like what was previously written on this list was incorrect: I now infer that zfs does NOT do any "smart toggling" of write cache enable/disable on drives that it uses. (although it may or may not do some "flush cache" calls at appropriate moments)
Check here: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157 -r Phil Brown writes: > Roch Bourbonnais - Performance Engineering wrote: > > I''m puzzled by 2 things. > > > > Naively I''d think a write_cache should not help throughput > > test since the cache should fill up after which you should still be > > throttled by the physical drain rate. You clearly show that > > it helps; Anyone knows why/how a cache helps throughput ? > > > > And the second thing...quick search, this seems relevant > > > > Bug ID: 6397876 > > Synopsis: sata drives need default write cache controlled via property > > Integrated in Build: snv_38 > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876 > > > > > well, that just says that the cache settings dont stick across reboot. that > doesnt seem to have any bearing on whether zfs toggles write cache or not. > > From the sound of things, it sounds like what was previously written on > this list was incorrect: I now infer that zfs does NOT do any "smart > toggling" of write cache enable/disable on drives that it uses. > (although it may or may not do some "flush cache" calls at appropriate moments) > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch wrote:> Check here: > > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157 >distilled version: vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) /*...*/ /* * If we own the whole disk, try to enable disk write caching. * We ignore errors because it''s OK if we can''t do it. */ Which to me implies, "when a disk pool is mounted/created, enable write cache". (and presumably leave it on indefinately) The intersting thing is, dtrace with fbt::ldi_ioctl:entry { printf("ldi_ioctl called with %x\n",args[1]); } says that some kind of ldi_ioctl IS called, when I create a test zpool with these sata disks. specific ioctls called would seem to be: x422 x425 x42a and I believe DKIOCSETWCE is x425. HOWEVER... checking with format -e on those disks, says that write cache is NOT ENABLED after this happens. And interestingly, if I augment the dtrace with fbt::sata_set_cache_mode:entry, fbt::sata_init_write_cache_mode:entry { printf("%s called\n",probefunc); } the sata-specific set-cache routines, are NOT getting called. according to dtrace, anyways.... ?
I''ve got a pretty dumb question regarding SATA and write cache. I don''t see options in ''format -e'' on SATA drives for checking/setting write cache. I''ve seen the options for SCSI driver, but not SATA. I''d like to help on the SATA write cache enable/disable problem, if I can. What am I missing? Thanks! ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
I don''t believe ZFS toggles write cache on disks on the fly. Rather, write caching is enabled on disks which support this functionality. Then at appropriate points in the code ioctl is called to flush the cache thereby providing the appropriate data guarantees. However this by no means addresses your primary issue, viz. the performance using ZFS on your system is bad. -Sanjay Philip Brown wrote:> I previously wrote about my scepticism on the claims that zfs > selectively enables and disables write cache, to improve throughput > over the usual solaris defaults prior to this point. > > I posted my observations that this did not seem to be happening in any > meaningful way, for my zfs, on build nv33. > > I was told, "oh you just need the more modern drivers". > > Well, I''m now running S10u2, with > SUNWzfsr 11.10.0,REV=2006.05.18.01.46 > > I dont see much of a difference. > By default, iostat shows the disks grinding along at 10MB/sec during > the transfer. > However, if I manually enable write_cache on the drives (SATA drives, > FWIW), the drive throughput zips up to 30MB/sec during the transfer. > > > Test case: > > # zpool status philpool > pool: philpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > philpool ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > > # dd if=/dev/zero of=/philpool/testfile bs=256k count=10000 > > # [run iostat] > > The wall clock time for the i/o to quiesce, is as espected. Without > write cache manually enabled, it takes 3 times as long to finish, as > with it enabled. (1:30, vs 30sec) > > [Approximately a 2 gig file is generated. A side note of interest to > me is that in both cases, the dd returns to the user relatively > quickly, but the write goes on for quite a long time in the > background.. without apparently reserving 2 gigabytes of extra kernel > memory according to swap -s ] > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss