thr3ads.net - zfs discuss - [zfs-discuss] disk write cache, redux [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Philip Brown

2006-Jun-02 19:42 UTC

[zfs-discuss] disk write cache, redux

hi folks...
I''ve just been exposed to zfs directly, since I''m trying it
out on
"a certain 48-drive box with 4 cpus" :-)

I read in the archives, the recent " hard drive write cache "
thread. in which someone at sun made the claim that zfs takes advantage of 
the disk write cache, selectively enabling it and disabling it.

However, that does not seem to be at all true, on the system I am testing 
on. (or if it doesnt, it isnt doing it in any kind of effective way)


SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pc


On the following RAIDZ pool:

# zpool status rzpool
   pool: rzpool
  state: ONLINE
  scrub: none requested
config:

         NAME         STATE     READ WRITE CKSUM
         rzpool       ONLINE       0     0     0
           raidz      ONLINE       0     0     0
             c0t4d0   ONLINE       0     0     0
             c0t5d0   ONLINE       0     0     0
             c1t4d0   ONLINE       0     0     0
             c1t5d0   ONLINE       0     0     0
             c5t4d0   ONLINE       0     0     0
             c5t5d0   ONLINE       0     0     0
             c9t4d0   ONLINE       0     0     0
             c9t5d0   ONLINE       0     0     0
             c10t4d0  ONLINE       0     0     0
             c10t5d0  ONLINE       0     0     0


Write performance for large files appears to top out at around 15-20MB/sec, 
according to zpool iostat


However, when I manually enable write cache on all the drives involved... 
performance for the pathalogical case of

dd if=/dev/zero of=/rzpool/testfile bs=128k


jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very 
disappointed to see that was not sustained ;-) ]

This kind of performance differential also shows up with "real" load;
doing a tar| tar copy of large video files over NFS to the filesystem.


As a comparison, a single disk''s dd write performance is around 6MB/sec
no
cache, and 30MB/sec with write cache enabled.

So the 40-50MB/sec result is kind of disappointing, with a **10** disk pool.



Comments?

Bill Moore

2006-Jun-02 20:21 UTC

head link

[zfs-discuss] disk write cache, redux

On Fri, Jun 02, 2006 at 12:42:53PM -0700, Philip Brown
wrote:> hi folks...
> I''ve just been exposed to zfs directly, since I''m trying
it out on
> "a certain 48-drive box with 4 cpus" :-)
> 
> I read in the archives, the recent " hard drive write cache "
> thread. in which someone at sun made the claim that zfs takes advantage of 
> the disk write cache, selectively enabling it and disabling it.
> 
> However, that does not seem to be at all true, on the system I am testing 
> on. (or if it doesnt, it isnt doing it in any kind of effective way)
> 
> 
> SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pc
That''s because you are using really old bits.  Upgrade to at least
build
38 and everything should work as advertised.


--Bill

Roch

2006-Jun-05 14:46 UTC

head link

[zfs-discuss] disk write cache, redux

Philip Brown writes:
 > hi folks...
 > I''ve just been exposed to zfs directly, since I''m trying
it out on
 > "a certain 48-drive box with 4 cpus" :-)
 > 
 > I read in the archives, the recent " hard drive write cache "
 > thread. in which someone at sun made the claim that zfs takes advantage of
 > the disk write cache, selectively enabling it and disabling it.
 > 
 > However, that does not seem to be at all true, on the system I am testing 
 > on. (or if it doesnt, it isnt doing it in any kind of effective way)
 > 
 > 
 > SunOS test-t[xxxxxx](ahem) 5.11 snv_33 i86pc i386 i86pc
 > 
 > 
 > On the following RAIDZ pool:
 > 
 > # zpool status rzpool
 >    pool: rzpool
 >   state: ONLINE
 >   scrub: none requested
 > config:
 > 
 >          NAME         STATE     READ WRITE CKSUM
 >          rzpool       ONLINE       0     0     0
 >            raidz      ONLINE       0     0     0
 >              c0t4d0   ONLINE       0     0     0
 >              c0t5d0   ONLINE       0     0     0
 >              c1t4d0   ONLINE       0     0     0
 >              c1t5d0   ONLINE       0     0     0
 >              c5t4d0   ONLINE       0     0     0
 >              c5t5d0   ONLINE       0     0     0
 >              c9t4d0   ONLINE       0     0     0
 >              c9t5d0   ONLINE       0     0     0
 >              c10t4d0  ONLINE       0     0     0
 >              c10t5d0  ONLINE       0     0     0
 > 
 > 
 > Write performance for large files appears to top out at around
15-20MB/sec,
 > according to zpool iostat
 > 
 > 
 > However, when I manually enable write cache on all the drives involved... 
 > performance for the pathalogical case of
 > 
 > dd if=/dev/zero of=/rzpool/testfile bs=128k
 > 
 > 
 > jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very 
 > disappointed to see that was not sustained ;-) ]


Yes it is; See  "Sequential writing is jumping"; should not
be too hard to fix though.

	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6415647

 > 
 > This kind of performance differential also shows up with "real"
load;
 > doing a tar| tar copy of large video files over NFS to the filesystem.
 > 
 > 
 > As a comparison, a single disk''s dd write performance is around
6MB/sec no
 > cache, and 30MB/sec with write cache enabled.
 > 
 > So the 40-50MB/sec result is kind of disappointing, with a **10** disk
pool.
 > 

I Don''t think RAID-Z is your problem in the above, but if the
performance of random read is important do check this:

	http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to


-r

 > 
 > 
 > Comments?
 > 
 > 
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Philip Brown

2006-Jun-15 00:38 UTC

head link

[zfs-discuss] Re: disk write cache, redux

I previously wrote about my scepticism on the claims that zfs selectively 
enables and disables write cache, to improve throughput over the usual 
solaris defaults prior to this point.

I posted my observations that this did not seem to be happening in any 
meaningful way, for my zfs, on build nv33.

I was told, "oh you just need the more modern drivers".

Well, I''m now running S10u2, with
SUNWzfsr  11.10.0,REV=2006.05.18.01.46

I dont see much of a difference.
By default, iostat shows the disks grinding along at 10MB/sec during the 
transfer.
However, if I manually enable write_cache on the drives (SATA drives, FWIW), 
the drive throughput zips up to 30MB/sec during the transfer.


Test case:

# zpool status philpool
   pool: philpool
  state: ONLINE
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         philpool    ONLINE       0     0     0
           c5t1d0    ONLINE       0     0     0
           c5t4d0    ONLINE       0     0     0
           c5t5d0    ONLINE       0     0     0

# dd if=/dev/zero of=/philpool/testfile bs=256k count=10000

# [run iostat]

The wall clock time for the i/o to quiesce, is as espected. Without write 
cache manually enabled, it takes 3 times as long to finish, as with it 
enabled.  (1:30, vs 30sec)

[Approximately a 2 gig file is generated. A side note of interest to me is 
that in both cases, the dd returns to the user relatively quickly, but the 
write goes on for quite a long time in the background.. without apparently 
reserving 2 gigabytes of extra kernel memory according to swap -s ]

Dennis Clarke

2006-Jun-15 01:54 UTC

head link

[zfs-discuss] Re: disk write cache, redux

> I previously wrote about my scepticism on the claims that zfs selectively
> enables and disables write cache, to improve throughput over the usual
> solaris defaults prior to this point.
I have snv_38 here.  With a zpool thus :

bash-3.1# zpool status
  pool: zfs0
 state: ONLINE
 scrub: scrub completed with 0 errors on Sun Jun 11 16:17:24 2006
config:

        NAME         STATE     READ WRITE CKSUM
        zfs0         ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t10d0  ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t11d0  ONLINE       0     0     0
            c1t11d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t12d0  ONLINE       0     0     0
            c1t12d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t9d0   ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t13d0  ONLINE       0     0     0
            c1t13d0  ONLINE       0     0     0

errors: No known data errors

Regardless of what abuse I throw at this I never seem to see anything happen
that indicates that cache is being "toggled" on or off.

Furthermore these are all Sun 36G disks.
> I posted my observations that this did not seem to be happening in any
> meaningful way, for my zfs, on build nv33.
>
> I was told, "oh you just need the more modern drivers".
>
> Well, I''m now running S10u2, with
> SUNWzfsr  11.10.0,REV=2006.05.18.01.46
 Its possible that the feature you seek is in snv somewhere and not in that
S10 wos.

 But I am guessing.  We would need to look to the changelogs to see where
that feature was incorporated in the ZFS bits.

 Better yet .. use the source Luke !

Dennis

Roch Bourbonnais - Performance Engineering

2006-Jun-15 10:23 UTC

head link

[zfs-discuss] Re: disk write cache, redux

I''m puzzled by 2 things.

Naively I''d think a write_cache  should not help throughput
test since the cache should fill  up after which you should still be
throttled by the physical drain rate. You clearly show that
it helps; Anyone knows why/how a cache helps throughput ?

And the second thing...quick search, this seems relevant

	Bug ID: 6397876
	Synopsis: sata drives need default write cache controlled via property
	Integrated in Build: snv_38
	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876

May have missed U2 though. Sorry about that...

-r


Philip Brown writes:
 > I previously wrote about my scepticism on the claims that zfs selectively 
 > enables and disables write cache, to improve throughput over the usual 
 > solaris defaults prior to this point.
 > 
 > I posted my observations that this did not seem to be happening in any 
 > meaningful way, for my zfs, on build nv33.
 > 
 > I was told, "oh you just need the more modern drivers".
 > 
 > Well, I''m now running S10u2, with
 > SUNWzfsr  11.10.0,REV=2006.05.18.01.46
 > 
 > I dont see much of a difference.
 > By default, iostat shows the disks grinding along at 10MB/sec during the 
 > transfer.
 > However, if I manually enable write_cache on the drives (SATA drives,
FWIW),
 > the drive throughput zips up to 30MB/sec during the transfer.
 > 
 > 
 > Test case:
 > 
 > # zpool status philpool
 >    pool: philpool
 >   state: ONLINE
 >   scrub: none requested
 > config:
 > 
 >          NAME        STATE     READ WRITE CKSUM
 >          philpool    ONLINE       0     0     0
 >            c5t1d0    ONLINE       0     0     0
 >            c5t4d0    ONLINE       0     0     0
 >            c5t5d0    ONLINE       0     0     0
 > 
 > # dd if=/dev/zero of=/philpool/testfile bs=256k count=10000
 > 
 > # [run iostat]
 > 
 > The wall clock time for the i/o to quiesce, is as espected. Without write 
 > cache manually enabled, it takes 3 times as long to finish, as with it 
 > enabled.  (1:30, vs 30sec)
 > 
 > [Approximately a 2 gig file is generated. A side note of interest to me is
 > that in both cases, the dd returns to the user relatively quickly, but the
 > write goes on for quite a long time in the background.. without apparently
 > reserving 2 gigabytes of extra kernel memory according to swap -s ]
 > 
 > 
 > 
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jonathan Edwards

2006-Jun-15 12:25 UTC

head link

[zfs-discuss] Re: disk write cache, redux

On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering  
wrote:
> Naively I''d think a write_cache  should not help throughput
> test since the cache should fill  up after which you should still be
> throttled by the physical drain rate. You clearly show that
> it helps; Anyone knows why/how a cache helps throughput ?
7200 RPM disks are typically IOP bound - so the write cache (which
can be up to 16MB on some drives) should be able to buffer enough
IO to deliver more efficiently on each IOP and also reduce head seek.
Not sure which vendors implement write through when the cache fills,
or how detailed the drive cache algos on SATA can go ..

Take a look at PSARC 2004/652:
http://www.opensolaris.org/os/community/arc/caselog/2004/652/

.je

Roch

2006-Jun-15 13:49 UTC

head link

[zfs-discuss] Re: disk write cache, redux

Just was on the phone with Andy  Bowers.  He cleared up that
our SATA device drivers need some work.  We basically do not
have  the necessary I/O  concurrency at this  stage.  So the
write_cache is actually a good substitute for tag queuing.

So that explain why we get  more throughput _on SATA_ drives
from the write_cache; and I guess  the other bug explain why
ZFS is still not able to benefit from it.

-r

Jonathan Edwards writes:
 > 
 > On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering  
 > wrote:
 > 
 > > Naively I''d think a write_cache  should not help throughput
 > > test since the cache should fill  up after which you should still be
 > > throttled by the physical drain rate. You clearly show that
 > > it helps; Anyone knows why/how a cache helps throughput ?
 > 
 > 7200 RPM disks are typically IOP bound - so the write cache (which
 > can be up to 16MB on some drives) should be able to buffer enough
 > IO to deliver more efficiently on each IOP and also reduce head seek.
 > Not sure which vendors implement write through when the cache fills,
 > or how detailed the drive cache algos on SATA can go ..
 > 
 > Take a look at PSARC 2004/652:
 > http://www.opensolaris.org/os/community/arc/caselog/2004/652/
 > 
 > .je
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anton B. Rang

2006-Jun-15 14:30 UTC

head link

[zfs-discuss] Re: Re: disk write cache, redux

The write cache decouples the actual write to disk from the data transfer from
the host. For a streaming operation, this means that the disk can typically
stream data onto tracks with almost no latency (because the cache can aggregate
multiple I/O operations into full tracks which can be written without waiting
for the right sector to come around).

Disks could do this with "write cache disabled" if they actually used
their write cache anyway and simply didn''t acknowledge the write
immediately, but it appears they don''t, perhaps because this would add
extra latency compared to actually waiting for the disk to get to the right
place and transferring the data (well, one track''s worth at least), or
perhaps because SATA disks are normally used with the write cache enabled and
hence that''s the optimized code path.
 
 
This message posted from opensolaris.org

Phil Brown

2006-Jun-15 15:02 UTC

head link

[zfs-discuss] Re: disk write cache, redux

Roch Bourbonnais - Performance Engineering wrote:> I''m puzzled by 2 things.
> 
> Naively I''d think a write_cache  should not help throughput
> test since the cache should fill  up after which you should still be
> throttled by the physical drain rate. You clearly show that
> it helps; Anyone knows why/how a cache helps throughput ?
> 
> And the second thing...quick search, this seems relevant
> 
> 	Bug ID: 6397876
> 	Synopsis: sata drives need default write cache controlled via property
> 	Integrated in Build: snv_38
> 	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876
> 

well, that just says that the cache settings dont stick across reboot. that 
doesnt seem to have any bearing on whether zfs toggles write cache or not.

 From the sound of things, it sounds like what was previously written on 
this list was incorrect: I now infer that zfs does NOT do any "smart 
toggling" of write cache enable/disable on drives that it uses.
(although it may or may not do some "flush cache" calls at appropriate
moments)

Roch

2006-Jun-15 15:20 UTC

head link

[zfs-discuss] Re: disk write cache, redux

Check here:

 http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157

-r

Phil Brown writes:
 > Roch Bourbonnais - Performance Engineering wrote:
 > > I''m puzzled by 2 things.
 > > 
 > > Naively I''d think a write_cache  should not help throughput
 > > test since the cache should fill  up after which you should still be
 > > throttled by the physical drain rate. You clearly show that
 > > it helps; Anyone knows why/how a cache helps throughput ?
 > > 
 > > And the second thing...quick search, this seems relevant
 > > 
 > > 	Bug ID: 6397876
 > > 	Synopsis: sata drives need default write cache controlled via
property
 > > 	Integrated in Build: snv_38
 > > 	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6397876
 > > 
 > 
 > 
 > well, that just says that the cache settings dont stick across reboot.
that
 > doesnt seem to have any bearing on whether zfs toggles write cache or not.
 > 
 >  From the sound of things, it sounds like what was previously written on 
 > this list was incorrect: I now infer that zfs does NOT do any "smart 
 > toggling" of write cache enable/disable on drives that it uses.
 > (although it may or may not do some "flush cache" calls at
appropriate moments)
 > 
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Philip Brown

2006-Jun-15 19:56 UTC

head link

[zfs-discuss] Re: disk write cache, redux

Roch wrote:> Check here:
> 
> 
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157
> 
distilled version:
   vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)
   /*...*/
   /*
    * If we own the whole disk, try to enable disk write caching.
    * We ignore errors because it''s OK if we can''t do it.
    */

Which to me implies, "when a disk pool is mounted/created, enable write
cache".

(and presumably leave it on indefinately)

The intersting thing is, dtrace with

fbt::ldi_ioctl:entry { printf("ldi_ioctl called with %x\n",args[1]); }

says that some kind of ldi_ioctl IS called, when I create a test zpool with 
these sata disks.
specific ioctls called would seem to be:
x422
x425
x42a

and I believe  DKIOCSETWCE is x425.

HOWEVER... checking with format -e on those disks, says that write cache is 
NOT ENABLED after this happens.

And interestingly, if I augment the dtrace with
fbt::sata_set_cache_mode:entry,
fbt::sata_init_write_cache_mode:entry
{
         printf("%s called\n",probefunc);
}

the sata-specific set-cache routines, are NOT getting called. according to 
dtrace, anyways.... ?

Gregory Shaw

2006-Jun-16 03:36 UTC

head link

[zfs-discuss] Re: disk write cache, redux

I''ve got a pretty dumb question regarding SATA and write cache.   I  
don''t see options in ''format -e'' on SATA drives for
checking/setting
write cache.

I''ve seen the options for SCSI driver, but not SATA.

I''d like to help on the SATA write cache enable/disable problem, if I  
can.

What am I missing?

Thanks!

-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Sanjay Nadkarni

2006-Jun-16 07:14 UTC

head link

[zfs-discuss] Re: disk write cache, redux

I don''t believe ZFS toggles write cache on disks on the fly.  Rather, 
write caching is enabled on disks which support this functionality.  
Then at appropriate points in the code ioctl is called to flush the 
cache thereby  providing  the appropriate data guarantees.  

However this by no means addresses your primary issue, viz.  the 
performance using ZFS on your system is bad. 

-Sanjay





Philip Brown wrote:
> I previously wrote about my scepticism on the claims that zfs 
> selectively enables and disables write cache, to improve throughput 
> over the usual solaris defaults prior to this point.
>
> I posted my observations that this did not seem to be happening in any 
> meaningful way, for my zfs, on build nv33.
>
> I was told, "oh you just need the more modern drivers".
>
> Well, I''m now running S10u2, with
> SUNWzfsr  11.10.0,REV=2006.05.18.01.46
>
> I dont see much of a difference.
> By default, iostat shows the disks grinding along at 10MB/sec during 
> the transfer.
> However, if I manually enable write_cache on the drives (SATA drives, 
> FWIW), the drive throughput zips up to 30MB/sec during the transfer.
>
>
> Test case:
>
> # zpool status philpool
>    pool: philpool
>   state: ONLINE
>   scrub: none requested
> config:
>
>          NAME        STATE     READ WRITE CKSUM
>          philpool    ONLINE       0     0     0
>            c5t1d0    ONLINE       0     0     0
>            c5t4d0    ONLINE       0     0     0
>            c5t5d0    ONLINE       0     0     0
>
> # dd if=/dev/zero of=/philpool/testfile bs=256k count=10000
>
> # [run iostat]
>
> The wall clock time for the i/o to quiesce, is as espected. Without 
> write cache manually enabled, it takes 3 times as long to finish, as 
> with it enabled.  (1:30, vs 30sec)
>
> [Approximately a 2 gig file is generated. A side note of interest to 
> me is that in both cases, the dd returns to the user relatively 
> quickly, but the write goes on for quite a long time in the 
> background.. without apparently reserving 2 gigabytes of extra kernel 
> memory according to swap -s ]
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jun 2006 - disk write cache, redux

[zfs-discuss] disk write cache, redux

[zfs-discuss] disk write cache, redux

[zfs-discuss] disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux

[zfs-discuss] Re: disk write cache, redux