thr3ads.net - zfs discuss - [zfs-discuss] ZFS I/O algorithms [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Bob Friesenhahn

2008-Mar-15 21:12 UTC

[zfs-discuss] ZFS I/O algorithms

Can someone please describe to me the actual underlying I/O operations 
which occur when a 128K block of data is written to a storage pool 
configured as shown below (with default ZFS block sizes)?  I am 
particularly interested in the degree of "striping" across mirrors 
which occurs.  This would be for Solaris 10 U4.


 	NAME                                       STATE     READ WRITE CKSUM
 	Sun_2540                                   ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B80003A8A0B0000096A47B4559Ed0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AA047B4529Bd0  ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B80003A8A0B0000096E47B456DAd0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AA447B4544Fd0  ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B80003A8A0B0000096147B451BEd0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AA847B45605d0  ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B80003A8A0B0000096647B453CEd0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AAC47B45739d0  ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B80003A8A0B0000097347B457D4d0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AB047B457ADd0  ONLINE       0     0     0
 	  mirror                                   ONLINE       0     0     0
 	    c4t600A0B800039C9B500000A9C47B4522Dd0  ONLINE       0     0     0
 	    c4t600A0B800039C9B500000AB447B4595Fd0  ONLINE       0     0     0


Thanks,

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Mar-16 03:56 UTC

head link

[zfs-discuss] ZFS I/O algorithms

Bob Friesenhahn wrote:> Can someone please describe to me the actual underlying I/O operations 
> which occur when a 128K block of data is written to a storage pool 
> configured as shown below (with default ZFS block sizes)?  I am 
> particularly interested in the degree of "striping" across
mirrors
> which occurs.  This would be for Solaris 10 U4.
>   
My observation, is that each metaslab is, by default, 1 MByte in size.  Each
top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
vdev''s metaslab before moving onto another one.  So you should see
eight
128kByte allocs per top-level vdev before the next top-level vdev is
allocated.

That said, the actual iops are sent in parallel.  So it is not unusual 
to see
many, most, or all of the top-level vdevs concurrently busy.

Does this match your experience?
 -- richard
>
>  	NAME                                       STATE     READ WRITE CKSUM
>  	Sun_2540                                   ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B80003A8A0B0000096A47B4559Ed0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AA047B4529Bd0  ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B80003A8A0B0000096E47B456DAd0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AA447B4544Fd0  ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B80003A8A0B0000096147B451BEd0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AA847B45605d0  ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B80003A8A0B0000096647B453CEd0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AAC47B45739d0  ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B80003A8A0B0000097347B457D4d0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AB047B457ADd0  ONLINE       0     0     0
>  	  mirror                                   ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000A9C47B4522Dd0  ONLINE       0     0     0
>  	    c4t600A0B800039C9B500000AB447B4595Fd0  ONLINE       0     0     0
>
>
> Thanks,
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Bob Friesenhahn

2008-Mar-16 04:34 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Sat, 15 Mar 2008, Richard Elling wrote:>
> My observation, is that each metaslab is, by default, 1 MByte in size. 
Each
> top-level vdev is allocated by metaslabs.  ZFS tries to allocate a
top-level
> vdev''s metaslab before moving onto another one.  So you should see
eight
> 128kByte allocs per top-level vdev before the next top-level vdev is
> allocated.
>
> That said, the actual iops are sent in parallel.  So it is not unusual to
see
> many, most, or all of the top-level vdevs concurrently busy.
>
> Does this match your experience?
I do see that all the devices are quite evenly busy.  There is no 
doubt that the load balancing is quite good.  The main question is if 
there is any actual "striping" going on (breaking the data into 
smaller chunks), or if the algorithm is simply load balancing. 
Striping trades IOPS for bandwidth.

Using my application, I did some tests today.  The application was 
used to do balanced read/write of about 500GB of data in some tens of 
thousand of reasonably large files.  The application sequentially 
reads a file, then sequentially writes a file.  Several copies (2-6) 
of the application were run at once for concurrency.  What I noticed 
is that with hardly any CPU being used, the read+write bandwidth 
seemed to be bottlenecked at about 280MB/second with ''zfs
iostat''
showing very balanced I/O between the reads and the writes.

The system I set up is performing quite a bit differently than I 
anticipated.  The I/O is bottlenecked and I find that my application 
can do significant processing of the data without significantly 
increasing the application run time.  So CPU time is almost free.

If I was to assign a smaller block size for the filesystem, would that 
provide more of the benefits of striping or would it be detrimental to 
performance due to the number of I/Os?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mario Goebbels

2008-Mar-16 12:07 UTC

head link

[zfs-discuss] ZFS I/O algorithms

> I do see that all the devices are quite evenly busy.  There is no 
> doubt that the load balancing is quite good.  The main question is if 
> there is any actual "striping" going on (breaking the data into 
> smaller chunks), or if the algorithm is simply load balancing. 
> Striping trades IOPS for bandwidth.
There''s no striping across vdevs going on. It''s simple load
balancing,
i.e. blocks are spread semi-randomly across the vdevs. I say semi
because apparently the average bandwidth load of the vdevs influence the
outcome.

-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 217 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080316/0930230d/attachment.bin>

Richard Elling

2008-Mar-16 15:40 UTC

head link

[zfs-discuss] ZFS I/O algorithms

Bob Friesenhahn wrote:> On Sat, 15 Mar 2008, Richard Elling wrote:
>>
>> My observation, is that each metaslab is, by default, 1 MByte in 
>> size.  Each
>> top-level vdev is allocated by metaslabs.  ZFS tries to allocate a 
>> top-level
>> vdev''s metaslab before moving onto another one.  So you should
see eight
>> 128kByte allocs per top-level vdev before the next top-level vdev is
>> allocated.
>>
>> That said, the actual iops are sent in parallel.  So it is not 
>> unusual to see
>> many, most, or all of the top-level vdevs concurrently busy.
>>
>> Does this match your experience?
>
> I do see that all the devices are quite evenly busy.  There is no 
> doubt that the load balancing is quite good.  The main question is if 
> there is any actual "striping" going on (breaking the data into 
> smaller chunks), or if the algorithm is simply load balancing. 
> Striping trades IOPS for bandwidth.
By my definition of striping, yes it is going on.  But there are
different ways to spread the data.  The way that writes are handled,
ZFS rewards devices which can provide good sequential write
bandwidth, like disks.  Reads are another story, they read from
where the data is, which in turn depends on the conditions at
write time.

The other behaviour you may see is that reads and writes are
coalesced, when possible.  At the device level you may see your
smaller blocks being coalesced into larger iops.
>
> Using my application, I did some tests today.  The application was 
> used to do balanced read/write of about 500GB of data in some tens of 
> thousand of reasonably large files.  The application sequentially 
> reads a file, then sequentially writes a file.  Several copies (2-6) 
> of the application were run at once for concurrency.  What I noticed 
> is that with hardly any CPU being used, the read+write bandwidth 
> seemed to be bottlenecked at about 280MB/second with ''zfs
iostat''
> showing very balanced I/O between the reads and the writes.
But where is the bottleneck?  iostat will show bottlenecks in the
physical disks and channels.  vmstat or mpstat will show the
bottlenecks in cpus.  To see if the app is the bottleneck will
require some analysis of the app itself.  Is it spending its time
blocked on I/O?
>
> The system I set up is performing quite a bit differently than I 
> anticipated.  The I/O is bottlenecked and I find that my application 
> can do significant processing of the data without significantly 
> increasing the application run time.  So CPU time is almost free.
>
> If I was to assign a smaller block size for the filesystem, would that 
> provide more of the benefits of striping or would it be detrimental to 
> performance due to the number of I/Os?
I would not expect to see much difference, but the proof is in the pudding.
Let us know what you find.
 -- richard

Bob Friesenhahn

2008-Mar-16 19:59 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Sun, 16 Mar 2008, Richard Elling wrote:>
> But where is the bottleneck?  iostat will show bottlenecks in the
> physical disks and channels.  vmstat or mpstat will show the
> bottlenecks in cpus.  To see if the app is the bottleneck will
> require some analysis of the app itself.  Is it spending its time
> blocked on I/O?
The application is spending almost all the time blocked on I/O.  I see 
that the number of device writes per second seems pretty high.  The 
application is doing I/O in 128K blocks.  How many IOPS does a modern 
300GB 15K RPM SAS drive typically deliver?  Of course the IOPS 
capacity depends on if the access is random or sequential.  At the 
application level, the access is completely sequential but ZFS is 
likely doing some extra seeks.

iostat output (atime=off):

                  extended device statistics 
device    r/s    w/s   Mr/s   Mw/s wait actv  svc_t  %w  %b 
sd0       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd1       0.0    0.0    0.0    0.0  0.0  0.0    2.8   0   0 
sd2       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd10     80.4  170.7   10.0   19.9  0.0  9.2   36.5   0  54 
sd11     82.1  170.2   10.2   20.0  0.0 13.3   52.9   0  71 
sd12     79.3  168.3    9.9   20.0  0.0 13.1   53.1   0  69 
sd13     80.6  173.0   10.0   19.9  0.0  9.3   36.7   0  56 
sd14     80.9  167.8   10.1   20.0  0.0 13.4   53.8   0  70 
sd15     77.7  168.7    9.7   19.9  0.0  9.1   37.1   0  52 
sd16     77.3  170.6    9.6   20.0  0.0 13.3   53.7   0  70 
sd17     76.4  168.2    9.5   20.0  0.0  9.1   37.2   0  52 
sd18     76.7  172.2    9.5   19.9  0.0 13.5   54.2   0  70 
sd19     83.8  173.2   10.4   20.0  0.0 13.7   53.4   0  74 
sd20     73.3  174.3    9.1   20.0  0.0  9.1   36.9   0  56 
sd21     75.3  170.2    9.4   20.0  0.0 13.2   53.9   0  69 
nfs1      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0  288   1  189  1018  413  815   26  102   88    0  3046    3   3   0  94
   1  185   1  180   634    1  830   43  111   74    0  3117    3   2   0  94
   2  284   1  183   521    6  617   27   98   67    0  4954    4   3   0  93
   3  176   1  239   748  353  555   25   76   62    0  3933    4   3   0  93

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bill Moloney

2008-Mar-19 21:56 UTC

head link

[zfs-discuss] ZFS I/O algorithms

Hi Bob ... as richard has mentioned, allocation to vdevs
is done in a fixed sized chunk (richard specs 1MB, but I
remember a 512KB number from the original spec, but this
is not very important), and the allocation algorithm is
basically doing load balancing.

for your non-raid pool, this chunk size will stay fixed regardless
of the block size you choose when creating the file system or the
IO unit size your applications(s) use. (The stripe size can 
dynamically change in a raidz pool, but not in your non-raid pool.)

Measuring bandwidth for you application load is tricky with ZFS,
since there are many hidden IO operations (besides the ones that
your application is requesting) that ZFS must perform.  If you collect
iostats on bytes transferred to hard drives and compare those numbers 
to the amount of data your application(s) transferred you can find
potentially large differences.  The differences in these scenarios are 
largely driven by the IO size your application(s) use. For example, when
I run the following tests here are my observations:
-using dual xeon server with qlogic FC 2G interface
-using a pool with 5 10Krpm FC 146 GB drives
-sequentially writing 4 15GB previously wriiten files in one
 file system in the pool (this file system is using 128KB 
 block size), and a separate thread writing
 each file concurrently for a total of 60GB written
block size written    actual written    disk IO observed  BW MB/S    %CPU
          4KB                      60GB                227.3GB           34.2   
20.4
         32KB                     60GB                216.5GB           36.1    
13.9
        128KB                    60Gb                  63.6GB           69.6    
31.0

You can see that a small application IO size causes much 
meta-data based IO (more than 3 times the actual application
IO requirements), while the 128KB application writes induce 
only marginally more disk IO than the application actually uses.

the BW numbers here are for just the application data, but when
you consider all the IO from the disks over the test times, the 
physical BW is obviously greater in all cases.

All my drives were uniformly busy in these tests, but the 
small application IO sizes forced much more total IO against
the drives.  In your case the application IO rate would be even
further degraded due to the mirror configuration.  The extra
load of reading and writing meta-data (including ditto-blocks) 
and mirror devices conspire to reduce the application IO rate, 
even though the disk device IO rates may be quite good.  

File system block size reduction only exacerbates the problem by
requiring more meta-data to support the same quantity of
application data, and for sequential IO this is a loser.  In any
case, for a non-raid pool, the allocation chunk size per drive 
(the stripe size) is not influenced by file system block size.

When application IO sizes get small, the overhead in ZFS goes
up dramatically.

regards, Bill
> The application is spending almost all the time
> blocked on I/O.  I see 
> that the number of device writes per second seems
> pretty high.  The 
> application is doing I/O in 128K blocks.  How many
> IOPS does a modern 
> 300GB 15K RPM SAS drive typically deliver?  Of course
> the IOPS 
> capacity depends on if the access is random or
> sequential.  At the 
> application level, the access is completely
> sequential but ZFS is 
> likely doing some extra seeks. 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Mar-19 23:23 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Wed, 19 Mar 2008, Bill Moloney wrote:
> When application IO sizes get small, the overhead in ZFS goes
> up dramatically.
Thanks for the feedback.  However, from what I have observed, it is 
not a full story at all.  On my own system, when a new file is 
written, the write block size does not make a significant difference 
to the write speed.  Similarly, read block size does not make a 
significant difference to the sequential read speed.  I do see a 
large difference in rates when an existing file is updated 
sequentially.  There is a many orders of magnitude difference for 
random I/O type updates.

I think that there some rather obvious reasons for the difference 
between writing a new file, or updating an existing file.  When 
writing a new file, the system can buffer up to a disk block''s worth 
of size prior to issuing a a disk I/O, or it can immedialy write what 
it has and since the write is sequential, it does not need to re-read 
prior to write (but there may be more metadata I/Os).  For the case of 
updating part of a disk block, there needs to be a read prior to write 
if the block is not cached in RAM.

If the system is short on RAM, it may be that ZFS issues many more 
write I/Os than if it has a lot of RAM.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bill Moloney

2008-Mar-20 02:35 UTC

head link

[zfs-discuss] ZFS I/O algorithms

> On my own system, when a new file is 
> written, the write block size does not make 
> a significant difference to the write speed
Yes, I''ve observed the same result ... when a new file is being written
sequentially, the file data and newly constructed meta-data can be
built in cache and written in large sequential chunks periodically,
without the need to read in existing meta-data and/or data.  It
seems that data and meta-data that are newly constucted in cache for 
sequential operations will persist in cache effectively, 
and the application IO size is a much less sensitive parameter.
Monitoring disks with iostat in these cases shows the disk IO to 
be only marginally greater than the application IO.
This is why I specified that the write tests
described in my previous post were to existing files.

The overhead of doing small sequential writes to an 
existing object is so much greater than writing to a new 
object, that it begs for some reasonable explanation.
The only one that i''ve been able assemble in various
experimentation, is that data/meta-data for existing objects 
is not retained effetively in cache if ZFS detects that such an
object is being sequntially written.  This forces the
constant re-reading of the data/meta-data associated with
such an object, causing a huge increase in device IO
traffic that does not seem to accompany the writing of a
brand new object.  The size of RAM seems to make little
difference in this case.

As small sequential writes accumulate in the 5 second cache, the 
chain of meta-data leading to the newly constructed data block may 
see only one pointer (of the 128 in the final set) changing to point to
this newly constructed data block, but all the meta-data from the
uber block to the target must be rewritten on the 5 second flush.
Of course this is not much diffrent from what''s happening in the
newly created object scenario, so it must be the behavior that follows 
this flush that''s different.  It seems to me that after this flush,
some,
or all of the data/meta-data that will be affected next is re-read even
though much of what''s needed for subsequent operations should already
be in cache.

My experience with large RAM systems and with the use of SSDs 
as ZFS cache devices has convinced me that data/meta-data associated
with sequential write operations to existing objects (and ZFS seems 
very good at detecting this association) does not get retained 
in cache very effectively.

You can see this very clearly if you look at the IO to a cache
device (ZFS allows you to easily attach a device to a pool as a 
cache device which acts as a sort of L2 type cache for RAM).  
When I do random IO operations to existing objects I
see a large amount of IO to my cache device as RAM fills and ZFS 
pushes cached information (that would otherwise be evicted)
to the SSD cache device.  If I repeat 
the random IO test over the same total file space I see improved
performance as I get occassional hits from the RAM cache and the
SSD cache.  As this extended cache heirarchy warms up with each
test run, my results continue to improve.  If I run sequential write 
operations to exiting objects however,  I see very little activity to 
my SSD cache, and virtually no change in performance  when I 
immediately run the same test again.  

It seems that ZFS is still in need of some fine-tuning for small
sequential write operations to exiting objects.

regards, Bill
 
 
This message posted from opensolaris.org

Robert Milkowski

2008-Mar-20 09:46 UTC

head link

[zfs-discuss] ZFS I/O algorithms

Hello Bob,

Wednesday, March 19, 2008, 11:23:58 PM, you wrote:

BF> On Wed, 19 Mar 2008, Bill Moloney wrote:
>> When application IO sizes get small, the overhead in ZFS goes
>> up dramatically.
BF> Thanks for the feedback.  However, from what I have observed, it is 
BF> not a full story at all.  On my own system, when a new file is 
BF> written, the write block size does not make a significant difference 
BF> to the write speed.  Similarly, read block size does not make a 
BF> significant difference to the sequential read speed.  I do see a 
BF> large difference in rates when an existing file is updated 
BF> sequentially.  There is a many orders of magnitude difference for 
BF> random I/O type updates.

BF> I think that there some rather obvious reasons for the difference 
BF> between writing a new file, or updating an existing file.  When 
BF> writing a new file, the system can buffer up to a disk block''s
worth
BF> of size prior to issuing a a disk I/O, or it can immedialy write what 
BF> it has and since the write is sequential, it does not need to re-read 
BF> prior to write (but there may be more metadata I/Os).  For the case of
BF> updating part of a disk block, there needs to be a read prior to write
BF> if the block is not cached in RAM.

Possibly when you created a file zfs used 128KB blocks.
Then if you randomly update that file the question is - what is an
average update size? If it''s below 128KB (and not aligned) you will
basically have to read the old 128KB block first and then write it to
new location.

In such scenario (like oracle databases on zfs) BEFORE you create
files set recordsize property to something smaller, ideally match your
avg. update size. In case of Oracle matching db_block_size should give
you best results most the times.

-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Mario Goebbels

2008-Mar-20 13:59 UTC

head link

[zfs-discuss] ZFS I/O algorithms

> Similarly, read block size does not make a 
> significant difference to the sequential read speed.
Last time I did a simple bench using dd, supplying the record size as
blocksize to it instead of no blocksize parameter bumped the mirror pool
speed from 90MB/s to 130MB/s.

-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 217 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080320/b6e539c8/attachment.bin>

Bob Friesenhahn

2008-Mar-20 15:07 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Thu, 20 Mar 2008, Mario Goebbels wrote:
>> Similarly, read block size does not make a
>> significant difference to the sequential read speed.
>
> Last time I did a simple bench using dd, supplying the record size as
> blocksize to it instead of no blocksize parameter bumped the mirror pool
> speed from 90MB/s to 130MB/s.
Indeed.  However, as an interesting twist to things, in my own 
benchmark runs I see two behaviors.  When the file size is smaller 
than the amount of RAM the ARC can reasonably grow to, the write block 
size does make a clear difference.  When the file size is larger than 
RAM, the write block size no longer makes much difference and 
sometimes larger block sizes actually go slower.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jonathan Edwards

2008-Mar-20 15:51 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote:> On Thu, 20 Mar 2008, Mario Goebbels wrote:
>
>>> Similarly, read block size does not make a
>>> significant difference to the sequential read speed.
>>
>> Last time I did a simple bench using dd, supplying the record size as
>> blocksize to it instead of no blocksize parameter bumped the mirror  
>> pool
>> speed from 90MB/s to 130MB/s.
>
> Indeed.  However, as an interesting twist to things, in my own
> benchmark runs I see two behaviors.  When the file size is smaller
> than the amount of RAM the ARC can reasonably grow to, the write block
> size does make a clear difference.  When the file size is larger than
> RAM, the write block size no longer makes much difference and
> sometimes larger block sizes actually go slower.
in that case .. try fixing the ARC size .. the dynamic resizing on the  
ARC can be less than optimal IMHO

---
.je

Bob Friesenhahn

2008-Mar-20 18:00 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Thu, 20 Mar 2008, Jonathan Edwards wrote:>
> in that case .. try fixing the ARC size .. the dynamic resizing on the ARC 
> can be less than optimal IMHO
Is a 16GB ARC size not considered to be enough? ;-)

I was only describing the behavior that I observed.  It seems to me 
that when large files are written very quickly, that when the file 
becomes bigger than the ARC, that what is contained in the ARC is 
mostly stale and does not help much any more.  If the file is smaller 
than the ARC, then there is likely to be more useful caching.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mario Goebbels (Webmail)

2008-Mar-20 18:48 UTC

head link

[zfs-discuss] ZFS I/O algorithms

> Is a 16GB ARC size not considered to be enough? ;-)
> 
> I was only describing the behavior that I observed.  It seems to me
> that when large files are written very quickly, that when the file
> becomes bigger than the ARC, that what is contained in the ARC is
> mostly stale and does not help much any more.  If the file is smaller
> than the ARC, then there is likely to be more useful caching.
That''s the problem of grouping writes, since they''re going to
be buffered
in the ARC. Writing a file larger than the ARC is akin to holding a
powerful firehose on it. :)

I guess what''d help your case would be an option that specifies the
minimum
of ARC memory dedicated to read caching only.

-mg

Jonathan Edwards

2008-Mar-20 19:09 UTC

head link

[zfs-discuss] ZFS I/O algorithms

On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote:> On Thu, 20 Mar 2008, Jonathan Edwards wrote:
>>
>> in that case .. try fixing the ARC size .. the dynamic resizing on  
>> the ARC
>> can be less than optimal IMHO
>
> Is a 16GB ARC size not considered to be enough? ;-)
>
> I was only describing the behavior that I observed.  It seems to me
> that when large files are written very quickly, that when the file
> becomes bigger than the ARC, that what is contained in the ARC is
> mostly stale and does not help much any more.  If the file is smaller
> than the ARC, then there is likely to be more useful caching.
sure i got that - it''s not the size of the arc in this case since  
caching is going to be a lost cause.. but explicitly setting a  
zfs_arc_max should result in fewer calls to arc_shrink() when you hit  
memory pressure between the application''s page buffer competing with  
the arc

in other words, as soon as the arc is 50% full of dirty pages (8GB)  
it''ll start evicting pages .. you can''t avoid that .. but what
you can
avoid is the additional weight of constantly growing and shrinking the  
cache as it tries to keep up with your constantly changing blocks in a  
large file

---
.je

zfs discuss - Mar 2008 - ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms

[zfs-discuss] ZFS I/O algorithms