thr3ads.net - zfs discuss - [zfs-discuss] force 4k writes [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Bill Sprouse

2009-Dec-16 15:35 UTC

[zfs-discuss] force 4k writes

Hi Richard,

How''s the ranch?  ;-)>
>> This is most likely a naive question on my part.  If recordsize is
>> set to 4k (or a multiple of 4k), will ZFS ever write a record that
>> is less than 4k or not a multiple of 4k?
>
> Yes.  The recordsize is the upper limit for a file record.
>
>> This includes metadata.
>
> Yes.  Metadata is compressed and seems to usually be one block.
>
>> Does compression have any effect on this?
>
> Yes. 4KB is the minimum size that can be compressed for regular data.
>
> NB. Physical writes may be larger because they are coalesced.  But
> if you are worried about recordsize, then you are implicitly worried
> about
> reads.
The question behind the question is, given the really bad things that  
can happen performance-wise with writes that are not 4k aligned when  
using flash devices, is there any way to insure that any and all  
writes from ZFS are 4k aligned?

>  -- richard

Richard Elling

2009-Dec-17 00:19 UTC

head link

[zfs-discuss] force 4k writes

On Dec 16, 2009, at 7:35 AM, Bill Sprouse wrote:> Hi Richard,
>
> How''s the ranch?  ;-)
Good.  Sunny, warm, turning green... perfect for the holidays :-)
>>> This is most likely a naive question on my part.  If recordsize is
>>> set to 4k (or a multiple of 4k), will ZFS ever write a record that
>>> is less than 4k or not a multiple of 4k?
>>
>> Yes.  The recordsize is the upper limit for a file record.
>>
>>> This includes metadata.
>>
>> Yes.  Metadata is compressed and seems to usually be one block.
>>
>>> Does compression have any effect on this?
>>
>> Yes. 4KB is the minimum size that can be compressed for regular data.
>>
>> NB. Physical writes may be larger because they are coalesced.  But
>> if you are worried about recordsize, then you are implicitly worried
>> about
>> reads.
>
> The question behind the question is, given the really bad things  
> that can happen performance-wise with writes that are not 4k aligned  
> when using flash devices, is there any way to insure that any and  
> all writes from ZFS are 4k aligned?
The short answer is no, not all writes will be 4 KB aligned.  As to  
how this
affects "flash devices," it depends on the device -- very few seem to
be built
the same way.  A quick dtrace script would show how writes are aligned  
to
the partition boundaries, but the partition alignment is left as an  
exercise for
the implementer.
  -- richard

-------------- next part --------------
A non-text attachment was scrubbed...
Name: aligned.d
Type: application/octet-stream
Size: 265 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091216/834f4cf8/attachment.obj>
-------------- next part --------------

Bob Friesenhahn

2009-Dec-17 02:54 UTC

head link

[zfs-discuss] force 4k writes

On Wed, 16 Dec 2009, Richard Elling wrote:
> the same way.  A quick dtrace script would show how writes are 
> aligned to the partition boundaries, but the partition alignment is 
> left as an exercise for the implementer. -- richard
With 128K reads and writes, not very much apparent alignment in my 
pool''s writes:

% /usr/sbin/dtrace -Cs aligned.d
Press ^C when done sampling^C
aligned=13014
nonaligned=71464

% iopattern
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
   30   70    592   2560 131072 130744  75586      0
   30   70    617  65536 131072 130753  78784      0
   30   70    624  65536 131072 130966  79808      0
   12   88   2948    512 131072 125161  36224 324105
    7   93   6200   3584 131072 130075      0 918510
   27   73   1969    512 131072 111426  49216 165040
   27   73    633  65536 131072 130657  80768      0
   27   73    618  65536 131072 130859  78976      0
   25   75    600  65536 131072 130744  76608      0
   23   77    606  65536 131072 130963  77504      0
   25   75    521  65536 131072 130694  66496      0
    7   93   6149   3584 131072 129810    256 779241
   13   87   4193    512 131072 124308  10291 498719
   26   74    579   2560 131072 130850  73986      0
   29   71    609  65536 131072 130533  77632      0
   25   75    591  65536 131072 130961  75584      0
   25   75    648  65536 131072 130768  82752      0
   25   75    603  65536 131072 130963  77120      0
   11   89   3278   2048 131072 127439  33280 374677
    4   96   6219   3584 131072 129941      0 789167

The percentage of writes which are sequential while reading is 
essentially blocked is quite impressive.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Dec-17 05:27 UTC

head link

[zfs-discuss] force 4k writes

On Dec 16, 2009, at 6:54 PM, Bob Friesenhahn wrote:
> On Wed, 16 Dec 2009, Richard Elling wrote:
>
>> the same way.  A quick dtrace script would show how writes are  
>> aligned to the partition boundaries, but the partition alignment is  
>> left as an exercise for the implementer. -- richard
>
> With 128K reads and writes, not very much apparent alignment in my  
> pool''s writes:
>
> % /usr/sbin/dtrace -Cs aligned.d
> Press ^C when done sampling^C
> aligned=13014
> nonaligned=71464
I just threw that together, and it doesn''t do anything clever like
identify sequential writes.  Is there an actual problem that we
can solve by looking at the alignment?  If so, maybe we can
do better...
>
> % iopattern
I modified iopattern so you can separate reads from writes.
I find that seeing them mixed is of little use, and very confusing :-)
http://www.richardelling.com/Home/scripts-and-programs-1/iopattern
  -- richard
> %RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
>  30   70    592   2560 131072 130744  75586      0
>  30   70    617  65536 131072 130753  78784      0
>  30   70    624  65536 131072 130966  79808      0
>  12   88   2948    512 131072 125161  36224 324105
>   7   93   6200   3584 131072 130075      0 918510
>  27   73   1969    512 131072 111426  49216 165040
>  27   73    633  65536 131072 130657  80768      0
>  27   73    618  65536 131072 130859  78976      0
>  25   75    600  65536 131072 130744  76608      0
>  23   77    606  65536 131072 130963  77504      0
>  25   75    521  65536 131072 130694  66496      0
>   7   93   6149   3584 131072 129810    256 779241
>  13   87   4193    512 131072 124308  10291 498719
>  26   74    579   2560 131072 130850  73986      0
>  29   71    609  65536 131072 130533  77632      0
>  25   75    591  65536 131072 130961  75584      0
>  25   75    648  65536 131072 130768  82752      0
>  25   75    603  65536 131072 130963  77120      0
>  11   89   3278   2048 131072 127439  33280 374677
>   4   96   6219   3584 131072 129941      0 789167
>
> The percentage of writes which are sequential while reading is  
> essentially blocked is quite impressive.
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric D. Mudama

2009-Dec-17 08:14 UTC

head link

[zfs-discuss] force 4k writes

On Wed, Dec 16 at  7:35, Bill Sprouse wrote:>The question behind the question is, given the really bad things that 
>can happen performance-wise with writes that are not 4k aligned when 
>using flash devices, is there any way to insure that any and all 
>writes from ZFS are 4k aligned?
Some flash devices can handle this better than others, often several
orders of magnitude better.  Not all devices (as you imply) are
so-affected.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Colin Raven

2009-Dec-17 08:26 UTC

head link

[zfs-discuss] force 4k writes

On Thu, Dec 17, 2009 at 09:14, Eric D. Mudama <edmudama at
bounceswoosh.org>wrote:
> On Wed, Dec 16 at  7:35, Bill Sprouse wrote:
>
>> The question behind the question is, given the really bad things that
can
>> happen performance-wise with writes that are not 4k aligned when using
flash
>> devices, is there any way to insure that any and all writes from ZFS
are 4k
>> aligned?
>>
>
> Some flash devices can handle this better than others, often several
> orders of magnitude better.  Not all devices (as you imply) are
> so-affected.
>
Is there - somewhere - a list of flash devices, with some (perhaps
subjective) indication of how they handle issues like this?

--
-Me
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091217/4164e04d/attachment.html>

stuart anderson

2009-Dec-18 05:04 UTC

head link

[zfs-discuss] force 4k writes

> On Wed, Dec 16 at  7:35, Bill Sprouse wrote:
> >The question behind the question is, given the
> really bad things that 
> >can happen performance-wise with writes that are not
> 4k aligned when 
> >using flash devices, is there any way to insure that
> any and all 
> >writes from ZFS are 4k aligned?
> 
> Some flash devices can handle this better than
> others, often several
> orders of magnitude better.  Not all devices (as you
> imply) are
> so-affected.

As a specific example of 2 devices with dramatically different performance for
sub-4k transfers has anyone done any ZFS benchmarks between the X25E and the F20
they can share?

I am particularly interested in zvol performance with a blocksize of 16k and
highly compressible data (~10x). I am going to run some comparison tests but
would appreciate any initial input on what to look out for or how to tune ZFS to
get the most out of the F20.

It might be helpful, e.g., if there where some where in the software stack where
I could tell part of the system to lie and treat the F20 as a 4k device?

Thanks.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-18 05:21 UTC

head link

[zfs-discuss] force 4k writes

On Dec 17, 2009, at 9:04 PM, stuart anderson wrote:>
> As a specific example of 2 devices with dramatically different  
> performance for sub-4k transfers has anyone done any ZFS benchmarks  
> between the X25E and the F20 they can share?
>
> I am particularly interested in zvol performance with a blocksize of  
> 16k and highly compressible data (~10x).
16 KB recordsize?  That seems a little unusual, what is the application?
>  I am going to run some comparison tests but would appreciate any  
> initial input on what to look out for or how to tune ZFS to get the  
> most out of the F20.
AFAICT, no tuning should be required.  It is quite fast.
> It might be helpful, e.g., if there where some where in the software  
> stack where I could tell part of the system to lie and treat the F20  
> as a 4k device?
The F20 is rated at 84,000 random 4KB write IOPS.  The DRAM write
buffer will hide 4KB write effects.

OTOH, the X-25E is rated at 3,300 random 4KB writes.  It shouldn''t take
much armchair analysis to come to the conclusion that the F20 is likely
to win that IOPS battle :-)
  -- richard

Stuart Anderson

2009-Dec-18 17:40 UTC

head link

[zfs-discuss] force 4k writes

On Dec 17, 2009, at 9:21 PM, Richard Elling wrote:
> On Dec 17, 2009, at 9:04 PM, stuart anderson wrote:
>> 
>> As a specific example of 2 devices with dramatically different
performance for sub-4k transfers has anyone done any ZFS benchmarks between the
X25E and the F20 they can share?
>> 
>> I am particularly interested in zvol performance with a blocksize of
16k and highly compressible data (~10x).
> 
> 16 KB recordsize?  That seems a little unusual, what is the application?
SAM-QFS metadata whose fundamental disk allocation unit (DAU) size for metadata
is 16kB.
> 
>> I am going to run some comparison tests but would appreciate any
initial input on what to look out for or how to tune ZFS to get the most out of
the F20.
> 
> AFAICT, no tuning should be required.  It is quite fast.
> 
>> It might be helpful, e.g., if there where some where in the software
stack where I could tell part of the system to lie and treat the F20 as a 4k
device?
> 
> The F20 is rated at 84,000 random 4KB write IOPS.  The DRAM write
> buffer will hide 4KB write effects.
Not from some direct vdbench comparison results I have seen. My main concern
here has to do with ZFS compression, which I need for my application, breaking
up the transfer sizes the F20 sees into smaller than 4KB writes where there is a
critical performance difference. I also suspect/hope that SAM-QFS is telling ZFS
to aggressively flush/commit any metadata updates to stable storage which
probably aggravates the problem though I have not test this yet.
> 
> OTOH, the X-25E is rated at 3,300 random 4KB writes.  It shouldn''t
take
> much armchair analysis to come to the conclusion that the F20 is likely
> to win that IOPS battle :-)
Though to be fair you should probably compare a single F20 DOM to an X25-E, or 4
X25E''s to a full F20, and of course my systems don''t run from
an armchair :)

Thanks.

--
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Richard Elling

2009-Dec-18 18:29 UTC

head link

[zfs-discuss] force 4k writes

On Dec 18, 2009, at 9:40 AM, Stuart Anderson wrote:>
> On Dec 17, 2009, at 9:21 PM, Richard Elling wrote:
>
>> On Dec 17, 2009, at 9:04 PM, stuart anderson wrote:
>>>
>>> As a specific example of 2 devices with dramatically different  
>>> performance for sub-4k transfers has anyone done any ZFS  
>>> benchmarks between the X25E and the F20 they can share?
>>>
>>> I am particularly interested in zvol performance with a blocksize  
>>> of 16k and highly compressible data (~10x).
>>
>> 16 KB recordsize?  That seems a little unusual, what is the  
>> application?
>
> SAM-QFS metadata whose fundamental disk allocation unit (DAU) size  
> for metadata is 16kB.
Ah, ok.  That explains it.  I''m not sure there are a lot of people  
doing this.
Most folks don''t know QFS exists or is open source.
>>
>>> I am going to run some comparison tests but would appreciate any  
>>> initial input on what to look out for or how to tune ZFS to get  
>>> the most out of the F20.
>>
>> AFAICT, no tuning should be required.  It is quite fast.
>>
>>> It might be helpful, e.g., if there where some where in the  
>>> software stack where I could tell part of the system to lie and  
>>> treat the F20 as a 4k device?
>>
>> The F20 is rated at 84,000 random 4KB write IOPS.  The DRAM write
>> buffer will hide 4KB write effects.
>
> Not from some direct vdbench comparison results I have seen. My main  
> concern here has to do with ZFS compression, which I need for my  
> application, breaking up the transfer sizes the F20 sees into  
> smaller than 4KB writes where there is a critical performance  
> difference. I also suspect/hope that SAM-QFS is telling ZFS to  
> aggressively flush/commit any metadata updates to stable storage  
> which probably aggravates the problem though I have not test this yet.
ZFS will coalesce writes, regardless of the recordsize. However, this  
is not
the case for writes to the ZIL (for obvious reasons). Measure the ZIL  
activity
to see how that workload looks.  If you don''t see ZIL activity, then  
you should
see (mostly) larger I/Os when the txg commits.
>>
>> OTOH, the X-25E is rated at 3,300 random 4KB writes.  It
shouldn''t
>> take
>> much armchair analysis to come to the conclusion that the F20 is  
>> likely
>> to win that IOPS battle :-)
>
> Though to be fair you should probably compare a single F20 DOM to an  
> X25-E, or 4 X25E''s to a full F20, and of course my systems
don''t run
> from an armchair :)
...or 1,000 1 TB SATA disks... :-)
  -- richard

zfs discuss - Dec 2009 - force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes

[zfs-discuss] force 4k writes