Hi Richard, How''s the ranch? ;-)> >> This is most likely a naive question on my part. If recordsize is >> set to 4k (or a multiple of 4k), will ZFS ever write a record that >> is less than 4k or not a multiple of 4k? > > Yes. The recordsize is the upper limit for a file record. > >> This includes metadata. > > Yes. Metadata is compressed and seems to usually be one block. > >> Does compression have any effect on this? > > Yes. 4KB is the minimum size that can be compressed for regular data. > > NB. Physical writes may be larger because they are coalesced. But > if you are worried about recordsize, then you are implicitly worried > about > reads.The question behind the question is, given the really bad things that can happen performance-wise with writes that are not 4k aligned when using flash devices, is there any way to insure that any and all writes from ZFS are 4k aligned?> -- richard
On Dec 16, 2009, at 7:35 AM, Bill Sprouse wrote:> Hi Richard, > > How''s the ranch? ;-)Good. Sunny, warm, turning green... perfect for the holidays :-)>>> This is most likely a naive question on my part. If recordsize is >>> set to 4k (or a multiple of 4k), will ZFS ever write a record that >>> is less than 4k or not a multiple of 4k? >> >> Yes. The recordsize is the upper limit for a file record. >> >>> This includes metadata. >> >> Yes. Metadata is compressed and seems to usually be one block. >> >>> Does compression have any effect on this? >> >> Yes. 4KB is the minimum size that can be compressed for regular data. >> >> NB. Physical writes may be larger because they are coalesced. But >> if you are worried about recordsize, then you are implicitly worried >> about >> reads. > > The question behind the question is, given the really bad things > that can happen performance-wise with writes that are not 4k aligned > when using flash devices, is there any way to insure that any and > all writes from ZFS are 4k aligned?The short answer is no, not all writes will be 4 KB aligned. As to how this affects "flash devices," it depends on the device -- very few seem to be built the same way. A quick dtrace script would show how writes are aligned to the partition boundaries, but the partition alignment is left as an exercise for the implementer. -- richard -------------- next part -------------- A non-text attachment was scrubbed... Name: aligned.d Type: application/octet-stream Size: 265 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091216/834f4cf8/attachment.obj> -------------- next part --------------
On Wed, 16 Dec 2009, Richard Elling wrote:> the same way. A quick dtrace script would show how writes are > aligned to the partition boundaries, but the partition alignment is > left as an exercise for the implementer. -- richardWith 128K reads and writes, not very much apparent alignment in my pool''s writes: % /usr/sbin/dtrace -Cs aligned.d Press ^C when done sampling^C aligned=13014 nonaligned=71464 % iopattern %RAN %SEQ COUNT MIN MAX AVG KR KW 30 70 592 2560 131072 130744 75586 0 30 70 617 65536 131072 130753 78784 0 30 70 624 65536 131072 130966 79808 0 12 88 2948 512 131072 125161 36224 324105 7 93 6200 3584 131072 130075 0 918510 27 73 1969 512 131072 111426 49216 165040 27 73 633 65536 131072 130657 80768 0 27 73 618 65536 131072 130859 78976 0 25 75 600 65536 131072 130744 76608 0 23 77 606 65536 131072 130963 77504 0 25 75 521 65536 131072 130694 66496 0 7 93 6149 3584 131072 129810 256 779241 13 87 4193 512 131072 124308 10291 498719 26 74 579 2560 131072 130850 73986 0 29 71 609 65536 131072 130533 77632 0 25 75 591 65536 131072 130961 75584 0 25 75 648 65536 131072 130768 82752 0 25 75 603 65536 131072 130963 77120 0 11 89 3278 2048 131072 127439 33280 374677 4 96 6219 3584 131072 129941 0 789167 The percentage of writes which are sequential while reading is essentially blocked is quite impressive. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Dec 16, 2009, at 6:54 PM, Bob Friesenhahn wrote:> On Wed, 16 Dec 2009, Richard Elling wrote: > >> the same way. A quick dtrace script would show how writes are >> aligned to the partition boundaries, but the partition alignment is >> left as an exercise for the implementer. -- richard > > With 128K reads and writes, not very much apparent alignment in my > pool''s writes: > > % /usr/sbin/dtrace -Cs aligned.d > Press ^C when done sampling^C > aligned=13014 > nonaligned=71464I just threw that together, and it doesn''t do anything clever like identify sequential writes. Is there an actual problem that we can solve by looking at the alignment? If so, maybe we can do better...> > % iopatternI modified iopattern so you can separate reads from writes. I find that seeing them mixed is of little use, and very confusing :-) http://www.richardelling.com/Home/scripts-and-programs-1/iopattern -- richard> %RAN %SEQ COUNT MIN MAX AVG KR KW > 30 70 592 2560 131072 130744 75586 0 > 30 70 617 65536 131072 130753 78784 0 > 30 70 624 65536 131072 130966 79808 0 > 12 88 2948 512 131072 125161 36224 324105 > 7 93 6200 3584 131072 130075 0 918510 > 27 73 1969 512 131072 111426 49216 165040 > 27 73 633 65536 131072 130657 80768 0 > 27 73 618 65536 131072 130859 78976 0 > 25 75 600 65536 131072 130744 76608 0 > 23 77 606 65536 131072 130963 77504 0 > 25 75 521 65536 131072 130694 66496 0 > 7 93 6149 3584 131072 129810 256 779241 > 13 87 4193 512 131072 124308 10291 498719 > 26 74 579 2560 131072 130850 73986 0 > 29 71 609 65536 131072 130533 77632 0 > 25 75 591 65536 131072 130961 75584 0 > 25 75 648 65536 131072 130768 82752 0 > 25 75 603 65536 131072 130963 77120 0 > 11 89 3278 2048 131072 127439 33280 374677 > 4 96 6219 3584 131072 129941 0 789167 > > The percentage of writes which are sequential while reading is > essentially blocked is quite impressive. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Dec 16 at 7:35, Bill Sprouse wrote:>The question behind the question is, given the really bad things that >can happen performance-wise with writes that are not 4k aligned when >using flash devices, is there any way to insure that any and all >writes from ZFS are 4k aligned?Some flash devices can handle this better than others, often several orders of magnitude better. Not all devices (as you imply) are so-affected. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Thu, Dec 17, 2009 at 09:14, Eric D. Mudama <edmudama at bounceswoosh.org>wrote:> On Wed, Dec 16 at 7:35, Bill Sprouse wrote: > >> The question behind the question is, given the really bad things that can >> happen performance-wise with writes that are not 4k aligned when using flash >> devices, is there any way to insure that any and all writes from ZFS are 4k >> aligned? >> > > Some flash devices can handle this better than others, often several > orders of magnitude better. Not all devices (as you imply) are > so-affected. >Is there - somewhere - a list of flash devices, with some (perhaps subjective) indication of how they handle issues like this? -- -Me -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091217/4164e04d/attachment.html>
> On Wed, Dec 16 at 7:35, Bill Sprouse wrote: > >The question behind the question is, given the > really bad things that > >can happen performance-wise with writes that are not > 4k aligned when > >using flash devices, is there any way to insure that > any and all > >writes from ZFS are 4k aligned? > > Some flash devices can handle this better than > others, often several > orders of magnitude better. Not all devices (as you > imply) are > so-affected.As a specific example of 2 devices with dramatically different performance for sub-4k transfers has anyone done any ZFS benchmarks between the X25E and the F20 they can share? I am particularly interested in zvol performance with a blocksize of 16k and highly compressible data (~10x). I am going to run some comparison tests but would appreciate any initial input on what to look out for or how to tune ZFS to get the most out of the F20. It might be helpful, e.g., if there where some where in the software stack where I could tell part of the system to lie and treat the F20 as a 4k device? Thanks. -- This message posted from opensolaris.org
On Dec 17, 2009, at 9:04 PM, stuart anderson wrote:> > As a specific example of 2 devices with dramatically different > performance for sub-4k transfers has anyone done any ZFS benchmarks > between the X25E and the F20 they can share? > > I am particularly interested in zvol performance with a blocksize of > 16k and highly compressible data (~10x).16 KB recordsize? That seems a little unusual, what is the application?> I am going to run some comparison tests but would appreciate any > initial input on what to look out for or how to tune ZFS to get the > most out of the F20.AFAICT, no tuning should be required. It is quite fast.> It might be helpful, e.g., if there where some where in the software > stack where I could tell part of the system to lie and treat the F20 > as a 4k device?The F20 is rated at 84,000 random 4KB write IOPS. The DRAM write buffer will hide 4KB write effects. OTOH, the X-25E is rated at 3,300 random 4KB writes. It shouldn''t take much armchair analysis to come to the conclusion that the F20 is likely to win that IOPS battle :-) -- richard
On Dec 17, 2009, at 9:21 PM, Richard Elling wrote:> On Dec 17, 2009, at 9:04 PM, stuart anderson wrote: >> >> As a specific example of 2 devices with dramatically different performance for sub-4k transfers has anyone done any ZFS benchmarks between the X25E and the F20 they can share? >> >> I am particularly interested in zvol performance with a blocksize of 16k and highly compressible data (~10x). > > 16 KB recordsize? That seems a little unusual, what is the application?SAM-QFS metadata whose fundamental disk allocation unit (DAU) size for metadata is 16kB.> >> I am going to run some comparison tests but would appreciate any initial input on what to look out for or how to tune ZFS to get the most out of the F20. > > AFAICT, no tuning should be required. It is quite fast. > >> It might be helpful, e.g., if there where some where in the software stack where I could tell part of the system to lie and treat the F20 as a 4k device? > > The F20 is rated at 84,000 random 4KB write IOPS. The DRAM write > buffer will hide 4KB write effects.Not from some direct vdbench comparison results I have seen. My main concern here has to do with ZFS compression, which I need for my application, breaking up the transfer sizes the F20 sees into smaller than 4KB writes where there is a critical performance difference. I also suspect/hope that SAM-QFS is telling ZFS to aggressively flush/commit any metadata updates to stable storage which probably aggravates the problem though I have not test this yet.> > OTOH, the X-25E is rated at 3,300 random 4KB writes. It shouldn''t take > much armchair analysis to come to the conclusion that the F20 is likely > to win that IOPS battle :-)Though to be fair you should probably compare a single F20 DOM to an X25-E, or 4 X25E''s to a full F20, and of course my systems don''t run from an armchair :) Thanks. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
On Dec 18, 2009, at 9:40 AM, Stuart Anderson wrote:> > On Dec 17, 2009, at 9:21 PM, Richard Elling wrote: > >> On Dec 17, 2009, at 9:04 PM, stuart anderson wrote: >>> >>> As a specific example of 2 devices with dramatically different >>> performance for sub-4k transfers has anyone done any ZFS >>> benchmarks between the X25E and the F20 they can share? >>> >>> I am particularly interested in zvol performance with a blocksize >>> of 16k and highly compressible data (~10x). >> >> 16 KB recordsize? That seems a little unusual, what is the >> application? > > SAM-QFS metadata whose fundamental disk allocation unit (DAU) size > for metadata is 16kB.Ah, ok. That explains it. I''m not sure there are a lot of people doing this. Most folks don''t know QFS exists or is open source.>> >>> I am going to run some comparison tests but would appreciate any >>> initial input on what to look out for or how to tune ZFS to get >>> the most out of the F20. >> >> AFAICT, no tuning should be required. It is quite fast. >> >>> It might be helpful, e.g., if there where some where in the >>> software stack where I could tell part of the system to lie and >>> treat the F20 as a 4k device? >> >> The F20 is rated at 84,000 random 4KB write IOPS. The DRAM write >> buffer will hide 4KB write effects. > > Not from some direct vdbench comparison results I have seen. My main > concern here has to do with ZFS compression, which I need for my > application, breaking up the transfer sizes the F20 sees into > smaller than 4KB writes where there is a critical performance > difference. I also suspect/hope that SAM-QFS is telling ZFS to > aggressively flush/commit any metadata updates to stable storage > which probably aggravates the problem though I have not test this yet.ZFS will coalesce writes, regardless of the recordsize. However, this is not the case for writes to the ZIL (for obvious reasons). Measure the ZIL activity to see how that workload looks. If you don''t see ZIL activity, then you should see (mostly) larger I/Os when the txg commits.>> >> OTOH, the X-25E is rated at 3,300 random 4KB writes. It shouldn''t >> take >> much armchair analysis to come to the conclusion that the F20 is >> likely >> to win that IOPS battle :-) > > Though to be fair you should probably compare a single F20 DOM to an > X25-E, or 4 X25E''s to a full F20, and of course my systems don''t run > from an armchair :)...or 1,000 1 TB SATA disks... :-) -- richard