I am using ZFS as the backing store for an iscsi target running a virtual machine. I am looking at using 8K block size on the zfs volume. I was looking at the comstar iscsi settings and there is also a blk size configuration, which defaults as 512 bytes. That would make me believe that all of the IO will be broken down into 512 bytes which seems very inefficient. It seems this value should match the file system allocation/cluster size in the VM, maybe 4K if you are using an ntfs file system. Does anyone have any input on this? Thanks, Geoff -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100509/08250456/attachment.html>
----- "Brandon High" <bhigh at freaks.com> skrev:> On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net> wrote: > > I am looking at using 8K block size on the zfs volume. > > 8k is the default for zvols.So with a 1TB zbol with default blocksize, dedup is done on 8k blocks? If so, some 32 gigs of memory (or l2arc) will be required per terabyte for the DDT, which is quite a lot... Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net> wrote:> I am looking at using 8K block size on the zfs volume.8k is the default for zvols.> I was looking at the comstar iscsi settings and there is also a blk size > configuration, which defaults as 512 bytes. That would make me believe that > all of the IO will be broken down into 512 bytes which seems very > inefficient.I haven''t done any tuning on my comstar volumes, and they''re using 8k blocks. The setting is in the dataset''s volblocksize parameter.> It seems this value should match the file system allocation/cluster size in > the VM, maybe 4K if you are using an ntfs file system.You''ll have more overhead using smaller volblocksize values, and get worse compression (since compression is done on the block). If you have dedup enabled, you''ll create more entries in the DDT which can have pretty disastrous consequences on write performance. Ensuring that your VM is block-aligned to 4k (or the guest OS''s block size) boundaries will help performance and dedup as well. -B -- Brandon High : bhigh at freaks.com
>-----Original Message----- >From: Brandon High [mailto:bhigh at freaks.com] >Sent: Monday, May 10, 2010 9:55 AM > >On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net> wrote: >> I am looking at using 8K block size on the zfs volume. > >8k is the default for zvols. >You are right, I didn''t look at that property, and instead I was focused on the record size property.>> I was looking at the comstar iscsi settings and there is also a blk >> size configuration, which defaults as 512 bytes. That would make me >> believe that all of the IO will be broken down into 512 bytes which >> seems very inefficient. > >I haven''t done any tuning on my comstar volumes, and they''re using 8kblocks.>The setting is in the dataset''s volblocksize parameter.When I look at the stmfadm llift-lu -v .... it shows me the block size of "512". I am running NexentaCore 3.0 (b134+) . I wonder if the default size has changed with different versions.> >> It seems this value should match the file system allocation/cluster >> size in the VM, maybe 4K if you are using an ntfs file system. > >You''ll have more overhead using smaller volblocksize values, and get worse >compression (since compression is done on the block). If you have dedup >enabled, you''ll create more entries in the DDT which can have prettydisastrous>consequences on write performance. > >Ensuring that your VM is block-aligned to 4k (or the guest OS''s block >size) boundaries will help performance and dedup as well.This is where I am probably the most confused l need to get straightened in my mind. I thought dedup and compression is done on the record level. As long as you are using a multiple of the file system block size, then alignment shouldn''t be a problem with iscsi based zvols. When using a zvol comstar stores the metadata in a zvol object; instead of the first part of the volume. As Roy pointed out, you have to be careful on the record size because DDT and L2ARC lists consuming lots of RAM. But it seems you have four things to look at: File system block size -> Iscsi blk size -> zvol block size -> zvol record size. What is the relationship between iscsi blk size and zvol block size? What is the relationship between zvol block size and zvol record size? Thanks, Geoff
On Mon, May 10, 2010 at 1:53 PM, Geoff Nordli <geoffn at gnaa.net> wrote:> You are right, I didn''t look at that property, and instead I was focused on > the record size property.zvols don''t have a recordsize - That''s a property of filesystem datasets, not volumes.> When I look at the stmfadm llift-lu -v .... it shows me the block size of > "512". ?I am running NexentaCore 3.0 (b134+) . ?I wonder if the default size > has changed with different versions.I see what you''re referring to. The iscsi block size, which is what the LUN reports to initiator as it''s block size, vs. the block size written to disk. Remember that up until very recently, most drives used 512 byte blocks. Most OS expect a 512b block and make certain assumptions based on that, which is probably why it''s the default.>>Ensuring that your VM is block-aligned to 4k (or the guest OS''s block >>size) boundaries will help performance and dedup as well. > > This is where I am probably the most confused l need to get straightened in > my mind. ?I thought dedup and compression is done on the record level.It''s at the record level for filesystems, block level for zvol.> As long as you are using a multiple of the file system block size, then > alignment shouldn''t be a problem with iscsi based zvols. ?When using a zvol > comstar stores the metadata in a zvol object; instead of the first part of > the volume.There can be an "off by one" error which will cause small writes to span blocks. If the data is not block aligned, then a 4k write causes two read/modify/writes (on zfs two blocks have to be read then written and block pointers updated) whereas an aligned write will not require the existing data to be read. This is assuming that the zvol block size = VM fs block size = 4k. In the case where the zvol block size is a multiple of the VM fs block size (eg 4k VM fs, 8k zvol), then writing one fs block will alway require a read for an aligned filesystem, but could require two for an unaligned fs if the VM fs block spans two zvol blocks. There''s been a lot of discussion about this lately with the introduction of WD''s 4k sector drives, since they have a 512b sector emulation mode.> What is the relationship between iscsi blk size and zvol block size?There is none. iscsi block size is what the target LUN reports to initiators. volblocksize is what size chunks are written to the pool.> What is the relationship between zvol block size and zvol record size?They are never both present on a dataset. volblocksize is only for volumes, recordsize is only for filesystems. Both control the size of the unit of data written to the pool. This unit of data is what the checksum is calculated on, and what the compression and dedup are performed on. -B -- Brandon High : bhigh at freaks.com
>-----Original Message----- >From: Brandon High [mailto:bhigh at freaks.com] >Sent: Monday, May 10, 2010 3:12 PM > >On Mon, May 10, 2010 at 1:53 PM, Geoff Nordli <geoffn at gnaa.net> wrote: >> You are right, I didn''t look at that property, and instead I was >> focused on the record size property. > >zvols don''t have a recordsize - That''s a property of filesystem datasets,not>volumes.Awesome, that makes things a lot clearer now :)> >> When I look at the stmfadm llift-lu -v .... it shows me the block size >> of "512". ?I am running NexentaCore 3.0 (b134+) . ?I wonder if the >> default size has changed with different versions. > >I see what you''re referring to. The iscsi block size, which is what the LUNreports>to initiator as it''s block size, vs. the block size written to disk.So in essence this is the disk "sector" size, again makes sense. Are people actually changing this value?> > >> As long as you are using a multiple of the file system block size, >> then alignment shouldn''t be a problem with iscsi based zvols. ?When >> using a zvol comstar stores the metadata in a zvol object; instead of >> the first part of the volume. > >There can be an "off by one" error which will cause small writes to spanblocks. If>the data is not block aligned, then a 4k write causes tworead/modify/writes (on>zfs two blocks have to be read then written and block pointers updated)whereas>an aligned write will not require the existing data to be read. This isassuming that>the zvol block size = VM fs block size = 4k. In the case where the zvolblock size is>a multiple of the VM fs block size (eg 4k VM fs, 8k zvol), then writing onefs block>will alway require a read for an aligned filesystem, but could require twofor an>unaligned fs if the VM fs block spans two zvol blocks. > >There''s been a lot of discussion about this lately with the introduction ofWD''s 4k>sector drives, since they have a 512b sector emulation mode. >Doesn''t this alignment have more to do with aligning writes to the stripe/segment size of a traditional storage array? The articles I am reading suggests creating a small unused partition to take up the space up to 127bytes (assuming 128byte segment), then create the real partition from the 128th sector going forward. I am not sure how this would happen with zfs. Thanks for clearing up my misconceptions. Geoff
On Mon, May 10, 2010 at 3:53 PM, Geoff Nordli <geoffn at gnaa.net> wrote:> Doesn''t this alignment have more to do with aligning writes to the > stripe/segment size of a traditional storage array? ?The articles I amIt is a lot like a stripe / segment size. If you want to think of it in those terms, you''ve got a segment of 512b (the iscsi block size) and a width of 16, giving you an 8k stripe size. Any write that is less than 8k will require a RMW cycle, and any write in multiples of 8k will do "full stripe" writes. If the write doesn''t start on an 8k boundary, you risk having writes span multiple underlying zvol blocks. There''s an explanation of WD''s "Advanced Format" at Anandtech that describes the problem with 4k physical sectors, here http://www.anandtech.com/show/2888. Instead of sector, think zvol block though. When using a zvol, you''ve essentially got $volblocksize sized physical sectors, but the initiator sees the 512b block size that the LUN is reporting. If you don''t block align, you risk having a write straddle two zfs blocks. There may be some benefit to using a 4k volblocksize, but you''ll use more time and space on block checksums and, etc in your zpool. I think 8k is a reasonable trade off.> reading suggests creating a small unused partition to take up the space up > to 127bytes (assuming 128byte segment), then create the real partition from > the 128th sector going forward. ?I am not sure how this would happen with > zfs.If you''re using the whole disk with zfs, you don''t need to worry about it. If you''re using fdisk partitions or slices, you need be a little more careful. I made an attempt to 4k block align the SSD that I''m using for a slog / L2ARC, which in theory should line up better with the devices erase boundary. While not really pertinent to this discussion it gives some idea on how to do it. You want the filesystem to start at a point where ( $offset * $sector_size * $sectors_per_cylinder ) % 4096 = 0. For most LBA drives, you''ve got 16065 sectors/cylinder and 512b sectors, giving 8 as the smallest offset that will align. ( 8 * 512 * 16065 ) % 4096 = 0 First you have to look at fdisk (on an SMI labeled disk) and realize that you''re going to lose the first cylinder to the MBR. When you then create slices in format, it''ll report one cylinder less than fdisk did, so remember to account for that in your offset. For an iscsi LUN used by a VM, you should align its filesystem on a zvol block boundary. Windows Vista and Server 2008 use 240 heads & 63 sectors/track, so they are already 8k block aligned. Linux, Solaris, and BSD also let you specify the geometry used by fdisk, but I wasn''t comfortable doing it with Solaris since you have to create a geometry file first. For my 30GB OCZ Vertex: bhigh at basestar:~$ pfexec fdisk -W - /dev/rdsk/c1t0d0p0 * /dev/rdsk/c1t0d0p0 default fdisk table * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 3892 cylinders [..] * Id Act Bhead Bsect Bcyl Ehead Esect Ecyl Rsect Numsect 191 128 0 1 1 254 63 1023 16065 62508915 bhigh at basestar:~$ pfexec prtvtoc /dev/rdsk/c1t0d0p0 * /dev/rdsk/c1t0d0p0 partition map * * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 16065 sectors/cylinder * 3891 cylinders * 3889 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 0 112455 112454 * 62428590 48195 62476784 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 112455 2056320 2168774 1 4 01 2168775 60243750 62412524 2 5 01 0 62508915 62508914 8 1 01 0 16065 16064 -B -- Brandon High : bhigh at freaks.com
>-----Original Message----- >From: Brandon High [mailto:bhigh at freaks.com] >Sent: Monday, May 10, 2010 5:56 PM > >On Mon, May 10, 2010 at 3:53 PM, Geoff Nordli <geoffn at gnaa.net> wrote: >> Doesn''t this alignment have more to do with aligning writes to the >> stripe/segment size of a traditional storage array? ?The articles I am > >It is a lot like a stripe / segment size. If you want to think of it inthose terms,>you''ve got a segment of 512b (the iscsi block size) and a width of 16,giving you>an 8k stripe size. Any write that is less than 8k will require a RMW cycle,and any>write in multiples of 8k will do "full stripe" writes. If the write doesn''tstart on an>8k boundary, you risk having writes span multiple underlying zvol blocks. > > >When using a zvol, you''ve essentially got $volblocksize sized physicalsectors, but>the initiator sees the 512b block size that the LUN is reporting. If youdon''t block>align, you risk having a write straddle two zfs blocks. There may be somebenefit>to using a 4k volblocksize, but you''ll use more time and space on blockchecksums>and, etc in your zpool. I think 8k is a reasonable trade off. > > >If you''re using the whole disk with zfs, you don''t need to worry about it.If you''re>using fdisk partitions or slices, you need be a little more careful. >So... as long as you use whole disks, set the volblocksize to a multiple of the virtual machines file system allocation size, then you don''t have to worry about alignment/optimization with ZFS. Thanks again!! Geoff