thr3ads.net - zfs discuss - [zfs-discuss] ZFS and Comstar iSCSI BLK size [May 2010]

If this information is useful, please help other people find it:
Share via:

Geoff Nordli

2010-May-10 04:42 UTC

[zfs-discuss] ZFS and Comstar iSCSI BLK size

I am using ZFS as the backing store for an iscsi target running a virtual
machine.

 

I am looking at using 8K block size on the zfs volume.  

 

I was looking at the comstar iscsi settings and there is also a blk size
configuration, which defaults as 512 bytes. That would make me believe that
all of the IO will be broken down into 512 bytes which seems very
inefficient.  

 

It seems this value should match the file system allocation/cluster size in
the VM, maybe 4K if you are using an ntfs file system. 

 

Does anyone have any input on this?

 

Thanks,

 

Geoff 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100509/08250456/attachment.html>

Roy Sigurd Karlsbakk

2010-May-10 16:54 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

----- "Brandon High" <bhigh at freaks.com> skrev:
> On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:
> > I am looking at using 8K block size on the zfs volume.
> 
> 8k is the default for zvols.
So with a 1TB zbol with default blocksize, dedup is done on 8k blocks? If so,
some 32 gigs of memory (or l2arc) will be required per terabyte for the DDT,
which is quite a lot...

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Brandon High

2010-May-10 16:54 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:> I am looking at using 8K block size on the zfs volume.
8k is the default for zvols.
> I was looking at the comstar iscsi settings and there is also a blk size
> configuration, which defaults as 512 bytes. That would make me believe that
> all of the IO will be broken down into 512 bytes which seems very
> inefficient.
I haven''t done any tuning on my comstar volumes, and they''re
using 8k
blocks. The setting is in the dataset''s volblocksize parameter.
> It seems this value should match the file system allocation/cluster size in
> the VM, maybe 4K if you are using an ntfs file system.
You''ll have more overhead using smaller volblocksize values, and get
worse compression (since compression is done on the block). If you
have dedup enabled, you''ll create more entries in the DDT which can
have pretty disastrous consequences on write performance.

Ensuring that your VM is block-aligned to 4k (or the guest OS''s block
size) boundaries will help performance and dedup as well.

-B

-- 
Brandon High : bhigh at freaks.com

Geoff Nordli

2010-May-10 20:53 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

>-----Original Message-----
>From: Brandon High [mailto:bhigh at freaks.com]
>Sent: Monday, May 10, 2010 9:55 AM
>
>On Sun, May 9, 2010 at 9:42 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:
>> I am looking at using 8K block size on the zfs volume.
>
>8k is the default for zvols.
>
You are right, I didn''t look at that property, and instead I was
focused on
the record size property.  
>> I was looking at the comstar iscsi settings and there is also a blk
>> size configuration, which defaults as 512 bytes. That would make me
>> believe that all of the IO will be broken down into 512 bytes which
>> seems very inefficient.
>
>I haven''t done any tuning on my comstar volumes, and
they''re using 8k
blocks.>The setting is in the dataset''s volblocksize parameter.
When I look at the stmfadm llift-lu -v .... it shows me the block size of
"512".  I am running NexentaCore 3.0 (b134+) .  I wonder if the
default size
has changed with different versions.  
>
>> It seems this value should match the file system allocation/cluster
>> size in the VM, maybe 4K if you are using an ntfs file system.
>
>You''ll have more overhead using smaller volblocksize values, and
get worse
>compression (since compression is done on the block). If you have dedup
>enabled, you''ll create more entries in the DDT which can have
pretty
disastrous>consequences on write performance.
>
>Ensuring that your VM is block-aligned to 4k (or the guest OS''s
block
>size) boundaries will help performance and dedup as well.
This is where I am probably the most confused l need to get straightened in
my mind.  I thought dedup and compression is done on the record level.  

As long as you are using a multiple of the file system block size, then
alignment shouldn''t be a problem with iscsi based zvols.  When using a
zvol
comstar stores the metadata in a zvol object; instead of the first part of
the volume.     

As Roy pointed out, you have to be careful on the record size because DDT
and L2ARC lists consuming lots of RAM.  

But it seems you have four things to look at:

File system block size -> Iscsi blk size -> zvol block size -> zvol
record
size.  

What is the relationship between iscsi blk size and zvol block size?

What is the relationship between zvol block size and zvol record size?

Thanks,

Geoff

Brandon High

2010-May-10 22:11 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

On Mon, May 10, 2010 at 1:53 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:> You are right, I didn''t look at that property, and instead I was
focused on
> the record size property.
zvols don''t have a recordsize - That''s a property of
filesystem
datasets, not volumes.
> When I look at the stmfadm llift-lu -v .... it shows me the block size of
> "512". ?I am running NexentaCore 3.0 (b134+) . ?I wonder if the
default size
> has changed with different versions.
I see what you''re referring to. The iscsi block size, which is what
the LUN reports to initiator as it''s block size, vs. the block size
written to disk.

Remember that up until very recently, most drives used 512 byte
blocks. Most OS expect a 512b block and make certain assumptions based
on that, which is probably why it''s the default.
>>Ensuring that your VM is block-aligned to 4k (or the guest OS''s
block
>>size) boundaries will help performance and dedup as well.
>
> This is where I am probably the most confused l need to get straightened in
> my mind. ?I thought dedup and compression is done on the record level.
It''s at the record level for filesystems, block level for zvol.
> As long as you are using a multiple of the file system block size, then
> alignment shouldn''t be a problem with iscsi based zvols. ?When
using a zvol
> comstar stores the metadata in a zvol object; instead of the first part of
> the volume.
There can be an "off by one" error which will cause small writes to
span blocks. If the data is not block aligned, then a 4k write causes
two read/modify/writes (on zfs two blocks have to be read then written
and block pointers updated) whereas an aligned write will not require
the existing data to be read. This is assuming that the zvol block
size = VM fs block size = 4k. In the case where the zvol block size is
a multiple of the VM fs block size (eg 4k VM fs, 8k zvol), then
writing one fs block will alway require a read for an aligned
filesystem, but could require two for an unaligned fs if the VM fs
block spans two zvol blocks.

There''s been a lot of discussion about this lately with the
introduction of WD''s 4k sector drives, since they have a 512b sector
emulation mode.
> What is the relationship between iscsi blk size and zvol block size?
There is none. iscsi block size is what the target LUN reports to
initiators. volblocksize is what size chunks are written to the pool.
> What is the relationship between zvol block size and zvol record size?
They are never both present on a dataset. volblocksize is only for
volumes, recordsize is only for filesystems. Both control the size of
the unit of data written to the pool. This unit of data is what the
checksum is calculated on, and what the compression and dedup are
performed on.

-B

-- 
Brandon High : bhigh at freaks.com

Geoff Nordli

2010-May-10 22:53 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

>-----Original Message-----
>From: Brandon High [mailto:bhigh at freaks.com]
>Sent: Monday, May 10, 2010 3:12 PM
>
>On Mon, May 10, 2010 at 1:53 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:
>> You are right, I didn''t look at that property, and instead I
was
>> focused on the record size property.
>
>zvols don''t have a recordsize - That''s a property of
filesystem datasets,
not>volumes.
Awesome, that makes things a lot clearer now :) 
>
>> When I look at the stmfadm llift-lu -v .... it shows me the block size
>> of "512". ?I am running NexentaCore 3.0 (b134+) . ?I wonder
if the
>> default size has changed with different versions.
>
>I see what you''re referring to. The iscsi block size, which is what
the LUN
reports>to initiator as it''s block size, vs. the block size written to
disk.
So in essence this is the disk "sector" size, again makes sense.   Are
people actually changing this value?
>
>
>> As long as you are using a multiple of the file system block size,
>> then alignment shouldn''t be a problem with iscsi based zvols.
?When
>> using a zvol comstar stores the metadata in a zvol object; instead of
>> the first part of the volume.
>
>There can be an "off by one" error which will cause small writes
to span
blocks. If>the data is not block aligned, then a 4k write causes two
read/modify/writes (on>zfs two blocks have to be read then written and block pointers updated)
whereas>an aligned write will not require the existing data to be read. This is
assuming that>the zvol block size = VM fs block size = 4k. In the case where the zvol
block size is>a multiple of the VM fs block size (eg 4k VM fs, 8k zvol), then writing one
fs block>will alway require a read for an aligned filesystem, but could require two
for an>unaligned fs if the VM fs block spans two zvol blocks.
>
>There''s been a lot of discussion about this lately with the
introduction of
WD''s 4k>sector drives, since they have a 512b sector emulation mode.
>
Doesn''t this alignment have more to do with aligning writes to the
stripe/segment size of a traditional storage array?  The articles I am
reading suggests creating a small unused partition to take up the space up
to 127bytes (assuming 128byte segment), then create the real partition from
the 128th sector going forward.  I am not sure how this would happen with
zfs.

Thanks for clearing up my misconceptions.

Geoff

Brandon High

2010-May-11 00:55 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

On Mon, May 10, 2010 at 3:53 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:> Doesn''t this alignment have more to do with aligning writes to the
> stripe/segment size of a traditional storage array? ?The articles I am
It is a lot like a stripe / segment size. If you want to think of it
in those terms, you''ve got a segment of 512b (the iscsi block size)
and a width of 16, giving you an 8k stripe size. Any write that is
less than 8k will require a RMW cycle, and any write in multiples of
8k will do "full stripe" writes. If the write doesn''t start
on an 8k
boundary, you risk having writes span multiple underlying zvol blocks.

There''s an explanation of WD''s "Advanced Format" at
Anandtech that
describes the problem with 4k physical sectors, here
http://www.anandtech.com/show/2888. Instead of sector, think zvol
block though.

When using a zvol, you''ve essentially got $volblocksize sized physical
sectors, but the initiator sees the 512b block size that the LUN is
reporting. If you don''t block align, you risk having a write straddle
two zfs blocks. There may be some benefit to using a 4k volblocksize,
but you''ll use more time and space on block checksums and, etc in your
zpool. I think 8k is a reasonable trade off.
> reading suggests creating a small unused partition to take up the space up
> to 127bytes (assuming 128byte segment), then create the real partition from
> the 128th sector going forward. ?I am not sure how this would happen with
> zfs.
If you''re using the whole disk with zfs, you don''t need to
worry about
it. If you''re using fdisk partitions or slices, you need be a little
more careful.

I made an attempt to 4k block align the SSD that I''m using for a slog
/ L2ARC, which in theory should line up better with the devices erase
boundary. While not really pertinent to this discussion it gives some
idea on how to do it.

You want the filesystem to start at a point where ( $offset *
$sector_size * $sectors_per_cylinder ) % 4096 = 0.

For most LBA drives, you''ve got 16065 sectors/cylinder and 512b
sectors, giving 8 as the smallest offset that will align.
( 8 * 512 * 16065 ) % 4096 = 0

First you have to look at fdisk (on an SMI labeled disk) and realize
that you''re going to lose the first cylinder to the MBR. When you then
create slices in format, it''ll report one cylinder less than fdisk
did, so remember to account for that in your offset.

For an iscsi LUN used by a VM, you should align its filesystem on a
zvol block boundary. Windows Vista and Server 2008 use 240 heads & 63
sectors/track, so they are already 8k block aligned. Linux, Solaris,
and BSD also let you specify the geometry used by fdisk, but I wasn''t
comfortable doing it with Solaris since you have to create a geometry
file first.

For my 30GB OCZ Vertex:

bhigh at basestar:~$ pfexec fdisk -W - /dev/rdsk/c1t0d0p0
* /dev/rdsk/c1t0d0p0 default fdisk table
* Dimensions:
*    512 bytes/sector
*     63 sectors/track
*    255 tracks/cylinder
*   3892 cylinders
[..]
* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect      Numsect
  191   128  0      1      1       254    63     1023    16065      62508915


bhigh at basestar:~$ pfexec prtvtoc  /dev/rdsk/c1t0d0p0
* /dev/rdsk/c1t0d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*    3891 cylinders
*    3889 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector
*           0    112455    112454
*    62428590     48195  62476784
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      4    00     112455   2056320   2168774
       1      4    01    2168775  60243750  62412524
       2      5    01          0  62508915  62508914
       8      1    01          0     16065     16064


-B

-- 
Brandon High : bhigh at freaks.com

Geoff Nordli

2010-May-11 18:52 UTC

head link

[zfs-discuss] ZFS and Comstar iSCSI BLK size

>-----Original Message-----
>From: Brandon High [mailto:bhigh at freaks.com]
>Sent: Monday, May 10, 2010 5:56 PM
>
>On Mon, May 10, 2010 at 3:53 PM, Geoff Nordli <geoffn at gnaa.net>
wrote:
>> Doesn''t this alignment have more to do with aligning writes to
the
>> stripe/segment size of a traditional storage array? ?The articles I am
>
>It is a lot like a stripe / segment size. If you want to think of it in
those terms,>you''ve got a segment of 512b (the iscsi block size) and a width of
16,
giving you>an 8k stripe size. Any write that is less than 8k will require a RMW cycle,
and any>write in multiples of 8k will do "full stripe" writes. If the
write doesn''t
start on an>8k boundary, you risk having writes span multiple underlying zvol blocks.
>
>
>When using a zvol, you''ve essentially got $volblocksize sized
physical
sectors, but>the initiator sees the 512b block size that the LUN is reporting. If you
don''t block>align, you risk having a write straddle two zfs blocks. There may be some
benefit>to using a 4k volblocksize, but you''ll use more time and space on
block
checksums>and, etc in your zpool. I think 8k is a reasonable trade off.
>
>
>If you''re using the whole disk with zfs, you don''t need to
worry about it.
If you''re>using fdisk partitions or slices, you need be a little more careful.
>
So...  as long as you use whole disks, set the volblocksize to a multiple of
the virtual machines file system allocation size, then you don''t have
to
worry about alignment/optimization with ZFS.  

Thanks again!! 

Geoff

zfs discuss - May 2010 - ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size

[zfs-discuss] ZFS and Comstar iSCSI BLK size