thr3ads.net - zfs discuss - [zfs-discuss] zvol space consumption vs ashift, metadata packing [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Daniel Carosone

2011-Oct-04 23:14 UTC

[zfs-discuss] zvol space consumption vs ashift, metadata packing

I sent a zvol from host a, to host b, twice.  Host b has two pools,
one ashift=9, one ashift=12.  I sent the zvol to each of the pools on
b.  The original source pool is ashift=9, and an old revision (2009_06
because it''s still running xen). 

I sent it twice, because something strange happened on the first send,
to the ashift=12 pool.  "zfs list -o space" showed figures at least
twice those on the source, maybe roughly 2.5 times.

I suspected this may be related to ashift, so tried the second send to
the ahsift=9 pool; these received snapshots line up with the same
space consumption as the source.

What is going on? Is there really that much metadata overhead?  How
many metadata blocks are needed for each 8k vol block, and are they
each really only holding 512 bytes of metadata in a 4k allocation?
Can they not be packed appropriately for the ashift?

Longer term, if zfs were to pack metadata into full blocks by ashift,
is it likely that this could be introduced via a zpool upgrade, with
space recovered as metadata is rewritten - or would it need the pool
to be recreated?  Or is there some other solution in the works?

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111005/0132ab61/attachment.bin>

Richard Elling

2011-Oct-05 04:28 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:
> I sent a zvol from host a, to host b, twice.  Host b has two pools,
> one ashift=9, one ashift=12.  I sent the zvol to each of the pools on
> b.  The original source pool is ashift=9, and an old revision (2009_06
> because it''s still running xen). 
:-)
> I sent it twice, because something strange happened on the first send,
> to the ashift=12 pool.  "zfs list -o space" showed figures at
least
> twice those on the source, maybe roughly 2.5 times.
Can you share the output?

"15% of nothin'' is nothin''!''"
                 Jimmy Buffett
> I suspected this may be related to ashift, so tried the second send to
> the ahsift=9 pool; these received snapshots line up with the same
> space consumption as the source.
> 
> What is going on? Is there really that much metadata overhead?  How
> many metadata blocks are needed for each 8k vol block, and are they
> each really only holding 512 bytes of metadata in a 4k allocation?
> Can they not be packed appropriately for the ashift?
Doesn''t matter how small metadata compresses, the minimum size you can
write
is 4KB.
> 
> Longer term, if zfs were to pack metadata into full blocks by ashift,
> is it likely that this could be introduced via a zpool upgrade, with
> space recovered as metadata is rewritten - or would it need the pool
> to be recreated?  Or is there some other solution in the works?
I think we''d need to see the exact layout of the internal data. This
can be
achieved with the zfs_blkstats macro in mdb. Perhaps we can take this offline
and report back?
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Daniel Carosone

2011-Oct-08 03:25 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling
wrote:> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:
> 
> > I sent it twice, because something strange happened on the first send,
> > to the ashift=12 pool.  "zfs list -o space" showed figures
at least
> > twice those on the source, maybe roughly 2.5 times.
> 
> Can you share the output?
Source machine, zpool v14 snv_111b:

NAME          AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  VOLSIZE
int/iscsi_01  99.2G   237G     37.9G    199G              0          0     200G

Destination machine, zpool v31 snv_151b:

NAME           AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  VOLSIZE
geek/iscsi_01  3.64T   550G     88.4G    461G              0          0     200G
uext/iscsi_01  1.73T   245G     39.2G    206G              0          0     200G

geek is the ashift=12 pool, obviously.  I''m assuming the smaller
difference for uext is due to other layout differences in the pool
versions.
> > What is going on? Is there really that much metadata overhead?  How
> > many metadata blocks are needed for each 8k vol block, and are they
> > each really only holding 512 bytes of metadata in a 4k allocation?
> > Can they not be packed appropriately for the ashift?
> 
> Doesn''t matter how small metadata compresses, the minimum size you
can write
> is 4KB.
This isn''t about whether the metadata compresses, this is about
whether ZFS is smart enough to use all the space in a 4k block for
metadata, rather than assuming it can fit at best 512 bytes,
regardless of ashift.  By packing, I meant packing them full rather
than leaving them mostly empty and wasted (or anything to do with
compression). 
> I think we''d need to see the exact layout of the internal data.
This can be
> achieved with the zfs_blkstats macro in mdb. Perhaps we can take this
offline
> and report back?
Happy to - what other details / output would you like?

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111008/a2d83b07/attachment.bin>

Richard Elling

2011-Oct-08 14:04 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[exposed organs below?]

On Oct 7, 2011, at 8:25 PM, Daniel Carosone wrote:> On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote:
>> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:
>> 
>>> I sent it twice, because something strange happened on the first
send,
>>> to the ashift=12 pool.  "zfs list -o space" showed
figures at least
>>> twice those on the source, maybe roughly 2.5 times.
>> 
>> Can you share the output?
> 
> Source machine, zpool v14 snv_111b:
> 
> NAME          AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD 
VOLSIZE
> int/iscsi_01  99.2G   237G     37.9G    199G              0          0    
200G
> 
> Destination machine, zpool v31 snv_151b:
> 
> NAME           AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD 
VOLSIZE
> geek/iscsi_01  3.64T   550G     88.4G    461G              0          0    
200G
> uext/iscsi_01  1.73T   245G     39.2G    206G              0          0    
200G
> 
> geek is the ashift=12 pool, obviously.  I''m assuming the smaller
> difference for uext is due to other layout differences in the pool
> versions.
> 
>>> What is going on? Is there really that much metadata overhead?  How
>>> many metadata blocks are needed for each 8k vol block, and are they
>>> each really only holding 512 bytes of metadata in a 4k allocation?
>>> Can they not be packed appropriately for the ashift?
>> 
>> Doesn''t matter how small metadata compresses, the minimum size
you can write
>> is 4KB.
> 
> This isn''t about whether the metadata compresses, this is about
> whether ZFS is smart enough to use all the space in a 4k block for
> metadata, rather than assuming it can fit at best 512 bytes,
> regardless of ashift.  By packing, I meant packing them full rather
> than leaving them mostly empty and wasted (or anything to do with
> compression). 
The answer is: it depends. Let''s look for more clues first...
> 
>> I think we''d need to see the exact layout of the internal
data. This can be
>> achieved with the zfs_blkstats macro in mdb. Perhaps we can take this
offline
>> and report back?
> 
> Happy to - what other details / output would you like?
This is easier to do offline, but while we''re here?
[assuming Solaris-derived OS with mdb]

0. scrub the pool, so that the block usage stats are loaded

1. find the address of the pool''s spa structure, for example
	# echo ::spa | mdb -k
	ADDR                 STATE NAME                                                
	ffffff01c647d580    ACTIVE stuff
	ffffff01c52b1040    ACTIVE syspool

2. look at the block usage stats, for example
	# echo ffffff01c52b1040::zfs_blkstats | mdb -k
	Dittoed blocks on same vdev: 4541
	
	Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
	     1    16K      1K   3.00K   3.00K   16.00     0.00  object directory
	     3  1.50K   1.50K   4.50K   1.50K    1.00     0.00  object array
	   163  19.8M   1.46M   4.39M   27.6K   13.52     0.28  bpobj
	   336  1.79M    724K   2.12M   6.46K    2.53     0.13  SPA space map
	?


3. compare the block usage stats for the various pools
	Block counts are obvious
	LSIZE = logical size
	PSIZE = physical size, after compression
	ASIZE = allocated size, how much disk space is used (including raidz &
copies)
	avg = average allocated size per block
	comp = compression ratio (LSIZE:PSIZE)
	%Total is the percent of total allocated space

It should be obvious that ashift = 9 for the above example.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9 













-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111008/12133d80/attachment-0001.html>

Bob Friesenhahn

2011-Oct-10 18:24 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

On Sat, 8 Oct 2011, Daniel Carosone wrote:> This isn''t about whether the metadata compresses, this is about
> whether ZFS is smart enough to use all the space in a 4k block for
> metadata, rather than assuming it can fit at best 512 bytes,
> regardless of ashift.  By packing, I meant packing them full rather
> than leaving them mostly empty and wasted (or anything to do with
> compression).
It would seem like quite a problem/challenge to stuff unrelated 
metadata into the same metadata block.  The COW algorithm would become 
much more complex and slow since now many otherwise-unrelated 
references to the same block need to be updated whenever that block is 
updated (copied).

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2011-Oct-10 18:26 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

2011-10-08 7:25, Daniel Carosone ?????:> On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote:
>> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:
>>
>>> What is going on? Is there really that much metadata overhead?  How
>>> many metadata blocks are needed for each 8k vol block, and are they
>>> each really only holding 512 bytes of metadata in a 4k allocation?
>>> Can they not be packed appropriately for the ashift?
>> Doesn''t matter how small metadata compresses, the minimum size
you can write
>> is 4KB.
> This isn''t about whether the metadata compresses, this is about
> whether ZFS is smart enough to use all the space in a 4k block for
> metadata, rather than assuming it can fit at best 512 bytes,
> regardless of ashift.  By packing, I meant packing them full rather
> than leaving them mostly empty and wasted (or anything to do with
> compression).
Compression or packing won''t cut it, I think. At least that''s
why
I abandoned my first suggestion to problem solution in that
bugtracker and proposed another. Basically my first idea was
ripped from ATM protocol fixed-size (small) "cells" making
up "frames" which are whole units sent over the wire with a
common header, checksum, etc. Likewise, I proposed that
4kb on-disk blocks (ashift=12) should be regarded as being
made up of 8 (or more) 512-byte "cells" each with a portion
of metadata. A major downside to such solution would be
introduced incompatibility to other implementations of ZFS
in terms of on-disk data and its interpretation by code.

Thus I proposed the second idea with a code-only solution
to optimize performance (force user-configured minimal
data block sizes and physical alignments) where metadata
blocks would remain 512 bytes because the pool is formally
ashift=9 - and on-disk data would be compatible with other
pools and OSes boasting ZFS.

As far as I understand, each 512-byte block (as of ashift=9
pool) was already too big for a single "quantum" of metadata
(which apparently range around 200-300 bytes according to
"zdb -DD").

Due to, at least, performance reasons, each metadata block
is addresed as an individual block in the ZFS tree of blocks
(roughly: rooted by uberblock, branched at metadata blocks,
leafing at data blocks). Upon every change of data (and TXG
sync), the whole branch of metadata blocks leading up to the
uberblock has to be updated, and these blocks are written
anew into empty (unassigned-yet) space on the pool thanks
to ZFS COW never overwriting live data.

On one hand, it does not seem like a problem to coalesce
writes of 8 metadata blocks into 4kb portions - in code only -
so that new 4kb-sector-aware ZFS code would perform well
on newer HDDs and waste less space than now. On another,
I do not know how long the tree branches are, but perhaps
any change of a data block produces enough changes of
metadata to fill up a 4kb block, or several, or a large
portion of one - so in practice there is no problem of
waiting for a chance to write several metadata blocks
as well... Thus I think my second solution is viable.

//Jim

Bob Friesenhahn

2011-Oct-10 18:48 UTC

head link

[zfs-discuss] zvol space consumption vs ashift, metadata packing

On Mon, 10 Oct 2011, Jim Klimov wrote:>
> Thus I proposed the second idea with a code-only solution
> to optimize performance (force user-configured minimal
> data block sizes and physical alignments) where metadata
> blocks would remain 512 bytes because the pool is formally
> ashift=9 - and on-disk data would be compatible with other
> pools and OSes boasting ZFS.
The problem with this approach is that it results in the same 
write-amplification that ashift=12 is trying to avoid.  If the 
underlying hardware will still write 4K blocks then it will still be 
writing 4K blocks.  Storage space is saved but much/most of the 
performance benefit restored by ashift=12 goes away, or the problem 
becomes even worse due to concurrent updates which are coalesced at 
the hardware level.

File data is still usually much larger than 4K blocks so most 
performance problems because of the 4K blocks must be due to writing 
metadata.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Reasonably Related Threads

Search for more reasonably related threads

zfs discuss - Oct 2011 - zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

[zfs-discuss] zvol space consumption vs ashift, metadata packing

Reasonably Related Threads