Daniel Carosone
2011-Oct-04 23:14 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
I sent a zvol from host a, to host b, twice. Host b has two pools, one ashift=9, one ashift=12. I sent the zvol to each of the pools on b. The original source pool is ashift=9, and an old revision (2009_06 because it''s still running xen). I sent it twice, because something strange happened on the first send, to the ashift=12 pool. "zfs list -o space" showed figures at least twice those on the source, maybe roughly 2.5 times. I suspected this may be related to ashift, so tried the second send to the ahsift=9 pool; these received snapshots line up with the same space consumption as the source. What is going on? Is there really that much metadata overhead? How many metadata blocks are needed for each 8k vol block, and are they each really only holding 512 bytes of metadata in a 4k allocation? Can they not be packed appropriately for the ashift? Longer term, if zfs were to pack metadata into full blocks by ashift, is it likely that this could be introduced via a zpool upgrade, with space recovered as metadata is rewritten - or would it need the pool to be recreated? Or is there some other solution in the works? -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111005/0132ab61/attachment.bin>
Richard Elling
2011-Oct-05 04:28 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:> I sent a zvol from host a, to host b, twice. Host b has two pools, > one ashift=9, one ashift=12. I sent the zvol to each of the pools on > b. The original source pool is ashift=9, and an old revision (2009_06 > because it''s still running xen).:-)> I sent it twice, because something strange happened on the first send, > to the ashift=12 pool. "zfs list -o space" showed figures at least > twice those on the source, maybe roughly 2.5 times.Can you share the output? "15% of nothin'' is nothin''!''" Jimmy Buffett> I suspected this may be related to ashift, so tried the second send to > the ahsift=9 pool; these received snapshots line up with the same > space consumption as the source. > > What is going on? Is there really that much metadata overhead? How > many metadata blocks are needed for each 8k vol block, and are they > each really only holding 512 bytes of metadata in a 4k allocation? > Can they not be packed appropriately for the ashift?Doesn''t matter how small metadata compresses, the minimum size you can write is 4KB.> > Longer term, if zfs were to pack metadata into full blocks by ashift, > is it likely that this could be introduced via a zpool upgrade, with > space recovered as metadata is rewritten - or would it need the pool > to be recreated? Or is there some other solution in the works?I think we''d need to see the exact layout of the internal data. This can be achieved with the zfs_blkstats macro in mdb. Perhaps we can take this offline and report back? -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
Daniel Carosone
2011-Oct-08 03:25 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote:> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote: > > > I sent it twice, because something strange happened on the first send, > > to the ashift=12 pool. "zfs list -o space" showed figures at least > > twice those on the source, maybe roughly 2.5 times. > > Can you share the output?Source machine, zpool v14 snv_111b: NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD VOLSIZE int/iscsi_01 99.2G 237G 37.9G 199G 0 0 200G Destination machine, zpool v31 snv_151b: NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD VOLSIZE geek/iscsi_01 3.64T 550G 88.4G 461G 0 0 200G uext/iscsi_01 1.73T 245G 39.2G 206G 0 0 200G geek is the ashift=12 pool, obviously. I''m assuming the smaller difference for uext is due to other layout differences in the pool versions.> > What is going on? Is there really that much metadata overhead? How > > many metadata blocks are needed for each 8k vol block, and are they > > each really only holding 512 bytes of metadata in a 4k allocation? > > Can they not be packed appropriately for the ashift? > > Doesn''t matter how small metadata compresses, the minimum size you can write > is 4KB.This isn''t about whether the metadata compresses, this is about whether ZFS is smart enough to use all the space in a 4k block for metadata, rather than assuming it can fit at best 512 bytes, regardless of ashift. By packing, I meant packing them full rather than leaving them mostly empty and wasted (or anything to do with compression).> I think we''d need to see the exact layout of the internal data. This can be > achieved with the zfs_blkstats macro in mdb. Perhaps we can take this offline > and report back?Happy to - what other details / output would you like? -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111008/a2d83b07/attachment.bin>
Richard Elling
2011-Oct-08 14:04 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
[exposed organs below?] On Oct 7, 2011, at 8:25 PM, Daniel Carosone wrote:> On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote: >> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote: >> >>> I sent it twice, because something strange happened on the first send, >>> to the ashift=12 pool. "zfs list -o space" showed figures at least >>> twice those on the source, maybe roughly 2.5 times. >> >> Can you share the output? > > Source machine, zpool v14 snv_111b: > > NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD VOLSIZE > int/iscsi_01 99.2G 237G 37.9G 199G 0 0 200G > > Destination machine, zpool v31 snv_151b: > > NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD VOLSIZE > geek/iscsi_01 3.64T 550G 88.4G 461G 0 0 200G > uext/iscsi_01 1.73T 245G 39.2G 206G 0 0 200G > > geek is the ashift=12 pool, obviously. I''m assuming the smaller > difference for uext is due to other layout differences in the pool > versions. > >>> What is going on? Is there really that much metadata overhead? How >>> many metadata blocks are needed for each 8k vol block, and are they >>> each really only holding 512 bytes of metadata in a 4k allocation? >>> Can they not be packed appropriately for the ashift? >> >> Doesn''t matter how small metadata compresses, the minimum size you can write >> is 4KB. > > This isn''t about whether the metadata compresses, this is about > whether ZFS is smart enough to use all the space in a 4k block for > metadata, rather than assuming it can fit at best 512 bytes, > regardless of ashift. By packing, I meant packing them full rather > than leaving them mostly empty and wasted (or anything to do with > compression).The answer is: it depends. Let''s look for more clues first...> >> I think we''d need to see the exact layout of the internal data. This can be >> achieved with the zfs_blkstats macro in mdb. Perhaps we can take this offline >> and report back? > > Happy to - what other details / output would you like?This is easier to do offline, but while we''re here? [assuming Solaris-derived OS with mdb] 0. scrub the pool, so that the block usage stats are loaded 1. find the address of the pool''s spa structure, for example # echo ::spa | mdb -k ADDR STATE NAME ffffff01c647d580 ACTIVE stuff ffffff01c52b1040 ACTIVE syspool 2. look at the block usage stats, for example # echo ffffff01c52b1040::zfs_blkstats | mdb -k Dittoed blocks on same vdev: 4541 Blocks LSIZE PSIZE ASIZE avg comp %Total Type 1 16K 1K 3.00K 3.00K 16.00 0.00 object directory 3 1.50K 1.50K 4.50K 1.50K 1.00 0.00 object array 163 19.8M 1.46M 4.39M 27.6K 13.52 0.28 bpobj 336 1.79M 724K 2.12M 6.46K 2.53 0.13 SPA space map ? 3. compare the block usage stats for the various pools Block counts are obvious LSIZE = logical size PSIZE = physical size, after compression ASIZE = allocated size, how much disk space is used (including raidz & copies) avg = average allocated size per block comp = compression ratio (LSIZE:PSIZE) %Total is the percent of total allocated space It should be obvious that ashift = 9 for the above example. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111008/12133d80/attachment-0001.html>
Bob Friesenhahn
2011-Oct-10 18:24 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
On Sat, 8 Oct 2011, Daniel Carosone wrote:> This isn''t about whether the metadata compresses, this is about > whether ZFS is smart enough to use all the space in a 4k block for > metadata, rather than assuming it can fit at best 512 bytes, > regardless of ashift. By packing, I meant packing them full rather > than leaving them mostly empty and wasted (or anything to do with > compression).It would seem like quite a problem/challenge to stuff unrelated metadata into the same metadata block. The COW algorithm would become much more complex and slow since now many otherwise-unrelated references to the same block need to be updated whenever that block is updated (copied). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2011-Oct-10 18:26 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
2011-10-08 7:25, Daniel Carosone ?????:> On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote: >> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote: >> >>> What is going on? Is there really that much metadata overhead? How >>> many metadata blocks are needed for each 8k vol block, and are they >>> each really only holding 512 bytes of metadata in a 4k allocation? >>> Can they not be packed appropriately for the ashift? >> Doesn''t matter how small metadata compresses, the minimum size you can write >> is 4KB. > This isn''t about whether the metadata compresses, this is about > whether ZFS is smart enough to use all the space in a 4k block for > metadata, rather than assuming it can fit at best 512 bytes, > regardless of ashift. By packing, I meant packing them full rather > than leaving them mostly empty and wasted (or anything to do with > compression).Compression or packing won''t cut it, I think. At least that''s why I abandoned my first suggestion to problem solution in that bugtracker and proposed another. Basically my first idea was ripped from ATM protocol fixed-size (small) "cells" making up "frames" which are whole units sent over the wire with a common header, checksum, etc. Likewise, I proposed that 4kb on-disk blocks (ashift=12) should be regarded as being made up of 8 (or more) 512-byte "cells" each with a portion of metadata. A major downside to such solution would be introduced incompatibility to other implementations of ZFS in terms of on-disk data and its interpretation by code. Thus I proposed the second idea with a code-only solution to optimize performance (force user-configured minimal data block sizes and physical alignments) where metadata blocks would remain 512 bytes because the pool is formally ashift=9 - and on-disk data would be compatible with other pools and OSes boasting ZFS. As far as I understand, each 512-byte block (as of ashift=9 pool) was already too big for a single "quantum" of metadata (which apparently range around 200-300 bytes according to "zdb -DD"). Due to, at least, performance reasons, each metadata block is addresed as an individual block in the ZFS tree of blocks (roughly: rooted by uberblock, branched at metadata blocks, leafing at data blocks). Upon every change of data (and TXG sync), the whole branch of metadata blocks leading up to the uberblock has to be updated, and these blocks are written anew into empty (unassigned-yet) space on the pool thanks to ZFS COW never overwriting live data. On one hand, it does not seem like a problem to coalesce writes of 8 metadata blocks into 4kb portions - in code only - so that new 4kb-sector-aware ZFS code would perform well on newer HDDs and waste less space than now. On another, I do not know how long the tree branches are, but perhaps any change of a data block produces enough changes of metadata to fill up a 4kb block, or several, or a large portion of one - so in practice there is no problem of waiting for a chance to write several metadata blocks as well... Thus I think my second solution is viable. //Jim
Bob Friesenhahn
2011-Oct-10 18:48 UTC
[zfs-discuss] zvol space consumption vs ashift, metadata packing
On Mon, 10 Oct 2011, Jim Klimov wrote:> > Thus I proposed the second idea with a code-only solution > to optimize performance (force user-configured minimal > data block sizes and physical alignments) where metadata > blocks would remain 512 bytes because the pool is formally > ashift=9 - and on-disk data would be compatible with other > pools and OSes boasting ZFS.The problem with this approach is that it results in the same write-amplification that ashift=12 is trying to avoid. If the underlying hardware will still write 4K blocks then it will still be writing 4K blocks. Storage space is saved but much/most of the performance benefit restored by ashift=12 goes away, or the problem becomes even worse due to concurrent updates which are coalesced at the hardware level. File data is still usually much larger than 4K blocks so most performance problems because of the 4K blocks must be due to writing metadata. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/