Short question: I''m curious as to how ZFS manages space (free and used) and how its usage interacts with thin provisioning provided by HDS arrays. Is there any effort to minimize the number of provisioned disk blocks that get writes so as to not negate any space benefits that thin provisioning may give? Background & more detailed questions: In Jeff Bonwick''s blog[1], he talks about free space management and metaslabs. Of particular interest is the statement: "ZFS divides the space on each virtual device into a few hundred regions called metaslabs." 1. http://blogs.sun.com/bonwick/entry/space_maps In Hu Yoshida''s (CTO, Hitachi Data Systems) blog[2] there is a discussion of thin provisioning at the enterprise array level. Of particular interest is the statement: "Dynamic Provisioning is not a panacea for all our storage woes. There are applications that do a hard format or write across the volume when they do an allocation and that would negate the value of thin provisioning." In another entry[3] he goes on to say: "Capacity is allocated to ''thin'' volumes from this pool in units of 42 MB pages...." 2. http://blogs.hds.com/hu/2007/05/dynamic_or_thin_provisioning.html 3. http://blogs.hds.com/hu/2007/05/thin_and_wide_.html This says that any time that a 42 MB region gets one sector written, 42 MB of storage is permanently[4] allocated to the virtual LUN. 4. Until the LUN is destroyed, that is. I know that ZFS does not do a write across all of the disk as part of formatting. Does it, however, drop some sort of metaslab data structures on each of those "few hundred regions"? When space is allocated, does it make an attempt to spread the allocations across all of the metaslabs, or does it more or less fill up one metaslab before moving to the next? As data is deleted, do the freed blocks get reused before never used blocks? Is there any collaboration between the storage vendors and ZFS developers to allow the file system to tell the storage array "this range of blocks is unused" so that the array can reclaim the space? I could see this as useful when doing re-writes of data (e.g. crypto rekey) to concentrate data that had become scattered into contiguous space. -- Mike Gerdts http://mgerdts.blogspot.com/
Victor Latushkin
2007-Sep-14 18:08 UTC
[zfs-discuss] space allocation vs. thin provisioning
Mike Gerdts wrote:> Short question:Not so short really :-) Answers to som questions inline. I think others will correct me if I''m wrong.> I''m curious as to how ZFS manages space (free and used) and how > its usage interacts with thin provisioning provided by HDS > arrays. Is there any effort to minimize the number of provisioned > disk blocks that get writes so as to not negate any space > benefits that thin provisioning may give? > > > Background & more detailed questions: > > In Jeff Bonwick''s blog[1], he talks about free space management > and metaslabs. Of particular interest is the statement: "ZFS > divides the space on each virtual device into a few hundred > regions called metaslabs." > > 1. http://blogs.sun.com/bonwick/entry/space_maps > > In Hu Yoshida''s (CTO, Hitachi Data Systems) blog[2] there is a > discussion of thin provisioning at the enterprise array level. > Of particular interest is the statement: "Dynamic Provisioning is > not a panacea for all our storage woes. There are applications > that do a hard format or write across the volume when they do an > allocation and that would negate the value of thin provisioning." > In another entry[3] he goes on to say: "Capacity is allocated to > ''thin'' volumes from this pool in units of 42 MB pages...." > > 2. http://blogs.hds.com/hu/2007/05/dynamic_or_thin_provisioning.html > 3. http://blogs.hds.com/hu/2007/05/thin_and_wide_.html > > This says that any time that a 42 MB region gets one sector > written, 42 MB of storage is permanently[4] allocated to the > virtual LUN. > > 4. Until the LUN is destroyed, that is. > > I know that ZFS does not do a write across all of the disk as > part of formatting. Does it, however, drop some sort of metaslab > data structures on each of those "few hundred regions"?No, it does not need to format disk in any way as metadata like space map information is kept in the DMU object which do not differ in nature from other DMU objects and may be stored anywhere.> When space is allocated, does it make an attempt to spread the > allocations across all of the metaslabs, or does it more or less > fill up one metaslab before moving to the next?Answer to this question is contained in here: http://blogs.sun.com/bonwick/entry/zfs_block_allocation In short, outer metaslabs (lower DVAs) are assigned higher weight, and previously used metaslabs also get weight boost.> As data is deleted, do the freed blocks get reused before never > used blocks?It depends. Current implementation of space map allocator function is so that it changes cursor position on metaslab only when we allocate data, but does not touch it when data is freed. So this looks like answer is ''no''. But at the same time when we reach the end of space map, we start again from the beginning of space map so we have a chance to allocate previously freed, so the answer is ''yes''.> Is there any collaboration between the storage vendors and ZFS > developers to allow the file system to tell the storage array > "this range of blocks is unused" so that the array can reclaim > the space? I could see this as useful when doing re-writes of > data (e.g. crypto rekey) to concentrate data that had become > scattered into contiguous space.I think that currently there are no such mechanism, but it does not mean it cannot be developed. Hth, Victor
Mike Gerdts wrote:> I''m curious as to how ZFS manages space (free and used) and how > its usage interacts with thin provisioning provided by HDS > arrays. Is there any effort to minimize the number of provisioned > disk blocks that get writes so as to not negate any space > benefits that thin provisioning may give?I was trying to compose an email asking almost the exact same question, but in the context of array-based replication. They''re similar in the sense that you''re asking about using already-written space, rather than to go off into virgin sectors of the disks (in my case, in the hope that the previous write is still waiting to be replicated and thus can be replaced by the current data)> > > Background & more detailed questions: > > In Jeff Bonwick''s blog[1], he talks about free space management > and metaslabs. Of particular interest is the statement: "ZFS > divides the space on each virtual device into a few hundred > regions called metaslabs." > > 1. http://blogs.sun.com/bonwick/entry/space_mapsI wish I''d have seen this blog while I was composing my question... it answers some of my questions about how things work (plus Jeff''s zfs_block_allocation entry actually moots most of my comments since they''ve already been implemented) (snip)> > As data is deleted, do the freed blocks get reused before never > used blocks?I didn''t see any code where this would happen. I would really love to see a zpool setting where I can specify the reuse algorithm. (For example: zpool set block_reuse_policy=mru or =dense or =broad or =low) MRU (most recently used) in the hopes that the storage replication hasn''t yet committed the previous write to the other side of the WAN DENSE (reuse any previously-written space) in the thin-provisioning case BROAD (venture off into new space when possible) for media that has a rewrite cycle limitations (flash drives) to spread the writes over as much of the media as possible LOW (prioritize low-block# space) would provide optimal rotational latency for random i/o in the fututre and might be a special case of the above. The corresponding HIGH would improve sequential i/o. (Implementation is left as an exercise to the reader ;)> > Is there any collaboration between the storage vendors and ZFS > developers to allow the file system to tell the storage array > "this range of blocks is unused" so that the array can reclaim > the space? I could see this as useful when doing re-writes of > data (e.g. crypto rekey) to concentrate data that had become > scattered into contiguous space.Deallocating storage space is something that nobody seems to be good at: ever tried to shrink a filesystem? Or a ZFS pool? Or a SAN RAID group? --Joe
On 9/14/07, Moore, Joe <jmoore at ugs.com> wrote:> I was trying to compose an email asking almost the exact same question, > but in the context of array-based replication. They''re similar in the > sense that you''re asking about using already-written space, rather than > to go off into virgin sectors of the disks (in my case, in the hope that > the previous write is still waiting to be replicated and thus can be > replaced by the current data)At one point, I thought this was how data replication should happen too. However, unless you have two consecutive writes to the same space, coalescing the writes could make it so that the data (generically, including fs metadata) on the replication target may be corrupt. Generally speaking, you need to have in-order writes to ensure that you maintain "crash consistent" data integrity in the event of a various failure modes. Of course, I can see how writes could be batched coalesced and applied in a journaled manner such that each batch fully applies or is rolled back on the target. I haven''t heard of this being done. Mike -- Mike Gerdts http://mgerdts.blogspot.com/