Victor Latushkin
2007-Sep-15 16:55 UTC
[zfs-discuss] Project proposal: Block selection policy and space map enhancements
I''m proposing new project for ZFS community - Block Selection Policy
and
Space Map Enhancements.
Space map[1] is very efficient data structure for keeping track of free
space in the metaslabs, but there is at least one area of improvement -
space map block selection algorithm which could be better (see [2]).
I propose community project to address the following:
* define requirements for a block selection policy for various workloads
* improve current space map / metaslab implementation to allow for
efficient implementation of multiple block selection policies
* develop a collection of the block selection policies optimized for
various workloads and requirements
Some background, motivation and requirements you may find in the writeup
below.
With the best regards,
Victor
Background:
==========Current block selection algorithm as implemented in
metaslab_ff_alloc()
caused some pain recently (see [3-9,10] for examples), and as a result
bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill performance,
even on a pool with lots of free data" [11] was filed. Investigation of
this bug identified race condition in the metaslab selection code (see
metaslab_group_alloc() for details), and this indeed provided some
relief but did not solve the problem completely (see [4,10] for example).
Current Limitations:
===================Synopsis of bug 6495013 suggests that it is loop and
recursion in
metaslab_ff_alloc() which may kill performance. Indeed, loops and
recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the worst
case, where N is a number of segments in the space map. Free space
fragmentation and alignment requirements of metaslab_ff_alloc() may help
this to surface even on a pool with lots of free space.
For example, let''s imagine pool consisting of only one 256k metaslab
with two allocated 512k blocks - one in the beginning, another one at
the end. Space map for this metaslab will contain only one space segment
[512,261632). Attempt to allocate 128k block from this space map will
fail, though we definitely have 255k of contiguous free space in this case.
Gang blocks came to rescue us here (see [13]). We start trying to
allocate smaller block sizes - 64k, 32k and so on, until we allocate
enough of smaller blocks to satisfy allocation of 128k block. In this
case, depending on the position of 64k-aligned and 512b-aligned cursors
(see use of sm_ppd in metaslab_ff_alloc()), we may get multiple variants
of smaller block locations (GH is gang header, GM1 and GM2 are gang
members, A indicates allocated blocks, and F - free space in the space map):
GH GM1 GM2
A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k)
F: [1k-64k) [192k-255,5k)
In this case we effectively allocate 128k block not aligned on 128k
boundary with additional overhead of gang header.
GH GM2 GM1
A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k)
F: [1k-64k) [192k-255,5k)
In this case halves of our 128k are swapped.
In case where 512b-aligned cursor points to offset 64k and we allocate
GH starting at that offset, we''ll have to allocate 3 gang members - one
64k and two 32k, fragmenting free space further.
Other option would be to walk trough space map and take as much free
space segment as needed to satisfy an allocation. With our example it
would look like this:
GH GM1
A:[0-512)[512-1k)[1k-129k) [255,5k-256k)
F: [129k-255,5k)
But do we really need overhead of gang header in this case? It looks
like we can definitely do better and it is a bit early to "stop looking
and start ganging" (see [12]). It may be better to look smarter instead.
Potential number of free space segments in space map grows with the size
of space map. Since number of metaslabs per vdev is somewhat fixed (see
[14]), larger vdevs have larger space maps and larger potential number
of free space segments in them, thus higher chances of worst-case
behaviour. This may be mitigated by increasing number of metaslabs per
vdev, but this will bring other difficulties.
Requirement:
===========Another implementations of block allocation functions and/or space
map
data structure may make allocation procedure better space map citizen
with O(logN) bound, may provide additional information like size of the
largest free space segment in the space map. This may translate into
more compact space maps, improvements in metaslab selection, more timely
and efficient defragmentation, and other benefits for performance and
predictability of ZFS.
Motivation:
==========Better block selection policies will provide better and/or more
predictable performance with empty/full/fragmented storage pools, will
help to use current ZFS features like snapshots etc more effectively,
will also make ZFS more friendly for different storage devices like
Flash card, thin provisioned storage etc.
Links:
=====
[1] http://blogs.sun.com/bonwick/entry/space_maps
[2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation
[3] nfsd threads hang in ZFS
http://www.opensolaris.org/jive/thread.jspa?messageID=43565
[4] Snapshot impact on performance
http://www.opensolaris.org/jive/thread.jspa?messageID=64379
[5] NFS/ZFS performance problems - txg_wait_open() deadlocks?
http://www.opensolaris.org/jive/thread.jspa?messageID=92885
[6] zpool export consumes whole CPU and takes more than 30 minutes to
complete
http://www.opensolaris.org/jive/thread.jspa?messageID=93159
[7] Best Practises => Keep Pool Below 80%?
http://www.opensolaris.org/jive/thread.jspa?messageID=93210
[8] ZFS performance problems - solved
http://www.opensolaris.org/jive/thread.jspa?messageID=102762
[9] 120473-05
http://www.opensolaris.org/jive/thread.jspa?messageID=109078
[10] ZFS performance and memory consumption
http://www.opensolaris.org/jive/thread.jspa?messageID=135184
[10] ZFS pool fragmentation
http://www.opensolaris.org/jive/thread.jspa?messageID=136349
[11] 6495013 Loops and recursion in metaslab_ff_alloc can kill
performance, even on a pool with lots of free data
http://bugs.opensolaris.org/view_bug.do?bug_id=6495013
[12] 6596237 Stop looking and start ganging
http://bugs.opensolaris.org/view_bug.do?bug_id=6596237
[13] zio_write_allocate_gang_members()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#1248
[14] vdev_ms_shift calculation in vdev_init()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev.c#1113
eric kustarz
2007-Sep-20 17:54 UTC
[zfs-discuss] Project proposal: Block selection policy and space map enhancements
On Sep 15, 2007, at 12:55 PM, Victor Latushkin wrote:> I''m proposing new project for ZFS community - Block Selection > Policy and > Space Map Enhancements.+1. I wonder if some of this could look into a dynamic policy. For example, a policy that switches when the pool becomes "too full". eric> > Space map[1] is very efficient data structure for keeping track of > free > space in the metaslabs, but there is at least one area of > improvement - > space map block selection algorithm which could be better (see [2]). > > I propose community project to address the following: > > * define requirements for a block selection policy for various > workloads > > * improve current space map / metaslab implementation to allow for > efficient implementation of multiple block selection policies > > * develop a collection of the block selection policies optimized for > various workloads and requirements > > > Some background, motivation and requirements you may find in the > writeup > below. > > With the best regards, > Victor > > > Background: > ==========> Current block selection algorithm as implemented in > metaslab_ff_alloc() > caused some pain recently (see [3-9,10] for examples), and as a result > bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill > performance, > even on a pool with lots of free data" [11] was filed. > Investigation of > this bug identified race condition in the metaslab selection code (see > metaslab_group_alloc() for details), and this indeed provided some > relief but did not solve the problem completely (see [4,10] for > example). > > > Current Limitations: > ===================> Synopsis of bug 6495013 suggests that it is loop and recursion in > metaslab_ff_alloc() which may kill performance. Indeed, loops and > recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the > worst > case, where N is a number of segments in the space map. Free space > fragmentation and alignment requirements of metaslab_ff_alloc() may > help > this to surface even on a pool with lots of free space. > > For example, let''s imagine pool consisting of only one 256k metaslab > with two allocated 512k blocks - one in the beginning, another one at > the end. Space map for this metaslab will contain only one space > segment > [512,261632). Attempt to allocate 128k block from this space map will > fail, though we definitely have 255k of contiguous free space in > this case. > > Gang blocks came to rescue us here (see [13]). We start trying to > allocate smaller block sizes - 64k, 32k and so on, until we allocate > enough of smaller blocks to satisfy allocation of 128k block. In this > case, depending on the position of 64k-aligned and 512b-aligned > cursors > (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple > variants > of smaller block locations (GH is gang header, GM1 and GM2 are gang > members, A indicates allocated blocks, and F - free space in the > space map): > > GH GM1 GM2 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case we effectively allocate 128k block not aligned on 128k > boundary with additional overhead of gang header. > > GH GM2 GM1 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case halves of our 128k are swapped. > > In case where 512b-aligned cursor points to offset 64k and we allocate > GH starting at that offset, we''ll have to allocate 3 gang members - > one > 64k and two 32k, fragmenting free space further. > > Other option would be to walk trough space map and take as much free > space segment as needed to satisfy an allocation. With our example it > would look like this: > > GH GM1 > A:[0-512)[512-1k)[1k-129k) [255,5k-256k) > F: [129k-255,5k) > > But do we really need overhead of gang header in this case? It looks > like we can definitely do better and it is a bit early to "stop > looking > and start ganging" (see [12]). It may be better to look smarter > instead. > > Potential number of free space segments in space map grows with the > size > of space map. Since number of metaslabs per vdev is somewhat fixed > (see > [14]), larger vdevs have larger space maps and larger potential number > of free space segments in them, thus higher chances of worst-case > behaviour. This may be mitigated by increasing number of metaslabs per > vdev, but this will bring other difficulties. > > Requirement: > ===========> Another implementations of block allocation functions and/or space map > data structure may make allocation procedure better space map citizen > with O(logN) bound, may provide additional information like size of > the > largest free space segment in the space map. This may translate into > more compact space maps, improvements in metaslab selection, more > timely > and efficient defragmentation, and other benefits for performance and > predictability of ZFS. > > > Motivation: > ==========> Better block selection policies will provide better and/or more > predictable performance with empty/full/fragmented storage pools, will > help to use current ZFS features like snapshots etc more effectively, > will also make ZFS more friendly for different storage devices like > Flash card, thin provisioned storage etc. > > Links: > =====> > [1] http://blogs.sun.com/bonwick/entry/space_maps > > [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > [3] nfsd threads hang in ZFS > http://www.opensolaris.org/jive/thread.jspa?messageID=43565 > > [4] Snapshot impact on performance > http://www.opensolaris.org/jive/thread.jspa?messageID=64379 > > [5] NFS/ZFS performance problems - txg_wait_open() deadlocks? > http://www.opensolaris.org/jive/thread.jspa?messageID=92885 > > [6] zpool export consumes whole CPU and takes more than 30 minutes to > complete > http://www.opensolaris.org/jive/thread.jspa?messageID=93159 > > [7] Best Practises => Keep Pool Below 80%? > http://www.opensolaris.org/jive/thread.jspa?messageID=93210 > > [8] ZFS performance problems - solved > http://www.opensolaris.org/jive/thread.jspa?messageID=102762 > > [9] 120473-05 > http://www.opensolaris.org/jive/thread.jspa?messageID=109078 > > [10] ZFS performance and memory consumption > http://www.opensolaris.org/jive/thread.jspa?messageID=135184 > > [10] ZFS pool fragmentation > http://www.opensolaris.org/jive/thread.jspa?messageID=136349 > > [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill > performance, even on a pool with lots of free data > http://bugs.opensolaris.org/view_bug.do?bug_id=6495013 > > [12] 6596237 Stop looking and start ganging > http://bugs.opensolaris.org/view_bug.do?bug_id=6596237 > > [13] zio_write_allocate_gang_members() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/zio.c#1248 > > [14] vdev_ms_shift calculation in vdev_init() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/vdev.c#1113 > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss