Victor Latushkin
2007-Sep-15 16:55 UTC
[zfs-discuss] Project proposal: Block selection policy and space map enhancements
I''m proposing new project for ZFS community - Block Selection Policy and Space Map Enhancements. Space map[1] is very efficient data structure for keeping track of free space in the metaslabs, but there is at least one area of improvement - space map block selection algorithm which could be better (see [2]). I propose community project to address the following: * define requirements for a block selection policy for various workloads * improve current space map / metaslab implementation to allow for efficient implementation of multiple block selection policies * develop a collection of the block selection policies optimized for various workloads and requirements Some background, motivation and requirements you may find in the writeup below. With the best regards, Victor Background: ==========Current block selection algorithm as implemented in metaslab_ff_alloc() caused some pain recently (see [3-9,10] for examples), and as a result bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill performance, even on a pool with lots of free data" [11] was filed. Investigation of this bug identified race condition in the metaslab selection code (see metaslab_group_alloc() for details), and this indeed provided some relief but did not solve the problem completely (see [4,10] for example). Current Limitations: ===================Synopsis of bug 6495013 suggests that it is loop and recursion in metaslab_ff_alloc() which may kill performance. Indeed, loops and recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the worst case, where N is a number of segments in the space map. Free space fragmentation and alignment requirements of metaslab_ff_alloc() may help this to surface even on a pool with lots of free space. For example, let''s imagine pool consisting of only one 256k metaslab with two allocated 512k blocks - one in the beginning, another one at the end. Space map for this metaslab will contain only one space segment [512,261632). Attempt to allocate 128k block from this space map will fail, though we definitely have 255k of contiguous free space in this case. Gang blocks came to rescue us here (see [13]). We start trying to allocate smaller block sizes - 64k, 32k and so on, until we allocate enough of smaller blocks to satisfy allocation of 128k block. In this case, depending on the position of 64k-aligned and 512b-aligned cursors (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple variants of smaller block locations (GH is gang header, GM1 and GM2 are gang members, A indicates allocated blocks, and F - free space in the space map): GH GM1 GM2 A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) F: [1k-64k) [192k-255,5k) In this case we effectively allocate 128k block not aligned on 128k boundary with additional overhead of gang header. GH GM2 GM1 A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) F: [1k-64k) [192k-255,5k) In this case halves of our 128k are swapped. In case where 512b-aligned cursor points to offset 64k and we allocate GH starting at that offset, we''ll have to allocate 3 gang members - one 64k and two 32k, fragmenting free space further. Other option would be to walk trough space map and take as much free space segment as needed to satisfy an allocation. With our example it would look like this: GH GM1 A:[0-512)[512-1k)[1k-129k) [255,5k-256k) F: [129k-255,5k) But do we really need overhead of gang header in this case? It looks like we can definitely do better and it is a bit early to "stop looking and start ganging" (see [12]). It may be better to look smarter instead. Potential number of free space segments in space map grows with the size of space map. Since number of metaslabs per vdev is somewhat fixed (see [14]), larger vdevs have larger space maps and larger potential number of free space segments in them, thus higher chances of worst-case behaviour. This may be mitigated by increasing number of metaslabs per vdev, but this will bring other difficulties. Requirement: ===========Another implementations of block allocation functions and/or space map data structure may make allocation procedure better space map citizen with O(logN) bound, may provide additional information like size of the largest free space segment in the space map. This may translate into more compact space maps, improvements in metaslab selection, more timely and efficient defragmentation, and other benefits for performance and predictability of ZFS. Motivation: ==========Better block selection policies will provide better and/or more predictable performance with empty/full/fragmented storage pools, will help to use current ZFS features like snapshots etc more effectively, will also make ZFS more friendly for different storage devices like Flash card, thin provisioned storage etc. Links: ===== [1] http://blogs.sun.com/bonwick/entry/space_maps [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation [3] nfsd threads hang in ZFS http://www.opensolaris.org/jive/thread.jspa?messageID=43565 [4] Snapshot impact on performance http://www.opensolaris.org/jive/thread.jspa?messageID=64379 [5] NFS/ZFS performance problems - txg_wait_open() deadlocks? http://www.opensolaris.org/jive/thread.jspa?messageID=92885 [6] zpool export consumes whole CPU and takes more than 30 minutes to complete http://www.opensolaris.org/jive/thread.jspa?messageID=93159 [7] Best Practises => Keep Pool Below 80%? http://www.opensolaris.org/jive/thread.jspa?messageID=93210 [8] ZFS performance problems - solved http://www.opensolaris.org/jive/thread.jspa?messageID=102762 [9] 120473-05 http://www.opensolaris.org/jive/thread.jspa?messageID=109078 [10] ZFS performance and memory consumption http://www.opensolaris.org/jive/thread.jspa?messageID=135184 [10] ZFS pool fragmentation http://www.opensolaris.org/jive/thread.jspa?messageID=136349 [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill performance, even on a pool with lots of free data http://bugs.opensolaris.org/view_bug.do?bug_id=6495013 [12] 6596237 Stop looking and start ganging http://bugs.opensolaris.org/view_bug.do?bug_id=6596237 [13] zio_write_allocate_gang_members() http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#1248 [14] vdev_ms_shift calculation in vdev_init() http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev.c#1113
eric kustarz
2007-Sep-20 17:54 UTC
[zfs-discuss] Project proposal: Block selection policy and space map enhancements
On Sep 15, 2007, at 12:55 PM, Victor Latushkin wrote:> I''m proposing new project for ZFS community - Block Selection > Policy and > Space Map Enhancements.+1. I wonder if some of this could look into a dynamic policy. For example, a policy that switches when the pool becomes "too full". eric> > Space map[1] is very efficient data structure for keeping track of > free > space in the metaslabs, but there is at least one area of > improvement - > space map block selection algorithm which could be better (see [2]). > > I propose community project to address the following: > > * define requirements for a block selection policy for various > workloads > > * improve current space map / metaslab implementation to allow for > efficient implementation of multiple block selection policies > > * develop a collection of the block selection policies optimized for > various workloads and requirements > > > Some background, motivation and requirements you may find in the > writeup > below. > > With the best regards, > Victor > > > Background: > ==========> Current block selection algorithm as implemented in > metaslab_ff_alloc() > caused some pain recently (see [3-9,10] for examples), and as a result > bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill > performance, > even on a pool with lots of free data" [11] was filed. > Investigation of > this bug identified race condition in the metaslab selection code (see > metaslab_group_alloc() for details), and this indeed provided some > relief but did not solve the problem completely (see [4,10] for > example). > > > Current Limitations: > ===================> Synopsis of bug 6495013 suggests that it is loop and recursion in > metaslab_ff_alloc() which may kill performance. Indeed, loops and > recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the > worst > case, where N is a number of segments in the space map. Free space > fragmentation and alignment requirements of metaslab_ff_alloc() may > help > this to surface even on a pool with lots of free space. > > For example, let''s imagine pool consisting of only one 256k metaslab > with two allocated 512k blocks - one in the beginning, another one at > the end. Space map for this metaslab will contain only one space > segment > [512,261632). Attempt to allocate 128k block from this space map will > fail, though we definitely have 255k of contiguous free space in > this case. > > Gang blocks came to rescue us here (see [13]). We start trying to > allocate smaller block sizes - 64k, 32k and so on, until we allocate > enough of smaller blocks to satisfy allocation of 128k block. In this > case, depending on the position of 64k-aligned and 512b-aligned > cursors > (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple > variants > of smaller block locations (GH is gang header, GM1 and GM2 are gang > members, A indicates allocated blocks, and F - free space in the > space map): > > GH GM1 GM2 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case we effectively allocate 128k block not aligned on 128k > boundary with additional overhead of gang header. > > GH GM2 GM1 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case halves of our 128k are swapped. > > In case where 512b-aligned cursor points to offset 64k and we allocate > GH starting at that offset, we''ll have to allocate 3 gang members - > one > 64k and two 32k, fragmenting free space further. > > Other option would be to walk trough space map and take as much free > space segment as needed to satisfy an allocation. With our example it > would look like this: > > GH GM1 > A:[0-512)[512-1k)[1k-129k) [255,5k-256k) > F: [129k-255,5k) > > But do we really need overhead of gang header in this case? It looks > like we can definitely do better and it is a bit early to "stop > looking > and start ganging" (see [12]). It may be better to look smarter > instead. > > Potential number of free space segments in space map grows with the > size > of space map. Since number of metaslabs per vdev is somewhat fixed > (see > [14]), larger vdevs have larger space maps and larger potential number > of free space segments in them, thus higher chances of worst-case > behaviour. This may be mitigated by increasing number of metaslabs per > vdev, but this will bring other difficulties. > > Requirement: > ===========> Another implementations of block allocation functions and/or space map > data structure may make allocation procedure better space map citizen > with O(logN) bound, may provide additional information like size of > the > largest free space segment in the space map. This may translate into > more compact space maps, improvements in metaslab selection, more > timely > and efficient defragmentation, and other benefits for performance and > predictability of ZFS. > > > Motivation: > ==========> Better block selection policies will provide better and/or more > predictable performance with empty/full/fragmented storage pools, will > help to use current ZFS features like snapshots etc more effectively, > will also make ZFS more friendly for different storage devices like > Flash card, thin provisioned storage etc. > > Links: > =====> > [1] http://blogs.sun.com/bonwick/entry/space_maps > > [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > [3] nfsd threads hang in ZFS > http://www.opensolaris.org/jive/thread.jspa?messageID=43565 > > [4] Snapshot impact on performance > http://www.opensolaris.org/jive/thread.jspa?messageID=64379 > > [5] NFS/ZFS performance problems - txg_wait_open() deadlocks? > http://www.opensolaris.org/jive/thread.jspa?messageID=92885 > > [6] zpool export consumes whole CPU and takes more than 30 minutes to > complete > http://www.opensolaris.org/jive/thread.jspa?messageID=93159 > > [7] Best Practises => Keep Pool Below 80%? > http://www.opensolaris.org/jive/thread.jspa?messageID=93210 > > [8] ZFS performance problems - solved > http://www.opensolaris.org/jive/thread.jspa?messageID=102762 > > [9] 120473-05 > http://www.opensolaris.org/jive/thread.jspa?messageID=109078 > > [10] ZFS performance and memory consumption > http://www.opensolaris.org/jive/thread.jspa?messageID=135184 > > [10] ZFS pool fragmentation > http://www.opensolaris.org/jive/thread.jspa?messageID=136349 > > [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill > performance, even on a pool with lots of free data > http://bugs.opensolaris.org/view_bug.do?bug_id=6495013 > > [12] 6596237 Stop looking and start ganging > http://bugs.opensolaris.org/view_bug.do?bug_id=6596237 > > [13] zio_write_allocate_gang_members() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/zio.c#1248 > > [14] vdev_ms_shift calculation in vdev_init() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/vdev.c#1113 > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss