thr3ads.net - zfs discuss - [zfs-discuss] Project proposal: Block selection policy and space map enhancements [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Victor Latushkin

2007-Sep-15 16:55 UTC

[zfs-discuss] Project proposal: Block selection policy and space map enhancements

I''m proposing new project for ZFS community - Block Selection Policy
and
Space Map Enhancements.

Space map[1] is very efficient data structure for keeping track of free 
space in the metaslabs, but there is at least one area of improvement - 
space map block selection algorithm which could be better (see [2]).

I propose community project to address the following:

* define requirements for a block selection policy for various workloads

* improve current space map / metaslab implementation to allow for 
efficient implementation of multiple block selection policies

* develop a collection of the block selection policies optimized for 
various workloads and requirements


Some background, motivation and requirements you may find in the writeup 
below.

With the best regards,
Victor


Background:
==========Current block selection algorithm as implemented in
metaslab_ff_alloc()
caused some pain recently (see [3-9,10] for examples), and as a result 
bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill performance,
even on a pool with lots of free data" [11] was filed. Investigation of 
this bug identified race condition in the metaslab selection code (see 
metaslab_group_alloc() for details), and this indeed provided some 
relief but did not solve the problem completely (see [4,10] for example).


Current Limitations:
===================Synopsis of bug 6495013 suggests that it is loop and
recursion in
metaslab_ff_alloc() which may kill performance. Indeed, loops and 
recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the worst 
case, where N is a number of segments in the space map. Free space 
fragmentation and alignment requirements of metaslab_ff_alloc() may help 
this to surface even on a pool with lots of free space.

For example, let''s imagine pool consisting of only one 256k metaslab 
with two allocated 512k blocks - one in the beginning, another one at 
the end. Space map for this metaslab will contain only one space segment 
[512,261632). Attempt to allocate 128k block from this space map will 
fail, though we definitely have 255k of contiguous free space in this case.

Gang blocks came to rescue us here (see [13]). We start trying to 
allocate smaller block sizes - 64k, 32k and so on, until we allocate 
enough of smaller blocks to satisfy allocation of 128k block. In this 
case, depending on the position of 64k-aligned and 512b-aligned cursors 
(see use of sm_ppd in metaslab_ff_alloc()), we may get multiple variants 
of smaller block locations (GH is gang header, GM1 and GM2 are gang 
members, A indicates allocated blocks, and F - free space in the space map):

             GH             GM1       GM2
A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
F:              [1k-64k)                   [192k-255,5k)

In this case we effectively allocate 128k block not aligned on 128k 
boundary with additional overhead of gang header.

             GH             GM2       GM1
A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
F:              [1k-64k)                   [192k-255,5k)

In this case halves of our 128k are swapped.

In case where 512b-aligned cursor points to offset 64k and we allocate 
GH starting at that offset, we''ll have to allocate 3 gang members - one
64k and two 32k, fragmenting free space further.

Other option would be to walk trough space map and take as much free 
space segment as needed to satisfy an allocation. With our example it 
would look like this:

             GH      GM1
A:[0-512)[512-1k)[1k-129k)           [255,5k-256k)
F:                       [129k-255,5k)

But do we really need overhead of gang header in this case? It looks 
like we can definitely do better and it is a bit early to "stop looking 
and start ganging" (see [12]). It may be better to look smarter instead.

Potential number of free space segments in space map grows with the size 
of space map. Since number of metaslabs per vdev is somewhat fixed (see 
[14]), larger vdevs have larger space maps and larger potential number 
of free space segments in them, thus higher chances of worst-case 
behaviour. This may be mitigated by increasing number of metaslabs per 
vdev, but this will bring other difficulties.

Requirement:
===========Another implementations of block allocation functions and/or space
map
data structure may make allocation procedure better space map citizen 
with O(logN) bound, may provide additional information like size of the 
largest free space segment in the space map. This may translate into 
more compact space maps, improvements in metaslab selection, more timely 
and efficient defragmentation, and other benefits for performance and 
predictability of ZFS.


Motivation:
==========Better block selection policies will provide better and/or more 
predictable performance with empty/full/fragmented storage pools, will 
help to use current ZFS features like snapshots etc more effectively, 
will also make ZFS more friendly for different storage devices like 
Flash card, thin provisioned storage etc.

Links:
=====
[1] http://blogs.sun.com/bonwick/entry/space_maps

[2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation

[3] nfsd threads hang in ZFS
http://www.opensolaris.org/jive/thread.jspa?messageID=43565

[4] Snapshot impact on performance
http://www.opensolaris.org/jive/thread.jspa?messageID=64379

[5] NFS/ZFS performance problems - txg_wait_open() deadlocks?
http://www.opensolaris.org/jive/thread.jspa?messageID=92885

[6] zpool export consumes whole CPU and takes more than 30 minutes to 
complete
http://www.opensolaris.org/jive/thread.jspa?messageID=93159

[7] Best Practises => Keep Pool Below 80%?
http://www.opensolaris.org/jive/thread.jspa?messageID=93210

[8] ZFS performance problems - solved
http://www.opensolaris.org/jive/thread.jspa?messageID=102762

[9] 120473-05
http://www.opensolaris.org/jive/thread.jspa?messageID=109078

[10] ZFS performance and memory consumption
http://www.opensolaris.org/jive/thread.jspa?messageID=135184

[10] ZFS pool fragmentation
http://www.opensolaris.org/jive/thread.jspa?messageID=136349

[11] 6495013 Loops and recursion in metaslab_ff_alloc can kill 
performance, even on a pool with lots of free data
http://bugs.opensolaris.org/view_bug.do?bug_id=6495013

[12] 6596237 Stop looking and start ganging
http://bugs.opensolaris.org/view_bug.do?bug_id=6596237

[13] zio_write_allocate_gang_members()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#1248

[14] vdev_ms_shift calculation in vdev_init()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev.c#1113

eric kustarz

2007-Sep-20 17:54 UTC

head link

[zfs-discuss] Project proposal: Block selection policy and space map enhancements

On Sep 15, 2007, at 12:55 PM, Victor Latushkin wrote:
> I''m proposing new project for ZFS community - Block Selection  
> Policy and
> Space Map Enhancements.
+1.

I wonder if some of this could look into a dynamic policy.  For  
example, a policy that switches when the pool becomes "too full".

eric
>
> Space map[1] is very efficient data structure for keeping track of  
> free
> space in the metaslabs, but there is at least one area of  
> improvement -
> space map block selection algorithm which could be better (see [2]).
>
> I propose community project to address the following:
>
> * define requirements for a block selection policy for various  
> workloads
>
> * improve current space map / metaslab implementation to allow for
> efficient implementation of multiple block selection policies
>
> * develop a collection of the block selection policies optimized for
> various workloads and requirements
>
>
> Some background, motivation and requirements you may find in the  
> writeup
> below.
>
> With the best regards,
> Victor
>
>
> Background:
> ==========> Current block selection algorithm as implemented in  
> metaslab_ff_alloc()
> caused some pain recently (see [3-9,10] for examples), and as a result
> bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill  
> performance,
> even on a pool with lots of free data" [11] was filed.  
> Investigation of
> this bug identified race condition in the metaslab selection code (see
> metaslab_group_alloc() for details), and this indeed provided some
> relief but did not solve the problem completely (see [4,10] for  
> example).
>
>
> Current Limitations:
> ===================> Synopsis of bug 6495013 suggests that it is loop
and recursion in
> metaslab_ff_alloc() which may kill performance. Indeed, loops and
> recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the  
> worst
> case, where N is a number of segments in the space map. Free space
> fragmentation and alignment requirements of metaslab_ff_alloc() may  
> help
> this to surface even on a pool with lots of free space.
>
> For example, let''s imagine pool consisting of only one 256k
metaslab
> with two allocated 512k blocks - one in the beginning, another one at
> the end. Space map for this metaslab will contain only one space  
> segment
> [512,261632). Attempt to allocate 128k block from this space map will
> fail, though we definitely have 255k of contiguous free space in  
> this case.
>
> Gang blocks came to rescue us here (see [13]). We start trying to
> allocate smaller block sizes - 64k, 32k and so on, until we allocate
> enough of smaller blocks to satisfy allocation of 128k block. In this
> case, depending on the position of 64k-aligned and 512b-aligned  
> cursors
> (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple  
> variants
> of smaller block locations (GH is gang header, GM1 and GM2 are gang
> members, A indicates allocated blocks, and F - free space in the  
> space map):
>
>              GH             GM1       GM2
> A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
> F:              [1k-64k)                   [192k-255,5k)
>
> In this case we effectively allocate 128k block not aligned on 128k
> boundary with additional overhead of gang header.
>
>              GH             GM2       GM1
> A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
> F:              [1k-64k)                   [192k-255,5k)
>
> In this case halves of our 128k are swapped.
>
> In case where 512b-aligned cursor points to offset 64k and we allocate
> GH starting at that offset, we''ll have to allocate 3 gang members
-
> one
> 64k and two 32k, fragmenting free space further.
>
> Other option would be to walk trough space map and take as much free
> space segment as needed to satisfy an allocation. With our example it
> would look like this:
>
>              GH      GM1
> A:[0-512)[512-1k)[1k-129k)           [255,5k-256k)
> F:                       [129k-255,5k)
>
> But do we really need overhead of gang header in this case? It looks
> like we can definitely do better and it is a bit early to "stop  
> looking
> and start ganging" (see [12]). It may be better to look smarter  
> instead.
>
> Potential number of free space segments in space map grows with the  
> size
> of space map. Since number of metaslabs per vdev is somewhat fixed  
> (see
> [14]), larger vdevs have larger space maps and larger potential number
> of free space segments in them, thus higher chances of worst-case
> behaviour. This may be mitigated by increasing number of metaslabs per
> vdev, but this will bring other difficulties.
>
> Requirement:
> ===========> Another implementations of block allocation functions
and/or space map
> data structure may make allocation procedure better space map citizen
> with O(logN) bound, may provide additional information like size of  
> the
> largest free space segment in the space map. This may translate into
> more compact space maps, improvements in metaslab selection, more  
> timely
> and efficient defragmentation, and other benefits for performance and
> predictability of ZFS.
>
>
> Motivation:
> ==========> Better block selection policies will provide better and/or
more
> predictable performance with empty/full/fragmented storage pools, will
> help to use current ZFS features like snapshots etc more effectively,
> will also make ZFS more friendly for different storage devices like
> Flash card, thin provisioned storage etc.
>
> Links:
> =====>
> [1] http://blogs.sun.com/bonwick/entry/space_maps
>
> [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation
>
> [3] nfsd threads hang in ZFS
> http://www.opensolaris.org/jive/thread.jspa?messageID=43565
>
> [4] Snapshot impact on performance
> http://www.opensolaris.org/jive/thread.jspa?messageID=64379
>
> [5] NFS/ZFS performance problems - txg_wait_open() deadlocks?
> http://www.opensolaris.org/jive/thread.jspa?messageID=92885
>
> [6] zpool export consumes whole CPU and takes more than 30 minutes to
> complete
> http://www.opensolaris.org/jive/thread.jspa?messageID=93159
>
> [7] Best Practises => Keep Pool Below 80%?
> http://www.opensolaris.org/jive/thread.jspa?messageID=93210
>
> [8] ZFS performance problems - solved
> http://www.opensolaris.org/jive/thread.jspa?messageID=102762
>
> [9] 120473-05
> http://www.opensolaris.org/jive/thread.jspa?messageID=109078
>
> [10] ZFS performance and memory consumption
> http://www.opensolaris.org/jive/thread.jspa?messageID=135184
>
> [10] ZFS pool fragmentation
> http://www.opensolaris.org/jive/thread.jspa?messageID=136349
>
> [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill
> performance, even on a pool with lots of free data
> http://bugs.opensolaris.org/view_bug.do?bug_id=6495013
>
> [12] 6596237 Stop looking and start ganging
> http://bugs.opensolaris.org/view_bug.do?bug_id=6596237
>
> [13] zio_write_allocate_gang_members()
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ 
> common/fs/zfs/zio.c#1248
>
> [14] vdev_ms_shift calculation in vdev_init()
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ 
> common/fs/zfs/vdev.c#1113
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Sep 2007 - Project proposal: Block selection policy and space map enhancements

[zfs-discuss] Project proposal: Block selection policy and space map enhancements

[zfs-discuss] Project proposal: Block selection policy and space map enhancements