Tony Galway
2007-May-07 16:49 UTC
[zfs-discuss] Zpool, RaidZ & how it spreads its disk load?
Greetings learned ZFS geeks & guru?s, Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. I?ve created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. # zpool create tp raidz c4t600 ... 8 disks worth of zpool # zfs create tp/pool # zfs set recordsize=8k tp/pool # zfs set mountpoint=/pool tp/pool I then create a 100G data file that is created by sequentially writing 64k blocks to the test data file. When I then issue a # zpool iostat -v tp 10 I see the following strange behaviour. I see anywhere from up to 16 iterations (ie 160 seconds) of the following, where there are only writes to 2 of the 8 disks: capacity operations bandwidth pool used avail read write read write -------------------------------------- ----- ----- ----- ----- ----- ----- testpool 29.7G 514G 0 2.76K 0 22.1M raidz1 29.7G 514G 0 2.76K 0 22.1M c4t600C0FF0000000000A74531B659C5C00d0s6 - - 0 0 0 0 c4t600C0FF0000000000A74533F3CF1AD00d0s6 - - 0 0 0 0 c4t600C0FF0000000000A74534C5560FB00d0s6 - - 0 0 0 0 c4t600C0FF0000000000A74535E50E5A400d0s6 - - 0 1.38K 0 2.76M c4t600C0FF0000000000A74537C1C061500d0s6 - - 0 0 0 0 c4t600C0FF0000000000A745343B08C4B00d0s6 - - 0 0 0 0 c4t600C0FF0000000000A745379CB90B600d0s6 - - 0 0 0 0 c4t600C0FF0000000000A74530237AA9300d0s6 - - 0 1.38K 0 2.76M -------------------------------------- ----- ----- ----- ----- ----- ----- During these periods, my data file does not grow in size, but then I see writes to all of the disks like the following: capacity operations bandwidth pool used avail read write read write -------------------------------------- ----- ----- ----- ----- ----- ----- testpool 64.0G 480G 0 1.45K 0 11.6M raidz1 64.0G 480G 0 1.45K 0 11.6M c4t600C0FF0000000000A74531B659C5C00d0s6 - - 0 246 0 8.22M c4t600C0FF0000000000A74533F3CF1AD00d0s6 - - 0 220 0 8.23M c4t600C0FF0000000000A74534C5560FB00d0s6 - - 0 254 0 8.20M c4t600C0FF0000000000A74535E50E5A400d0s6 - - 0 740 0 1.45M c4t600C0FF0000000000A74537C1C061500d0s6 - - 0 299 0 8.21M c4t600C0FF0000000000A745343B08C4B00d0s6 - - 0 284 0 8.21M c4t600C0FF0000000000A745379CB90B600d0s6 - - 0 266 0 8.22M c4t600C0FF0000000000A74530237AA9300d0s6 - - 0 740 0 1.45M -------------------------------------- ----- ----- ----- ----- ----- ----- And my data file will increase in size, but also notice notice, in the above, those disks that were being written to before, have a load that is consistent with the previous example. For background, the server, and the storage are dedicated solely to this testing, and there are no other applications being run at this time. I thought that RaidZ would spread its load across all disks somewhat evenly. Can someone explain this result? I can consistently reproduce it as well. Thanks -Tony This message posted from opensolaris.org
Mario Goebbels
2007-May-07 17:08 UTC
[zfs-discuss] Re: Zpool, RaidZ & how it spreads its disk load?
Something I was wondering about myself. What does the raidz toplevel (pseudo?) device do? Does it just indicate to the SPA, or whatever module is responsible, to additionally generate parity? The thing I''d like to know is if variable block sizes, dynamic striping et al still applies to a single RAIDZ device, too. Thanks! -mg This message posted from opensolaris.org
Chris Csanady
2007-May-07 18:26 UTC
[zfs-discuss] Zpool, RaidZ & how it spreads its disk load?
On 5/7/07, Tony Galway <tony.galway at sun.com> wrote:> Greetings learned ZFS geeks & guru''s, > > Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. > I''ve created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. > # zpool create tp raidz c4t600 ... 8 disks worth of zpool > # zfs create tp/pool > # zfs set recordsize=8k tp/pool > # zfs set mountpoint=/pool tp/poolThis is a known problem, and is an interaction between the alignment requirements imposed by RAID-Z and the small recordsize you have chosen. You may effectively avoid it in most situations by choosing a RAID-Z strip width of 2^n+1. For a fixed record size, this will work perfectly well. Even so, there will still be cases where small files will cause problems for RAID-Z. While it does not affect many people right now, I think it will become a more serious issue when disks move to 4k sectors. I think the reason for the alignment constraint was to ensure that the stranded space was accounted for, otherwise it would cause problems as the pool fills up. (Consider a 3 device RAID-Z, where only one data sector and one parity sector are written; the third sector in that stripe is essentially dead space.) Would it be possible (or worthwhile) to make the allocator aware of this dead space, rather than imposing the alignment requirements? Something like a concept of tentatively allocated space in the allocator, which would be managed based on the requirements of the vdev. Using such a mechanism, it could coalesce the space if possible for allocations. Of course, it would also have to convert the misaligned bits back into tentatively allocated space when blocks are freed. While I expect this may require changes which would not easily be backward compatible, the alignment on RAID-Z has always felt a bit wrong. While the more severe effects can be addressed by also writing out the dead space, that will not address uneven placement of data and parity across the stripes. Any thoughts? Chris
Mario Goebbels
2007-May-07 18:43 UTC
[zfs-discuss] Re: Zpool, RaidZ & how it spreads its disk load?
What are these alignment requirements? I would have thought that at the lowest level, parity stripes would have been allocated traditionally, while treating the remaining usable space like a JBOD the level above, thus not subject to any restraints (apart when getting close to the parity stripe boundaries). -mg This message posted from opensolaris.org
James Blackburn
2007-May-07 19:45 UTC
[zfs-discuss] Zpool, RaidZ & how it spreads its disk load?
On 5/7/07, Chris Csanady <csanady at gmail.com> wrote:> On 5/7/07, Tony Galway <tony.galway at sun.com> wrote: > > Greetings learned ZFS geeks & guru''s, > > > > Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. > > I''ve created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. > > # zpool create tp raidz c4t600 ... 8 disks worth of zpool > > # zfs create tp/pool > > # zfs set recordsize=8k tp/pool > > # zfs set mountpoint=/pool tp/pool > > This is a known problem, and is an interaction between the alignment > requirements imposed by RAID-Z and the small recordsize you have > chosen. You may effectively avoid it in most situations by choosing a > RAID-Z strip width of 2^n+1. For a fixed record size, this will work > perfectly well.Well an alignment issue may be the case for the second iostat output, but not for the first. I''d suspect in the first case the I/O being seen is the syncing of the transaction group and associated block pointers to the RAID (though I could be very wrong on this). Also I''m also not entirely sure about your formula (how can you choose a stripe width that''s not a power of 2?). For an 8 disk single parity RAID data is going to be written to 7 disks and parity to 1. If each disk block is 512 bytes, then 128 disk blocks will be written for each 64k filesystem block. This will require 18 rows (and a bit of the 19th) on the 7 data disks. Therefore we have a requirement for 128 blocks of data + 19 blocks of parity = 147 blocks. Now if we take into account the alignment requirement it says that the number of block written must equal a multiple of (nparity + 1). So 148 blocks will be written. 148 % 8 = 4 This means that on each successive 64k write the ''extra'' roundup block will alternate between one disk and another 4 disks apart (which happens to be just what we see).> Even so, there will still be cases where small files will cause > problems for RAID-Z. While it does not affect many people right now, > I think it will become a more serious issue when disks move to 4k > sectors.True. But when disks move to 4k sectors they will be on the order of terabytes in size. It would probably be more pain than it''s worth to try to efficiently pack these. (And it''s very likely that your filesystem and per file block size will be at least 4k.)> I think the reason for the alignment constraint was to ensure that the > stranded space was accounted for, otherwise it would cause problems as > the pool fills up. (Consider a 3 device RAID-Z, where only one data > sector and one parity sector are written; the third sector in that > stripe is essentially dead space.)Indeed. As Adam explained here: http://www.opensolaris.org/jive/thread.jspa?threadID=26115&tstart=0 it specifically pertains to what happens if you allow an odd numer of disk blocks to be written, you then free that block and try to fill the space with 512 bytes fs blocks -- you get a single 512-byte hole that you can''t fill.> Would it be possible (or worthwhile) to make the allocator aware of > this dead space, rather than imposing the alignment requirements? > Something like a concept of tentatively allocated space in the > allocator, which would be managed based on the requirements of the > vdev. Using such a mechanism, it could coalesce the space if possible > for allocations. Of course, it would also have to convert the > misaligned bits back into tentatively allocated space when blocks are > freed.It would add complexity and this roundup only occurs in the RAID-Z vdev. As the metaslab/space allocator doesn''t have any idea about the on disk layout it wouldn''t be able to say whether successive single free blocks in the space map are on the same/different disks -- and this would further add to the complexity of data/parity allocation within the RAID-Z vdev itself.> While I expect this may require changes which would not easily be > backward compatible, the alignment on RAID-Z has always felt a bit > wrong. While the more severe effects can be addressed by also writing > out the dead space, that will not address uneven placement of data and > parity across the stripes.I''ve also had issues with this (under a slightly different guise). I''ve implemented a rather naive raidz implementation based on the current implementation which allows you to use all the disk space on an array of mismatched disks. What I''ve done is use the grid portion of the block pointer to specify a RAID ''version'' number (of which you are currently allowed 255 (0 being reserved for the current layout)). I''ve then organized it such that metaslab_init is specialised in the raidz vdev (a la vdev_raidz_asize()) and allocates the metaslab as before, but forces a new metaslab when a boundary is reached that would alter the number of disks in a stripe. This increases the number of metaslabs by O(number of disks). It also means that you need to psize_to_asize slightly later in the metaslab allocation section (rather than once per vdev); and that things like raidz_asize() and map_alloc() have an addition lg(number_disks) overhead in computation. The result is allocation that is computationally marginally more complex though, from benchmarking with dtrace, it''s hardly noticeable compared to the overhead of malloc(); and you get a lot a disk space back (if you''re a disk collector (read: poor student) like me :) ). While this is by no means complete, it does give you some things for ''free'' (as it were). Like changing the number of parity disks from 1->2->1 should simply be a matter of creating a new grid version; and it has backwards compatibility with the original RAID-Z. Unfortunately it isn''t all good news. Adding/relacing disks doesn''t appear to be as easy as this requires munging the space map. And then you have the problem (now we get to it...) of single block unallocated space when the striped width changes. This could probably be dealt with by passivating a metaslab as full if all that''s still available is single blocks of contiguous free space... But yes, something to work on :). Anyway; that turned out to be rather longer than I expected. If anyone has any wise words of advice, I''m all ears! James