Edward Ned Harvey
2010-Oct-17  13:38 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
The default blocksize is 128K. If you are using mirrors, then each block on disk will be 128K whenever possible. But if you''re using raidzN with a capacity of M disks (M disks useful capacity + N disks redundancy) then the block size on each individual disk will be 128K / M. Right? This is one of the reasons the raidzN resilver code is inefficient. Since you end up waiting for the slowest seek time of any one disk in the vdev, and when that''s done, the amount of data you were able to process was at most 128K. Rinse and repeat. Would it not be wise, when creating raidzN vdev''s, to increase the blocksize to 128K * M? Then, the on-disk blocksize for each disk could be the same as the mirror on-disk blocksize of 128K. It still won''t resilver as fast as a mirror, but the raidzN resilver would be accelerated by as much as M times. Right? The only disadvantage that I know of would be wasted space. Every 4K file in a mirror can waste up to 124K of disk space, right? And in the above described scenario, every 4K file in the raidzN can waste up to 128K * M of disk space, right? Also, if you have a lot of these sparse 4K blocks, then the resilver time doesn''t actually improve either. Because you perform one seek, and regardless if you fetch 128K or 128K*M, you still paid one maximum seek time to fetch 4K of useful data. Point is: If the goal is to reduce the number of on-disk slabs, and therefore reduce the number of seeks necessary to resilver, one thing you could do is increase the pool blocksize, right? YMMV, and YM will depend on how you use your pool. Hopefully you''re able to bias your usage in favor of large block writes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/abaf6dfb/attachment.html>
Bob Friesenhahn
2010-Oct-17  16:04 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
On Sun, 17 Oct 2010, Edward Ned Harvey wrote:> > The default blocksize is 128K.? If you are using mirrors, then each > block on disk will be 128K whenever possible.? But if you''re using > raidzN with a capacity of M disks (M disks useful capacity + N disks > redundancy) then the block size on each individual disk will be 128K > / M.? Right?? This is one of the reasons the raidzN resilver code is > inefficient.? Since you end up waiting for the slowest seek time of > any one disk in the vdev, and when that''s done, the amount of data > you were able to process was at most 128K.? Rinse and repeat.Your idea about what it means for "code" to be inefficient is clearly vastly different than my own. Regardless, the the physical layout issues (impacting IOPS requirements) are a reality.> Would it not be wise, when creating raidzN vdev''s, to increase the > blocksize to 128K * M?? Then, the on-disk blocksize for each disk > could be the same as the mirror on-disk blocksize of 128K.? It still > won''t resilver as fast as a mirror, but the raidzN resilver would be > accelerated by as much as M times.? Right?This might work for HPC applications with huge files and huge sequential streaming data rate requirements. It would be detrimental for the case of small files, or applications which issue many small writes, and particularly bad for many random synchronous writes.> The only disadvantage that I know of would be wasted space.? Every > 4K file in a mirror can waste up to 124K of disk space, right?? And > in the above described scenario, every 4K file in the raidzN can > waste up to 128K * M of disk space, right?? Also, if you have a lot > of these sparse 4K blocks, then the resilver time doesn''t actually > improve either.? Because you perform one seek, and regardless if you > fetch 128K or 128K*M, you still paid one maximum seek time to fetch > 4K of useful data.The tally of disadvantages are quite large. Note that zfs needs to write each zfs "block" and you are dramatically increasing the level of write amplification. Also zfs needs to checksum each whole block and the checksum adds to the latency. The risk of block corruption is increased. 128K is already quite large for a block. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Kyle McDonald
2010-Oct-17  17:26 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/17/2010 9:38 AM, Edward Ned Harvey wrote:> > The default blocksize is 128K. If you are using mirrors, then > each block on disk will be 128K whenever possible. But if you''re > using raidzN with a capacity of M disks (M disks useful capacity + > N disks redundancy) then the block size on each individual disk > will be 128K / M. Right? >If I understand things correctly, I think this is why it is recommended that you pick an M that divides into 128K evenly. I believe powers of 2 are recommended. I think increasing the block size to 128K*M would be overkill, but that idea does make me wonder: In cases where M can''t be a power of 2, would it make sense to adjust the block size so that M still divides evenly? If M were 4 then the data written to each drive would be 32K. So if you really wanted to M to be 5 drives, is there an advantage to making the block size 160K, or if that''s too big, how about 80K? Like wise if you really wanted to M to be 3 drives, would adjusting it BS to 96K make sense? -Kyle -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) iQEcBAEBAgAGBQJMuzG2AAoJEEADRM+bKN5wokMH/A2W3hjf2yZx0uO4n0UvSbIY aAS2faGjx9R03ile3u1K/Qlg/dAm0zLdMkNoKY8Pcg8TPx3VLCapNvmlySxCldAf rPXC8NC5xzIj75oGqb1VGByUlqerCdVldvBjo5vFKcDM83CcpLLjmO6gJzNe1UoV MwcKsb0oZv3JzmYcvqjW/lNCIjaQzxkm0k0EP+pV1tx+HMPyHp+kaxnzv4v994GO zwz0OfUOsHaIkSJda8t8ekg9qMdvZa63X8A0VGmhnR26lpjHZD/274IPBStapasx IC+T7O0EYazQSO3fftZ6MCd9O6//0tbQX0MLHPDMpyX90EU+ihILuqYn/QjJjhg=4mvO -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/3526f91a/attachment.html>
Richard Elling
2010-Oct-18  03:00 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
On Oct 17, 2010, at 6:38 AM, Edward Ned Harvey wrote:> The default blocksize is 128K. If you are using mirrors, then each block on disk will be 128K whenever possible. But if you''re using raidzN with a capacity of M disks (M disks useful capacity + N disks redundancy) then the block size on each individual disk will be 128K / M. Right?Yes, but it is worse for RAID-5 where you will likely have to do a RMW if your stripe size is not perfectly matched to the blocksize. This is the case where raidz shines over the alternatives.> This is one of the reasons the raidzN resilver code is inefficient. Since you end up waiting for the slowest seek time of any one disk in the vdev, and when that''s done, the amount of data you were able to process was at most 128K. Rinse and repeat.How is this different than all other RAID implementations?> Would it not be wise, when creating raidzN vdev''s, to increase the blocksize to 128K * M? Then, the on-disk blocksize for each disk could be the same as the mirror on-disk blocksize of 128K. It still won''t resilver as fast as a mirror, but the raidzN resilver would be accelerated by as much as M times. Right?We had this discussion in 2007, IIRC. The bottom line was that if you have a fixed record size workload, then set the appropriate recordsize and it will make sense to adjust your raidz1 configuration to avoid gaps. For raidz2/3 or mixed record length workloads, is not clear that matching the number of data/parity disks offers any advantage.> The only disadvantage that I know of would be wasted space. Every 4K file in a mirror can waste up to 124K of disk space, right?No. 4K files have recordsize of 4K. This is why we refer to this case as a mixed record size workloads. Remember, the recordsize parameter is a maximum limit, not a minimum limit.> And in the above described scenario, every 4K file in the raidzN can waste up to 128K * M of disk space, right?No.> Also, if you have a lot of these sparse 4K blocks, then the resilver time doesn''t actually improve either. Because you perform one seek, and regardless if you fetch 128K or 128K*M, you still paid one maximum seek time to fetch 4K of useful data.Seek penalties are hard to predict or model. Modern drives have efficient algorithms and large buffer caches. It cannot be predicted whether the next read will be in the buffer cache already. Indeed, it is not even possible to predict the read order. The only sure-fire way to prevent seeks is to use SSDs.> Point is: If the goal is to reduce the number of on-disk slabs, and therefore reduce the number of seeks necessary to resilver, one thing you could do is increase the pool blocksize, right?No the pool block size, the application''s block size. Applications which make lots of itty bitty I/Os will tend to take more time to resilved. Applications that make lots of large I/Os will resilver faster.> YMMV, and YM will depend on how you use your pool. Hopefully you''re able to bias your usage in favor of large block writes.Yep, it depends entirely on how you use the pool. As soon as you come up with a credible model to predict that, then we can optimize accordingly :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/b08d442e/attachment-0001.html>
Edward Ned Harvey
2010-Oct-18  14:13 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
> From: Richard Elling [mailto:richard.elling at gmail.com] > > > This is one of the reasons the raidzN resilver code is inefficient. > > Since you end up waiting for the slowest seek time of any one disk in > > the vdev, and when that''s done, the amount of data you were able to > > process was at most 128K.? Rinse and repeat. > > How is this different than all other RAID implementations?Hardware raid has the disadvantage that it must resilver the whole disk regardless of how much of the disk is used. Hardware raid has the advantage that it will resilver sequentially, so despite the fact that it resilvers unused space, it is limited by sustainable throughput instead of random seek time. The resilver time for hardware raid is a constant regardless of what the OS has done with the disks over time (neglecting system usage during resilver.) If your ZFS vdev is significantly full, with data that was written, and snapshotted, and rewritten, and snapshots destroyed, etc etc etc ... Typical usage for a system that has been in production for a while ... then the time to resilver the whole disk block-by-block will be lower than the time to resilver the used portions in order of allocation time. This is why sometimes the ZFS resilver time for a raidzN can be higher than the time to resilver a similar hardware raid. As evidenced by the frequent comments & complaints on this list about raidzN resilver time. Let''s crunch some really quick numbers here. Suppose a 6Gbit/sec sas/sata bus, with 6 disks in a raid-5. Each disk is 1TB, 1000G, and each disk is capable of sustaining 1 Gbit/sec sequential operations. These are typical measurements for systems I use. Then 1000G = 8000Gbit. It will take 8000 sec to resilver = 133min. So whenever people have resilver times longer than that ... It''s because ZFS resilver code for raidzN is inefficient.
Bob Friesenhahn
2010-Oct-18  14:53 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
On Mon, 18 Oct 2010, Edward Ned Harvey wrote:> sec to resilver = 133min. So whenever people have resilver times longer > than that ... It''s because ZFS resilver code for raidzN is inefficient.You keep using the term "code" and using terms like "code is inefficient" when it seems that you are talking about something else entirely. As someone who authors "code", this is very confusing to me. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Marty Scholes
2010-Oct-18  20:33 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
> Richard wrote: > Yep, it depends entirely on how you use the pool. As soon as you > come up with a credible model to predict that, then we can optimize > accordingly :-)You say that somewhat tongue-in-cheek, but Edward''s right. If the resliver code progresses in slab/transaction-group/whatever-the-correct-term-is order, then a pool with any significant use will have the resilver code seeking all over the disk. If instead, resilver blindly moved in block number order, then it would have very little seek activity and the effective throughput would be close to that of pure sequential i/o for both the new disk and the remaining disks in the vdev. Would it make sense for scrub/resilver to be more aware of operating in disk order instead of zfs order? -- This message posted from opensolaris.org
Edward Ned Harvey
2010-Oct-18  21:32 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Marty Scholes > > Would it make sense for scrub/resilver to be more aware of operating in > disk order instead of zfs order?It would certainly make sense. As mentioned, even if you do the entire disk this way, including unused space, it is faster than making the poor little disks randomly seek all over the place for tiny little fragments that eventually add up to a significant portion of the whole disk. The main question is: How difficult would it be to implement?
Edward Ned Harvey
2010-Oct-20  12:50 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
> From: Edward Ned Harvey [mailto:shill at nedharvey.com] > > Let''s crunch some really quick numbers here. Suppose a 6Gbit/sec > sas/sata bus, with 6 disks in a raid-5. Each disk is 1TB, 1000G, and > each disk is capable of sustaining 1 Gbit/sec sequential operations. > These are typical measurements for systems I use. Then 1000G > 8000Gbit. It will take 8000 sec to resilver = 133min. So whenever > people have resilver times longer than that ... It''s because ZFS > resilver code for raidzN is inefficient.I hate to be the unfortunate one verifying my own point here, but: One of the above mentioned disks needed to be resilvered yesterday. (Actually a 2T disk.) It has now resilvered 1.12T in 18.5 hrs, and has 10.5 hrs remaining. This is a mirror. The problem would be several times worse if it were a raidz. So I guess it''s unfair to say "raidz is inefficient at resilvering." The truth is, ZFS in general is inefficient at resilvering, but the problem is several times worse on raidz than it is for mirrors. The more disks in the vdev, the worse the problem. The fewer vdev''s in the pool, the worse the problem. So you''re able to minimize the problem by using a bunch of mirrors instead of raidzN. Although the problem exists on mirrors too, it''s nothing so dramatic that I would destroy & recreate my pool because of it. People with raidzN often do.
Trond Michelsen
2010-Oct-20  13:11 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
On Wed, Oct 20, 2010 at 2:50 PM, Edward Ned Harvey <shill at nedharvey.com> wrote:> One of the above mentioned disks needed to be resilvered yesterday. > (Actually a 2T disk.) ?It has now resilvered 1.12T in 18.5 hrs, and has 10.5 > hrs remaining. ?This is a mirror. ?The problem would be several times worse > if it were a raidz.Is this one of those "Advanced format" drives (Western Digital EARS or Samsung F4), which emulates 512 byte sectors? Or is that only a problem with raidz anyway? -- Trond Michelsen
Erik Trimble
2010-Oct-21  03:02 UTC
[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
On Mon, 2010-10-18 at 17:32 -0400, Edward Ned Harvey wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Marty Scholes > > > > Would it make sense for scrub/resilver to be more aware of operating in > > disk order instead of zfs order? > > It would certainly make sense. As mentioned, even if you do the entire disk > this way, including unused space, it is faster than making the poor little > disks randomly seek all over the place for tiny little fragments that > eventually add up to a significant portion of the whole disk. > > The main question is: How difficult would it be to implement? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussIdeally, you want the best of both worlds: ZFS is currently *much* faster when doing partial resyncs (i.e. updating stale drives) by using the walk-the-metadata-tree method. However, it would be nice to have it recognize when a full disk rebuild is required, and switch to some form of a full disk sequential copy. The problem with a full sequential copy is threefold, however: (a) you (often) copy a whole lots of bits that aren''t actually holding any valuable info (b) it can get a little tricky distinguishing between the case of an interrupted full-disk resilver and a freshen-the-stale-drive resilver. (c) You generally punt on any advantage of knowing how the pool is structured. Frankly, if I could ever figure out when the mythical BP rewrite (or equivalent feature) will appear, I''d be able to implement a defragger (or, maybe, a "compactor" is a better term). Having a defrag util keep the zpool relatively compacted would seriously reduce the work in a resilver. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)