thr3ads.net - zfs discuss - [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2010-Oct-17 13:38 UTC

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

The default blocksize is 128K. If you are using mirrors, then each block on
disk will be 128K whenever possible. But if you''re using raidzN with a
capacity of M disks (M disks useful capacity + N disks redundancy) then the
block size on each individual disk will be 128K / M. Right? This is one of
the reasons the raidzN resilver code is inefficient. Since you end up
waiting for the slowest seek time of any one disk in the vdev, and when
that''s done, the amount of data you were able to process was at most
128K.
Rinse and repeat.

Would it not be wise, when creating raidzN vdev''s, to increase the
blocksize
to 128K * M? Then, the on-disk blocksize for each disk could be the same as
the mirror on-disk blocksize of 128K. It still won''t resilver as fast
as a
mirror, but the raidzN resilver would be accelerated by as much as M times.
Right?

The only disadvantage that I know of would be wasted space. Every 4K file
in a mirror can waste up to 124K of disk space, right? And in the above
described scenario, every 4K file in the raidzN can waste up to 128K * M of
disk space, right? Also, if you have a lot of these sparse 4K blocks, then
the resilver time doesn''t actually improve either. Because you perform
one
seek, and regardless if you fetch 128K or 128K*M, you still paid one maximum
seek time to fetch 4K of useful data.

Point is: If the goal is to reduce the number of on-disk slabs, and
therefore reduce the number of seeks necessary to resilver, one thing you
could do is increase the pool blocksize, right? YMMV, and YM will depend on
how you use your pool. Hopefully you''re able to bias your usage in
favor of
large block writes.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/abaf6dfb/attachment.html>

Bob Friesenhahn

2010-Oct-17 16:04 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

On Sun, 17 Oct 2010, Edward Ned Harvey wrote:
> 
> The default blocksize is 128K.? If you are using mirrors, then each 
> block on disk will be 128K whenever possible.? But if you''re using
> raidzN with a capacity of M disks (M disks useful capacity + N disks 
> redundancy) then the block size on each individual disk will be 128K 
> / M.? Right?? This is one of the reasons the raidzN resilver code is 
> inefficient.? Since you end up waiting for the slowest seek time of 
> any one disk in the vdev, and when that''s done, the amount of data
> you were able to process was at most 128K.? Rinse and repeat.
Your idea about what it means for "code" to be inefficient is clearly 
vastly different than my own.  Regardless, the the physical layout 
issues (impacting IOPS requirements) are a reality.
> Would it not be wise, when creating raidzN vdev''s, to increase the
> blocksize to 128K * M?? Then, the on-disk blocksize for each disk 
> could be the same as the mirror on-disk blocksize of 128K.? It still 
> won''t resilver as fast as a mirror, but the raidzN resilver would
be
> accelerated by as much as M times.? Right?
This might work for HPC applications with huge files and huge 
sequential streaming data rate requirements.  It would be detrimental 
for the case of small files, or applications which issue many small 
writes, and particularly bad for many random synchronous writes.
> The only disadvantage that I know of would be wasted space.? Every 
> 4K file in a mirror can waste up to 124K of disk space, right?? And 
> in the above described scenario, every 4K file in the raidzN can 
> waste up to 128K * M of disk space, right?? Also, if you have a lot 
> of these sparse 4K blocks, then the resilver time doesn''t actually
> improve either.? Because you perform one seek, and regardless if you 
> fetch 128K or 128K*M, you still paid one maximum seek time to fetch 
> 4K of useful data.
The tally of disadvantages are quite large.  Note that zfs needs to 
write each zfs "block" and you are dramatically increasing the level 
of write amplification.  Also zfs needs to checksum each whole block 
and the checksum adds to the latency.  The risk of block corruption is 
increased.  128K is already quite large for a block.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Kyle McDonald

2010-Oct-17 17:26 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 


On 10/17/2010 9:38 AM, Edward Ned Harvey wrote:>
> The default blocksize is 128K. If you are using mirrors, then
> each block on disk will be 128K whenever possible. But if you''re
> using raidzN with a capacity of M disks (M disks useful capacity +
> N disks redundancy) then the block size on each individual disk
> will be 128K / M. Right?
>
If I understand things correctly, I think this is why it is
recommended that you pick an M that divides into 128K evenly. I
believe powers of 2 are recommended.

I think increasing the block size to 128K*M would be overkill, but
that idea does make me wonder:

In cases where M can''t be a power of 2, would it make sense to adjust
the block size so that M still divides evenly?

If M were 4 then the data written to each drive would be 32K. So if
you really wanted to M to be 5 drives, is there an advantage to making
the block size 160K, or if that''s too big, how about 80K?

Like wise if you really wanted to M to be 3 drives, would adjusting it
BS to 96K make sense?

  -Kyle

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (MingW32)
 
iQEcBAEBAgAGBQJMuzG2AAoJEEADRM+bKN5wokMH/A2W3hjf2yZx0uO4n0UvSbIY
aAS2faGjx9R03ile3u1K/Qlg/dAm0zLdMkNoKY8Pcg8TPx3VLCapNvmlySxCldAf
rPXC8NC5xzIj75oGqb1VGByUlqerCdVldvBjo5vFKcDM83CcpLLjmO6gJzNe1UoV
MwcKsb0oZv3JzmYcvqjW/lNCIjaQzxkm0k0EP+pV1tx+HMPyHp+kaxnzv4v994GO
zwz0OfUOsHaIkSJda8t8ekg9qMdvZa63X8A0VGmhnR26lpjHZD/274IPBStapasx
IC+T7O0EYazQSO3fftZ6MCd9O6//0tbQX0MLHPDMpyX90EU+ihILuqYn/QjJjhg=4mvO
-----END PGP SIGNATURE-----

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/3526f91a/attachment.html>

Richard Elling

2010-Oct-18 03:00 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

On Oct 17, 2010, at 6:38 AM, Edward Ned Harvey wrote:
> The default blocksize is 128K.  If you are using mirrors, then each block
on disk will be 128K whenever possible.  But if you''re using raidzN
with a capacity of M disks (M disks useful capacity + N disks redundancy) then
the block size on each individual disk will be 128K / M.  Right?
Yes, but it is worse for RAID-5 where you will likely have to do a RMW if your
stripe size is not perfectly matched to the blocksize.  This is the case where
raidz
shines over the alternatives.
> This is one of the reasons the raidzN resilver code is inefficient.  Since
you end up waiting for the slowest seek time of any one disk in the vdev, and
when that''s done, the amount of data you were able to process was at
most 128K.  Rinse and repeat.
How is this different than all other RAID implementations?
> Would it not be wise, when creating raidzN vdev''s, to increase the
blocksize to 128K * M?  Then, the on-disk blocksize for each disk could be the
same as the mirror on-disk blocksize of 128K.  It still won''t resilver
as fast as a mirror, but the raidzN resilver would be accelerated by as much as
M times.  Right?
We had this discussion in 2007, IIRC. The bottom line was that if you have a
fixed record size workload, then set the appropriate recordsize and it will
make sense to adjust your raidz1 configuration to avoid gaps. For raidz2/3 or
mixed record length workloads, is not clear that matching the number of
data/parity
disks offers any advantage.
> The only disadvantage that I know of would be wasted space.  Every 4K file
in a mirror can waste up to 124K of disk space, right?
No.  4K files have recordsize of 4K.  This is why we refer to this case as a
mixed
record size workloads.   Remember, the recordsize parameter is a maximum
limit, not a minimum limit.
> And in the above described scenario, every 4K file in the raidzN can waste
up to 128K * M of disk space, right?
No.
> Also, if you have a lot of these sparse 4K blocks, then the resilver time
doesn''t actually improve either.  Because you perform one seek, and
regardless if you fetch 128K or 128K*M, you still paid one maximum seek time to
fetch 4K of useful data.
Seek penalties are hard to predict or model. Modern drives have efficient
algorithms
and large buffer caches.  It cannot be predicted whether the next read will be
in the
buffer cache already.  Indeed, it is not even possible to predict the read
order.  The
only sure-fire way to prevent seeks is to use SSDs.
> Point is:  If the goal is to reduce the number of on-disk slabs, and
therefore reduce the number of seeks necessary to resilver, one thing you could
do is increase the pool blocksize, right?
No the pool block size, the application''s block size.  Applications
which make lots of
itty bitty I/Os will tend to take more time to resilved.  Applications that make
lots of large
I/Os will resilver faster.
> YMMV, and YM will depend on how you use your pool.  Hopefully
you''re able to bias your usage in favor of large block writes.
Yep, it depends entirely on how you use the pool.  As soon as you come up with a
credible model to predict that, then we can optimize accordingly :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com













-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/b08d442e/attachment-0001.html>

Edward Ned Harvey

2010-Oct-18 14:13 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> > This is one of the reasons the raidzN resilver code is inefficient.
> > Since you end up waiting for the slowest seek time of any one disk in
> > the vdev, and when that''s done, the amount of data you were
able to
> > process was at most 128K.? Rinse and repeat.
> 
> How is this different than all other RAID implementations?
Hardware raid has the disadvantage that it must resilver the whole disk
regardless of how much of the disk is used.  Hardware raid has the advantage
that it will resilver sequentially, so despite the fact that it resilvers
unused space, it is limited by sustainable throughput instead of random seek
time.  The resilver time for hardware raid is a constant regardless of what
the OS has done with the disks over time (neglecting system usage during
resilver.)

If your ZFS vdev is significantly full, with data that was written, and
snapshotted, and rewritten, and snapshots destroyed, etc etc etc ... Typical
usage for a system that has been in production for a while ... then the time
to resilver the whole disk block-by-block will be lower than the time to
resilver the used portions in order of allocation time.  This is why
sometimes the ZFS resilver time for a raidzN can be higher than the time to
resilver a similar hardware raid.  As evidenced by the frequent comments &
complaints on this list about raidzN resilver time.

Let''s crunch some really quick numbers here.  Suppose a 6Gbit/sec
sas/sata
bus, with 6 disks in a raid-5.  Each disk is 1TB, 1000G, and each disk is
capable of sustaining 1 Gbit/sec sequential operations.  These are typical
measurements for systems I use.  Then 1000G = 8000Gbit.  It will take 8000
sec to resilver = 133min.  So whenever people have resilver times longer
than that ... It''s because ZFS resilver code for raidzN is inefficient.

Bob Friesenhahn

2010-Oct-18 14:53 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

On Mon, 18 Oct 2010, Edward Ned Harvey wrote:> sec to resilver = 133min.  So whenever people have resilver times longer
> than that ... It''s because ZFS resilver code for raidzN is
inefficient.
You keep using the term "code" and using terms like "code is 
inefficient" when it seems that you are talking about something else 
entirely.  As someone who authors "code", this is very confusing to 
me.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marty Scholes

2010-Oct-18 20:33 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

> Richard wrote:
> Yep, it depends entirely on how you use the pool.  As soon as you
> come up with a credible model to predict that, then we can optimize
> accordingly :-)
You say that somewhat tongue-in-cheek, but Edward''s right.  If the
resliver code progresses in slab/transaction-group/whatever-the-correct-term-is
order, then a pool with any significant use will have the resilver code seeking
all over the disk.

If instead, resilver blindly moved in block number order, then it would have
very little seek activity and the effective throughput would be close to that of
pure sequential i/o for both the new disk and the remaining disks in the vdev.

Would it make sense for scrub/resilver to be more aware of operating in disk
order instead of zfs order?
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Oct-18 21:32 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Marty Scholes
> 
> Would it make sense for scrub/resilver to be more aware of operating in
> disk order instead of zfs order?
It would certainly make sense.  As mentioned, even if you do the entire disk
this way, including unused space, it is faster than making the poor little
disks randomly seek all over the place for tiny little fragments that
eventually add up to a significant portion of the whole disk.

The main question is:  How difficult would it be to implement?

Edward Ned Harvey

2010-Oct-20 12:50 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

> From: Edward Ned Harvey [mailto:shill at nedharvey.com]
>  
> Let''s crunch some really quick numbers here.  Suppose a 6Gbit/sec
> sas/sata bus, with 6 disks in a raid-5.  Each disk is 1TB, 1000G, and
> each disk is capable of sustaining 1 Gbit/sec sequential operations.
> These are typical measurements for systems I use.  Then 1000G >
8000Gbit.  It will take 8000 sec to resilver = 133min.  So whenever
> people have resilver times longer than that ... It''s because ZFS
> resilver code for raidzN is inefficient.
I hate to be the unfortunate one verifying my own point here, but:

One of the above mentioned disks needed to be resilvered yesterday.
(Actually a 2T disk.)  It has now resilvered 1.12T in 18.5 hrs, and has 10.5
hrs remaining.  This is a mirror.  The problem would be several times worse
if it were a raidz.

So I guess it''s unfair to say "raidz is inefficient at
resilvering."  The
truth is, ZFS in general is inefficient at resilvering, but the problem is
several times worse on raidz than it is for mirrors.  The more disks in the
vdev, the worse the problem.  The fewer vdev''s in the pool, the worse
the
problem.  So you''re able to minimize the problem by using a bunch of
mirrors
instead of raidzN.

Although the problem exists on mirrors too, it''s nothing so dramatic
that I
would destroy & recreate my pool because of it.  People with raidzN often
do.

Trond Michelsen

2010-Oct-20 13:11 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

On Wed, Oct 20, 2010 at 2:50 PM, Edward Ned Harvey <shill at
nedharvey.com> wrote:> One of the above mentioned disks needed to be resilvered yesterday.
> (Actually a 2T disk.) ?It has now resilvered 1.12T in 18.5 hrs, and has
10.5
> hrs remaining. ?This is a mirror. ?The problem would be several times worse
> if it were a raidz.
Is this one of those "Advanced format" drives (Western Digital EARS or
Samsung F4), which emulates 512 byte sectors? Or is that only a
problem with raidz anyway?

-- 
Trond Michelsen

Erik Trimble

2010-Oct-21 03:02 UTC

head link

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

On Mon, 2010-10-18 at 17:32 -0400, Edward Ned Harvey
wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Marty Scholes
> > 
> > Would it make sense for scrub/resilver to be more aware of operating
in
> > disk order instead of zfs order?
> 
> It would certainly make sense.  As mentioned, even if you do the entire
disk
> this way, including unused space, it is faster than making the poor little
> disks randomly seek all over the place for tiny little fragments that
> eventually add up to a significant portion of the whole disk.
> 
> The main question is:  How difficult would it be to implement?
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ideally, you want the best of both worlds:  ZFS is currently *much*
faster when doing partial resyncs (i.e. updating stale drives) by using
the walk-the-metadata-tree method.   However, it would be nice to have
it recognize when a full disk rebuild is required, and switch to some
form of a full disk sequential copy.

The problem with a full sequential copy is threefold, however:

(a) you (often) copy a whole lots of bits that aren''t actually holding
any valuable info

(b) it can get a little tricky distinguishing between the case of an
interrupted full-disk resilver and a freshen-the-stale-drive resilver.

(c) You generally punt on any advantage of knowing how the pool is
structured.

Frankly, if I could ever figure out when the mythical BP rewrite (or
equivalent feature) will appear, I''d be able to implement a defragger
(or, maybe, a "compactor" is a better term). Having a defrag util keep
the zpool relatively compacted would seriously reduce the work in a
resilver.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Possibly Parallel Threads

Search for more apparently analagous threads

zfs discuss - Oct 2010 - RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

[zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

Possibly Parallel Threads