thr3ads.net - zfs discuss - [zfs-discuss] All (pure) SSD pool rehash [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Matt Banks

2011-Sep-27 17:21 UTC

[zfs-discuss] All (pure) SSD pool rehash

I know there was a thread about this a few months ago.

However, with the costs of SSD''s falling like they have, the idea of an
Oracle X4270 M2/Cisco C210 M2/IBM x3650 M3 class of machine with a 13 drive
RAIDZ2 zpool (1 hot spare) is really starting to sound alluring to me/us.
Especially with something like the OCZ Deneva 2 drives (Sandforce 2281 with a
supercap), the SanDisk (Pliant) Lightning series, or perhaps the Hitachi
SSD400M''s coming in at prices that aren''t a whole lot more
than 600GB 15k drives. (From an enterprise perspective anyway.)

Systems with a similar load (OLTP) are frequently I/O bound - eg a server with a
Sun 2540 FC array w/ 11x300GB 15k SAS drives and 2x X25-e''s for
ZIL/L2ARC, so the extra bandwidth would be welcome.

Am I crazy for putting something like this into production using Solaris 10/11?
On paper, it really seems ideal for our needs.

Also, maybe I read it wrong, but why is it that (in the previous thread about hw
raid and zpools) zpools with large numbers of physical drives (eg 20+) were
frowned upon? I know that ZFS!=WAFL but it''s so common in the NetApp
world that I was surprised to read that. A 20 drive RAID-Z2 pool really
wouldn''t/couldn''t recover (resilver) from a drive failure?
That seems to fly in the face of the x4500 boxes from a few years ago.

matt

Bob Friesenhahn

2011-Sep-27 17:39 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Tue, 27 Sep 2011, Matt Banks wrote:>
> Am I crazy for putting something like this into production using Solaris
10/11? On paper, it really seems ideal for our needs.
As long as the drive firmware operates correctly, I don''t see a 
problem.
> Also, maybe I read it wrong, but why is it that (in the previous 
> thread about hw raid and zpools) zpools with large numbers of 
> physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL
There is no concern with a large number of physical drives in a pool. 
The primary concern is with the number of drives per vdev.  Any 
variation in the latency of the drives hinders performance and each 
I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev 
(or strip) when raidzN is used.  Having more vdevs is better for 
consistent performance and more available IOPS.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Paul Kraus

2011-Sep-27 17:39 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Tue, Sep 27, 2011 at 1:21 PM, Matt Banks <mattbanks at gmail.com>
wrote:
> Also, maybe I read it wrong, but why is it that (in the previous thread
about
> hw raid and zpools) zpools with large numbers of physical drives (eg 20+)
> were frowned upon? I know that ZFS!=WAFL but it''s so common in the
> NetApp world that I was surprised to read that. A 20 drive RAID-Z2 pool
> really wouldn''t/couldn''t recover (resilver) from a drive
failure? That seems
> to fly in the face of the x4500 boxes from a few years ago.
    There is a world of difference between a zpool with 20+ drives and
a single vdev with 20+ drives. What has been frowned upon is a single
vdev with more than about 8 drives. I have a zpool with 120 drives, 22
vdevs each with 5 drives in a raidz2 and 10 hot spares. The only
failures I had to resilver were before it went production (and I  had
little data in it at the time), but I expect resilver times to be
reasonable based on experience with other configurations I have had.

    Keep in mind that random read I/O is proportional to the number of
vdevs, NOT the number of drives. See
https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&output=html
for the results of some of my testing.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Erik Trimble

2011-Sep-27 21:12 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:> On Tue, 27 Sep 2011, Matt Banks wrote:
>
>> Also, maybe I read it wrong, but why is it that (in the previous 
>> thread about hw raid and zpools) zpools with large numbers of 
>> physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL
>
> There is no concern with a large number of physical drives in a pool. 
> The primary concern is with the number of drives per vdev.  Any 
> variation in the latency of the drives hinders performance and each 
> I/O to a vdev consumes 1 "IOP" across all of the drives in the
vdev
> (or strip) when raidzN is used.  Having more vdevs is better for 
> consistent performance and more available IOPS.
>
> Bob
To expound just a bit on Bob''s reply:   the reason that large numbers
of
disks in a RAIDZ* vdev are frowned upon has to do with the fact that 
IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks 
are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the 
same as a 5-disk vdev.  Streaming throughput is significantly higher 
(i.e. it scales as O(N)), but you''re unlikely to get that for the vast 
majority of workloads.

Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the 
situation where the time to resilver X amount of data on a 5-drive RAIDZ 
is the same as a 30-drive RAIDZ.  Given that you''re highly likely to 
store much more data on a larger vdev,  your resilver time to replace a 
drive goes up linearly with the number of drives in a RAIDZ vdev.

This leads to this situation:  if I have 20 x 1TB drives, here''s
several
possible configurations, and the relative resilver times (relative, 
because without knowing the exact configuration of the data itself, I 
can''t estimate wall-clock-time resilver times):

(a)    5 x 4-disk RAIDZ:  15TB usable, takes N amount of time to replace 
a failed disk
(b)    4 x 5-disk RAIDZ:  16TB usable, takes 1.25N time to replace a disk
(c)    2 x 10-disk RAIDZ:  18TB Usable, takes 2.5N time to replace a disk
(d)    1 x 20-disk RAIDZ:    19TB usable, takes 5N time to replace a disk

Notice that by doubling the number of drives in a RAIDZ, you double the 
resilver time for the same amount of data in the ZPOOL.

The above also applies to RAIDZ[23], as the additional parity disk 
doesn''t materially impact resilver times in either direction (and, yes,
it''s not really a "parity disk", I''m just being
sloppy).

Also, the other main reason is that larger numbers of drives in a single 
vdev mean there is a higher probability that multiple disk failures will 
result in loss of data. Richard Elling had some data on the exact 
calculations, but it boils down to the fact that your chance of total 
data loss from multiple drive failures goes up MORE THAN LINEARLY by 
adding drives into a vdev.  Thus, a 1x10-disk RAIDZ has well over 2x the 
chance of failure that  2 x 5-disk RAIDZ zpool has.

-Erik

Edward Ned Harvey

2011-Sep-28 01:21 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Matt Banks
> 
> Am I crazy for putting something like this into production using Solaris
10/11?> On paper, it really seems ideal for our needs.
Do you have an objection to solaris 10/11 for some reason?
No, it''s not crazy (and I wonder why you would ask).

> Also, maybe I read it wrong, but why is it that (in the previous thread
about> hw raid and zpools) zpools with large numbers of physical drives (eg 20+)
Clarification that I know others have already added, but I reiterate: 
It''s
not the number of devices in a zpool that matters.  It''s the amount of
data
in the resilvering vdev, and the number of devices inside the vdev, and your
usage patterns (where the typical use pattern is the worst case usage
pattern, especially for a database server).  Together these of course have a
relation to the number of devices in the pool, but that''s not what
matters.

The problem basically applies to HDD''s.  By creating your pool of
SSD''s,
this problem should be eliminated.

Here is the problem:

Assuming the data in the pool is evenly distributed amongst the vdev''s,
then
the more vdev''s you have, the less data is in each one.  If you make
your
pool of a small number of large raidzN vdev''s, then you''re
going to have
relatively a lot of data in each vdev, and therefore a lot of data in the
resilvering vdev.

When a vdev resilvers, it will read each slab of data, in essentially time
order, which is approximately random disk order, in order to reconstruct the
data that must be written on the resilvering device.  This creates two
problems, (a) Since each disk must fetch a piece of each slab, the random
access time of the vdev as a whole is approximately the random access time
of the slowest individual device.  So the more devices in the vdev, the
worse the IOPS for the vdev...  And (b) the more data slabs in the vdev, the
more iterations of random IO operations must be completed.  

In other words, during resilvers, you''re IOPS limited.  If your pool is
made
of all SSD''s, then problem (a) is basically nonexistent, since the
random
access time of all the devices are equal and essentially zero.  Problem (b)
isn''t necessarily a problem...  It''s like, if somebody is
giving you $1,000
for free every month and then they suddenly drop down to only $500, you
complain about what you''ve lost.   ;-)  (See below.)

In a hardware raid system, resilvering will be done sequentially on all
disks in the array.  Depending on your specs, a typical time might be 2hrs.
All blocks will be resilvered regardless of whether or not they''re
used.
But in ZFS, only used blocks will be resilvered.  That means, if your vdev
is empty, your resilver is completed instantly.  Also, if your vdev is made
of SSD''s, then the random access times will be just like the sequential
access times, and your worst case is still equal to hardware raid resilver.

The only time there''s a problem is when you have a vdev made of
HDD''s, and
there''s a bunch of data in it, and it''s scattered randomly
(which typically
happens due to the nature of COW and snapshot deletion/creation over time).
So the HDD''s thrash around spending all their time doing random access,
with
very little payload for each random op.  In these cases, even HDD mirrors
end up having resilver times that are several times longer than sequentially
resilvering the whole disk including unused blocks.  In this case, mirrors
are the best case scenario, because they''re both (a) minimal data in
each
vdev, and (b) minimal number of devices in the resilvering vdev.  Even so,
the mirror resilver time might be like 12 hours, in my experience, instead
of the 2hrs that hardware would have needed to resilver the whole disk.  But
if you were using a big vdev (raidzN) of a bunch of HDD''s
(let''s say, 21
disks in a raidz3), you might get resilver times that are a couple orders of
magnitude too long...  Like 20 days instead of 10 hours.  At this level, you
should assume your resilver will never complete.

So again:  Not a problem if you''re making your pool out of
SSD''s.

Fajar A. Nugraha

2011-Sep-28 01:30 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Wed, Sep 28, 2011 at 8:21 AM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:> When a vdev resilvers, it will read each slab of data, in essentially time
> order, which is approximately random disk order, in order to reconstruct
the
> data that must be written on the resilvering device. ?This creates two
> problems, (a) Since each disk must fetch a piece of each slab, the random
> access time of the vdev as a whole is approximately the random access time
> of the slowest individual device. ?So the more devices in the vdev, the
> worse the IOPS for the vdev... ?And (b) the more data slabs in the vdev,
the
> more iterations of random IO operations must be completed.
>
> In other words, during resilvers, you''re IOPS limited. ?If your
pool is made
> of all SSD''s, then problem (a) is basically nonexistent, since the
random
> access time of all the devices are equal and essentially zero. ?Problem (b)
> isn''t necessarily a problem... ?It''s like, if somebody is
giving you $1,000
> for free every month and then they suddenly drop down to only $500, you
> complain about what you''ve lost. ? ;-) ?(See below.)
If you regularly spend all of the given $1,000, then you''re going to
complain hard when it suddenly drops to $500.
> So again: ?Not a problem if you''re making your pool out of
SSD''s.
Big problem if your system is already using most of the available IOPS
during normal operation.

-- 
Fajar

Bob Friesenhahn

2011-Sep-28 01:36 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Tue, 27 Sep 2011, Edward Ned Harvey wrote:>
> The problem basically applies to HDD''s.  By creating your pool of
SSD''s,
> this problem should be eliminated.
This is not completely true.  SSDs will help significantly but they 
will still suffer from the synchronized commit of a transaction group. 
SSDs don''t suffer from seek time, but they still suffer from 
erase/write time and many SSDs are capable of only a few thousand 
flushed writes per second.  It is just a matter of degree.

SSDs which do garbage collection during the write cycle could cause 
the whole vdev to temporarily hang until the last SSD has committed 
its write.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2011-Sep-28 14:40 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Sep 27, 2011, at 6:30 PM, Fajar A. Nugraha wrote:> On Wed, Sep 28, 2011 at 8:21 AM, Edward Ned Harvey
>> So again:  Not a problem if you''re making your pool out of
SSD''s.
> 
> Big problem if your system is already using most of the available
IOPSduring normal operation.
Resilvers are throttled, so they should not impact normal operation.
> On Sep 27, 2011, at 6:36 PM, Bob Friesenhahn wrote:
> On Tue, 27 Sep 2011, Edward Ned Harvey wrote:
>> 
>> The problem basically applies to HDD''s.  By creating your pool
of SSD''s,
>> this problem should be eliminated.
> 
> This is not completely true.  SSDs will help significantly but they will
still suffer from the synchronized commit of a transaction group. SSDs
don''t suffer from seek time, but they still suffer from erase/write
time and many SSDs are capable of only a few thousand flushed writes per second.
It is just a matter of degree.
Also, the default settings for the resilver throttle are set for HDDs. For SSDs,
it is a
good idea to change the throttle to be more aggressive.
> SSDs which do garbage collection during the write cycle could cause the
whole vdev to temporarily hang until the last SSD has committed its write.
I think this will be unlikely, especially for a resilver workload. Resilvers are
done
async, so the only time you will be waiting is for the return of the cache flush
during
txg commit. In the cases I''ve measured cache flushes, they tend to
complete faster
on SSDs than HDDs, but it might be worthwhile to characterize this so we know.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Edward Ned Harvey

2011-Sep-28 16:44 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> Also, the default settings for the resilver throttle are set for HDDs. For
SSDs,> it is a
> good idea to change the throttle to be more aggressive.
You mean...
Be more aggressive, resilver faster?
or Be more aggressive, throttling the resilver?

What''s the reasoning that makes you want to set it differently from a
HDD?

Garrett D''Amore

2011-Sep-29 06:33 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Sep 28, 2011, at 8:44 PM, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:richard.elling at gmail.com]
>> 
>> Also, the default settings for the resilver throttle are set for HDDs.
For
> SSDs,
>> it is a
>> good idea to change the throttle to be more aggressive.
> 
> You mean...
> Be more aggressive, resilver faster?
> or Be more aggressive, throttling the resilver?
> 
> What''s the reasoning that makes you want to set it differently
from a HDD?
I think he means, resilver faster.

SSDs can be driven harder, and have more IOPs so we can hit them harder with
less impact on the overall performance.  The reason we throttle at all is to
avoid saturating the bandwidth of the drive with resilver which would prevent
regular operations from making progress.  Generally I believe resilver
operations are not "bandwidth bound" in the sense of pure throughput,
but are IOPs bound.  As SSDs have no seek time, they can handle a lot more of
these little operations than a regular hard disk.

  - Garrett
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Zaeem Arshad

2011-Sep-29 13:15 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore
<garrett.damore at gmail.com>wrote:
>
>
> I think he means, resilver faster.
>
> SSDs can be driven harder, and have more IOPs so we can hit them harder
> with less impact on the overall performance.  The reason we throttle at all
> is to avoid saturating the bandwidth of the drive with resilver which would
> prevent regular operations from making progress.  Generally I believe
> resilver operations are not "bandwidth bound" in the sense of
pure
> throughput, but are IOPs bound.  As SSDs have no seek time, they can handle
> a lot more of these little operations than a regular hard disk.
>
>  - Garrett
>
>
What''s the throttling rate if I may call it that?


--
Zaeem
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110929/6bde3078/attachment.html>

Jim Klimov

2011-Oct-16 10:56 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

2011-09-29 17:15, Zaeem Arshad ?????:>
>
> On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore 
> <garrett.damore at gmail.com <mailto:garrett.damore at
gmail.com>> wrote:
>
>
>
>     I think he means, resilver faster.
>
>     SSDs can be driven harder, and have more IOPs so we can hit them
>     harder with less impact on the overall performance.  The reason we
>     throttle at all is to avoid saturating the bandwidth of the drive
>     with resilver which would prevent regular operations from making
>     progress.  Generally I believe resilver operations are not
>     "bandwidth bound" in the sense of pure throughput, but are
IOPs
>     bound.  As SSDs have no seek time, they can handle a lot more of
>     these little operations than a regular hard disk.
>
>      - Garrett
>
>
>
> What''s the throttling rate if I may call it that?
>
IIRC about 7MBps, and I guess it is hardcoded since the value
is so well known as to have been reported several times.

I think another rationale for SSD throttling was with L2ARC tasks -
to reduce probable effects of write overdriving in SSD hardwares
(less efficient and more wear on SSD cells).

//Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111016/67fb3348/attachment-0001.html>

Richard Elling

2011-Oct-16 21:18 UTC

head link

[zfs-discuss] All (pure) SSD pool rehash

On Oct 16, 2011, at 3:56 AM, Jim Klimov wrote:
> 2011-09-29 17:15, Zaeem Arshad ?????:
>> 
>> 
>> On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore
<garrett.damore at gmail.com> wrote:
>> 
>> 
>> I think he means, resilver faster.
>> 
>> SSDs can be driven harder, and have more IOPs so we can hit them harder
with less impact on the overall performance.  The reason we throttle at all is
to avoid saturating the bandwidth of the drive with resilver which would prevent
regular operations from making progress.  Generally I believe resilver
operations are not "bandwidth bound" in the sense of pure throughput,
but are IOPs bound.  As SSDs have no seek time, they can handle a lot more of
these little operations than a regular hard disk.
>> 
>>  - Garrett
>> 
>> 
>> 
>> What''s the throttling rate if I may call it that?
>> 
> 
> IIRC about 7MBps, and I guess it is hardcoded since the value 
> is so well known as to have been reported several times.
No, the resilver throttling is more based on IOPS than bandwidth.
> I think another rationale for SSD throttling was with L2ARC tasks -
> to reduce probable effects of write overdriving in SSD hardwares
> (less efficient and more wear on SSD cells).
L2ARC fill rate is, by default in most distros, 16MB/sec until full, then
8MB/sec.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

zfs discuss - Sep 2011 - All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash

[zfs-discuss] All (pure) SSD pool rehash