thr3ads.net - zfs discuss - [zfs-discuss] Server with 4 drives, how to configure ZFS? [Jun 2011]

If this information is useful, please help other people find it:
Share via:

Nomen Nescio

2011-Jun-15 20:33 UTC

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Has there been any change to the server hardware with respect to number of
drives since ZFS has come out? Many of the servers around still have an even
number of drives (2, 4) etc. and it seems far from optimal from a ZFS
standpoint. All you can do is make one or two mirrors, or a 3 way mirror and
a spare, right? Wouldn''t it make sense to ship with an odd number of
drives
so you could at least RAIDZ? Or stop making provision for anything except 1
or two drives or no drives at all and require CD or netbooting and just
expect everybody to be using NAS boxes? I am just a home server user, what
do you guys who work on commercial accounts think? How are people using
these servers?

Jim Klimov

2011-Jun-16 13:11 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

As recently discussed on this list, after all ZFS does not care
very much for the number of drives in a raidzN set, so optimization
is not about stripe alignment and stuff but about number of spindles,
resilver times, number of redundancy disks, etc.

In my setups with 4 identical drives in a server I typically made
a 10-20Gb rpool as a mirror of slices on a couple of drives, a
same-sized pool for swap on the other couple of drives, and
this leaves me with 4 identical-sized slices for a separate
data pool. Depending on requirements we can do any layout:
performance (raid10) vs. reliable (raidz2) vs space (raidz1).

HTH,
//Jim


2011-06-16 0:33, Nomen Nescio ?????:> Has there been any change to the server hardware with respect to number of
> drives since ZFS has come out? Many of the servers around still have an
even
> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint. All you can do is make one or two mirrors, or a 3 way mirror
and
> a spare, right? Wouldn''t it make sense to ship with an odd number
of drives
> so you could at least RAIDZ? Or stop making provision for anything except 1
> or two drives or no drives at all and require CD or netbooting and just
> expect everybody to be using NAS boxes? I am just a home server user, what
> do you guys who work on commercial accounts think? How are people using
> these servers?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2011-Jun-16 14:01 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

On Wed, 15 Jun 2011, Nomen Nescio wrote:
> Has there been any change to the server hardware with respect to number of
> drives since ZFS has come out? Many of the servers around still have an
even
> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint. All you can do is make one or two mirrors, or a 3 way mirror
and
> a spare, right? Wouldn''t it make sense to ship with an odd number
of drives
> so you could at least RAIDZ? Or stop making provision for anything except 1
Yes, it all seems pretty silly.  Using a small dedicated boot drive 
(maybe an SSD or Compact Flash) would make sense so that the main 
disks can all be used in one pool.  FreeBSD apparently supports 
booting from raidz so it would allow booting from a four-disk raidz 
pool.  Unfortunately, Solaris does not support that.

Given a fixed number of drive bays, there may be value to keeping one 
drive bay completely unused (hot/cold spare, or empty).  The reason 
for this is that it allows you to insert new drives in order to 
upgrade the drives in your pool, or handle the case of a broken drive 
bay.  Without the ability to insert a new drive, you need to 
compromise the safety of your pool in order to replace a drive or 
upgrade the drives to a larger size.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marty Scholes

2011-Jun-16 14:40 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> Has there been any change to the server hardware with
> respect to number of
> drives since ZFS has come out? Many of the servers
> around still have an even
> number of drives (2, 4) etc. and it seems far from
> optimal from a ZFS
> standpoint. All you can do is make one or two
> mirrors, or a 3 way mirror and
> a spare, right? 
With four drives you could also make a RAIDZ3 set, allowing you to have the
lowest usable space, poorest performance and worst resilver times possible.

Sorry, couldn''t resist.
-- 
This message posted from opensolaris.org

Eric Sproul

2011-Jun-16 14:59 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

On Wed, Jun 15, 2011 at 4:33 PM, Nomen Nescio <nobody at dizum.com>
wrote:> Has there been any change to the server hardware with respect to number of
> drives since ZFS has come out? Many of the servers around still have an
even
> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint.
With enterprise-level 2.5" drives hitting 1TB, I''ve decided to buy
only 2.5"-based chassis, which typically provide 6-8 bays in a 1U form
factor.  That''s more than enough to build an rpool mirror and a
raidz1+spare, raidz2, or 3x-mirror pool for data.  Having 8 bays is
also a nice fit for the typical 8-port SAS HBA.

Eric

Edward Ned Harvey

2011-Jun-17 01:15 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nomen Nescio
> 
> Has there been any change to the server hardware with respect to number
> of
> drives since ZFS has come out? Many of the servers around still have an
even> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint. 
I don''t see the problem. Install the OS onto a mirrored partition, and
configure all the remaining storage however you like - raid or mirror or
whatever.

My personal preference, assuming 4 disks, since the OS is mostly reads and
only a little bit of writes, is to create a 4-way mirrored 100G partition
for the OS, and the remaining 900G of each disk (or whatever) becomes either
a stripe of mirrors or raidz, as appropriate in your case, for the
storagepool.

Daniel Carosone

2011-Jun-17 02:26 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey
wrote:> My personal preference, assuming 4 disks, since the OS is mostly reads and
> only a little bit of writes, is to create a 4-way mirrored 100G partition
> for the OS, and the remaining 900G of each disk (or whatever) becomes
either
> a stripe of mirrors or raidz, as appropriate in your case, for the
> storagepool.
Is it still the case, as it once was, that allocating anything other
than whole disks as vdevs forces NCQ / write cache off on the drive
(either or both, forget which, guess write cache)? 

If so, can this be forced back on somehow to regain performance when
known to be safe?  

I think the original assumption was that zfs-in-a-partition likely
implied the disk was shared with ufs, rather than another async-safe
pool. 

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110617/a103abf7/attachment.bin>

Edward Ned Harvey

2011-Jun-17 02:40 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Daniel Carosone [mailto:dan at geek.com.au]
> Sent: Thursday, June 16, 2011 10:27 PM
> 
> Is it still the case, as it once was, that allocating anything other
> than whole disks as vdevs forces NCQ / write cache off on the drive
> (either or both, forget which, guess write cache)?
I will only say, that regardless of whether or not that is or ever was true,
I believe it''s entirely irrelevant.  Because your system performs read
and
write caching and buffering in ram, the tiny little ram on the disk
can''t
possibly contribute anything.

When it comes to reads:  The OS does readahead more intelligently than the
disk could ever hope.  Hardware readahead is useless.

When it comes to writes:  Categorize as either async or sync.

When it comes to async writes:  The OS will buffer and optimize, and the
applications have long since marched onward before the disk even sees the
data.  It''s irrelevant how much time has elapsed before the disk
finally
commits to platter.

When it comes to sync writes:  The write will not be completed, and the
application will block, until all the buffers have been flushed.  Both ram
and disk buffer.  So neither the ram nor disk buffer is able to help you.

It''s like selling usb fobs labeled USB2 or USB3.  If you look up or
measure
the actual performance of any one of these devices, they can''t come
anywhere
near the bus speed...  In fact, I recently paid $45 for a USB3 16G fob,
which is finally able to achieve 380 Mbit.  Oh, thank goodness I''m no
longer
constrained by that slow 480 Mbit bus...   ;-)   Even so, my new fob is
painfully slow compared to a normal cheap-o usb2 hard disk.  They just put
these labels on there because it''s a marketing requirement.  Something
that
formerly mattered one day, but people still use as a purchasing decider.

Daniel Carosone

2011-Jun-17 03:05 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey
wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au]
> > Sent: Thursday, June 16, 2011 10:27 PM
> > 
> > Is it still the case, as it once was, that allocating anything other
> > than whole disks as vdevs forces NCQ / write cache off on the drive
> > (either or both, forget which, guess write cache)?
> 
> I will only say, that regardless of whether or not that is or ever was
true,
> I believe it''s entirely irrelevant.  Because your system performs
read and
> write caching and buffering in ram, the tiny little ram on the disk
can''t
> possibly contribute anything.
I disagree.  It can vastly help improve the IOPS of the disk and keep
the channel open for more transactions while one is in progress.
Otherwise, the channel is idle, blocked on command completion, while
the heads seek. 
> When it comes to reads:  The OS does readahead more intelligently than the
> disk could ever hope.  Hardware readahead is useless.
Little argument here, although the disk is aware of physical geometry
and may well read an entire track. 
> When it comes to writes:  Categorize as either async or sync.
> 
> When it comes to async writes:  The OS will buffer and optimize, and the
> applications have long since marched onward before the disk even sees the
> data.  It''s irrelevant how much time has elapsed before the disk
finally
> commits to platter.
To the application in he short term, but not to the system. TXG closes
have to wait for that, and applications have to wait for those to
close so the next can open and accept new writes.
> When it comes to sync writes:  The write will not be completed, and the
> application will block, until all the buffers have been flushed.  Both ram
> and disk buffer.  So neither the ram nor disk buffer is able to help you.
Yes. With write cache on in the drive, and especially with multiple
outstanding commands, the async writes can all be streamed quickly to
the disk. Then a cache sync can be issued, before the sync/FUA writes
to close the txg are done.

Without write cache, each async write (though deferred and perhaps
coalesced) is synchronous to platters.  This adds latency and
decreases IOPS, impacting other operations (reads) as well.
Please measure it, you will find this impact significant and even
perhaps drastic for some quite realistic workloads.

All this before the disk write cache has any chance to provide
additional benefit by seek optimisations - ie, regardless of whether
it is succesful or not in doing so.  

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110617/fc656827/attachment.bin>

Neil Perrin

2011-Jun-17 04:28 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On 06/16/11 20:26, Daniel Carosone wrote:> On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey wrote:
>   
>> My personal preference, assuming 4 disks, since the OS is mostly reads
and
>> only a little bit of writes, is to create a 4-way mirrored 100G
partition
>> for the OS, and the remaining 900G of each disk (or whatever) becomes
either
>> a stripe of mirrors or raidz, as appropriate in your case, for the
>> storagepool.
>>     
>
> Is it still the case, as it once was, that allocating anything other
> than whole disks as vdevs forces NCQ / write cache off on the drive
> (either or both, forget which, guess write cache)?
It was once the case that using a slice as a vdev forced the write cache 
off,
but I just tried it and found it wasn''t disabled - at least with the 
current source.
In fact it looks like we no longer change the setting.
You may want to experiment yourself on your ZFS version (see below for 
how the check).
>  
>
> If so, can this be forced back on somehow to regain performance when
> known to be safe?  
>   
Yes: "format -e"-> select disk ->  "cache" -> 
"write" ->
"display"/"enable"/"disable"> I think the original assumption was that zfs-in-a-partition likely
> implied the disk was shared with ufs, rather than another async-safe
> pool.
- Correct.


Neil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110616/309dd682/attachment.html>

Edward Ned Harvey

2011-Jun-17 11:06 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Daniel Carosone [mailto:dan at geek.com.au]
> Sent: Thursday, June 16, 2011 10:27 PM
> 
> Is it still the case, as it once was, that allocating anything other
> than whole disks as vdevs forces NCQ / write cache off on the drive
> (either or both, forget which, guess write cache)?
I will only say, that regardless of whether or not that is or ever was true,
I believe it''s entirely irrelevant.  Because your system performs read
and
write caching and buffering in ram, the tiny little ram on the disk
can''t
possibly contribute anything.

When it comes to reads:  The OS does readahead more intelligently than the
disk could ever hope.  Hardware readahead is useless.

When it comes to writes:  Categorize as either async or sync.

When it comes to async writes:  The OS will buffer and optimize, and the
applications have long since marched onward before the disk even sees the
data.  It''s irrelevant how much time has elapsed before the disk
finally
commits to platter.

When it comes to sync writes:  The write will not be completed, and the
application will block, until all the buffers have been flushed.  Both ram
and disk buffer.  So neither the ram nor disk buffer is able to help you.

It''s like selling usb fobs labeled USB2 or USB3.  If you look up or
measure
the actual performance of any one of these devices, they can''t come
anywhere
near the bus speed...  In fact, I recently paid $45 for a USB3 16G fob,
which is finally able to achieve 380 Mbit.  Oh, thank goodness I''m no
longer
constrained by that slow 480 Mbit bus...   ;-)   Even so, my new fob is
painfully slow compared to a normal cheap-o usb2 hard disk.  They just put
these labels on there because it''s a marketing requirement.  Something
that
formerly mattered one day, but people still use as a purchasing decider.

Edward Ned Harvey

2011-Jun-17 11:41 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Daniel Carosone [mailto:dan at geek.com.au]
> Sent: Thursday, June 16, 2011 11:05 PM
> 
> the [sata] channel is idle, blocked on command completion, while
> the heads seek.
I''m interested in proving this point.  Because I believe it''s
false.

Just hand waving for the moment ... Presenting the alternative viewpoint
that I think is correct...

All drives, regardless of whether or not their disk cache or buffer is
enabled, support PIO and DMA.  This means no matter the state of the cache
or buffer, the bus will deliver information to/from the memory of the disk
as fast as possible, and the disk will optimize the visible workload to the
best of its ability, and the disk will report back an interrupt when each
operation is completed out-of-order.

The difference between enabling or disabling the disk write buffer is:  If
the write buffer is disabled...  It still gets used temporarily ... but the
disk doesn''t interrupt "completed" until the buffer is
flushed to platter.
If the disk write buffer is enabled, the disk will immediately report
"completed" as soon as it receives the data, before flushing to
platter...
And if your application happens to have issued the write in "sync"
mode (or
the fsync() command), your OS will additionally issue the hardware sync
command, and your application will block until the hardware sync has
completed.

It would be stupid for a disk to hog the bus in an idle state.

Jim Klimov

2011-Jun-17 12:30 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-17 15:41, Edward Ned Harvey ?????:>> From: Daniel Carosone [mailto:dan at geek.com.au]
>> Sent: Thursday, June 16, 2011 11:05 PM
>>
>> the [sata] channel is idle, blocked on command completion, while
>> the heads seek.
> I''m interested in proving this point.  Because I believe
it''s false.
>
> Just hand waving for the moment ... Presenting the alternative viewpoint
> that I think is correct...I''m also interested to hear the in-the-trenches specialists
and architechts on this point, however, the way it was
explained to me a while ago, disk caches and higher
interface speeds really matter in large arrays, where
you have one (okay, 8) links from your controller to a
backplane with dozens of disks, and the faster any of
these disks completes its bursty operation, the less
latency is induced on the array in whole.

So even if the spinning drive can not sustain 6Gbps,
its 64Mb of cache quite can spit out (or read in) its
bit of data, free the bus, and let the other many drives
spit theirs.

I am not sure if this is relevant to say a motherboard
controller where one chip processes 6-8 disks, but
maybe there''s something to it too...

//Jim

Jim Klimov

2011-Jun-17 12:55 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-17 15:06, Edward Ned Harvey ?????:>
> When it comes to reads:  The OS does readahead more intelligently than the
> disk could ever hope.  Hardware readahead is useless.
Here''s another (lame?) question to the experts, partly as a
followup to my last post about large arrays and essentially
a shared bus to be freed ASAP: can the OS request a disk
readahead (send a small command and release the bus)
and then later poll the disk(''s cache) for the readahead
results? That is, it would not "hold the line" between
sending a request and receiving the result.

Alternatively, does it work in a packeted protocol (and in
effect requests and responses do not "hold the line",
but the controller must keep states - are these command
queues?), and so the ability to transfer packets faster
and free the shared ether between disks, backplanes
and controllers, is critical per se?

Thanks,
//Jim

The more I know, the more I know how little I know ;)

Ross Walker

2011-Jun-18 01:48 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 17, 2011, at 7:06 AM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> I will only say, that regardless of whether or not that is or ever was
true,
> I believe it''s entirely irrelevant.  Because your system performs
read and
> write caching and buffering in ram, the tiny little ram on the disk
can''t
> possibly contribute anything.
You would be surprised.

The on-disk buffer is there so data is ready when the hard drive head lands,
without it the drive''s average rotational latency will trend higher due
to missed landings because the data wasn''t in buffer at the right time.

The read buffer is to allow the disk to continuously read sectors whether the
system bus is ready to transfer or not. Without it, sequential reads
wouldn''t last long enough to reach max throughput before they would
have to pause because of bus contention and then suffer a rotation of latency
hit which would kill read performance.

Try disabling the on-board write or read cache and see how your sequential IO
performs and you''ll see just how valuable those puny caches are.

-Ross

Richard Elling

2011-Jun-18 23:47 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote:
> On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote:
>>> From: Daniel Carosone [mailto:dan at geek.com.au]
>>> Sent: Thursday, June 16, 2011 10:27 PM
>>> 
>>> Is it still the case, as it once was, that allocating anything
other
>>> than whole disks as vdevs forces NCQ / write cache off on the drive
>>> (either or both, forget which, guess write cache)?
>> 
>> I will only say, that regardless of whether or not that is or ever was
true,
>> I believe it''s entirely irrelevant.  Because your system
performs read and
>> write caching and buffering in ram, the tiny little ram on the disk
can''t
>> possibly contribute anything.
> 
> I disagree.  It can vastly help improve the IOPS of the disk and keep
> the channel open for more transactions while one is in progress.
> Otherwise, the channel is idle, blocked on command completion, while
> the heads seek. 
Actually, all of the data I''ve gathered recently shows that the number
of
IOPS does not significantly increase for HDDs running random workloads. 
However the response time does :-( My data is leading me to want to restrict 
the queue depth to 1 or 2 for HDDs.

SDDs are another story, they scale much better in the response time and
IOPS vs queue depth analysis.

Has anyone else studied this?
 -- richard

Andrew Gabriel

2011-Jun-19 13:04 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

Richard Elling wrote:> Actually, all of the data I''ve gathered recently shows that the
number of
> IOPS does not significantly increase for HDDs running random workloads. 
> However the response time does :-( My data is leading me to want to
restrict
> the queue depth to 1 or 2 for HDDs.
>   
Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into 
less random elevator sweep pattern, and increase IOPs through reducing 
the total seek time, which may be why IOPs does not drop as much as one 
might imagine if you think of the heads doing random seeks (they aren''t
random anymore). However, this requires that there''s a reasonable queue
of I/Os for the controller to optimise, and processing that queue will 
necessarily increase the average response time. If you run with a queue 
depth of 1 or 2, the controller can''t do this.

This is something I played with ~30 years ago, when the OS disk driver 
was responsible for the queuing and reordering disc transfers to reduce 
total seek time, and disk controllers were dumb. There are lots of 
options and compromises, generally weighing reduction in total seek time 
against longest response time. Best reduction in total seek time comes 
from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for 
the far end of the disk, whilst you keep getting new transfers queued 
for the track just in front of you, and you might end up reading or 
writing the whole disk before you get to do that transfer which is 
queued for the far end. If you can get a big enough queue, you can 
modify the insertion algorithm to never insert into the current sweep, 
so you are effectively planning two sweeps ahead. Then the worse 
response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if the I/O 
is a synchronous or asynchronous, and age of existing queue entries). I 
had much fun playing with this at the time.

-- 
Andrew Gabriel

Edward Ned Harvey

2011-Jun-19 13:28 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Richard Elling [mailto:richard.elling at gmail.com]
> Sent: Saturday, June 18, 2011 7:47 PM
> 
> Actually, all of the data I''ve gathered recently shows that the
number of
> IOPS does not significantly increase for HDDs running random workloads.
> However the response time does :-( 
Could you clarify what you mean by that?  I was planning, in the near
future, to go run iozone on some system with, and without the disk cache
enabled according to format -e.  If my hypothesis is right, it
shouldn''t
significantly affect the IOPS, which seems to be corroborated by your
message.

I was also planning to perform sequential throughput testing on two disks
simultaneously, with and without the disk cache enabled.  If one disk is
actually able to hog the bus in an idle state, it should mean the total
combined throughput with cache disabled would be equal to a single disk.
(Which I highly doubt.)

> However the response time does [increase] :-(
This comment seems to indicate that the drive queues up a whole bunch of
requests, and since the queue is large, each individual response time has
become large.  It''s not that physical actual performance has degraded
with
the cache enabled, it''s that the queue has become long.  For async
writes,
you don''t really care how long the queue is, but if you have a mixture
of
async writes and occasional sync writes...  Then the queue gets long, and
when you sync, the sync operation will take a long time to complete.  You
might actually benefit by disabling the disk cache.

Richard, have I gotten the gist of what you''re saying?

Incidentally, I have done extensive testing of enabling/disabling the HBA
writeback cache.  I found that as long as you have a dedicated log device
for sync writes, your performance is significantly better by disabling the
HBA writeback.  Something on order of 15% better.

Richard Elling

2011-Jun-19 15:03 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:richard.elling at gmail.com]
>> Sent: Saturday, June 18, 2011 7:47 PM
>> 
>> Actually, all of the data I''ve gathered recently shows that
the number of
>> IOPS does not significantly increase for HDDs running random workloads.
>> However the response time does :-( 
> 
> Could you clarify what you mean by that?  
Yes. I''ve been looking at what the value of zfs_vdev_max_pending should
be.
The old value was 35 (a guess, but a really bad guess) and the new value is
10 (another guess, but a better guess).  I observe that data from a fast, modern
HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 IOPS. 
But as we add threads, the average response time increases from 2.3ms to 137ms.
Since the whole idea is to get lower response time, and we know disks are not 
simple queues so there is no direct IOPS to response time relationship, maybe it
is simply better to limit the number of outstanding I/Os.

FWIW, I left disksort enabled (the default)
> I was planning, in the near
> future, to go run iozone on some system with, and without the disk cache
> enabled according to format -e.  If my hypothesis is right, it
shouldn''t
> significantly affect the IOPS, which seems to be corroborated by your
> message.
iozone is a file system benchmark, won''t tell you much about IOPS at
the disk level.
Be aware of all of the caching that goes on there.

I used a simple vdbench test on the raw device: vary I/O size from 512 bytes to
128KB,
vary threads from 1 to 10, full stroke, 4KB random, read and write.
> 
> I was also planning to perform sequential throughput testing on two disks
> simultaneously, with and without the disk cache enabled.  If one disk is
> actually able to hog the bus in an idle state, it should mean the total
> combined throughput with cache disabled would be equal to a single disk.
> (Which I highly doubt.)
> 
> 
>> However the response time does [increase] :-(
> 
> This comment seems to indicate that the drive queues up a whole bunch of
> requests, and since the queue is large, each individual response time has
> become large.  It''s not that physical actual performance has
degraded with
> the cache enabled, it''s that the queue has become long.  For async
writes,
> you don''t really care how long the queue is, but if you have a
mixture of
> async writes and occasional sync writes...  Then the queue gets long, and
> when you sync, the sync operation will take a long time to complete.  You
> might actually benefit by disabling the disk cache.
> 
> Richard, have I gotten the gist of what you''re saying?
I haven''t formed an opinion yet, but I''m inclined towards
wanting overall
better latency.
> 
> Incidentally, I have done extensive testing of enabling/disabling the HBA
> writeback cache.  I found that as long as you have a dedicated log device
> for sync writes, your performance is significantly better by disabling the
> HBA writeback.  Something on order of 15% better.
> 
Yes, I recall these tests.
 -- richard

Richard Elling

2011-Jun-19 15:06 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:> Richard Elling wrote:
>> Actually, all of the data I''ve gathered recently shows that
the number of IOPS does not significantly increase for HDDs running random
workloads. However the response time does :-( My data is leading me to want to
restrict the queue depth to 1 or 2 for HDDs.
>>  
> 
> Thinking out loud here, but if you can queue up enough random I/Os, the
embedded disk controller can probably do a good job reordering them into less
random elevator sweep pattern, and increase IOPs through reducing the total seek
time, which may be why IOPs does not drop as much as one might imagine if you
think of the heads doing random seeks (they aren''t random anymore).
However, this requires that there''s a reasonable queue of I/Os for the
controller to optimise, and processing that queue will necessarily increase the
average response time. If you run with a queue depth of 1 or 2, the controller
can''t do this.
I agree. And disksort is in the mix, too.
> This is something I played with ~30 years ago, when the OS disk driver was
responsible for the queuing and reordering disc transfers to reduce total seek
time, and disk controllers were dumb.
...and disksort still survives... maybe we should kill it?
> There are lots of options and compromises, generally weighing reduction in
total seek time against longest response time. Best reduction in total seek time
comes from planning out your elevator sweep, and inserting newly queued requests
into the right position in the sweep ahead. That also gives the potentially
worse response time, as you may have one transfer queued for the far end of the
disk, whilst you keep getting new transfers queued for the track just in front
of you, and you might end up reading or writing the whole disk before you get to
do that transfer which is queued for the far end. If you can get a big enough
queue, you can modify the insertion algorithm to never insert into the current
sweep, so you are effectively planning two sweeps ahead. Then the worse response
time becomes the time to process one queue full, rather than the time to read or
write the whole disk. Lots of other tricks too (e.g. insertion into sweeps
taking into account priority, such as if the I/O is a synchronous or
asynchronous, and age of existing queue entries). I had much fun playing with
this at the time.
The other wrinkle for ZFS is that the priority scheduler can''t re-order
I/Os sent to the disk.
So it might make better sense for ZFS to keep the disk queue depth small for
HDDs.
 -- richard

Daniel Carosone

2011-Jun-20 03:52 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Fri, Jun 17, 2011 at 07:41:41AM -0400, Edward Ned Harvey
wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au]
> > Sent: Thursday, June 16, 2011 11:05 PM
> > 
> > the [sata] channel is idle, blocked on command completion, while
> > the heads seek.
> 
> I''m interested in proving this point.  Because I believe
it''s false.
> 
> Just hand waving for the moment ... Presenting the alternative viewpoint
> that I think is correct...
> 
> All drives, regardless of whether or not their disk cache or buffer is
> enabled, support PIO and DMA.  This means no matter the state of the cache
> or buffer, the bus will deliver information to/from the memory of the disk
> as fast as possible, and the disk will optimize the visible workload to the
> best of its ability, and the disk will report back an interrupt when each
> operation is completed out-of-order.
Yes, up to the that last "out-of-order". Without NCQ, requests are
in-order and wait for completion with the channel idle. 
> It would be stupid for a disk to hog the bus in an idle state.
Yes, but remember that ATA was designed originally to be stupid
(simple).  The complexity has crept in over time.  Understanding the
history and development order is important here.

So, for older ATA disks, commands would transfer relatively quickly
over the channel, which would then remain idle until a completion
interrupt. Systems got faster.  Write cache was added to make writes
"complete" faster, read cache (with prefetch) was added in the hope
of satisfying read requests faster and freeing up the channel. Systems
got faster. NCQ was added (rather, TCQ was reinvented and crippled) to
try and get better concurrency. NCQ supports only a few outstanding
ops, in part because write-cache was by then established practice
(turning it off would adversely impact benchmarks, especially for
software that couldn''t take advantage of concurrency).

So, today with NCQ, writes are again essentially in-order (to cache)
until the cache is full and request start blocking.  NCQ may offer
some benefit to concurrent reads, but again of litle value if the cache
is full.

Furthermore, the disk controllers may not be doing such a great job
when given concurrent requests anyway, as Richard mentions elsewhere.
Will reply to those points a little later.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110620/88e6a592/attachment.bin>

Edward Ned Harvey

2011-Jun-20 12:48 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Richard Elling [mailto:richard.elling at gmail.com]
> Sent: Sunday, June 19, 2011 11:03 AM
> 
> > I was planning, in the near
> > future, to go run iozone on some system with, and without the disk
cache
> > enabled according to format -e.  If my hypothesis is right, it
shouldn''t
> > significantly affect the IOPS, which seems to be corroborated by your
> > message.
> 
> iozone is a file system benchmark, won''t tell you much about IOPS
at the
disk> level.
> Be aware of all of the caching that goes on there.
Yeah, that''s the whole point.  The basis of my argument was:  Due to
the
caching & buffering the system does in RAM, the disks'' cache &
buffer are
not relevant.  The conversation spawns from the premise of whole-disk versus
partition-based pools, possibly toggling the disk cache to off.  See the
subject of this email.   ;-)

Hopefully I''ll have time to (dis) prove that conjecture this week.

Gary Mills

2011-Jun-20 13:31 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling
wrote:> On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote:
> >> From: Richard Elling [mailto:richard.elling at gmail.com]
> >> Sent: Saturday, June 18, 2011 7:47 PM
> >> 
> >> Actually, all of the data I''ve gathered recently shows
that the number of
> >> IOPS does not significantly increase for HDDs running random
workloads.
> >> However the response time does :-( 
> > 
> > Could you clarify what you mean by that?  
> 
> Yes. I''ve been looking at what the value of zfs_vdev_max_pending
should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast,
modern
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333
IOPS.
> But as we add threads, the average response time increases from 2.3ms to
137ms.
> Since the whole idea is to get lower response time, and we know disks are
not
> simple queues so there is no direct IOPS to response time relationship,
maybe it
> is simply better to limit the number of outstanding I/Os.
How would this work for a storage device with an intelligent
controller that provides only a few LUNs to the host, even though it
contains a much larger number of disks?  I would expect the controller
to be more efficient with a large number of outstanding IOs because it
could distribute those IOs across the disks.  It would, of course,
require a non-volatile cache to provide fast turnaround for writes.

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-

Richard Elling

2011-Jun-20 15:21 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 20, 2011, at 6:31 AM, Gary Mills wrote:
> On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
>> On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote:
>>>> From: Richard Elling [mailto:richard.elling at gmail.com]
>>>> Sent: Saturday, June 18, 2011 7:47 PM
>>>> 
>>>> Actually, all of the data I''ve gathered recently shows
that the number of
>>>> IOPS does not significantly increase for HDDs running random
workloads.
>>>> However the response time does :-( 
>>> 
>>> Could you clarify what you mean by that?  
>> 
>> Yes. I''ve been looking at what the value of
zfs_vdev_max_pending should be.
>> The old value was 35 (a guess, but a really bad guess) and the new
value is
>> 10 (another guess, but a better guess).  I observe that data from a
fast, modern
>> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to
333 IOPS.
>> But as we add threads, the average response time increases from 2.3ms
to 137ms.
>> Since the whole idea is to get lower response time, and we know disks
are not
>> simple queues so there is no direct IOPS to response time relationship,
maybe it
>> is simply better to limit the number of outstanding I/Os.
> 
> How would this work for a storage device with an intelligent
> controller that provides only a few LUNs to the host, even though it
> contains a much larger number of disks?  I would expect the controller
> to be more efficient with a large number of outstanding IOs because it
> could distribute those IOs across the disks.  It would, of course,
> require a non-volatile cache to provide fast turnaround for writes.
Yes, I''ve set it as high as 4,000 for a fast storage array. One size
does not fit all.

For normal operations, with a separate log and HDDs in the pool, I''m
leaning towards 16.
Except when resilvering or scrubbing, in which case 1 is better for HDDs.
 -- richard

Richard Elling

2011-Jun-20 15:24 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

On Jun 15, 2011, at 1:33 PM, Nomen Nescio wrote:
> Has there been any change to the server hardware with respect to number of
> drives since ZFS has come out? Many of the servers around still have an
even
> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint. All you can do is make one or two mirrors, or a 3 way mirror
and
> a spare, right? Wouldn''t it make sense to ship with an odd number
of drives
> so you could at least RAIDZ? Or stop making provision for anything except 1
> or two drives or no drives at all and require CD or netbooting and just
> expect everybody to be using NAS boxes? I am just a home server user, what
> do you guys who work on commercial accounts think? How are people using
> these servers?
I see 2 disks for boot and usually one or more 24-disk JBODs.  A few 12-disk
JBODs
are still being sold, but I rarely see a single 12-disk JBOD. I''m also
seeing a few
SBBs that have 16 disks and boot from SATA DOMs. Anyone else?

 -- richard

Andrew Gabriel

2011-Jun-20 15:37 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

Richard Elling wrote:> On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
>   
>> Richard Elling wrote:
>>     
>>> Actually, all of the data I''ve gathered recently shows
that the number of IOPS does not significantly increase for HDDs running random
workloads. However the response time does :-( My data is leading me to want to
restrict the queue depth to 1 or 2 for HDDs.
>>>  
>>>       
>> Thinking out loud here, but if you can queue up enough random I/Os, the
embedded disk controller can probably do a good job reordering them into less
random elevator sweep pattern, and increase IOPs through reducing the total seek
time, which may be why IOPs does not drop as much as one might imagine if you
think of the heads doing random seeks (they aren''t random anymore).
However, this requires that there''s a reasonable queue of I/Os for the
controller to optimise, and processing that queue will necessarily increase the
average response time. If you run with a queue depth of 1 or 2, the controller
can''t do this.
>>     
>
> I agree. And disksort is in the mix, too.
>   
Oh, I''d never looked at that.
>> This is something I played with ~30 years ago, when the OS disk driver
was responsible for the queuing and reordering disc transfers to reduce total
seek time, and disk controllers were dumb.
>>     
>
> ...and disksort still survives... maybe we should kill it?
>   
It looks like it''s possibly slightly worse than the pathologically
worst
response time case I described below...
>> There are lots of options and compromises, generally weighing reduction
in total seek time against longest response time. Best reduction in total seek
time comes from planning out your elevator sweep, and inserting newly queued
requests into the right position in the sweep ahead. That also gives the
potentially worse response time, as you may have one transfer queued for the far
end of the disk, whilst you keep getting new transfers queued for the track just
in front of you, and you might end up reading or writing the whole disk before
you get to do that transfer which is queued for the far end. If you can get a
big enough queue, you can modify the insertion algorithm to never insert into
the current sweep, so you are effectively planning two sweeps ahead. Then the
worse response time becomes the time to process one queue full, rather than the
time to read or write the whole disk. Lots of other tricks too (e.g. insertion
into sweeps taking into account priority, such as if the I/O is a synchronous or
asynchronous, and age of existing queue entries). I had much fun playing with
this at the time.
>>     
>
> The other wrinkle for ZFS is that the priority scheduler can''t
re-order I/Os sent to the disk.
>   
Does that also go through disksort? Disksort doesn''t seem to have any 
concept of priorities (but I haven''t looked in detail where it plugs in
to the whole framework).
> So it might make better sense for ZFS to keep the disk queue depth small
for HDDs.
>  -- richard
>   
-- 
Andrew Gabriel

Garrett D''Amore

2011-Jun-20 15:43 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

For SSD we have code in illumos that disables disksort.  Ultimately, we believe
that the cost of disksort is in the noise for performance.

  -- Garrett D''Amore

On Jun 20, 2011, at 8:38 AM, "Andrew Gabriel" <Andrew.Gabriel at
oracle.com> wrote:
> Richard Elling wrote:
>> On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
>>  
>>> Richard Elling wrote:
>>>    
>>>> Actually, all of the data I''ve gathered recently shows
that the number of IOPS does not significantly increase for HDDs running random
workloads. However the response time does :-( My data is leading me to want to
restrict the queue depth to 1 or 2 for HDDs.
>>>>       
>>> Thinking out loud here, but if you can queue up enough random I/Os,
the embedded disk controller can probably do a good job reordering them into
less random elevator sweep pattern, and increase IOPs through reducing the total
seek time, which may be why IOPs does not drop as much as one might imagine if
you think of the heads doing random seeks (they aren''t random anymore).
However, this requires that there''s a reasonable queue of I/Os for the
controller to optimise, and processing that queue will necessarily increase the
average response time. If you run with a queue depth of 1 or 2, the controller
can''t do this.
>>>    
>> 
>> I agree. And disksort is in the mix, too.
>>  
> 
> Oh, I''d never looked at that.
> 
>>> This is something I played with ~30 years ago, when the OS disk
driver was responsible for the queuing and reordering disc transfers to reduce
total seek time, and disk controllers were dumb.
>>>    
>> 
>> ...and disksort still survives... maybe we should kill it?
>>  
> 
> It looks like it''s possibly slightly worse than the pathologically
worst response time case I described below...
> 
>>> There are lots of options and compromises, generally weighing
reduction in total seek time against longest response time. Best reduction in
total seek time comes from planning out your elevator sweep, and inserting newly
queued requests into the right position in the sweep ahead. That also gives the
potentially worse response time, as you may have one transfer queued for the far
end of the disk, whilst you keep getting new transfers queued for the track just
in front of you, and you might end up reading or writing the whole disk before
you get to do that transfer which is queued for the far end. If you can get a
big enough queue, you can modify the insertion algorithm to never insert into
the current sweep, so you are effectively planning two sweeps ahead. Then the
worse response time becomes the time to process one queue full, rather than the
time to read or write the whole disk. Lots of other tricks too (e.g. insertion
into sweeps taking into account priority, such as if
> the I/O is a synchronous or asynchronous, and age of existing queue
entries). I had much fun playing with this at the time.
>>>    
>> 
>> The other wrinkle for ZFS is that the priority scheduler can''t
re-order I/Os sent to the disk.
>>  
> 
> Does that also go through disksort? Disksort doesn''t seem to have
any concept of priorities (but I haven''t looked in detail where it
plugs in to the whole framework).
> 
>> So it might make better sense for ZFS to keep the disk queue depth
small for HDDs.
>> -- richard
>>  
> 
> -- 
> Andrew Gabriel
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nomen Nescio

2011-Jun-21 00:00 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Hello Marty! 
> With four drives you could also make a RAIDZ3 set, allowing you to have
> the lowest usable space, poorest performance and worst resilver times
> possible.
That''s not funny. I was actually considering this :p

But you have to admit, it would probably be somewhat reliable!

Daniel Carosone

2011-Jun-21 00:58 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling
wrote:> Yes. I''ve been looking at what the value of zfs_vdev_max_pending
should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast,
modern
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333
IOPS.
> But as we add threads, the average response time increases from 2.3ms to
137ms.
Interesting.  What happens to total throughput, since that''s the
expected tradeoff against latency here.  I might guess that in your
tests with a constant io size, it''s linear with IOPS - but I wonder if
that remains so for larger IO or with mixed sizes?
> Since the whole idea is to get lower response time, and we know disks are
not
> simple queues so there is no direct IOPS to response time relationship,
maybe it
> is simply better to limit the number of outstanding I/Os.
I also wonder if we''re seeing a form of "bufferbloat" here in
these
latencies.

As I wrote in another post yesterday, remember that you''re not
counting actual outstanding IO''s here, because the write IO''s
are
being acknowledged immediately and tracked internally. The disk may
therefore be getting itself into a state where either the buffer/queue
is efectively full, or the number of requests it is tracking
internally becomes inefficient (as well as the head-thrashing). 

Even before you get to that state and writes start slowing down too,
your averages are skewed by write cache. All the writes are fast,
while a longer queue exposes reads to contention with eachother, as
well as to a much wider window of writes.  Can you look at the average
response time for just the reads, even amongst a mixed r/w workflow?
Perhaps some alternate statistic than average, too.

Can you repeat the tests with write-cache disabled, so you''re more
accurately exposing the controller''s actual workload and backlog?

I hypothesise that this will avoid those latencies getting so
ridiculously out of control, and potentially also show better
(relative) results for higher concurrency counts.  Alternately, it
will show that your disk firmware really is horrible at managing
concurrency even for small values :)

Whether it shows better absolute results than a shorter queue + write
cache is an entirely different question.  The write cache will
certainly make things faster in the common case, which is another way
of saying that your lower-bound average latencies are artificially low
and making the degradation look worse.
> > This comment seems to indicate that the drive queues up a whole bunch
of
> > requests, and since the queue is large, each individual response time
has
> > become large.  It''s not that physical actual performance has
degraded with
> > the cache enabled, it''s that the queue has become long.  For
async writes,
> > you don''t really care how long the queue is, but if you have
a mixture of
> > async writes and occasional sync writes...  Then the queue gets long,
and
> > when you sync, the sync operation will take a long time to complete. 
You
> > might actually benefit by disabling the disk cache.
> > 
> > Richard, have I gotten the gist of what you''re saying?
> 
> I haven''t formed an opinion yet, but I''m inclined towards
wanting overall
> better latency.
And, in particlar, better latency for specific (read) requests that zfs
prioritises; these are often the ones that contribute most to a system
feeling unresponsive.  If this prioritisation is lost once passed to
the disk, both because the disk doesn''t have a priority mechanism and
because it''s contending with the deferred cost of previous writes,
then you''ll get better latency for the requests you care most about
with a shorter queue.

--
Dan.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110621/b38e803e/attachment-0001.bin>

Dave U. Random

2011-Jun-21 07:30 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Hello Jim! I understood ZFS doesn''t like slices but from your reply
maybe I
should reconsider. I have a few older servers with 4 bays x 73G. If I make a
root mirror pool and swap on the other 2 as you suggest, then I would have
about 63G x 4 left over. If so then I am back to wondering what to do about
4 drives. Is raidz1 worthwhile in this scenario? That is less redundancy
that a mirror and much less than a 3 way mirror, isn''t it? Is it even
possible to do raidz2 on 4 slices? Or would 2, 2 way mirrors be better? I
don''t understand what RAID10 is, is it simply a stripe of two mirrors?
Or
would it be best to do a 3 way mirror and a hot spare? I would like to be
able to tolerate losing one drive without loss of integrity.

I will be doing new installs of Solaris 10. Is there an option in the
installer for me to issue ZFS commands and set up pools or do I need to
format the disks before installing and if so how do I do that? Thank you.

Bob Friesenhahn

2011-Jun-21 14:08 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Sun, 19 Jun 2011, Richard Elling wrote:
> Yes. I''ve been looking at what the value of zfs_vdev_max_pending
should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast,
modern
I am still using 5 here. :-)
> I haven''t formed an opinion yet, but I''m inclined towards
wanting overall
> better latency.
Most properly implemented systems are not running at maximum capacity 
and so decreased latency is definitely desirable so that applications 
obtain the best CPU usage and short-lived requests do not clog the 
system.  Typical benchmark scenarios (max sustained or peak 
throughput) do not represent most real-world usage.  The 60 or 80% 
solution (with assured reasonable response time) is definitely better 
than the 99% solution when it comes to user satisfaction.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tomas Ögren

2011-Jun-21 15:03 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

On 21 June, 2011 - Nomen Nescio sent me these 0,4K bytes:
> Hello Marty! 
> 
> > With four drives you could also make a RAIDZ3 set, allowing you to
have
> > the lowest usable space, poorest performance and worst resilver times
> > possible.
> 
> That''s not funny. I was actually considering this :p
4-way mirror would be way more useful.
> But you have to admit, it would probably be somewhat reliable!
/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Richard Elling

2011-Jun-21 16:04 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jun 21, 2011, at 8:18 AM, Garrett D''Amore
wrote:>> 
>> Does that also go through disksort? Disksort doesn''t seem to
have any concept of priorities (but I haven''t looked in detail where it
plugs in to the whole framework).
>> 
>>> So it might make better sense for ZFS to keep the disk queue depth
small for HDDs.
>>> -- richard
>>>  
>> 
> 
> disksort is much further down than zio priorities... by the time disksort
sees them they have already been sorted in priority order.
Yes, disksort is at sd. So ZFS schedules I/Os, disksort reorders them, and the
drive reorders them again.
To get the best advantage out of the ZFS priority ordering, I can make an
argument to disable disksort and
keep the vdev_max_pending low to limit the reordering work done by the drive. I
am not convinced that
traditional benchmarks show the effects of ZFS priority ordering, though.
 -- richard

Dave U. Random

2011-Jun-22 19:45 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Hello!
> I don''t see the problem. Install the OS onto a mirrored partition,
and
> configure all the remaining storage however you like - raid or mirror or
> watever. 
I didn''t understand your point of view until I read the next paragraph.
> My personal preference, assuming 4 disks, since the OS is mostly reads and
> only a little bit of writes, is to create a 4-way mirrored 100G partition
> for the OS, and the remaining 900G of each disk (or whatever) becomes
> either a stripe of mirrors or raidz, as appropriate in your case, for the
> storagepool.
Oh, you are talking about 1T drives and my servers are all 4x73G! So
it''s a
fairly big deal since I have little storage to waste and still want to be
able to survive losing one drive. I should have given the numbers at the
beginning, sorry. Given this meager storage do you have any suggestions?
Thank you.

Edward Ned Harvey

2011-Jun-22 21:31 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Dave U.Random
> 
> > My personal preference, assuming 4 disks, since the OS is mostly reads
and> > only a little bit of writes, is to create a 4-way mirrored 100G
partition> > for the OS, and the remaining 900G of each disk (or whatever) becomes
> > either a stripe of mirrors or raidz, as appropriate in your case, for
the> > storagepool.
> 
> Oh, you are talking about 1T drives and my servers are all 4x73G! So
it''s
a> fairly big deal since I have little storage to waste and still want to be
> able to survive losing one drive. 
Well ... 
Slice all 4 drives into 13G and 60G.
Use a mirror of 13G for the rpool.
Use 4x 60G in some way (raidz, or stripe of mirrors) for tank
Use a mirror of 13G appended to tank

That would use all your space as efficiently as possible, while providing at
least one level of redundancy, and the only sacrifice you''re making is
the
fact that you get different performance characteristics between a raidz and
a mirror, which are both in the same pool.  For example, you might decide
the ideal performance characteristics for your workload are to use raidz...
Or to use mirrors ... but your pool is a hybrid, so you can''t achieve
the
ideal performance characteristics no matter which type of data workload you
have.

That is a very small sacrifice, considering the constraints you''re up
against for initial conditions.  "I have 4x 73G disks" "I want to
survive a
single disk failure" "I don''t want to waste any space"
and "My boot pool
must be included."

The only conclusion you can draw from that is:  First take it as a given
that you can''t boot from a raidz volume.  Given, you must have one
mirror.
Then you raidz all the remaining space that''s capable of being put into
a
raidz...  And what you have left is a pair of unused space, equal to the
size of your boot volume.  You either waste that space, or you mirror it and
put it into your tank.

It''s really the only solution, without changing your hardware or design
constraints.

Nomen Nescio

2011-Jun-23 04:48 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Hello Bob! Thanks for the reply. I was thinking about going with a 3 way
mirror and a hot spare. But I don''t think I can upgrade to larger
drives
unless I do it all at once, is that correct?

Dave U. Random

2011-Jun-23 13:38 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:
> Well ... 
> Slice all 4 drives into 13G and 60G.
> Use a mirror of 13G for the rpool.
> Use 4x 60G in some way (raidz, or stripe of mirrors) for tank
> Use a mirror of 13G appended to tank
Hi Edward! Thanks for your post. I think I understand what you are saying
but I don''t know how to actually do most of that. If I am going to make
a
new install of Solaris 10 does it give me the option to slice and dice my
disks and to issue zpool commands? Until now I have only used Solaris on
Intel with boxes and used both complete drives as a mirror.

Can you please tell me what are the steps to do your suggestion?

I imagine I can slice the drives in the installer and then setup a 4 way
root mirror (stupid but as you say not much choice) on the 13G section. Or
maybe one root mirror on two slices and then have 13G aux storage left to
mirror for something like /var/spool? What would you recommend? I
didn''t
understand what you suggested about appending a 13G mirror to tank. Would
that be something like RAID10 without actually being RAID10 so I could still
boot from it? How would the system use it?

In this setup that will install everything on the root mirror so I will
have to move things around later? Like /var and /usr or whatever I
don''t
want on the root mirror? And then I just make a RAID10 like Jim was saying
with the other 4x60 slices? How should I move mountpoints that aren''t
separate ZFS filesystems?
> The only conclusion you can draw from that is:  First take it as a given
> that you can''t boot from a raidz volume.  Given, you must have one
mirror.
Thanks, I will keep it in mind.
> Then you raidz all the remaining space that''s capable of being put
into a
> raidz...  And what you have left is a pair of unused space, equal to the
> size of your boot volume.  You either waste that space, or you mirror it
> and put it into your tank.
So RAID10 sounds like the only reasonable choice since there are an even
number of slices, I mean is RAIDZ1 even possible with 4 slices?

Paul Kraus

2011-Jun-23 17:25 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

On Thu, Jun 23, 2011 at 12:48 AM, Nomen Nescio <nobody at dizum.com>
wrote:
> Hello Bob! Thanks for the reply. I was thinking about going with a 3 way
> mirror and a hot spare. But I don''t think I can upgrade to larger
drives
> unless I do it all at once, is that correct?
    Why keep one out as a Hot Spare ? If you have another zpool and
the Hot Spare will be shared, that makes sense. If the drive is
powered on and spinning, I don''t see any downside to making it a 4-way
mirror instead of 3-way + HS.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Craig Cory

2011-Jun-23 17:47 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Paul Kraus wrote:> On Thu, Jun 23, 2011 at 12:48 AM, Nomen Nescio <nobody at dizum.com>
wrote:
>
>> Hello Bob! Thanks for the reply. I was thinking about going with a 3
way
>> mirror and a hot spare. But I don''t think I can upgrade to
larger drives
>> unless I do it all at once, is that correct?
>
>     Why keep one out as a Hot Spare ? If you have another zpool and
> the Hot Spare will be shared, that makes sense. If the drive is
> powered on and spinning, I don''t see any downside to making it a
4-way
> mirror instead of 3-way + HS.
>
> --
Also, to add larger disks to a mirrored pool, you can replace the mirror
members, one at a time, with the larger disk and wait for resilver to
complete. Then replace the other disk, resilver again.

Craig


-- 
Craig Cory
 Senior Instructor :: ExitCertified
 : Oracle/Sun Certified System Administrator
 : Oracle/Sun Certified Network Administrator
 : Oracle/Sun Certified Security Administrator
 : Symantec/Veritas Certified Instructor
 : RedHat Certified Systems Administrator


+-------------------------------------------------------------------------+
         ExitCertified :: Excellence in IT Certified Education

  Certified training with Oracle, Sun Microsystems, Apple, Symantec, IBM,
       Red Hat, MySQL, Hitachi Storage, SpringSource and VMWare.

             1.800.803.EXIT (3948)  |  www.ExitCertified.com
+-------------------------------------------------------------------------+

Cindy Swearingen

2011-Jun-23 17:59 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

Hi Dave,

Consider the easiest configuration first and it will probably save
you time and money in the long run, like this:

73g x 73g mirror (one large s0 on each disk) - rpool
73g x 73g mirror (use whole disks) - data pool

Then, get yourself two replacement disks, a good backup strategy,
and we all sleep better.

Convert the complexity of some of the suggestions to time and money
for replacement if something bad happens, and the formula would look
like this:

time to configure x time to replace x replacement disks = $$ >
cost of two replacement for two mirrored pools

A complex configuration of slices and a combination of raidZ and
mirrored pools across the same disks will be difficult to administer,
performance will be unknown, not to mention how much time it might take
to replace a disk.

Use the simplicity of ZFS as it was intended is my advice and you
will save time and money in the long run.

Cindy


On 06/23/11 07:38, Dave U. Random wrote:> Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at
nedharvey.com>
> wrote:
> 
>> Well ... 
>> Slice all 4 drives into 13G and 60G.
>> Use a mirror of 13G for the rpool.
>> Use 4x 60G in some way (raidz, or stripe of mirrors) for tank
>> Use a mirror of 13G appended to tank
> 
> Hi Edward! Thanks for your post. I think I understand what you are saying
> but I don''t know how to actually do most of that. If I am going to
make a
> new install of Solaris 10 does it give me the option to slice and dice my
> disks and to issue zpool commands? Until now I have only used Solaris on
> Intel with boxes and used both complete drives as a mirror.
> 
> Can you please tell me what are the steps to do your suggestion?
> 
> I imagine I can slice the drives in the installer and then setup a 4 way
> root mirror (stupid but as you say not much choice) on the 13G section. Or
> maybe one root mirror on two slices and then have 13G aux storage left to
> mirror for something like /var/spool? What would you recommend? I
didn''t
> understand what you suggested about appending a 13G mirror to tank. Would
> that be something like RAID10 without actually being RAID10 so I could
still
> boot from it? How would the system use it?
> 
> In this setup that will install everything on the root mirror so I will
> have to move things around later? Like /var and /usr or whatever I
don''t
> want on the root mirror? And then I just make a RAID10 like Jim was saying
> with the other 4x60 slices? How should I move mountpoints that
aren''t
> separate ZFS filesystems?
> 
>> The only conclusion you can draw from that is:  First take it as a
given
>> that you can''t boot from a raidz volume.  Given, you must have
one mirror.
> 
> Thanks, I will keep it in mind.
> 
>> Then you raidz all the remaining space that''s capable of being
put into a
>> raidz...  And what you have left is a pair of unused space, equal to
the
>> size of your boot volume.  You either waste that space, or you mirror
it
>> and put it into your tank.
> 
> So RAID10 sounds like the only reasonable choice since there are an even
> number of slices, I mean is RAIDZ1 even possible with 4 slices?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anonymous Remailer (austria)

2011-Jun-23 22:02 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> Hi Dave,
Hi Cindy.
> Consider the easiest configuration first and it will probably save
> you time and money in the long run, like this:
> 
> 73g x 73g mirror (one large s0 on each disk) - rpool
> 73g x 73g mirror (use whole disks) - data pool
> 
> Then, get yourself two replacement disks, a good backup strategy,
> and we all sleep better.
Oh, you''re throwing in free replacement disks too?! This is great! :P
> A complex configuration of slices and a combination of raidZ and
> mirrored pools across the same disks will be difficult to administer,
> performance will be unknown, not to mention how much time it might take
> to replace a disk.
Yeah that''s a very good point. But if you guys will make ZFS
filesystems
span vdevs then this could work even better! You''re right about the
complexity but OTOH the great thing about ZFS is not having to worry about
how to plan mount point allocations and with this scenario (I also have a
few servers with 4x36) the planning issue raises its ugly head again.
That''s
why I kind of like Edward''s suggestion even though it is complicated
(for
me) still I think it may be best given my goals. I like breathing room and
not having to worry about a filesystem filling, it''s great not having
to
know exactly ahead of time how much I have to allocate for a filesystem and
instead let the whole drive be used as needed.
> Use the simplicity of ZFS as it was intended is my advice and you
> will save time and money in the long run.
Thanks. I guess the answer is really using the small drives for root pools
and then getting the biggest drives I can afford for the other bays.

Thanks to everybody.

Edward Ned Harvey

2011-Jun-23 23:53 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nomen Nescio
> 
> Hello Bob! Thanks for the reply. I was thinking about going with a 3 way
> mirror and a hot spare. But I don''t think I can upgrade to larger
drives
> unless I do it all at once, is that correct?
No point in doing 3-way mirror and hotspare.  Just do 4-way mirror.

Edward Ned Harvey

2011-Jun-24 01:45 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Dave U.Random
> 
> If I am going to make a
> new install of Solaris 10 does it give me the option to slice and dice my
> disks and to issue zpool commands? 
No way that I know of, to install Solaris 10 into partitions.  Solaris 11
does it.

On solaris 10, if you want to do this, you have to do a bunch of extra
hassle.

Jim Klimov

2011-Jun-27 08:44 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

----- ???????? ????????? -----
??: "Dave U. Random" <anonymous at anonymitaet-im-inter.net>
????: Tuesday, June 21, 2011 18:32
????: Re: [zfs-discuss] Server with 4 drives, how to configure ZFS?
???? (To): zfs-discuss at opensolaris.org
> Hello Jim! I understood ZFS doesn''t like slices but from your 
> reply maybe I
> should reconsider. I have a few older servers with 4 bays x 73G. 
> If I make a
> root mirror pool and swap on the other 2 as you suggest, then I 
> would have
> about 63G x 4 left over.

For the sake of completeness, I should mention that you can also
create a fast and redundant 4-way mirrored root pool ;)
> If so then I am back to wondering what 
> to do about
> 4 drives. Is raidz1 worthwhile in this scenario? That is less 
> redundancythat a mirror and much less than a 3 way mirror, isn''t 
> it? Is it even
> possible to do raidz2 on 4 slices? Or would 2, 2 way mirrors be 
> better? I
> don''t understand what RAID10 is, is it simply a stripe of two 
> mirrors? Yes, by that I meant a striping over two mirrors.
> Or would it be best to do a 3 way mirror and a hot spare? I would 
> like to be
> able to tolerate losing one drive without loss of integrity.
Any of the scenarios above allow you to lose one drive and not 
lose data immediately. The rest is a compromise between both
performance, space and further redundancy:
* 3- or 4-way mirror: least useable space (25% of total disk capacity),
most redundancy, highest read speeds for concurrent loads
* striping of mirrors (raid10): average useable space (50%), high 
read speeds for concurrent loads, can tolerate loss of up to 2 drives
(slices) in a "good" scenario (if they are from different mirrors)
* raidz2: average useable space (50%), can tolerate loss of any 2 drives
* raidz1: max useable space (75%), can tolerate loss of any 1 drive
 
After all the discussions about performance recently on this forum,
I would not try to guess which performance would be better in 
general - raidz1 or raidz2 (there are reads, writes, scrubs and 
resilvers seemingly all with different preferences toward disk layout),
but with a generic workload we have (i.e. serving up zones with
some development databases and J2SE app servers) this was not
seen to matter much. So for us it was usually raidz2 for tolerance
or raidz1 for space.
 
> I will be doing new installs of Solaris 10. Is there an option 
> in the
> installer for me to issue ZFS commands and set up pools or do I 
> need to
> format the disks before installing and if so how do I do that?  
Unfortunately, I last installed Solaris 10u7 or so from scratch, 
others were liveupdates of existing systems and OpenSolaris 
machines, so I am not certain. 
>From what I gather, the text installer is much more powerfulthan the graphical one, and its ZFS root setup might encompass 
creating a root pool in a slice of given size, and possibly mirror 
it right away. Maybe you can do likewise in JumpStart, but we 
did not do that after all.
 
Anyhow, after you install a ZFS root of your sufficient size
(i.e. our minimalist Solaris 10 installs are often under 1-2Gb 
per boot environment, multiply for storing different OEs like 
LiveUpdate and for snapshot history), you can create a slice
for the data pool component (s3 in our setups), and then 
clone the disk slice layout to the other 3 drives like this:
#  prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2
(you might need to install the slice table spanning 100% of 
drives with the fdisk command, first).

Then you attach one of the slices to the ZFS root pool to make
a mirror, if the installer did not do that:
# zpool attach rpool c1t0d0s0 c1t1d0s0

If you have several controllers (perhaps even on different PCI buses) 
you might want to pick a drive on a different controller than the first 
one in order to have less SPoF''s, but make sure that the second 
controller is bootable from BIOS.

And make that drive bootable:
SPARC:
# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c1t1d0s0
x86/x86_64:
# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
 
For two other drives you just create a new pool in slices *s0:
# zpool create swappool mirror c1t2d0s0 c1t3d0s0
# zfs create -V2g swappool/dump
# zfs create -V6g swappool/swap

Sizes are arbitrary here, they depend on your RAM sizing.
You can later add swap from other pools, including a data pool.
Dump device size can be "tested" by configuring dumpadm to
use the new device - it would either refuse to use a device too 
small (then you recreate it bigger), or accept it.

The installer would probably create a dump and a swap devices
in your root pool, you may elect to destroy them since you have
another swap device, at least.

Make sure to update the /etc/vfstab file to reference the swap 
areas which your system should use further on.

After this is all completed, you can create a "data pool" in the
s3 slices with your chosen geometry, i.e.
# zpool create pool raidz2 c1t0d0s3 c1t1d0s3 c1t2d0s3 c1t3d0s3

In our setups this pool holds not only data, but also zone roots
(each in a dedicated dataset), separately from the root pool.
This allows each zone with its data (possibly in dedicated and
delegated sub-datasets) to be a single unit of backup and migration.
AFAIK this is not a Sun-supported configuration (they used to 
require that zone roots are kept with the root FS), but it works
well other than puzzling LiveUpgrade (depends on versions a 
lot, though). Regarding the latter, we found that it is faster and
least error-prone to detach the zones before LUing, then LU
just the global zone (clone of the current BE), and reattach 
the local zones with update mode. Maybe with recent LU
versions you don''t need trickery like that, I can''t say now.
> Thank you.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 
//Jim Klimov

Jim Klimov

2011-Jun-27 09:11 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> In this setup that will install everything on the root mirror so 
> I will
> have to move things around later? Like /var and /usr or whatever 
> I don''t
> want on the root mirror?Actually, you do want /usr and much of /var on the root pool, they
are integral parts of the "svc:/filesystem/local" needed to bring up
your system to a useable state (regardless of whether the other
pools are working or not).
 
Depending on the OS versions, you can do manual data migrations
to separate datasets of the root pool, in order to keep some data
common between OE''s or to enforce different quotas or compression
rules. For example, on SXCE and Solaris 10 (but not on oi_148a)
we successfully splice out many filesystems in such a layout
(the example below also illustrates multiple OEs):
 
# zfs list -o name,refer,quota,compressratio,canmount,mountpoint -t filesystem
-r rpool
NAME                    REFER  QUOTA  RATIO  CANMOUNT  MOUNTPOINT
rpool                          7.92M   none  1.45x        on  /rpool
rpool/ROOT                       21K   none  1.38x    noauto  /rpool/ROOT
rpool/ROOT/snv_117              758M   none  1.00x    noauto  /
rpool/ROOT/snv_117/opt         27.1M   none  1.00x    noauto  /opt
rpool/ROOT/snv_117/usr          416M   none  1.00x    noauto  /usr
rpool/ROOT/snv_117/var          122M   none  1.00x    noauto  /var
rpool/ROOT/snv_129              930M   none  1.45x    noauto  /
rpool/ROOT/snv_129/opt          109M   none  2.70x    noauto  /opt
rpool/ROOT/snv_129/usr          509M   none  2.71x    noauto  /usr
rpool/ROOT/snv_129/var          288M   none  2.54x    noauto  /var
rpool/SHARED                     18K   none  3.36x    noauto  legacy
rpool/SHARED/var                 18K   none  3.36x    noauto  legacy
rpool/SHARED/var/adm           2.97M     5G  4.43x    noauto  legacy
rpool/SHARED/var/cores          118M     5G  3.44x    noauto  legacy
rpool/SHARED/var/crash         1.39G     5G  3.41x    noauto  legacy
rpool/SHARED/var/log            102M     5G  3.43x    noauto  legacy
rpool/SHARED/var/mail          66.4M   none  1.79x    noauto  legacy
rpool/SHARED/var/tmp             20K   none  1.00x    noauto  legacy
rpool/test                     50.5K   none  1.00x    noauto  /rpool/test
 
Mounts of /var/* components are done via /etc/vfstab lines like:
rpool/SHARED/var/adm    -       /var/adm        zfs     -       yes     -
rpool/SHARED/var/log    -       /var/log        zfs     -       yes     -
rpool/SHARED/var/mail   -       /var/mail       zfs     -       yes     -
rpool/SHARED/var/crash  -       /var/crash      zfs     -       yes     -
rpool/SHARED/var/cores  -       /var/cores      zfs     -       yes     -

While system paths /usr /var /opt are mounted by SMF services
directly.
 
 > And then I just make a RAID10 like Jim 
> was saying
> with the other 4x60 slices? How should I move mountpoints that
aren''t
> separate ZFS filesystems? 
 
> 
> > The only conclusion you can draw from that is:  First 
> take it as a given
> > that you can''t boot from a raidz volume.  Given, you must 
> have one mirror.
> 
> Thanks, I will keep it in mind.
> 
> > Then you raidz all the remaining space that''s capable of
being
> put into a
> > raidz...  And what you have left is a pair of unused 
> space, equal to the
> > size of your boot volume.  You either waste that space, 
> or you mirror it
> > and put it into your tank....or use it as swap space :)
 > I didn''t understand what you suggested about appending a 13G 
> mirror to tank. Would that be something like RAID10 without
> actually being RAID10 so I could still boot from it? How would
> the system use it?No, this would be an uneven striping over a raid10 (or raidzN) 
bank of 60Gb slices and a 13Gb mirror. ZFS can do that too,
although for performance considerations unbalanced pools are 
not recommended and should be forced on command-line.

And you can not boot from any pool other than a mirror or a
single drive. Rationale: a single BIOS device must be sufficient
to boot the system and contain all the data needed to boot.
 > So RAID10 sounds like the only reasonable choice since there are 
> an even
> number of slices, I mean is RAIDZ1 even possible with 4 slices?Yes, it is possible with any amount of slices starting from 3.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 
 
-- 

+============================================================+ 
|                                                            | 
| ?????? ???????,                                 Jim Klimov | 
| ??????????? ????????                                   CTO | 
| ??? "??? ? ??"                                  JSC COS&HT | 
|                                                            | 
| +7-903-7705859 (cellular)          mailto:jimklimov at cos.ru | 
|                        CC:admin at cos.ru,jimklimov at gmail.com | 
+============================================================+ 
| ()  ascii ribbon campaign - against html mail              | 
| /\                        - against microsoft attachments  | 
+============================================================+ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110627/f44da601/attachment-0001.html>

Jim Klimov

2011-Jun-27 09:14 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> Hello Bob! Thanks for the reply. I was thinking about going with 
> a 3 way
> mirror and a hot spare.Keep in mind that you can have problems in Sol10u8 if you use
a mirror+spare config for the root pool. Should be fixed in u9.
 > But I don''t think I can upgrade to 
> larger drives
> unless I do it all at once, is that correct?You can replace the drives one by one, but the pool will only
expand when all the data drives have newer bigger capacity.
 
//Jim
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110627/7dbb0cc0/attachment.html>

Jim Klimov

2011-Jun-27 09:46 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-19 3:47, Richard Elling ?????:> On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote:
>
>> On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote:
>>>> From: Daniel Carosone [mailto:dan at geek.com.au]
>>>> Sent: Thursday, June 16, 2011 10:27 PM
>>>>
>>>> Is it still the case, as it once was, that allocating anything
other
>>>> than whole disks as vdevs forces NCQ / write cache off on the
drive
>>>> (either or both, forget which, guess write cache)?
>>> I will only say, that regardless of whether or not that is or ever
was true,
>>> I believe it''s entirely irrelevant.  Because your system
performs read and
>>> write caching and buffering in ram, the tiny little ram on the disk
can''t
>>> possibly contribute anything.
>> I disagree.  It can vastly help improve the IOPS of the disk and keep
>> the channel open for more transactions while one is in progress.
>> Otherwise, the channel is idle, blocked on command completion, while
>> the heads seek.
> Actually, all of the data I''ve gathered recently shows that the
number of
> IOPS does not significantly increase for HDDs running random workloads.
> However the response time does :-( My data is leading me to want to
restrict
> the queue depth to 1 or 2 for HDDs.
>
> SDDs are another story, they scale much better in the response time and
> IOPS vs queue depth analysis.
Now, is there going to be a tunable which would allow us to set
queue depths per-device? Or tunables are so evil that you''d
"rather poke an eye your with a stick"? (C) Richard Elling ;)

-- 


+============================================================+
|                                                            |
| ?????? ???????,                                 Jim Klimov |
| ??????????? ????????                                   CTO |
| ??? "??? ? ??"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimklimov at cos.ru |
|                          CC:admin at cos.ru,jimklimov at mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+

Nomen Nescio

2011-Jun-30 13:16 UTC

head link

[zfs-discuss] Server with 4 drives, how to configure ZFS?

> Actually, you do want /usr and much of /var on the root pool, they
> are integral parts of the "svc:/filesystem/local" needed to bring
up
> your system to a useable state (regardless of whether the other
> pools are working or not).
Ok. I have my feelings on that topic but they may not be as relevant for
ZFS. It may be because I tried to avoid single points of failure on other
systems with techniques that don''t map to ZFS or Solaris. I believe I
can
bring up several OS without /usr or /var although they complain they will
work. But I''ll take your point here.
> Depending on the OS versions, you can do manual data migrations
> to separate datasets of the root pool, in order to keep some data
> common between OE''s or to enforce different quotas or compression
> rules. For example, on SXCE and Solaris 10 (but not on oi_148a)
> we successfully splice out many filesystems in such a layout
> (the example below also illustrates multiple OEs):
Thanks, I have done similar things but I didn''t know if they were
"approved".
> And you can not boot from any pool other than a mirror or a
> single drive. Rationale: a single BIOS device must be sufficient
> to boot the system and contain all the data needed to boot.
Definitely important fact here.

Thanks for all the info!

Edward Ned Harvey

2011-Jul-02 13:27 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: Ross Walker [mailto:rswwalker at gmail.com]
> Sent: Friday, June 17, 2011 9:48 PM
> 
> The on-disk buffer is there so data is ready when the hard drive head
lands,> without it the drive''s average rotational latency will trend
higher due to
> missed landings because the data wasn''t in buffer at the right
time.
> 
> The read buffer is to allow the disk to continuously read sectors whether
the> system bus is ready to transfer or not. Without it, sequential reads
wouldn''t> last long enough to reach max throughput before they would have to pause
> because of bus contention and then suffer a rotation of latency hit which
> would kill read performance.
And it turns out ... Ross is the winner.  ;-)

My hypothesis wasn''t right, and whoever said a single disk would hog
the bus
in an idle state, that''s also wrong.

Conclusion:  Yes it matters to enable the write_cache.  But the reason it
matters is to ensure the right data is present at the right time.  NOT
because of any idle bus blocking.

Here''s the test:
I tested writing to a bunch (4) of disks simultaneously at maximum
throughput, with the write_cache enabled.  This is on a 6Gbit bus, they all
performed 1.0 Gbit/sec which was precisely the mfgr spec.  Then I disabled
write_cache on all the disks and repeated the test.  They all dropped to 750
Mbit/sec.

If the idle bus contention were correct, then the total bus speed would have
been limited to the max throughput of a single disk (1Gbit).  But I was
easily able to sustain 3Gbit, thus disproving the idle bus contention.

If the filesystem write buffer were making the disk write_cache irrelevant,
as I conjectured, then the total throughput would have been the same,
regardless of whether the write_cache was enabled or disabled.  Since
performance dropped with write_cache disabled, it disproves my hypothesis.

No further testing was necessary.  I''m not interested in how much
performance difference there is - or under which specific conditions they
occur.  I am only interested in the existence of a performance difference.
So the conclusion is yes, you want to enable your disk write cache (assuming
all the data on your disk is managed by ZFS.)

Edward Ned Harvey

2011-Jul-02 13:39 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> Conclusion:  Yes it matters to enable the write_cache.  
Now the question of whether or not it matters to use the whole disk versus
partitioning, and how to enable the write_cache automatically on sliced
disks:

I understand that different people have had different results based on which
hardware and which OS rev they are using.  So if this matters to you,
you''ll
just need to check for yourself.  But here is what I found:

On a Sun(Oracle) X4270 and solaris 11 express, my behavior is pretty much as
the man page describes.  If I create a pool using the whole disks, then the
write_cache is enabled automatically.  When I destroy a pool, the
write_cache is returned to its previous (disabled) state.  When I create a
pool using slices of the disks, then the write_cache is not automatically
enabled.

I would like to know:  Is there some way to enable the write cache
automatically on specific devices?  I have a script that will enable the
write_cache on all the devices (just a simple wrapper around format -f) and
of course I can make it run at startup, or on a cron job, etc.  But I''d
like
to know if there''s a more native way to achieve that end result.

I have one really specific reason to care about automatically enabling the
write_cache on sliced disks:  

All the disks in the system are large disks.  (2T).  The OS only needs a few
G, so we install the OS into mirrored slices.  The rest of the disk is
sliced and added to the storage pool.  The default behavior in this
situation is to disable write_cache on the first few disks.

Richard Elling

2011-Jul-02 15:00 UTC

head link

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

On Jul 2, 2011, at 6:39 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>> 
>> Conclusion:  Yes it matters to enable the write_cache.  
> 
> Now the question of whether or not it matters to use the whole disk versus
> partitioning, and how to enable the write_cache automatically on sliced
> disks:
> 
> I understand that different people have had different results based on
which
> hardware and which OS rev they are using.  So if this matters to you,
you''ll
> just need to check for yourself.  But here is what I found:
> 
> On a Sun(Oracle) X4270 and solaris 11 express, my behavior is pretty much
as
> the man page describes.  If I create a pool using the whole disks, then the
> write_cache is enabled automatically.  When I destroy a pool, the
> write_cache is returned to its previous (disabled) state.  When I create a
> pool using slices of the disks, then the write_cache is not automatically
> enabled.
Yes, this is annoying. In NexentaStor, we have a property that manages the
write cache policy on a per-device basis.
> 
> I would like to know:  Is there some way to enable the write cache
> automatically on specific devices?  I have a script that will enable the
> write_cache on all the devices (just a simple wrapper around format -f) and
> of course I can make it run at startup, or on a cron job, etc.  But
I''d like
> to know if there''s a more native way to achieve that end result.
I would say change the way it compiles, but you are stuck without source 
for Solaris.
> 
> I have one really specific reason to care about automatically enabling the
> write_cache on sliced disks:  
> 
> All the disks in the system are large disks.  (2T).  The OS only needs a
few
> G, so we install the OS into mirrored slices.  The rest of the disk is
> sliced and added to the storage pool.  The default behavior in this
> situation is to disable write_cache on the first few disks.
Sounds like a reasonable request to me. Maybe Oracle will accept an RFE?
 -- richard

zfs discuss - Jun 2011 - Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] Server with 4 drives, how to configure ZFS?

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)