thr3ads.net - zfs discuss - [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation [Mar 2012]

If this information is useful, please help other people find it:
Share via:

MLR

2012-Mar-21 03:16 UTC

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

I read the "ZFS_Best_Practices_Guide" and
"ZFS_Evil_Tuning_Guide", and have some
questions:

 1. Cache device for L2ARC
     Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool 
setup shouldn''t we be reading at least at the 500MB/s read/write range?
Why
would we want a ~500MB/s cache?
 2. ZFS dynamically strips along the top-most vdev''s and that
"performance for 1
vdev is equivalent to performance of one drive in that group". Am I correct
in
thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the
disks will go ~100MB/s each , this zpool would theoretically read/write at 
~100MB/s max (how about real world average?)? If this was RAID6 I think this 
would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x-
14x faster than zfs, and both provide the same amount of redundancy)? Is my 
thinking off in the RAID6 or RAIDZ2 numbers? Why doesn''t ZFS try to
dynamically
strip inside vdevs (and if it is, is there an easy to understand explanation why
a vdev doesn''t read from multiple drives at once when requesting data,
or why a
zpool wouldn''t make N number of requests to a vdev with N being the
number of
disks in that vdev)?

Since "performance for 1 vdev is equivalent to performance of one drive in
that
group" it seems like the higher raidzN are not very useful. If your using
raidzN
your probably looking for a lower than mirroring parity (aka 10%-33%), but if 
you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is 
terrible (almost unimaginable) if your running at 1/20 the "ideal"
performance.


Main Question:
 3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB 
drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure 
(Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want 
around 20%-25% parity. My system is like so:

Main Application: Home NAS
* Like to optimize max space with 20%(ideal) or 25% parity - would like
''decent''
reading performance
  - ''decent'' being max of 10GigE Ethernet, right now it is
only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE
becomes cheaper.
My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20 
disk raid.
* 16GB RAM
* Open to using ZIL/L2ARC, but, left out for now: writing doesn''t occur
much
(~7GB a week, maybe a big burst every couple months), and don''t really
read same
data multiple times.

What would be the best setup? I''m thinking one of the following:
    a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? 
(~200MB/s reading, best reliability)
    b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s 
reading, evens out size across vdevs)
    c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s 
reading, maximize vdevs for performance)

I am leaning towards "a." since I am thinking
"raidz3"+"raidz2" should provide a
little more reliability than 5 "raidz1"''s, but, worried that
the real world
read/write performance will be low (theoridical is ~200MB/s, and, since the 2nd 
vdev is 3x the size as the 1st, I am probably looking at more like 133MB/s?). 
The 12 disk array is also above the "9 disk group max" recommendation
in the
Best Practices guide, so not sure if this affects read performance (if it is 
just resilver time I am not as worried about it as long it isn''t like
3x
longer)?

I guess I''m hoping "a." really isn''t ~200MB/s hehe,
if it is I''m leaning towards
"b.", but, if so, all three are downgrades from my initial setup read 
performance wise -_-.

Is someone able to correct my understanding if some of my numbers are off, or 
would someone have a better raidzN configuration I should consider? Thanks for 
any help.

Jim Klimov

2012-Mar-21 11:56 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 7:16, MLR wrote:> I read the "ZFS_Best_Practices_Guide" and
"ZFS_Evil_Tuning_Guide", and have some
> questions:
>
>   1. Cache device for L2ARC
>       Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD
zpool
> setup shouldn''t we be reading at least at the 500MB/s read/write
range? Why
> would we want a ~500MB/s cache?
Basically, SSDs shine best in random IOs. For example, my
(consumer-grade) 2Tb disks in a home NAS yield up to 160MB/s
in linear reads, but drop to about 3Mb/s in random performance,
occasionally bursting 10-20Mb/s for a short time.

ZFS COW-based data structure is quite fragmented, so there
are many random seeks. Raw low-level performance gets hurt
as a tradeoff for reliability, and SSDs along with large
RAM buffers are ways to recover and boost the performance.

There is especially lot of work with metadata when/if you
use deduplication - tens of gigabytes of RAM are recommended
for a decent-sized pool of a few TB.
>   2. ZFS dynamically strips along the top-most vdev''s and that
"performance for 1
> vdev is equivalent to performance of one drive in that group". Am I
correct in
> thinking this means, for example, I have a single 14 disk raidz2 vdev
zpool, the
> disks will go ~100MB/s each , this zpool would theoretically read/write at
> ~100MB/s max (how about real world average?)? If this was RAID6 I think
this
> would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka
10x-
> 14x faster than zfs, and both provide the same amount of redundancy)? Is my
> thinking off in the RAID6 or RAIDZ2 numbers?
I think your numbers are not right. They would make sense
for RAID0 of 14 drives though.

All correctly implemented synchronously-redundant schemes
must wait for all storage devices to complete writing, so
they are "not faster" than single devices during writes,
and due to bus contention, etc. are often a bit slower
overall.

Reads on the other hand can be parallelised on RAIDzN as
well as on RAID5/6 and can boost read performance like
striping more or less.

As for "same level of redundancy", many people would stick
your finger at the statement that usual RAIDs don''t have a
method to know which part of the array is faulty (i.e. when
one sector in a RAID stripe becomes corrupted, there is no
way to certainly reconstruct correct data, and often no quick
way to detect the corruption either). Many arrays depend on
timestamps of the component disks so as to detect stale data,
and can only recover well from full-disk failures.

 > Why doesn''t ZFS try to dynamically> strip inside vdevs (and if it is, is there an easy to understand
explanation why
> a vdev doesn''t read from multiple drives at once when requesting
data, or why a
> zpool wouldn''t make N number of requests to a vdev with N being
the number of
> disks in that vdev)?
That it does, somewhat. In RAID terms you can think of a
ZFS pool with several top-level devices each made up from
several leaf devices, as implementing RAID50 or RAID60,
to contain lots of "blocks".

There are "banks" (TLVDevs) of disks in redundant arrays,
and these have block data (and redundancy blocks) striped
across sectors of different disks. A pool stripes (RAID0)
userdata across several TLVDEVs by storing different blocks
in different "banks". Loss of a whole TLVDEV is fatal, like
in RAID50.

ZFS has a variable step though, so depending on block size,
the block-stripe size within a TLVDEV can vary. For minimal
sized blocks on a raidz or raidz2 TLVDEV you''d have one or
two redundancy sectors and a data sector using two or three
disks only. Other "same-numbered" sectors of other disk in
the TLVDEV can be used by another such stripe.

There are nice illustrations in the docs and blogs regarding
the layout.

Note that redundancy disks are not used during normal reads
of uncorrupted data. However, I believe that there may be a
slight benefit from ZFS for smaller blocks which are not
using the whole raidzN array stripe, since parallel disks
can be used to read parts of different blocks. But the random
seeks involved in mechanical disks would probably make it
unnoticeable, and there''s probably lot of randomness in
storage of small blocks.
>
> Since "performance for 1 vdev is equivalent to performance of one
drive in that
> group" it seems like the higher raidzN are not very useful. If your
using raidzN
> your probably looking for a lower than mirroring parity (aka 10%-33%), but
if
> you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which
is
> terrible (almost unimaginable) if your running at 1/20 the
"ideal" performance.
There are several tradeoffs, and other people on the list can
explain them better (and did in the past - search the archives).
Mostly this regards resilver times (how many disks are used to
rebuild another disk) and striping performance. There were also
some calculations regarding i.e. 10-disk sets: you can make two
raidz1 arrays or one raidz2 array. They give you same userspace
sizes (8 data disks), but the latter is deemed a lot more reliable.

Basically, with mirroring you pay the most (2x-3x redundancy
for each disk) and get the best performance as well as best
redundancy. With raidzN you get more useable space on the
same disks at a greater hit to performance, but cheaper.

For many home users that does not matter. Say, your camera''s
CF card can stream its photos at 10MB/s to save it into your home
storage box, so sustained 10 or 50Mb/s of writes suffice for you.

One thing to note though is that with larger drives you get longer
times to just read in the whole drive trying to detect errors when
scrubbing - and this is something your system should proactively do.
This opens windows to multiple-drive errors, which can happen to
become unrecoverable (i.e. several hits to same block exceeding
its redundancy level). With multi-TB disks it is recommended to
have at least 3-disk redundancy via 3-4-way mirrors or raidz3 or
in traditional systems "RAID7" or "RAID6.3" as some call it.

Apparently, having 3 parity disks in a raidz3 array places some
requirement on the minimal size of the array so it becomes just
reasonable (perhaps 8-10 disks overall).
>
>
> Main Question:
>   3. I am updating my old RAID5 and want to reuse my old drives. I have 8
1.5TB
> drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure
> (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I
want
> around 20%-25% parity. My system is like so:
>
> Main Application: Home NAS
> * Like to optimize max space with 20%(ideal) or 25% parity - would like
''decent''
> reading performance
>    - ''decent'' being max of 10GigE Ethernet, right now it
is only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE
becomes cheaper.
> My RAID5 runs at ~500MB/s so was hoping to get at least above that with the
20
> disk raid.
10GigE is a theoretical 1250MB/s. That might be achievable
for writes with mirrored disks and/or good fast caching (in
bursts or if your working set fits in the cache), but seems
unlikely with raidz sets.

For reads caching would likewise help; disk speeds would be
good if you have written lots of data contiguaously (so that
the disks won''t have to seek too much and yield linear reads).

I am not ready to conjure up some numbers out of thin air now,
and hopefully someone else would reply to your main question
in detail.

I assume your other hardware won''t be a bottleneck?
(PCI buses, disk controllers, RAM access, etc.)
> * 16GB RAM
Not so much for ZFS advanced features - don''t try dedup ;)
Also, remember that L2ARC indexing still needs some RAM to
reference the cached blocks. Reference size is constant
(about 200 bytes per block), but due to varying block size
the ratio (GB of RAM => GB of L2ARC) can be different and
depends on your usage. In particular, for DEDUP the ratio
is very bad, about 2x (a dedup-table entry is about twice
as large as the reference to it from RAM ARC to L2ARC).
> * Open to using ZIL/L2ARC, but, left out for now: writing doesn''t
occur much
> (~7GB a week, maybe a big burst every couple months), and don''t
really read same
> data multiple times.
Dedicated fast and reliable (i.e. mirrored SSD or RAMDrive)
ZIL would help if you have synchronous writes. For example -
compilation of large projects creating many files, especially
over NFS.

ZIL is a rather specific investment, so it might not help you
at all, and ideally it is a write-only device (read in only
after crashes). So for SSDs you should expect a lot of wear,
and orient for a mirror of SLC devices. Or RAM disks. Or maybe
small dedicated HDDs to offload the write-seeks from main pool
(that last idea is often argued for/against)...
>
> What would be the best setup? I''m thinking one of the following:
>      a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
> (~200MB/s reading, best reliability)
>      b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)?
(~400MB/s
> reading, evens out size across vdevs)
>      c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?
(~500MB/s
> reading, maximize vdevs for performance)
>
> I am leaning towards "a." since I am thinking
"raidz3"+"raidz2" should provide a
> little more reliability than 5 "raidz1"''s, but, worried
that the real world
> read/write performance will be low (theoridical is ~200MB/s, and, since the
2nd
> vdev is 3x the size as the 1st, I am probably looking at more like
133MB/s?).
> The 12 disk array is also above the "9 disk group max"
recommendation in the
> Best Practices guide, so not sure if this affects read performance (if it
is
> just resilver time I am not as worried about it as long it isn''t
like 3x
> longer)?

One thing to note is that many people would not recommend using
a "disbalanced" ZFS array - one expanded by adding a TLVDEV after
many writes, or one consisting of differently-sized TLVDEVs.

ZFS does a rather good job of trying to use available storage
most efficiently, but it was often reported that it hits some
algorithmic bottleneck when one of the TLVDEVs is about 80-90%
full (even if others are new and empty). Blocks are balanced
across TLVDEVs on write, so your old data is not magically
redistributed until you explicitly rewrite it (i.e. zfs send
or rsync into another dataset on this pool).

So I''d suggest that you keep your disks separate, with two
pools made from 1.5Tb disks and from 3Tb disks, and use these
pools for different tasks (i.e. a working set with relatively
high turnaround and fragmentation, and WORM static data with
little fragmentation and high read performance).
Also this would allow you to more easily upgrade/replace the
whole set of 1.5Tb disks when the time comes.

Note that the two disk types can also have other different
characteristics, most notably the native sector size (4kb vs.
512b). You might expose your pool to a hit in reliability and
performance if you used the 4kb-sectored disks with emulated
512b sectors as a 512b-sectored disk, however you''d gain some
more useable space in exchange. You don''t have these negative
hits when you use a native 512b disk as a 512b disk.
It is likely that when you decide to replace the 1.5Tb disks,
all those available on the market would be 4kb-sectored, so
in-place replacement of disks (replacing pool disks one by one
and resilvering) would be a bad option, IF your 1.5Tb disks
have native 512b sectors and you use them as such in the pool.
If interested, read up more on "ashift=9 vs. ashift=12" issues
in ZFS.

>
> I guess I''m hoping "a." really isn''t ~200MB/s
hehe, if it is I''m leaning towards
> "b.", but, if so, all three are downgrades from my initial setup
read
> performance wise -_-.
>
> Is someone able to correct my understanding if some of my numbers are off,
or
> would someone have a better raidzN configuration I should consider? Thanks
for
> any help.
Again, I hope someone else would correctly suggest the setup
for your numbers. I''m somewhat more successful with theory now ;(

HTH,
//Jim Klimov

Paul Kraus

2012-Mar-21 12:41 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Wed, Mar 21, 2012 at 7:56 AM, Jim Klimov <jimklimov at cos.ru>
wrote:> 2012-03-21 7:16, MLR wrote:
> One thing to note is that many people would not recommend using
> a "disbalanced" ZFS array - one expanded by adding a TLVDEV after
> many writes, or one consisting of differently-sized TLVDEVs.
>
> ZFS does a rather good job of trying to use available storage
> most efficiently, but it was often reported that it hits some
> algorithmic bottleneck when one of the TLVDEVs is about 80-90%
> full (even if others are new and empty). Blocks are balanced
> across TLVDEVs on write, so your old data is not magically
> redistributed until you explicitly rewrite it (i.e. zfs send
> or rsync into another dataset on this pool).
    I have been running ZFS in a mission critical application since
zpool version 10 and have not seen any issues with some of the vdevs
in a zpool full while others are virtually empty. We have been running
commercial Solaris 10 releases. The configuration was that each
business unit had a separate zpool consisting of mirrored pairs of 500
GB LUNs from SAN based storage. Each zpool started with enough storage
for that business unit. As each business unit filled their space, we
added additional mirrored pairs of LUNs. So the smallest unit had one
mirror vdev and the largest had 13 vdevs. In the case of the two
largest (13 and 11 vdevs) most of the vdevs were well above 90%
utilized and there were 2 or 3 almost empty vdevs. We never saw any
reliability issues with this condition. In terms of performance, the
storage was NOT our performance bottleneck, so I do not know if there
were any performance issue with this situation.
> So I''d suggest that you keep your disks separate, with two
> pools made from 1.5Tb disks and from 3Tb disks, and use these
> pools for different tasks (i.e. a working set with relatively
> high turnaround and fragmentation, and WORM static data with
> little fragmentation and high read performance).
> Also this would allow you to more easily upgrade/replace the
> whole set of 1.5Tb disks when the time comes.
    I have never tried mixing drives of different size or performance
characteristic in the same zpool or vdev, except as a temporary
migration strategy. You already know that growing a RAIDz vdev is
currently impossible, so with a RAIDz strategy your only option for
growth is to add complete RAIDz vdevs, and you _want_ those to match
in terms of performance or you will have unpredictable performance.
For situations where you _might_ want to grow the data capacity in the
future I recommend mirrors, but ... and Richard Elling posted hard
data on this to the list a while back, to get the reliability of
RAIDz2 you need more than a 2-way mirror. In my mind, the larger the
amount of data (and size of drives) the _more_ reliability you need.

    We are no longer using the configuration described above. The
current configuration is five JBOD chassis of 24 drives each. We have
22 vdevs, each a RAIDz2 consisting of one drive from each chassis and
10 hot spares. Our priority was reliability followed by capacity and
performance. If we could have, we would have just used 3 or 4 way
mirrors, but we needed more capacity than that provided. I note that
in pre-production testing we did have two of the five JBOD chassis go
offline at once and did not lose _any_ data. The total pool size is
about 40 TB.

    We also have a redundant copy of the data on a remote system. That
system only has two JBOD chassis and capacity  is the priority. The
zpool consists of two vdevs each a RAIDz2 of 23 drives and two hot
spares. The performance is dreadful, but we _have_ the data in case of
a real disaster.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Jim Klimov

2012-Mar-21 12:55 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 16:41, Paul Kraus wrote:>      I have been running ZFS in a mission critical application since
> zpool version 10 and have not seen any issues with some of the vdevs
> in a zpool full while others are virtually empty. We have been running
> commercial Solaris 10 releases. The configuration was that each
Thanks for sharing some real-life data from larger deployments,
as you often did. That''s something I don''t often have access
to nowadays, with a liberty to tell :)

Nice to hear about lack of degradations in this scenario you
have, and it was one proposed a few years back on Sun Forums
I believe. Perhaps the problems come if you similarly expand
raidz-based arrays by adding TLVDEVs, or with OpenSolaris''s
experimental features?.. I don''t know, really :)

//Jim

Edward Ned Harvey

2012-Mar-21 13:28 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of MLR
> 
>      Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD
zpool> setup shouldn''t we be reading at least at the 500MB/s read/write
range?
> Why
> would we want a ~500MB/s cache?
You don''t add l2arc because you care about MB/sec.  You add it because
you
care about IOPS (read).

Similarly, you don''t add dedicated log device for MB/sec.  You add it
for
IOPS (sync write).

Any pool - raidz, raidz2, mirror - will give you optimum *sequential*
throughput.  All the performance enhancements are for random IO.  Mirrors
outperform raidzN, but in either case, you get improvements by adding log &
cache.

> Am I correct in
> thinking this means, for example, I have a single 14 disk raidz2 vdevzpool,

It''s not advisable to put more than ~8 disks in a single vdev, because
it
really hurts during resilver time.  Maybe a week or two to resilver like
that.

> the
> disks will go ~100MB/s each , this zpool would theoretically read/write at
No matter which configuration you choose, you can expect optimum throughput
from all drives in sequential operations.  Random IO is a different story.

> What would be the best setup? I''m thinking one of the following:
>     a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
> (~200MB/s reading, best reliability)
No.  12 in a single vdev is too much.

>     b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)?
(~400MB/s> reading, evens out size across vdevs)
Not bad, but different size vdev''s will perform differently (8 disks vs
4)
so...  See below.

>     c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?
(~500MB/s> reading, maximize vdevs for performance)
This would be your optimal configuration.

Paul Kraus

2012-Mar-21 13:30 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Tue, Mar 20, 2012 at 11:16 PM, MLR <maillistreader1 at gmail.com>
wrote:
> ?1. Cache device for L2ARC
> ? ? Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool
> setup shouldn''t we be reading at least at the 500MB/s read/write
range? Why
> would we want a ~500MB/s cache?
    Without knowing the I/O pattern, saying 500 MB/sec. is
meaningless. Achieving 500MB/sec. with 8KB files and lots of random
accesses is really hard, even with 20 HDDs. Achieving 500MB/sec. of
sequential streaming of 100MB+ files is much easier. An SSD will be as
fast on random I/O as on sequential (compared to an HDD). An SSD will
be as fast on small I/O as large (once again, compared to an HDD). Due
to it''s COW design, once a file is _changed_, ZFS no longer accesses
it strictly sequentially. If the files are written once and never
changed, then they _may_ be sequential on disk.

    An important point to remember about the ARC  / L2ARC is that it
(they ?) are ADAPTIVE. The amount of space used by the ARC will grow
as ZFS reads data and shrinks as other processes need memory. I also
suspect that data eventually ages out of the ARC. The L2ARC is
(mostly) just an extension of the ARC, except that it does not have to
give up capacity as other processes need more memory.
> ?2. ZFS dynamically strips along the top-most vdev''s and that
"performance for 1
> vdev is equivalent to performance of one drive in that group". Am I
correct in
> thinking this means, for example, I have a single 14 disk raidz2 vdev
zpool, the
> disks will go ~100MB/s each ,
   Assuming the disks will do 100MB/sec. for your data :-)
> this zpool would theoretically read/write at
> ~100MB/s max (how about real world average?)?
    Yes. In a RAIDz<n> when a write is dispatched to the vdev _all_
the drives must complete the write before the write is complete. All
the drives in the vdev are written to in parallel. This is (or should
be) the case for _any_ RAID scheme, including RAID1 (mirroring). If a
zpool has more than one vdev, then writes are distributed among the
vdevs based on a number of factors (which others are _much_ more
qualified to discuss).

    For ZFS, performance is proportional to the number of vdevs NOT
the number of drives or the number of drives per vdev. See
https://docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc
for some testing I did a while back. I did not test sequential read as
that is not part of our workload.
> If this was RAID6 I think this
> would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka
10x-
> 14x faster than zfs, and both provide the same amount of redundancy)? Is my
> thinking off in the RAID6 or RAIDZ2 numbers? Why doesn''t ZFS try
to dynamically
> strip inside vdevs (and if it is, is there an easy to understand
explanation why
> a vdev doesn''t read from multiple drives at once when requesting
data, or why a
> zpool wouldn''t make N number of requests to a vdev with N being
the number of
> disks in that vdev)?
    I understand why the read performance scales with the number of
vdevs, but I have never really understood _why_ it does not also scale
with the number of drives in each vdev. When I did my testing with 40
dribves, I expected similar READ performance regardless of the layout,
but that was NOT the case.
> Since "performance for 1 vdev is equivalent to performance of one
drive in that
> group" it seems like the higher raidzN are not very useful. If your
using raidzN
> your probably looking for a lower than mirroring parity (aka 10%-33%), but
if
> you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which
is
> terrible (almost unimaginable) if your running at 1/20 the
"ideal" performance.
    The recommendation is to not go over 8 or so drives per vdev, but
that is a performance issue NOT a reliability one. I have also not
been able to duplicate others observations that 2^N drives per vdev is
a magic number (4, 8, 16, etc). As you can see from the above, even a
40 drive vdev works and is reliable, just (relatively) slow :-)
> Main Question:
> ?3. I am updating my old RAID5 and want to reuse my old drives. I have 8
1.5TB
> drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure
> (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I
want
> around 20%-25% parity. My system is like so:
    Is the enclosure just a JBOD? If it is not, can it present drives
directly? If you cannot get at the drives individually, then the rest
of the discussion is largely moot.

    You are buying 3TB drives, by definition you are NOT looking for
performance or reliability but capacity. What is the uncorrectable
error rate on these 3TB drives? What is the real random I/Ops
capability of these 3TB drives? I am not trying to be mean here, but I
would hate to see you put a ton of effort into this and then be
disappointed with the result due to a poor choice of hardware.
> Main Application: Home NAS
> * Like to optimize max space with 20%(ideal) or 25% parity - would like
''decent''
> reading performance
> ?- ''decent'' being max of 10GigE Ethernet, right now it is
only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE
becomes cheaper.
    1,250MB/sec of random I/O (assuming small files) is very not
trivial to achieve and is way more than "decent"... On my home network
I see 30MB/sec of large file traffic per client, and I rarely have
more than one client doing lots of I/O at a time.

    How much space do you _need_, including reasonable growth?
> My RAID5 runs at ~500MB/s so was hoping to get at least above that with the
20
> disk raid.
    How did you measure this?
> * 16GB RAM
    What OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAM
and serving up 20TB of random small files. The ARC uses between 8 and
10 GB with between 1 and 2 GB free. But our users are generally
accessing less than 3 TB of data at a time.
> * Open to using ZIL/L2ARC, but, left out for now: writing doesn''t
occur much
> (~7GB a week, maybe a big burst every couple months), and don''t
really read same
> data multiple times.
    ZIL helps sync write performance (NFS)
    L2ARC give you more ARC space which helps all reads
> What would be the best setup? I''m thinking one of the following:
> ? ?a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
> (~200MB/s reading, best reliability)
> ? ?b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)?
(~400MB/s
> reading, evens out size across vdevs)
> ? ?c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?
(~500MB/s
> reading, maximize vdevs for performance)
    With the eight 1.5TB drives you can:
1 x 8 (raidz<n>) == worst performance
2 x 4 (raidz<n>) == better performance
    if raidz2, then capacity is the same as mirror but has better reliability
4 x 2 (mirror) == best performance

    With the twelve 3TB drives you can:
1 x 12 (raidz<n>) == worst performance
2 x 6 (raidz<n>) == better performance
3 x 4 (raidz<n>) == better performance
4 x 3 (mirror) == best performance
6 x 2 (mirror) == almost best performance

    I agree with Jim that you should keep the 1.5TB and the 3TB drives
in separate zpools. Although you _can_ partition the 3TB drives to
look like two 1.5TB drives. Group the first partition on each 3TB
drive with the 1.5TB drives and use the second as a second zpool.
There are caveats with doing that, but it may fit your needs...

    With 20 logical 1.5TB drives you can:
1 x 20 (raidz<n>) == bad performance, don''t do this :-)
2 x 10 (raidz<n>) == better
3 x 6 + 2 hot spare (raidz<n>)
4 x 5 (raidz<n>)
6 x 3 + 2 hot spare (mirror)
9 x 2 + 2 hot spare (mirror)

    Plus another 12 logical 1.5TB drives:
1 x 12 (raidz<n>) == worst performance
2 x 6 (raidz<n>) == better performance
3 x 4 (raidz<n>) == better performance
4 x 3 (mirror) == best performance
6 x 2 (mirror) == almost best performance

    If you have the time, setup each configuration and _measure_ the
performance. If you can, load up a bunch of data (at least 33% full)
and then trigger a scrub to see how long a resilver takes. Remember
here that you are looking for _relative_ measures (unless you have a
performance goal you need to hit).

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Edward Ned Harvey

2012-Mar-21 13:30 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of MLR
> 
>     c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?
(~500MB/s> reading, maximize vdevs for performance)
If possible, spread your vdev''s across 4 different controllers/busses. 
So
if you lose any one controller/bus, you will only be degraded, pool
won''t go
offline.

Jim Klimov

2012-Mar-21 13:51 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 17:28, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of MLR
...>> Am I correct in thinking this means, for example, I have a single
 >> 14 disk raidz2 vdev zpool,>
> It''s not advisable to put more than ~8 disks in a single vdev,
because it
> really hurts during resilver time.  Maybe a week or two to resilver like
> that.
Yes, that''s important to note also. While ZFS marketing initially
stressed that unlike traditional RAID systems, a "rebuild" of ZFS
onto a spare/replacement disk only needs to copy referenced data
and not the whole disk, it somehow fell off the picture that such
rebuild is a lot of random IO - because the data block tree must
be read in as a tree walk, often with emphasis on block "age" (its
birth TXG number). If your pool is reasonably full (and who runs
it empty?) then this is indeed lots of random IO, and a blind
full-disk copy would have gone orders of magnitude faster.
The less disk participate in this thrashing - the faster it will
go (less data needed overall to reconstruct a disk''s worth of
sectors from redundancy data).

That''s the way I understand the problem, anyway...

//Jim

Paul Kraus

2012-Mar-21 14:37 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Wed, Mar 21, 2012 at 9:51 AM, Jim Klimov <jimklimov at cos.ru>
wrote:> 2012-03-21 17:28, Edward Ned Harvey wrote:
>> It''s not advisable to put more than ~8 disks in a single vdev,
because it
>> really hurts during resilver time. ?Maybe a week or two to resilver
like
>> that.
>
> Yes, that''s important to note also. While ZFS marketing initially
> stressed that unlike traditional RAID systems, a "rebuild" of ZFS
> onto a spare/replacement disk only needs to copy referenced data
> and not the whole disk, it somehow fell off the picture that such
> rebuild is a lot of random IO - because the data block tree must
> be read in as a tree walk, often with emphasis on block "age"
(its
> birth TXG number). If your pool is reasonably full (and who runs
> it empty?) then this is indeed lots of random IO, and a blind
> full-disk copy would have gone orders of magnitude faster.
> The less disk participate in this thrashing - the faster it will
> go (less data needed overall to reconstruct a disk''s worth of
> sectors from redundancy data).
     Three are two different cases here... resilver to reconstruct
data from a failed drive and a scrub to pro-actively find bad sectors.

     The best case situation for the first case (bad drive
replacement) is a mirrored drive in my experience. In that case only
the data involved in the failure needs to be read and written. I am
unclear how much of the data is read in the case of a failure of a
drive in a RAIDz<n> vdev _from_other_vdevs_. I have seen disk activity
on non-failure related vdevs during a drive replacement, which is why
I am unsure in this case.

    In the case of a "scrub", _all_ of the data in the zpool is read
and the checksums checked. My 22 vdev zpool takes about 300 hours for
this while the 2 vdev zpool takes over 600 hours. Both have comparable
amounts of data and snapshots. The 22 vdev zpool is on a production
server with normal I/O activity, the 2 vdev case is only receiving zfs
snapshots and doing no other I/O.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Marion Hakanson

2012-Mar-21 17:40 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

paul at kraus-haus.org said:>     Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
> Achieving 500MB/sec. with 8KB files and lots of random accesses is really
> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
> 100MB+ files is much easier.
> . . .
>     For ZFS, performance is proportional to the number of vdevs NOT the
> number of drives or the number of drives per vdev. See https://
>
docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
> Xc for some testing I did a while back. I did not test sequential read as
> that is not part of our workload. 
> . . .
>     I understand why the read performance scales with the number of vdevs,
> but I have never really understood _why_ it does not also scale with the
> number of drives in each vdev. When I did my testing with 40 dribves, I
> expected similar READ performance regardless of the layout, but that was
NOT
> the case. 
In your first paragraph you make the important point that
"performance"
is too ambiguous in this discussion.  Yet in the 2nd & 3rd paragraphs above,
you go back to using "performance" in its ambiguous form.  I assume
that
by "performance" you are mostly focussing on random-read
performance....

My experience is that sequential read performance _does_ scale with the number
of drives in each vdev.  Both sequential and random write performance also
scales in this manner (note that ZFS tends to save up small, random writes
and flush them out in a sequential batch).

Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping.  In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application.  This is why a single
vdev''s random-read performance is equivalent to the random-read
performance of
a single drive.


paul at kraus-haus.org said:>     The recommendation is to not go over 8 or so drives per vdev, but that
is
> a performance issue NOT a reliability one. I have also not been able to
> duplicate others observations that 2^N drives per vdev is a magic number
(4,
> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and
is
> reliable, just (relatively) slow :-) 
Again, the "performance issue" you describe above is for the
random-read
case, not sequential.  If you rarely experience small-random-read workloads,
then raidz* will perform just fine.  We often see 2000 MBytes/sec sequential
read (and write) performance on a raidz3 pool consisting of 3, 12-disk
vdev''s
(using 2TB drives).

However, when a disk fails and must be resilvered, that''s when you will
run into the slow performance of the small, random read workload.  This
is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
you''ve still got a lot of redundancy in place.

Regards,

Marion

Jim Klimov

2012-Mar-21 18:26 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 21:40, Marion Hakanson ?????:> Small, random read performance does not scale with the number of drives in
each
> raidz[123] vdev because of the dynamic striping.  In order to read a single
> logical block, ZFS has to read all the segments of that logical block,
which
> have been spread out across multiple drives, in order to validate the
checksum
> before returning that logical block to the application.  This is why a
single
> vdev''s random-read performance is equivalent to the random-read
performance of
> a single drive.
True, but if the stars align so nicely that all the sectors
related to the block are read simultaneously in parallel
from several drives of the top-level vdev, so there is no
(substantial) *latency* incurred by waiting between the first
and last drives to complete the read request, then the
*aggregate bandwidth* of the array is (should be) similar
to performance (bandwidth) of a stripe.

This gain would probably be hidden by caches and averages,
unless the stars align so nicely for many blocks in a row,
such as a sequential uninterrupted read of a file written
out sequentially - so that component drives would stream
it off the platter track by track in a row... Ah, what a
wonderful world that would be! ;)

Also, after the sector is read by the disk and passed to
the OS, it is supposedly cached until all sectors of the
block arrive into the cache and the checksum matches.
During this time the HDD is available to do other queued
mechanical tasks. I am not sure which cache that might be:
too early for ARC - no block yet, and the vdev-caches now
drop non-metadata sectors. Perhaps it is just a variable
buffer space in the instance of the reading routine which
tries to gather all pieces of the block together and pass
it to the reader (and into ARC)...

//Jim

Richard Elling

2012-Mar-21 18:53 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

comments below...

On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:
> paul at kraus-haus.org said:
>>    Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
>> Achieving 500MB/sec. with 8KB files and lots of random accesses is
really
>> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming
of
>> 100MB+ files is much easier.
>> . . .
>>    For ZFS, performance is proportional to the number of vdevs NOT the
>> number of drives or the number of drives per vdev. See https://
>>
docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
>> Xc for some testing I did a while back. I did not test sequential read
as
>> that is not part of our workload. 
Actually, few people have sequential workloads. Many think they do, but I say
prove it with iopattern.
>> . . .
>>    I understand why the read performance scales with the number of
vdevs,
>> but I have never really understood _why_ it does not also scale with
the
>> number of drives in each vdev. When I did my testing with 40 dribves, I
>> expected similar READ performance regardless of the layout, but that
was NOT
>> the case. 
> 
> In your first paragraph you make the important point that
"performance"
> is too ambiguous in this discussion.  Yet in the 2nd & 3rd paragraphs
above,
> you go back to using "performance" in its ambiguous form.  I
assume that
> by "performance" you are mostly focussing on random-read
performance....
> 
> My experience is that sequential read performance _does_ scale with the
number
> of drives in each vdev.  Both sequential and random write performance also
> scales in this manner (note that ZFS tends to save up small, random writes
> and flush them out in a sequential batch).
Yes.

I wrote a small, random read performance model that considers the various
caches.
It is described here:
http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf

The spreadsheet shown in figure 3 is available for the asking (and it works on
your
iphone or ipad :-)
> Small, random read performance does not scale with the number of drives in
each
> raidz[123] vdev because of the dynamic striping.  In order to read a single
> logical block, ZFS has to read all the segments of that logical block,
which
> have been spread out across multiple drives, in order to validate the
checksum
> before returning that logical block to the application.  This is why a
single
> vdev''s random-read performance is equivalent to the random-read
performance of
> a single drive.
It is not as bad as that. The actual worst case number for a HDD with
zfs_vdev_max_pending
of one is:
	average IOPS * ((D+P) / D)
where,
	D = number of data vdevs
	P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
	total disks per set = D + P

We did many studies that verified this. More recent studies show
zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk
at
OpenStorage Summit last fall.
> paul at kraus-haus.org said:
>>    The recommendation is to not go over 8 or so drives per vdev, but
that is
>> a performance issue NOT a reliability one. I have also not been able to
>> duplicate others observations that 2^N drives per vdev is a magic
number (4,
>> 8, 16, etc). As you can see from the above, even a 40 drive vdev works
and is
>> reliable, just (relatively) slow :-) 
Paul, I have a considerable amount of data that refutes your findings. Can we
agree
that YMMV and varies dramatically, depending on your workload?
> 
> Again, the "performance issue" you describe above is for the
random-read
> case, not sequential.  If you rarely experience small-random-read
workloads,
> then raidz* will perform just fine.  We often see 2000 MBytes/sec
sequential
> read (and write) performance on a raidz3 pool consisting of 3, 12-disk
vdev''s
> (using 2TB drives).
Yes, this is relatively easy to see. I''ve seen 6GByes/sec for large
configs, but
that begins to push the system boundaries in many ways.
> 
> However, when a disk fails and must be resilvered, that''s when you
will
> run into the slow performance of the small, random read workload.  This
> is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
> especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
> you''ve still got a lot of redundancy in place.
> 
> Regards,
> 
> Marion
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120321/d8fde6f2/attachment-0001.html>

maillist reader

2012-Mar-21 19:36 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Thank you all for the information, I believe it is much clearer to me.
"Sequential Reads" should scale with the number of disks in the entire
zpool (regardless of amount of vdevs), and "Random Reads" will scale
with
just the number of vdevs (aka idea I had before only applies to "Random
Reads"), which I am much more happy with. Everything on my system should be
mostly sequential as editing should not occur much (aka no virtual machine
type things), when things get changed it usually means deleteing the old
file and adding the "updated" file.

I read though that ZFS does not have a "defragmentation" tool, is this
still the case? It would seem with such a performance difference between
"sequential reads" and "random reads" for raidzN''s,
a defragmentation tool
would be very high on ZFS''s TODO list ;).
>    Is the enclosure just a JBOD? If it is not, can it present drives
 &>I assume your other hardware won''t be a bottleneck?
>(PCI buses, disk controllers, RAM access, etc.)
A little more extra information about my system, it is a JBODs. The disks
go to a SAS-2 Expander (RES2SV240), and that has a single connection to a
tyan motherboard which has a LSI SAS 2008 chip controller built-in. The CPU
is a i3 with a DMI of 5 GT/s (DMI is new to me vs FSB). RAM is server grade
unbuffered ECC DDR3 1333 8GB sticks. It is a dedicated machine which will
do nothing but serve files over the network.

To my understanding the Network or disks themselves should be my bottlenet;
SAS-2 connection between SAS Expander and mobo should be 24Gbit/s or
3GB/s(1.5GB/s if SAS-1), and 5 GT/s should provide ~20 GB/s max bandwidth
for the 64bit machine from what I read online. I don''t think this
affects
me, but, I was also curious, does anyone know if the mobo <> sas expander
will still establish a SAS-2 connection (if they both support SAS-2) if the
backpanes only support SAS-1 / SATA 3Gb/s? I never looked up the backpane
partnumbers in my Norco, but, think they support SATA 6Gb/s, so assume they
support SAS-2. But in essance SAS Expander <> HDD''s wont be over
3Gb/s per
port, so as long as SAS Expander <> mobo establish''s
it''s own SAS-2
connection regardless of what SAS Expander <> HDD''s do, then I
don''t even
have to think about it. 1.5GB/s (SAS-1) is still above my optimal max
anyway though.

In essance, if the drives can provide it (and network interface ignored) I
think the theoredical limitation is 3GB/s. I mentioned 1.25GB/s for the
10GigE is max I am looking at, but, I''d be happy with anywhere between
500MB/s->1GB/s for sequential reads of large files (don''t really
care about
any type of writes, and hopefully random reads do not happen so often *will
test with iopattern*).
> What is the uncorrectable
> error rate on these 3TB drives? What is the real random I/Ops
> capability of these 3TB drives?
I''m unsure of these myself, all the other parts have arrived, or are on
route, but I have not actually bought the HDD''s yet so can still choose
almost anything. It will probably be cheapest consumer drives I can get
though (probably "Seagate Barracude Green ST3000DM001"''s or
"Hitachi 5K3000
3TB"''s). The 1.5TB''s I have in my old system are pretty
much the same thing.
>    How much space do you _need_, including reasonable growth?
My old system is 9.55TB and almost full, and I have about 3TB spread out
elsewear. This was setup about 5years ago. With the 20 disk enclosure
I''m
thinking about 30TB usuable space (but maybe only usign 15 disks at first),
and hoping it''ll last for another 5 years.
>    How did you measure this?
ATTO Benchmark is what I used on the local machine for the 500MB/s number.
For small files (1kB>16kB) it is small (50MB>150MB), for the larger 256kB+
it reads ~550MB/s). This is hardware RAID5 though. Over the 1Gbit network
Windows7 always gets up to ~100MB/s when writing/reading from the RAID5
share.
>    What OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAM
The new ZFS system OS will probably be OpenIndiana with the v28 zpool. I
have been looking at FreeNAS (FreeBSP) and a little up in the air on which
to choose.


Thank you all for the information. I will very likley create two zpools
(one for 1.5TB drives, and one for 3TB drives), initially I thought down
the road if the pool ever fills up (probably like 5+ years) I would start
swaping the 1.5TB drives with 3TB drives to let the small vdev
"expand"
after all were replaced, but, I didn''t realize there could potentially
be
performance problems via block size differences of 1.5TB (~5 year old
drives) drives and 3TB+ drives (~5 years in the future).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120321/c9d89bee/attachment.html>

Bob Friesenhahn

2012-Mar-21 19:59 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Wed, 21 Mar 2012, maillist reader wrote:
> I read though that ZFS does not have a "defragmentation" tool, is
this still the case? It would seem with such a
> performance difference between "sequential reads" and
"random reads" for raidzN''s, a defragmentation tool would be
> very high on ZFS''s TODO list ;).
Zfs does not usually suffer significantly from fragmentation.  To be 
clear, "random reads" means random file access and necessary head 
seeks.  Any mechanical-based device will suffer from this and it is 
not specific to zfs.

Something which is specific to zfs is that within a zdev, a stripe 
must be read and written completely.  Partial stripe reads are not 
supported since a full read is necessary in order to validate the 
block checksum.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Mar-22 10:03 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 22:53, Richard Elling wrote:
...>> This is why a single
>> vdev''s random-read performance is equivalent to the
random-read
>> performance of
>> a single drive.
>
> It is not as bad as that. The actual worst case number for a HDD with
> zfs_vdev_max_pending
> of one is:
> average IOPS * ((D+P) / D)
> where,
> D = number of data vdevs
> P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
> total disks per set = D + P
I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
size worth of data) there would be P+1 sectors used to store the
block, which is an even worse case at least capacity-wise, as well
as impacting fragmentation => seeks, but might occasionally allow
parallel reads of different objects (tasks running on disks not
involved in storage of the one data sector and maybe its parities
when required).

Is there any truth to this picture?

Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?
I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.

Thanks :)

> We did many studies that verified this. More recent studies show
> zfs_vdev_max_pending
> has a huge impact on average latency of HDDs, which I also described in
> my talk at
> OpenStorage Summit last fall.
What about drives without (a good implementation of) NCQ/TCQ/whatever?
Does ZFS in-kernel caching, queuing and sorting of pending requests
provide a similar service? Is it controllable with the same switch?

Or, alternatively, is it a kernel-only feature which does not depend
on hardware *CQ? Are there any benefits to disks with *CQ then? :)

Edward Ned Harvey

2012-Mar-22 16:30 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Paul Kraus
> 
>      Three are two different cases here... resilver to reconstruct
> data from a failed drive and a scrub to pro-actively find bad sectors.
> 
>      The best case situation for the first case (bad drive
> replacement) is a mirrored drive in my experience. In that case only
> the data involved in the failure needs to be read and written. I am
During resilver, all the data in the vdev must be read & reconstructed to
write the new disk.  Notice I said "vdev."  If you have a pool made of
a
single vdev, then it means all the data in your pool.  However if you have a
pool made of a million vdev''s, then ~ one millionth of the pool must be
read.  If you configured your pool using mirrors instead of raidz, then you
have minimized the size of your vdev''s, and maximized the IOPS
you''re able
to perform *per* vdev.  So mirrors resilver many times faster than raidz,
but still, mirrors in my experience resilver ~ 10x slower than blindly
reading & writing the entire disk serially.  In my experience, hardware raid
resilver takes a couple or a few hours (divide total disk size by total
sustainable throughput), while zfs mirror resilver takes a dozen hours, or a
day or two (lots of random IO).  While raidz takes several days, if not
multiple weeks to resilver.  Of course all this is variable and dependent on
both your data and usage patterns.

Edward Ned Harvey

2012-Mar-22 16:42 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of maillist reader
> 
> I read though that ZFS does not have a "defragmentation" tool, is
this
still the> case? 
True.

> It would seem with such a performance difference between
> "sequential reads" and "random reads" for
raidzN''s, a defragmentation tool
> would be very high on ZFS''s TODO list ;).
It is high on the todo list, and in fact a lot of other useful stuff is
dependent on the same code, so when/if implemented, it will enable a lot of
new features, where defrag is just one such new feature.

However, there''s a very difficult decision regarding *what* do you
count as
defragmentation?  (Not to mention, a lot of work to be done.)  The goal of
defrag is to align data on disks serially so as to maximize the useful speed
of the disks.  Unfortunately, there are some really big competing demands -
where data is read in different orders.

For example, the traditional perception of defrag would align disk blocks of
individual files.  Thus, when you later return to read those files
sequentially, you would have maximum performance.  But that''s not the
same
order of data read as compared to scrub/resilver/zfs send.
Scrub/resilver/zfs send operate in (at least approximate) temporal order.  

So if you defrag at a file level, you hurt the performance of
scrub/resilver/send.  If you defrag at the temporal pool level (which is the
default position, current behavior) you hurt performance of file operations.
Pick your poison.

Richard Elling

2012-Mar-22 16:52 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Mar 22, 2012, at 3:03 AM, Jim Klimov wrote:
> 2012-03-21 22:53, Richard Elling wrote:
> ...
>>> This is why a single
>>> vdev''s random-read performance is equivalent to the
random-read
>>> performance of
>>> a single drive.
>> 
>> It is not as bad as that. The actual worst case number for a HDD with
>> zfs_vdev_max_pending
>> of one is:
>> average IOPS * ((D+P) / D)
>> where,
>> D = number of data vdevs
>> P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
>> total disks per set = D + P
> 
> I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
> size worth of data) there would be P+1 sectors used to store the
> block, which is an even worse case at least capacity-wise, as well
> as impacting fragmentation => seeks, but might occasionally allow
> parallel reads of different objects (tasks running on disks not
> involved in storage of the one data sector and maybe its parities
> when required).
> 
> Is there any truth to this picture?
Yes, but it is a rare case for 512b sectors. It could be more common for 4KB
sector disks when ashift=12. However, in that case the performance increases
to the equivalent of mirroring, so there are some benefits.

FWIW, some people call this "RAID-1E"
> 
> Were there any research or tests regarding storage of many small
> files (1-sector sized or close to that) on different vdev layouts?
It is not a common case, so why bother?
> I believe that such files would use a single-sector-sized set of
> indirect blocks (dittoed at least twice), so one single-sector
> sized file would use at least 9 sectors in raidz2.
No. You can''t account for the metadata that way. Metadata space is not
1:1 with
data space. Metadata tends to get written in 16KB chunks, compressed.
> 
> Thanks :)
> 
> 
>> We did many studies that verified this. More recent studies show
>> zfs_vdev_max_pending
>> has a huge impact on average latency of HDDs, which I also described in
>> my talk at
>> OpenStorage Summit last fall.
> 
> What about drives without (a good implementation of) NCQ/TCQ/whatever?
All HDDs I''ve tested suck. The form of the suckage is that the number
of IOPS
stays relatively constant, but the average latency increases dramatically.  This
makes sense, due to the way elevator algorithms work.
> Does ZFS in-kernel caching, queuing and sorting of pending requests
> provide a similar service? Is it controllable with the same switch?
There are many caches at play here, with many tunables. The analysis
doesn''t
fit in an email.
> 
> Or, alternatively, is it a kernel-only feature which does not depend
> on hardware *CQ? Are there any benefits to disks with *CQ then? :)
Yes, SSDs with NCQ work very well.
 -- richard

--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120322/c401a03f/attachment.html>

Jim Klimov

2012-Mar-22 18:01 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 20:52, Richard Elling wrote:> Yes, but it is a rare case for 512b sectors. > It could be more common for 4KB sector disks when ashift=12.
...>> Were there any research or tests regarding storage of many small
>> files (1-sector sized or close to that) on different vdev layouts?
>
> It is not a common case, so why bother?
I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.

I agree that hordes of 512b files would be rare; 4kb-sized
files (or a bit larger - 2-3 userdata sectors) are a lot
more probable ;)
>
>> I believe that such files would use a single-sector-sized set of
>> indirect blocks (dittoed at least twice), so one single-sector
>> sized file would use at least 9 sectors in raidz2.
>
> No. You can''t account for the metadata that way. Metadata space is
not
> 1:1 with
> data space. Metadata tends to get written in 16KB chunks, compressed.
I purportedly made an example of single-sector-sized files.
The way I get it (maybe wrong though), the tree of indirect
blocks (dnode?) for a file is stored separately from other
similar objects. While different L0 blkptr_t objects (BPs)
"parented" by the same L1 object are stored as a single
block on disk (128 BPs sized 128 bytes each = 16kb), further
redundanced and ditto-copied, I believe that L0 BPs from
different files are stored in separate blocks - as well
as L0 BPs parented by different L1 BPs from different
byterange stretches of the same file. Likewise for other
layers of L(N+1) pointers if the file is sufficiently
large (in amount of blocks used to write it).

The BP tree for a file is itself an object for a ZFS dataset,
individually referenced (as inode number) and there''s a
pointer to its root from the DMU dnode of the dataset.

If the above rant is true, then the single-block file should
have a single L0 blkptr playing as its whole indirect tree
of block pointers, and that L0 would be stored in a dedicated
block (not shared with other files'' BPs), inflated by ditto
copies=2 and raidz/mirror redundancy.

Right/wrong?

Thanks,
//Jim

Bob Friesenhahn

2012-Mar-22 20:33 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Thu, 22 Mar 2012, Jim Klimov wrote:>
> I think that a certain Bob F. would disagree, especially
> when larger native sectors and ashist=12 come into play.
> Namely, one scenario where this is important is automated
> storage of thumbnails for websites, or some similar small
> objects in vast amounts.
I don''t know about that Bob F. but this Bob F. just took a look and 
noticed that thumbnail files for full-color images are typically 4KB 
or a bit larger.  Low-color thumbnails can be much smaller.

For a very large photo site, it would make sense to replicate just the 
thumbnails across a number of front-end servers and put the larger 
files on fewer storage servers because they are requested much less 
often and stream out better.  This would mean that those front-end 
"thumbnail" servers would primarily contain small files.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jeff Bacon

2012-Mar-24 23:33 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> 2012-03-21 16:41, Paul Kraus wrote:
> >      I have been running ZFS in a mission critical application since
> > zpool version 10 and have not seen any issues with some of the vdevs
> > in a zpool full while others are virtually empty. We have been running
> > commercial Solaris 10 releases. The configuration was that each
> 
> Thanks for sharing some real-life data from larger deployments,
> as you often did. That''s something I don''t often have
access
> to nowadays, with a liberty to tell :)
Here''s another datapoint, then: 

I''m using sol10u9 and u10 on a number of supermicro boxes,
mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
Application is NFS file service to a bunch of clients, and 
we also have an in-house database application written in Java
which implements a column-oriented db in files. Just about all
of it is raidz2, much of it running gzip-compressed.

Since I can''t find anything saying not to other than some common
wisdom about not putting your eggs all in one basket that I''m
choosing to reject in some cases, I just keep adding vdevs to
the pool. started with 2TB barracudas for dev/test/archive
usage and constellations for prod, now 3TB drives, have just
added some of the new Pipeline drives with nothing particularly
of interest to note therefrom. 

You can create a startlingly large pool this way:

ny-fs7(68)% zpool list
NAME   SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
srv    177T   114T  63.3T    64%  ONLINE  -

most pools are smaller. this is an archive box that''s also
the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
one is 130TB in 11 vdevs of 8 drives raidz2. I won''t guess
at the mix of 2TB and 3TB. these are both sol10u9. 

Another box has 150TB in 6 pools, raidz2/gzip using 2TB
constellations, dual X5690s with 144GB RAM running 20-30
Java db workers. We do manage to break this box on the
odd occasion - there''s a race condition in the ZIO code 
where a buffer can be freed while the block buffer is in
the process of being "loaned" out to the compression code.
However, it takes 600 zpool threads plus another 600-900
java threads running at the same time with a backlog of 
80000 ZIOs in queue, so it''s not the sort of thing that
anyone''s likely to run across much. :) It''s fixed
in sol11, I understand; however, our intended fix is
to split the whole thing so that the workload (which
for various reasons needs to be on one box) is moved
to a 4-socket Westmere, and all of the data pools
are served via NFS from other boxes. 

I did lose some data once, long ago, using LSI 1068-based 
controllers on older kit, but pretty much I can attribute
that to something between me-being-stupid and the 1068s
really not being especially friendly towards the LSI
expander chips in the older 3Gb/s SMC backplanes when used
for SATA-over-SAS tunneling. The current arrangements 
are pretty solid otherwise. 

The SATA-based boxes can be a little cranky when a drive
toasts, of course - they sit and hang for a while until they
finally decide to offline the drive. We take that as par
for the course; for the application in question (basically,
storing huge amounts of data on the odd occasion that someone
has a need for it), it''s not exactly a showstopper.

I am curious as to whether there is any practical upper-limit
on the number of vdevs, or how far one might push this kind of
configuration in terms of pool size - assuming a sufficient
quantity of RAM, of course.... I''m sure I will need to 
split this up someday but for the application there''s just
something hideously convenient about leaving it all in one
filesystem in one pool. 

-bacon

Richard Elling

2012-Mar-25 00:06 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Thanks for sharing, Jeff!
Comments below...

On Mar 24, 2012, at 4:33 PM, Jeff Bacon wrote:
>> 2012-03-21 16:41, Paul Kraus wrote:
>>>     I have been running ZFS in a mission critical application since
>>> zpool version 10 and have not seen any issues with some of the
vdevs
>>> in a zpool full while others are virtually empty. We have been
running
>>> commercial Solaris 10 releases. The configuration was that each
>> 
>> Thanks for sharing some real-life data from larger deployments,
>> as you often did. That''s something I don''t often have
access
>> to nowadays, with a liberty to tell :)
> 
> Here''s another datapoint, then: 
> 
> I''m using sol10u9 and u10 on a number of supermicro boxes,
> mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
> Application is NFS file service to a bunch of clients, and 
> we also have an in-house database application written in Java
> which implements a column-oriented db in files. Just about all
> of it is raidz2, much of it running gzip-compressed.
> 
> Since I can''t find anything saying not to other than some common
> wisdom about not putting your eggs all in one basket that I''m
> choosing to reject in some cases, I just keep adding vdevs to
> the pool. started with 2TB barracudas for dev/test/archive
> usage and constellations for prod, now 3TB drives, have just
> added some of the new Pipeline drives with nothing particularly
> of interest to note therefrom. 
> 
> You can create a startlingly large pool this way:
> 
> ny-fs7(68)% zpool list
> NAME   SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
> srv    177T   114T  63.3T    64%  ONLINE  -
> 
> most pools are smaller. this is an archive box that''s also
> the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
> one is 130TB in 11 vdevs of 8 drives raidz2. I won''t guess
> at the mix of 2TB and 3TB. these are both sol10u9. 
> 
> Another box has 150TB in 6 pools, raidz2/gzip using 2TB
> constellations, dual X5690s with 144GB RAM running 20-30
> Java db workers. We do manage to break this box on the
> odd occasion - there''s a race condition in the ZIO code 
> where a buffer can be freed while the block buffer is in
> the process of being "loaned" out to the compression code.
> However, it takes 600 zpool threads plus another 600-900
> java threads running at the same time with a backlog of 
> 80000 ZIOs in queue, so it''s not the sort of thing that
> anyone''s likely to run across much. :) It''s fixed
> in sol11, I understand; however, our intended fix is
> to split the whole thing so that the workload (which
> for various reasons needs to be on one box) is moved
> to a 4-socket Westmere, and all of the data pools
> are served via NFS from other boxes. 
> 
> I did lose some data once, long ago, using LSI 1068-based 
> controllers on older kit, but pretty much I can attribute
> that to something between me-being-stupid and the 1068s
> really not being especially friendly towards the LSI
> expander chips in the older 3Gb/s SMC backplanes when used
> for SATA-over-SAS tunneling. The current arrangements 
> are pretty solid otherwise. 
In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can''t fix.
> 
> The SATA-based boxes can be a little cranky when a drive
> toasts, of course - they sit and hang for a while until they
> finally decide to offline the drive. We take that as par
> for the course; for the application in question (basically,
> storing huge amounts of data on the odd occasion that someone
> has a need for it), it''s not exactly a showstopper.
> 
> 
> I am curious as to whether there is any practical upper-limit
> on the number of vdevs, or how far one might push this kind of
> configuration in terms of pool size - assuming a sufficient
> quantity of RAM, of course.... I''m sure I will need to 
> split this up someday but for the application there''s just
> something hideously convenient about leaving it all in one
> filesystem in one pool. 
I''ve run pools with > 100 top-level vdevs. It is not uncommon to see
40+ top-level vdevs.
 -- richard

--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120324/7721856d/attachment-0001.html>

Jeff Bacon

2012-Mar-25 13:26 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

> In general, mixing SATA and SAS directly behind expanders (eg without
> SAS/SATA intereposers) seems to be bad juju that an OS can''t fix.
In general I''d agree. Just mixing them on the same box can be
problematic,
I''ve noticed - though I think as much as anything that the firmware
on the 3G/s expanders just isn''t as well-tuned as the firmware
on the 6G/s expanders, and for all I know there''s a firmware update
that will make things better. 

SSDs seem to be an exception, however. Several boxes have a mix of 
Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
purposes on the expander with the constellations, or in one case, 
Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
expander under massive loads - the aforementioned box suffering from
80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
we were force-swapping processes out to disk as part of task management.
Meanwhile, the Java processes are doing batch import processing using
the Cheetahs as staging area, so those two expanders are under constant
heavy load. Yes that is as ugly as it sounds, don''t ask, and
don''t do
this yourself. This is what happens when you develop a database without
clear specs and have to just throw hardware underneath it guessing
all the way. But to give you an idea of the load they were/are under.) 

The SSDs were chosen with an eye towards expander-friendliness, and tested
relatively extensively before use. YMMV of course and this is nowhere
to skimp on a-data or Kingston; buy what Anand says to buy and you
seem to do very well. 

I would say, never do it on LSI 3G/s expanders. Be careful with using 
SATA spindles. Test the hell out of any SSD you use first. But you seem
to be able to get away with the better consumer-class SATA SSDs. 

(I realize that many here would say that if you are going to use
SSD in an enterprise config, you shouldn''t be messing with anything
short of Deneva or the SAS-based SSDs. I''d say there are simply
a bunch of caveats with the consumer MLC SSDs in such situations
to consider and if you are very clear about them up front, then 
they can be just fine. 

I suspect the real difficulty in these situations is in having
a management chain that is capable of both grokking the caveats up
front and remembering that they agreed to them when something
does go wrong. :)   As in this case I am the management chain,
it''s not an issue. This is of course not the usual case.) 

-bacon

Richard Elling

2012-Mar-25 16:55 UTC

head link

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

On Mar 25, 2012, at 6:26 AM, Jeff Bacon wrote:
>> In general, mixing SATA and SAS directly behind expanders (eg without
>> SAS/SATA intereposers) seems to be bad juju that an OS can''t
fix.
> 
> In general I''d agree. Just mixing them on the same box can be
problematic,
> I''ve noticed - though I think as much as anything that the
firmware
> on the 3G/s expanders just isn''t as well-tuned as the firmware
> on the 6G/s expanders, and for all I know there''s a firmware
update
> that will make things better. 
I haven''t noticed a big difference in the expanders, does anyone else
see
issues with 6G expanders?
> SSDs seem to be an exception, however. Several boxes have a mix of 
> Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
> purposes on the expander with the constellations, or in one case, 
> Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
> expander under massive loads - the aforementioned box suffering from
> 80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
> we were force-swapping processes out to disk as part of task management.
> Meanwhile, the Java processes are doing batch import processing using
> the Cheetahs as staging area, so those two expanders are under constant
> heavy load. Yes that is as ugly as it sounds, don''t ask, and
don''t do
> this yourself. This is what happens when you develop a database without
> clear specs and have to just throw hardware underneath it guessing
> all the way. But to give you an idea of the load they were/are under.) 
Sometime over beers we can trade war stories... many beers... :-)
> 
> The SSDs were chosen with an eye towards expander-friendliness, and tested
> relatively extensively before use. YMMV of course and this is nowhere
> to skimp on a-data or Kingston; buy what Anand says to buy and you
> seem to do very well. 
Yes. Be aware that companies like Kingston rebadge drives from other,
reputable suppliers. And some reputable suppliers have less-than-perfect
models.
> 
> I would say, never do it on LSI 3G/s expanders. Be careful with using 
> SATA spindles. Test the hell out of any SSD you use first. But you seem
> to be able to get away with the better consumer-class SATA SSDs. 
> 
> (I realize that many here would say that if you are going to use
> SSD in an enterprise config, you shouldn''t be messing with
anything
> short of Deneva or the SAS-based SSDs. I''d say there are simply
> a bunch of caveats with the consumer MLC SSDs in such situations
> to consider and if you are very clear about them up front, then 
> they can be just fine. 
> 
> I suspect the real difficulty in these situations is in having
> a management chain that is capable of both grokking the caveats up
> front and remembering that they agreed to them when something
> does go wrong. :)   As in this case I am the management chain,
> it''s not an issue. This is of course not the usual case.) 
We''d like to think that given the correct information, reasonable
people will
make the best choice.  And then there are PHBs.
 -- richard

--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/ed65e861/attachment-0001.html>

zfs discuss - Mar 2012 - Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation