thr3ads.net - zfs discuss - [zfs-discuss] ZFS best practice for FreeBSD? [Oct 2012]

If this information is useful, please help other people find it:
Share via:

andy thomas

2012-Oct-11 21:47 UTC

[zfs-discuss] ZFS best practice for FreeBSD?

According to a Sun document called something like ''ZFS best
practice'' I
read some time ago, best practice was to use the entire disk for ZFS and 
not to partition or slice it in any way. Does this advice hold good for 
FreeBSD as well?

I looked at a server earlier this week that was running FreeBSD 8.0 and 
had 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a 
spare. Large file I/O throughput was OK but the mail jail it hosted had 
periods when it was very slow with accessing lots of small files. All 
three disks (the two in the ZFS mirror plus the spare) had been 
partitioned with gpart so that partition 1 was a 6 GB swap and partition 2 
filled the rest of the disk and had a ''freebsd-zfs'' partition
on it. It
was these second partitions that were part of the mirror.

This doesn''t sound like a very good idea to me as surelt disk seeks for
swap and for ZFS file I/O are bound to clash. aren''t they?

Another point about the Sun ZFS paper - it mentioned optimum performance 
would be obtained with RAIDz pools if the number of disks was between 3 
and 9. So I''ve always limited my pools to a maximum of 9 active disks
plus
spares but the other day someone here was talking of seeing hundreds of 
disks in a single pool! So what is the current advice for ZFS in Solaris 
and FreeBSD?

Andy

Phillip Wagstrom

2012-Oct-11 21:58 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Oct 11, 2012, at 4:47 PM, andy thomas wrote:
> According to a Sun document called something like ''ZFS best
practice'' I read some time ago, best practice was to use the entire
disk for ZFS and not to partition or slice it in any way. Does this advice hold
good for FreeBSD as well?
	My understanding of the best practice was that with Solaris prior to ZFS, it
disabled the volatile disk cache.  With ZFS, the disk cache is used, but after
every transaction a cache-flush command is issued to ensure that the data made
it the platters.  If you slice the disk, enabling the disk cache for the whole
disk is dangerous because other file systems (meaning UFS) wouldn''t do
the cache-flush and there was a risk for data-loss should the cache fail due to,
say a power outage.
	Can''t speak to how BSD deals with the disk cache.
> I looked at a server earlier this week that was running FreeBSD 8.0 and had
2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a spare.
Large file I/O throughput was OK but the mail jail it hosted had periods when it
was very slow with accessing lots of small files. All three disks (the two in
the ZFS mirror plus the spare) had been partitioned with gpart so that partition
1 was a 6 GB swap and partition 2 filled the rest of the disk and had a
''freebsd-zfs'' partition on it. It was these second partitions
that were part of the mirror.
> 
> This doesn''t sound like a very good idea to me as surelt disk
seeks for swap and for ZFS file I/O are bound to clash. aren''t they?
	It surely would make a slow, memory starved swapping system even slower.  :)
> Another point about the Sun ZFS paper - it mentioned optimum performance
would be obtained with RAIDz pools if the number of disks was between 3 and 9.
So I''ve always limited my pools to a maximum of 9 active disks plus
spares but the other day someone here was talking of seeing hundreds of disks in
a single pool! So what is the current advice for ZFS in Solaris and FreeBSD?
	That number was drives per vdev, not per pool.

-Phil

Freddie Cash

2012-Oct-11 22:19 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Thu, Oct 11, 2012 at 2:47 PM, andy thomas <andy at time-domain.co.uk>
wrote:> According to a Sun document called something like ''ZFS best
practice'' I read
> some time ago, best practice was to use the entire disk for ZFS and not to
> partition or slice it in any way. Does this advice hold good for FreeBSD as
> well?
Solaris disabled the disk cache if the disk was partitioned, thus the
recommendation to always use the entire disk with ZFS.

FreeBSD''s GEOM architecture allows the disk cache to be enabled
whether you use the full disk or partition it.

Personally, I find it nicer to use GPT partitions on the disk.  That
way, you can start the partition at 1 MB ("gpart add -b 2048" on 512B
disks, or "gpart add -b 512" on 4K disks), leave a little wiggle-room
at the end of the disk, and use GPT labels to identify the disk (using
gpt/label-name for the device when adding to the pool).
> Another point about the Sun ZFS paper - it mentioned optimum performance
> would be obtained with RAIDz pools if the number of disks was between 3 and
> 9. So I''ve always limited my pools to a maximum of 9 active disks
plus
> spares but the other day someone here was talking of seeing hundreds of
> disks in a single pool! So what is the current advice for ZFS in Solaris
and
> FreeBSD?
You can have multiple disks in a vdev.  And you can multiple vdevs in
a pool.  Thus, you can have hundred of disks in a pool.  :)  Just
split the disks up into multiple vdevs, where each vdev is under 9
disks each.  :)  For example, we have 25 disks in the following pool,
but only 6 disks in each vdev (plus log/cache):


[root at alphadrive ~]# zpool list -v
NAME                        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage                    24.5T  20.7T  3.76T    84%  3.88x  DEGRADED  -
  raidz2                   8.12T  6.78T  1.34T         -
    gpt/disk-a1                -      -      -         -
    gpt/disk-a2                -      -      -         -
    gpt/disk-a3                -      -      -         -
    gpt/disk-a4                -      -      -         -
    gpt/disk-a5                -      -      -         -
    gpt/disk-a6                -      -      -         -
  raidz2                   5.44T  4.57T   888G         -
    gpt/disk-b1                -      -      -         -
    gpt/disk-b2                -      -      -         -
    gpt/disk-b3                -      -      -         -
    gpt/disk-b4                -      -      -         -
    gpt/disk-b5                -      -      -         -
    gpt/disk-b6                -      -      -         -
  raidz2                   5.44T  4.60T   863G         -
    gpt/disk-c1                -      -      -         -
    replacing                  -      -      -      932G
      6255083481182904200      -      -      -         -
      gpt/disk-c2              -      -      -         -
    gpt/disk-c3                -      -      -         -
    gpt/disk-c4                -      -      -         -
    gpt/disk-c5                -      -      -         -
    gpt/disk-c6                -      -      -         -
  raidz2                   5.45T  4.75T   720G         -
    gpt/disk-d1                -      -      -         -
    gpt/disk-d2                -      -      -         -
    gpt/disk-d3                -      -      -         -
    gpt/disk-d4                -      -      -         -
    gpt/disk-d5                -      -      -         -
    gpt/disk-d6                -      -      -         -
  gpt/log                  1.98G   460K  1.98G         -
cache                          -      -      -      -      -      -
  gpt/cache1               32.0G  32.0G     8M         -

-- 
Freddie Cash
fjwcash at gmail.com

Richard Elling

2012-Oct-11 23:26 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom <phillip.wagstrom at
gmail.com> wrote:
> 
> On Oct 11, 2012, at 4:47 PM, andy thomas wrote:
> 
>> According to a Sun document called something like ''ZFS best
practice'' I read some time ago, best practice was to use the entire
disk for ZFS and not to partition or slice it in any way. Does this advice hold
good for FreeBSD as well?
> 
> 	My understanding of the best practice was that with Solaris prior to ZFS,
it disabled the volatile disk cache.
This is not quite correct. If you use the whole disk ZFS will attempt to enable
the
write cache. To understand why, remember that UFS (and ext, by default) can die
a
horrible death (+fsck) if there is a power outage and cached data is not flushed
to disk.
So by default, Sun shipped some disks with write cache disabled by default. For
non-Sun
disks, they are most often shipped with write cache enabled and the most popular
file
systems (NTFS) properly issue cache flush requests as needed (for the same
reason ZFS
issues cache flush requests).
> With ZFS, the disk cache is used, but after every transaction a cache-flush
command is issued to ensure that the data made it the platters.
Write cache is flushed after uberblock updates and for ZIL writes. This is
important for
uberblock updates, so the uberblock doesn''t point to a garbaged MOS. It
is important
for ZIL writes, because they must be guaranteed written to media before ack.
 -- richard
>  If you slice the disk, enabling the disk cache for the whole disk is
dangerous because other file systems (meaning UFS) wouldn''t do the
cache-flush and there was a risk for data-loss should the cache fail due to, say
a power outage.
> 	Can''t speak to how BSD deals with the disk cache.
> 
>> I looked at a server earlier this week that was running FreeBSD 8.0 and
had 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a
spare. Large file I/O throughput was OK but the mail jail it hosted had periods
when it was very slow with accessing lots of small files. All three disks (the
two in the ZFS mirror plus the spare) had been partitioned with gpart so that
partition 1 was a 6 GB swap and partition 2 filled the rest of the disk and had
a ''freebsd-zfs'' partition on it. It was these second
partitions that were part of the mirror.
>> 
>> This doesn''t sound like a very good idea to me as surelt disk
seeks for swap and for ZFS file I/O are bound to clash. aren''t they?
> 
> 	It surely would make a slow, memory starved swapping system even slower. 
:)
> 
>> Another point about the Sun ZFS paper - it mentioned optimum
performance would be obtained with RAIDz pools if the number of disks was
between 3 and 9. So I''ve always limited my pools to a maximum of 9
active disks plus spares but the other day someone here was talking of seeing
hundreds of disks in a single pool! So what is the current advice for ZFS in
Solaris and FreeBSD?
> 
> 	That number was drives per vdev, not per pool.
> 
> -Phil
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121011/a6d6f480/attachment.html>

Toby Thain

2012-Oct-12 00:42 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On 11/10/12 5:47 PM, andy thomas wrote:> ...
> This doesn''t sound like a very good idea to me as surelt disk
seeks for
> swap and for ZFS file I/O are bound to clash. aren''t they?
>
As Phil implied, if your system is swapping, you already have bigger 
problems.

--Toby
>
> Andy
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

andy thomas

2012-Oct-12 07:11 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Thu, 11 Oct 2012, Freddie Cash wrote:
> On Thu, Oct 11, 2012 at 2:47 PM, andy thomas <andy at
time-domain.co.uk> wrote:
>> According to a Sun document called something like ''ZFS best
practice'' I read
>> some time ago, best practice was to use the entire disk for ZFS and not
to
>> partition or slice it in any way. Does this advice hold good for
FreeBSD as
>> well?
>
> Solaris disabled the disk cache if the disk was partitioned, thus the
> recommendation to always use the entire disk with ZFS.
>
> FreeBSD''s GEOM architecture allows the disk cache to be enabled
> whether you use the full disk or partition it.
>
> Personally, I find it nicer to use GPT partitions on the disk.  That
> way, you can start the partition at 1 MB ("gpart add -b 2048" on
512B
> disks, or "gpart add -b 512" on 4K disks), leave a little
wiggle-room
> at the end of the disk, and use GPT labels to identify the disk (using
> gpt/label-name for the device when adding to the pool).
This is apparently what had been done in this case:

 	gpart add -b 34 -s 6000000 -t freebsd-swap da0
 	gpart add -b 6000034 -s 1947525101 -t freebsd-zfs da1
 	gpart show

(stuff relating to a compact flash/SATA boot disk deleted)

=>        34  1953525101  da0  GPT  (932G)
           34     6000000    1  freebsd-swap  (2.9G)
      6000034  1947525101    2  freebsd-zfs  (929G)

=>        34  1953525101  da2  GPT  (932G)
           34     6000000    1  freebsd-swap  (2.9G)
      6000034  1947525101    2  freebsd-zfs  (929G)

=>        34  1953525101  da1  GPT  (932G)
           34     6000000    1  freebsd-swap  (2.9G)
      6000034  1947525101    2  freebsd-zfs  (929G)


Is this a good scheme? The server has 12 G of memory (upped from 4 GB last 
year after it kept crashing with out of memory reports on the console 
screen) so I doubt the swap would actually be used very often. Running 
Bonnie++ on this pool comes up with some very good results for sequential 
disk writes but the latency of over 43 seconds for block reads is 
terrible and is obviously impacting performance as a mail server, as shown 
here:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
hsl-main.hsl.of 24G    63  67 80584  20 70568  17   314  98 554226  60 410.1  13
Latency             77140us   43145ms   28872ms     171ms     212ms     232ms
Version  1.96       ------Sequential Create------ --------Random Create--------
hsl-main.hsl.office -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  16 19261  93 +++++ +++ 18491  97 21542  92 +++++ +++ 20691  94
Latency             15399us     488us     226us   27733us     103us     138us


The other issue with this server is it needs to be rebooted every 8-10 
weeks as disk I/O slows to a crawl over time and the server becomes 
unusable. After a reboot, it''s fine again. I''m told ZFS 13 on
FreeBSD 8.0
has a lot of problems so I was planning to rebuild the server with FreeBSD 
9.0 and ZFS 28 but I didn''t want to make any basic design mistakes in 
doing this.
>> Another point about the Sun ZFS paper - it mentioned optimum
performance
>> would be obtained with RAIDz pools if the number of disks was between 3
and
>> 9. So I''ve always limited my pools to a maximum of 9 active
disks plus
>> spares but the other day someone here was talking of seeing hundreds of
>> disks in a single pool! So what is the current advice for ZFS in
Solaris and
>> FreeBSD?
>
> You can have multiple disks in a vdev.  And you can multiple vdevs in
> a pool.  Thus, you can have hundred of disks in a pool.  :)  Just
> split the disks up into multiple vdevs, where each vdev is under 9
> disks each.  :)  For example, we have 25 disks in the following pool,
> but only 6 disks in each vdev (plus log/cache):
>
>
> [root at alphadrive ~]# zpool list -v
> NAME                        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH 
ALTROOT
> storage                    24.5T  20.7T  3.76T    84%  3.88x  DEGRADED  -
>  raidz2                   8.12T  6.78T  1.34T         -
>    gpt/disk-a1                -      -      -         -
>    gpt/disk-a2                -      -      -         -
>    gpt/disk-a3                -      -      -         -
>    gpt/disk-a4                -      -      -         -
>    gpt/disk-a5                -      -      -         -
>    gpt/disk-a6                -      -      -         -
>  raidz2                   5.44T  4.57T   888G         -
>    gpt/disk-b1                -      -      -         -
>    gpt/disk-b2                -      -      -         -
>    gpt/disk-b3                -      -      -         -
>    gpt/disk-b4                -      -      -         -
>    gpt/disk-b5                -      -      -         -
>    gpt/disk-b6                -      -      -         -
>  raidz2                   5.44T  4.60T   863G         -
>    gpt/disk-c1                -      -      -         -
>    replacing                  -      -      -      932G
>      6255083481182904200      -      -      -         -
>      gpt/disk-c2              -      -      -         -
>    gpt/disk-c3                -      -      -         -
>    gpt/disk-c4                -      -      -         -
>    gpt/disk-c5                -      -      -         -
>    gpt/disk-c6                -      -      -         -
>  raidz2                   5.45T  4.75T   720G         -
>    gpt/disk-d1                -      -      -         -
>    gpt/disk-d2                -      -      -         -
>    gpt/disk-d3                -      -      -         -
>    gpt/disk-d4                -      -      -         -
>    gpt/disk-d5                -      -      -         -
>    gpt/disk-d6                -      -      -         -
>  gpt/log                  1.98G   460K  1.98G         -
> cache                          -      -      -      -      -      -
>  gpt/cache1               32.0G  32.0G     8M         -

Great, thanks for the explanation! I didn''t realise you could have a
sort
of ''stacked pyramid'' vdev/pool structure.

Andy

andy thomas

2012-Oct-12 07:24 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Thu, 11 Oct 2012, Richard Elling wrote:
> On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom <phillip.wagstrom at
gmail.com> wrote:
>
>>
>> On Oct 11, 2012, at 4:47 PM, andy thomas wrote:
>>
>>> According to a Sun document called something like ''ZFS
best practice'' I read some time ago, best practice was to use the
entire disk for ZFS and not to partition or slice it in any way. Does this
advice hold good for FreeBSD as well?
>>
>> 	My understanding of the best practice was that with Solaris prior to
ZFS, it disabled the volatile disk cache.
>
> This is not quite correct. If you use the whole disk ZFS will attempt to
enable the
> write cache. To understand why, remember that UFS (and ext, by default) can
die a
> horrible death (+fsck) if there is a power outage and cached data is not
flushed to disk.
> So by default, Sun shipped some disks with write cache disabled by default.
For non-Sun
> disks, they are most often shipped with write cache enabled and the most
popular file
> systems (NTFS) properly issue cache flush requests as needed (for the same
reason ZFS
> issues cache flush requests).
Out of interest, how do you enable the write cache on a disk? I recently 
replaced a failing Dell-branded disk on a Dell server with an HP-branded 
disk (both disks were the identical Seagate model) and on running the EFI 
diagnostics just to check all was well, it reported the write cache was 
disabled on the new HP disk but enabled on the remaining Dell disks in the 
server. I couldn''t see any way of enabling the cache from the EFI diags
so
I left it as it was - probably not ideal.
>> With ZFS, the disk cache is used, but after every transaction a
cache-flush command is issued to ensure that the data made it the platters.
>
> Write cache is flushed after uberblock updates and for ZIL writes. This is
important for
> uberblock updates, so the uberblock doesn''t point to a garbaged
MOS. It is important
> for ZIL writes, because they must be guaranteed written to media before
ack.
Thanks for the explanation, that all makes sense now.

Andy
>>  If you slice the disk, enabling the disk cache for the whole disk is
dangerous because other file systems (meaning UFS) wouldn''t do the
cache-flush and there was a risk for data-loss should the cache fail due to, say
a power outage.
>> 	Can''t speak to how BSD deals with the disk cache.
>>
>>> I looked at a server earlier this week that was running FreeBSD 8.0
and had 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a
spare. Large file I/O throughput was OK but the mail jail it hosted had periods
when it was very slow with accessing lots of small files. All three disks (the
two in the ZFS mirror plus the spare) had been partitioned with gpart so that
partition 1 was a 6 GB swap and partition 2 filled the rest of the disk and had
a ''freebsd-zfs'' partition on it. It was these second
partitions that were part of the mirror.
>>>
>>> This doesn''t sound like a very good idea to me as surelt
disk seeks for swap and for ZFS file I/O are bound to clash. aren''t
they?
>>
>> 	It surely would make a slow, memory starved swapping system even
slower.  :)
>>
>>> Another point about the Sun ZFS paper - it mentioned optimum
performance would be obtained with RAIDz pools if the number of disks was
between 3 and 9. So I''ve always limited my pools to a maximum of 9
active disks plus spares but the other day someone here was talking of seeing
hundreds of disks in a single pool! So what is the current advice for ZFS in
Solaris and FreeBSD?
>>
>> 	That number was drives per vdev, not per pool.
>>
>> -Phil
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> --
>
> Richard.Elling at RichardElling.com
> +1-760-896-4422
>
>
>
>
>
>
>
>
>
>

---------------------------------
Andy Thomas,
Time Domain Systems

Tel: +44 (0)7866 556626
Fax: +44 (0)20 8372 2582
http://www.time-domain.co.uk

Jim Klimov

2012-Oct-12 10:28 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 11:11, andy thomas wrote:> Great, thanks for the explanation! I didn''t realise you could have
a
> sort of ''stacked pyramid'' vdev/pool structure.
Well, you can - the layers are "pool" - "top-level VDEVs" -
"leaf
VDEVs", though on trivial pools like single-disk ones, the layers
kinda merge into one or two :) This should be described in the
manpage in greater detail.

So the pool stripes over Top-Level VDEVs (TLVDEVs), roughly by
round-robining whole logical blocks upon write, and then each
tlvdev depending on its redundancy configuration forms the sectors
to be written onto its component leaf vdevs (low-level disks,
partitions or slices, luns, files, etc.) Since full-stripe writes
are not required by ZFS, smaller blocks can consume less sectors
than there are leafs (disks) in a tlvdev, but this does not result
in lost space "holes" nor in RMW cycles like on full-stripe RAID
systems. If there''s a free "hole" of contiguous logical
addressing
(roughly, striped across leaf vdevs within the tlvdev), where the
userdata sectors (after optional compression) plus the redundancy
sectors fit - it will be used.

I guess it is because of this contiguous addressing that a tlvdev
with raidzN can not (currently) change the number of component disks,
and a pool can not decrease the number of tlvdevs. If you add new
tlvdevs to an existing pool, the ZFS algorithms will try to put
some more load on emptier tlvdevs and balance the writes, although
according to discussions, this can still lead to disbalance and
performance problems on particular installations.

In fact, you can (although not recommended due to balancing reasons)
have tlvdevs of mixed size (like in Freddie''s example) and even of
different structure (i.e. mixing raidz and mirrors or even single
LUNs) by forcing the disk attachment.

Note however that a loss of a tlvdev kills your whole pool, so
don''t stripe important data over single disks/luns ;)

And you don''t have control of what gets written where, so
you''d
also get an averaged performance mix of raidz and mirrors with
unpredictable performance for particular userdata block''s storage.

HTH,
//Jim Klimov

Peter Jeremy

2012-Oct-12 11:20 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On 2012-Oct-12 08:11:13 +0100, andy thomas <andy at time-domain.co.uk>
wrote:>This is apparently what had been done in this case:
>
> 	gpart add -b 34 -s 6000000 -t freebsd-swap da0
> 	gpart add -b 6000034 -s 1947525101 -t freebsd-zfs da1
> 	gpart show
Assuming that you can be sure that you''ll keep 512B sector disks,
that''s OK but I''d recommend that you align both the swap and
ZFS
partitions on at least 4KiB boundaries for future-proofing (ie
you can safely stick the same partition table onto a 4KiB disk
in future).
>Is this a good scheme? The server has 12 G of memory (upped from 4 GB last 
>year after it kept crashing with out of memory reports on the console 
>screen) so I doubt the swap would actually be used very often.
Having enough swap to hold a crashdump is useful.  You might consider
using gmirror for swap redundancy (though 3-way is overkill).  (And
I''d strongly recommend against swapping to a zvol or ZFS - FreeBSD has
"issues" with that combination).
>The other issue with this server is it needs to be rebooted every 8-10 
>weeks as disk I/O slows to a crawl over time and the server becomes 
>unusable. After a reboot, it''s fine again. I''m told ZFS 13
on FreeBSD 8.0
>has a lot of problems
Yes, it does - and your symptoms match one of the problems.  Does
top(1) report lots of inactive and cache memory and very little free
memory and a high kstat.zfs.misc.arcstats.memory_throttle_count once
I/O starts slowing down?
> so I was planning to rebuild the server with FreeBSD 
>9.0 and ZFS 28 but I didn''t want to make any basic design mistakes
in
>doing this.
I''d suggest you test 9.1-RC2 (just released) with a view to using 9.1,
rather than installing 9.0.

Since your questions are FreeBSD specific, you might prefer to ask on
the freebsd-fs list.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121012/b37b6292/attachment.bin>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-12 13:12 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of andy thomas
> 
> According to a Sun document called something like ''ZFS best
practice'' I
> read some time ago, best practice was to use the entire disk for ZFS and
> not to partition or slice it in any way. Does this advice hold good for
> FreeBSD as well?
I''m not going to address the FreeBSD question.  I know others have made
some comments on the "best practice" on solaris, but here goes:

There are two reasons for the "best practice" of not partitioning. 
And I disagree with them both.

First, by default, the on-disk write cache is disabled.  But if you use the
whole disk in a zpool, then zfs enables the cache.  If you partition a disk and
use it for only zpool''s, then you might want to manually enable the
cache yourself.  This is a fairly straightforward scripting exercise.  You may
use this if you want:  (No warranty, etc, it will probably destroy your system
if you don''t read and understand and rewrite it yourself before
attempting to use it.)
https://dl.dropbox.com/u/543241/dedup%20tests/cachecontrol/cachecontrol.zip

If you do that, you''ll need to re-enable the cache once on each boot
(or zfs mount).

The second reason is because when you "zpool import" it
doesn''t automatically check all the partitions of all the devices - it
only scans devices.  So if you are forced to move your disks to a new system,
you try to import, you get an error message, you panic and destroy your disks. 
To overcome this problem, you just need to be good at remembering the disks were
partitioned - Perhaps you should make a habit of partitioning *all* of your
disks, so you''ll *always* remember.  On zpool import, you need to
specify the partitions to scan for zpools.  I believe this is the "zpool
import -d" option.

And finally - 

There are at least a couple of solid reasons *in favor* of partitioning.

#1  It seems common, at least to me, that I''ll build a server with
let''s say, 12 disk slots, and we''ll be using 2T disks or
something like that.  The OS itself only takes like 30G which means if I
don''t partition, I''m wasting 1.99T on each of the first two
disks.  As a result, when installing the OS, I always partition rpool down to
~80G or 100G, and I will always add the second partitions of the first disks to
the main data pool.

#2  A long time ago, there was a bug, where you couldn''t attach a
mirror unless the two devices had precisely the same geometry.  That was
addressed in a bugfix a couple of years ago.  (I had a failed SSD mirror, and
Sun shipped me a new SSD with a different firmware rev, and the size of the
replacement device was off by 1 block, so I couldn''t replace the failed
SSD).  After the bugfix, a mirror can be attached if there''s a little
bit of variation in the sizes of the two devices.  But it''s not quite
enough - As recently as 2 weeks ago, I tried to attach two devices that were
precisely the same, but couldn''t because of the different size.  One of
them was a local device, and the other was an iscsi target.   So I guess iscsi
must require a little bit of space, and that was enough to make the devices
un-mirror-able without partitioning.

Freddie Cash

2012-Oct-12 15:34 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On Fri, Oct 12, 2012 at 3:28 AM, Jim Klimov <jimklimov at cos.ru>
wrote:> In fact, you can (although not recommended due to balancing reasons)
> have tlvdevs of mixed size (like in Freddie''s example) and even of
> different structure (i.e. mixing raidz and mirrors or even single
> LUNs) by forcing the disk attachment.
My example shows 4 raidz2 vdevs, with each vdev having 6 disks, along
with a log vdev, and a cache vdev.  Not sure where you''re seeing an
imbalance.  Maybe it''s because the pool is currently resilvering a
drive, thus making it look like one of the vdevs has 7 drives?

My home file server ran with mixed vdevs for awhile (a 2 IDE-disk
mirror vdev with a 3 SATA-disk raidz1 vdev) as it was built using
scrounged parts.

But all my work file servers have matched vdevs.

-- 
Freddie Cash
fjwcash at gmail.com

Ian Collins

2012-Oct-12 20:41 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On 10/13/12 02:12, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:> There are at least a couple of solid reasons *in favor* of partitioning.
>
> #1  It seems common, at least to me, that I''ll build a server with
let''s say, 12 disk slots, and we''ll be using 2T disks or
something like that.  The OS itself only takes like 30G which means if I
don''t partition, I''m wasting 1.99T on each of the first two
disks.  As a result, when installing the OS, I always partition rpool down to
~80G or 100G, and I will always add the second partitions of the first disks to
the main data pool.
How do you provision a spare in that situation?

-- 
Ian.

Jim Klimov

2012-Oct-13 09:02 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 19:34, Freddie Cash ?????:> On Fri, Oct 12, 2012 at 3:28 AM, Jim Klimov <jimklimov at cos.ru>
wrote:
>> In fact, you can (although not recommended due to balancing reasons)
>> have tlvdevs of mixed size (like in Freddie''s example) and
even of
>> different structure (i.e. mixing raidz and mirrors or even single
>> LUNs) by forcing the disk attachment.
>
> My example shows 4 raidz2 vdevs, with each vdev having 6 disks, along
> with a log vdev, and a cache vdev.  Not sure where you''re seeing
an
> imbalance.  Maybe it''s because the pool is currently resilvering a
> drive, thus making it look like one of the vdevs has 7 drives?
No, my comment was about this pool having an 8Tb TLVDEV and
several 5.5Tb TLVDEVs - and that this kind of setup is quite
valid for ZFS - and that while striping data across disks
it can actually do better than round-robin, giving more data
to the larger components. But more weight on one side is
called imbalance ;)

Sorry if my using your example offended you somehow.

//Jim

Jim Klimov

2012-Oct-13 09:13 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

2012-10-13 0:41, Ian Collins ?????:> On 10/13/12 02:12, Edward Ned Harvey
> (opensolarisisdeadlongliveopensolaris) wrote:
>> There are at least a couple of solid reasons *in favor* of
partitioning.
>>
>> #1  It seems common, at least to me, that I''ll build a server
with
>> let''s say, 12 disk slots, and we''ll be using 2T disks
or something
>> like that.  The OS itself only takes like 30G which means if I
don''t
>> partition, I''m wasting 1.99T on each of the first two disks. 
As a
>> result, when installing the OS, I always partition rpool down to ~80G
>> or 100G, and I will always add the second partitions of the first
>> disks to the main data pool.
>
> How do you provision a spare in that situation?
Technically - you can layout the spare disks similarly and attach
the partitions or slices as spares for pools.

However, in servers I''ve seen there were predominantly different
layout designs:

1) Dedicated root disks/mirrors - small enough for rpool/swap
tasks, nowadays perhaps SSDs or CF cards - especially if care
was taken to use the rpool device mostly for reads and place
all writes like swap and logs onto other pools;

2) For smaller machines with 2 or 4 disks, a partition (slice)
is made for rpool sized about 10-20Gb, and the rest is for
data pool vdevs. In case of 4-disk machines, the rpool can be
a two-way mirror and the other couple of disks can host swap
and/or dump in an SVM or ZFS mirror for example. The data pool
components are identically sized and form a mirror, raid10 or
a raidz1; rarely a raidz2 - that is assumed to have better
resilience to loss of ANY two disks than a raid10 resilient
to loss of CORRECT two disks (from different mirrors).

3) For todays computers with all disks being big, I''d also
make a smallish rpool, a large data pool on separate disks,
and use the extra space on the disks with rpool for something
else - be it swap in SVM-mirrored partition, a scratch pool
for incoming data or tests, etc.

Not that any of this is necessarily a best practice, but that''s
just the way I was taught to do it and am used to. Using pool
components of the same size does also make replacements simpler ;)

//Jim

Freddie Cash

2012-Oct-13 15:43 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

Ah, okay, that makes sense. I wasn''t offended, just confused. :)

Thanks for the clarification
On Oct 13, 2012 2:01 AM, "Jim Klimov" <jimklimov at cos.ru>
wrote:
> 2012-10-12 19:34, Freddie Cash ?????:
>
>> On Fri, Oct 12, 2012 at 3:28 AM, Jim Klimov <jimklimov at cos.ru>
wrote:
>>
>>> In fact, you can (although not recommended due to balancing
reasons)
>>> have tlvdevs of mixed size (like in Freddie''s example) and
even of
>>> different structure (i.e. mixing raidz and mirrors or even single
>>> LUNs) by forcing the disk attachment.
>>>
>>
>> My example shows 4 raidz2 vdevs, with each vdev having 6 disks, along
>> with a log vdev, and a cache vdev.  Not sure where you''re
seeing an
>> imbalance.  Maybe it''s because the pool is currently
resilvering a
>> drive, thus making it look like one of the vdevs has 7 drives?
>>
>
> No, my comment was about this pool having an 8Tb TLVDEV and
> several 5.5Tb TLVDEVs - and that this kind of setup is quite
> valid for ZFS - and that while striping data across disks
> it can actually do better than round-robin, giving more data
> to the larger components. But more weight on one side is
> called imbalance ;)
>
> Sorry if my using your example offended you somehow.
>
> //Jim
>
> ______________________________**_________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
>
http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121013/d2b03777/attachment.html>

Ian Collins

2012-Oct-13 21:56 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

On 10/13/12 22:13, Jim Klimov wrote:> 2012-10-13 0:41, Ian Collins ?????:
>> On 10/13/12 02:12, Edward Ned Harvey
>> (opensolarisisdeadlongliveopensolaris) wrote:
>>> There are at least a couple of solid reasons *in favor* of
partitioning.
>>>
>>> #1  It seems common, at least to me, that I''ll build a
server with
>>> let''s say, 12 disk slots, and we''ll be using 2T
disks or something
>>> like that.  The OS itself only takes like 30G which means if I
don''t
>>> partition, I''m wasting 1.99T on each of the first two
disks.  As a
>>> result, when installing the OS, I always partition rpool down to
~80G
>>> or 100G, and I will always add the second partitions of the first
>>> disks to the main data pool.
>> How do you provision a spare in that situation?
> Technically - you can layout the spare disks similarly and attach
> the partitions or slices as spares for pools.
I probably didn''t didn''t make my self clear, so I''ll
try again!

Assuming the intention is to get the most storage from your drives.  If 
you add the remainder of the space on the drives you have partitioned 
for the root pool to the main pool giving a mix of device sizes in the 
pool, how do you provision a spare?

That''s why I have never done this.  I use whole drives everywhere and
as
you mention further down, use the spare space in the root pool for 
scratch filesystems.
> However, in servers I''ve seen there were predominantly different
> layout designs:
>
> 1) Dedicated root disks/mirrors - small enough for rpool/swap
> tasks, nowadays perhaps SSDs or CF cards - especially if care
> was taken to use the rpool device mostly for reads and place
> all writes like swap and logs onto other pools;
>
> 2) For smaller machines with 2 or 4 disks, a partition (slice)
> is made for rpool sized about 10-20Gb, and the rest is for
> data pool vdevs. In case of 4-disk machines, the rpool can be
> a two-way mirror and the other couple of disks can host swap
> and/or dump in an SVM or ZFS mirror for example. The data pool
> components are identically sized and form a mirror, raid10 or
> a raidz1; rarely a raidz2 - that is assumed to have better
> resilience to loss of ANY two disks than a raid10 resilient
> to loss of CORRECT two disks (from different mirrors).
>
> 3) For todays computers with all disks being big, I''d also
> make a smallish rpool, a large data pool on separate disks,
> and use the extra space on the disks with rpool for something
> else - be it swap in SVM-mirrored partition, a scratch pool
> for incoming data or tests, etc.
Most of the system I have built up this year are 2U boxes with 8 to 12 
(2TB) drives.  I expect these are very common at the moment.  I use your 
third option but I tend to just create a big rpool mirror and add a 
scratch filesystem  rather than partitioning the drives.

-- 
Ian.

Jim Klimov

2012-Oct-14 10:42 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

2012-10-14 1:56, Ian Collins ?????:> On 10/13/12 22:13, Jim Klimov wrote:
>> 2012-10-13 0:41, Ian Collins ?????:
>>> On 10/13/12 02:12, Edward Ned Harvey
>>> (opensolarisisdeadlongliveopensolaris) wrote:
>>>> #1  It seems common, at least to me, that I''ll build a
server with
>>>> let''s say, 12 disk slots, and we''ll be using
2T disks or something
>>>> like that.  The OS itself only takes like 30G which means if I
don''t
>>>> partition, I''m wasting 1.99T on each of the first two
disks.  As a
>>>> result, when installing the OS, I always partition rpool down
to ~80G
>>>> or 100G, and I will always add the second partitions of the
first
>>>> disks to the main data pool.
>>> How do you provision a spare in that situation?
>> Technically - you can layout the spare disks similarly and attach
>> the partitions or slices as spares for pools.
>
> I probably didn''t didn''t make my self clear, so
I''ll try again!
>
> Assuming the intention is to get the most storage from your drives.  If
> you add the remainder of the space on the drives you have partitioned
> for the root pool to the main pool giving a mix of device sizes in the
> pool, how do you provision a spare?
Well, as long as the replacement device (drive, partition, slice)
dedicated to a pool is at least as big as any of its devices,
it can kick in as a hot- or cold-spare. So I guess you can have
a pool with 2*1.90+8*2.0Tb devices (likely mirrored per-couple,
or a wilder mix of mirror+raidzN''s), an L2ARC SSD and a 2Tb spare.
If an rpool disk dies, the 2.0Tb space on the spare should be
enough to replace the 1.90Tb component of the data pool.
You might have harder time replacing the rpool part though.

Alternately, roughly following my approach #2, you can layout
all of your disks in the same manner, like 0.1+1.90Tb. Two or
three of these can form an rpool mirror, a couple more can be
swap, and a majority can form a raid10 or a raidzN on the known
faster sectors of the drives. You get a predictably faster pool
for scratch, incoming, database logs - whatever (as long as the
disks are not heavily utilized all the time, and you *can* have
the performance boost on a smaller pool with smaller mechanical
seek travels and faster cylinders).

In particular, the hotspare would be laid out the same and can
replace components of both types of pools.
> Most of the system I have built up this year are 2U boxes with 8 to 12
> (2TB) drives.  I expect these are very common at the moment.  I use your
> third option but I tend to just create a big rpool mirror and add a
> scratch filesystem  rather than partitioning the drives.
Consider me old-school, but I believe that writes to a filesystem
(or to a pool in case of ZFS) are a source of risk for corruption
during power glitches and such. Also they are a cause of higher
data fragmentation. When I have a chance, I prefer to keep apples
with apples - a relatively static rpool which gets written into
during package installations and updates, config changes and so
on; and one or more data pools for more active data lifecycles.
The rpool is too critical a component (regarding loss of service
during outages, such as inability to boot and fix problems via
remote access) to add risks to its corruption, especially now
that we don''t have failsafe boot options. As long as the system
boots and you have ssh to fix the data pools, you''ve kicked up
your SLAs a bit higher and reduced a few worries ;)

HTH, my 2c,
//Jim Klimov

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-14 13:41 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

> From: Ian Collins [mailto:ian at ianshome.com]
> 
> On 10/13/12 02:12, Edward Ned Harvey
> (opensolarisisdeadlongliveopensolaris) wrote:
> > There are at least a couple of solid reasons *in favor* of
partitioning.
> >
> > #1  It seems common, at least to me, that I''ll build a server
with let''s say, 12
> disk slots, and we''ll be using 2T disks or something like that. 
The OS itself
> only takes like 30G which means if I don''t partition, I''m
wasting 1.99T on each
> of the first two disks.  As a result, when installing the OS, I always
partition
> rpool down to ~80G or 100G, and I will always add the second partitions of
> the first disks to the main data pool.
> 
> How do you provision a spare in that situation?
A solid point.  I don''t.

This doesn''t mean you can''t - it just means I don''t.

If I''m not mistaken...  If you have a pool with multiple different
sizes of devices in the pool, you only need to add a spare of the larger size. 
If you have a smaller device failure, I believe the pool will use the larger
spare device rather than not using a spare.  So if I''m not mistaken,
you can add a spare to your pool exactly the same, regardless of having
partitions or no partitions.

If I''m wrong - if the pool won''t use the larger spare device
in place of a smaller failed device (partition), then you would likely need to
add one spare for each different size device used in your pool.  In particular,
this means:

Option 1:  Given that you partition your first 2 disks, 80G for OS and 1.99T for
data, you would likely want to partition *all* your disks the same, including
the disk that''s designated as a spare.  Then you could add your spare
80G partition as a spare device, and your spare 1.99T partition as a spare
device.

Option 2:  Suppose you partition your first disks, and you don''t want
to hassle on all the rest. (This is my case.)  Or you have physically different
size devices, a pool that was originall made of 1T disks but now it''s
been extended to include a bunch of 2T disks, or something like that. 
It''s conceivable you would want to have a spare of each different size,
which could in some cases mean you use two spares (one partitioned and one not)
in a pool where you might otherwise have only one spare.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-14 13:51 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> A solid point.  I don''t.
> 
> This doesn''t mean you can''t - it just means I
don''t.
This response was kind of long-winded.  So here''s a simpler version:

Suppose 6 disks in a system, each 2T.  c0t0d0 through c0t5d0

rpool is a mirror:
mirror c0t0d0p1 c0t1d0p1

c0t0d0p2 = 1.9T, unused (Extended, unused)
c0t1d0p2 = 1.9T, unused (Extended, unused)

Now partition all the other disks the same.  Create datapool:

zpool create datapool \
mirror c0t0d0p2 c0t1d0p2 \
mirror c0t2d0p1 c0t3d0p1 \
mirror c0t2d0p2 c0t3d0p2 \
mirror c0t4d0p1 c0t5d0p1 \
mirror c0t4d0p2 c0t4d0p2

Add a spare?  A seventh disk, c0t6d0
Partition it.
add spare c0t6d0p1 spare c0t6d0p2

Jim Klimov

2012-Oct-14 16:32 UTC

head link

[zfs-discuss] ZFS best practice for FreeBSD?

2012-10-14 17:51, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:> zpool create datapool \
> mirror c0t0d0p2 c0t1d0p2 \
> mirror c0t2d0p1 c0t3d0p1 \
> mirror c0t2d0p2 c0t3d0p2 \
> mirror c0t4d0p1 c0t5d0p1 \
> mirror c0t4d0p2 c0t4d0p2
>
> Add a spare?  A seventh disk, c0t6d0
> Partition it.
> add spare c0t6d0p1 spare c0t6d0p2
>

Kind of like what I also proposed. Now, just let''s hope that when
a spare is needed, the big spare is not used for the small partition 
first ;)

(Now I do wonder if ZFS cares to take the smallest sufficient spare
to fix a vdev, if it has a choice of several?)

Also note that mirrors make it easier to fix mistakes like that -
you can demote a mirror component into a single disk and rearrange
components, while you won''t be able to (easily) juggle leaf vdevs
on a raidzN set...

//Jim

zfs discuss - Oct 2012 - ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?

[zfs-discuss] ZFS best practice for FreeBSD?