thr3ads.net - zfs discuss - [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Eugen Leitl

2012-Nov-26 22:13 UTC

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

Dear internets,

I''ve got an old SunFire X2100M2 with 6-8 GBytes ECC RAM, which
I wanted to put into use with Linux, using the Linux
VServer patch (an analogon to zones), and 2x 2 TByte 
nearline (WD RE4) drives. It occured to me that the
1U case had enough space to add some SSDs (e.g.
2-4 80 GByte Intel SSDs), and the power supply
should be able to take both the 2x SATA HDs as well 
as 2-4 SATA SSDs, though I would need to splice into
existing power cables.

I also have a few LSI and an IBM M1015 (potentially 
reflashable to IT mode) adapters, so having enough ports
is less an issue (I''ll probably use an LSI
with 4x SAS/SATA for 4x SSD, and keep the onboard SATA
for HDs, or use each 2x for SSD and HD).

Now there are multiple configurations for this. 
Some using Linux (roof fs on a RAID10, /home on
RAID 1) or zfs. Now zfs on Linux probably wouldn''t
do hybrid zfs pools (would it?) and it wouldn''t
be probably stable enough for production. Right?

Assuming I wont''t have to compromise CPU performance 
(it''s an anemic Opteron 1210 1.8 GHz, dual core, after all, and
it will probably run several 10 of zones in production) and
sacrifice data integrity, can I make e.g. LSI SAS3442E
directly do SSD caching (it says something about CacheCade,
but I''m not sure it''s an OS-side driver thing), as it
is supposed to boost IOPS? Unlikely shot, but probably
somebody here would know.

If not, should I go directly OpenIndiana, and use
a hybrid pool?

Should I use all 4x SATA SSDs and 2x SATA HDs to
do a hybrid pool, or would this be an overkill?
The SSDs are Intel SSDSA2M080G2GC 80 GByte, so no speed demons
either. However, they''ve seen some wear and tear and
none of them has keeled over yet. So I think they''ll
be good for a few more years.

How would you lay out the pool with OpenIndiana
in either case to maximize IOPS and minimize CPU
load (assuming it''s an issue)? I wouldn''t mind
to trade 1/3rd to 1/2 of CPU due to zfs load, if
I can get decent IOPS.

This is terribly specific, I know, but I figured
somebody had tried something like that with an X2100 M2,
it being a rather popular Sun (RIP) Solaris box at
the time. Or not.

Thanks muchly, in any case.

-- Eugen

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Nov-27 12:12 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Eugen Leitl
> 
> can I make e.g. LSI SAS3442E
> directly do SSD caching (it says something about CacheCade,
> but I''m not sure it''s an OS-side driver thing), as it
> is supposed to boost IOPS? Unlikely shot, but probably
> somebody here would know.
Depending on the type of work you will be doing, the best performance thing you
could do is to disable zil (zfs set sync=disabled) and use SSD''s for
cache.  But don''t go *crazy* adding SSD''s for cache, because
they still have some in-memory footprint.  If you have 8G of ram and 80G
SSD''s, maybe just use one of them for cache, and let the other 3 do
absolutely nothing.  Better yet, make your OS on a pair of SSD mirror, then use
pair of HDD mirror for storagepool, and one SSD for cache.  Then you have one
SSD unused, which you could optionally add as dedicated log device to your
storagepool.  There are specific situations where it''s ok or not ok to
disable zil - look around and ask here if you have any confusion about it.

Don''t do redundancy in hardware.  Let ZFS handle it.

Eugen Leitl

2012-Nov-27 12:18 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

On Tue, Nov 27, 2012 at 12:12:43PM +0000, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Eugen Leitl
> > 
> > can I make e.g. LSI SAS3442E
> > directly do SSD caching (it says something about CacheCade,
> > but I''m not sure it''s an OS-side driver thing), as
it
> > is supposed to boost IOPS? Unlikely shot, but probably
> > somebody here would know.
> 
> Depending on the type of work you will be doing, the best performance thing
you could do is to disable zil (zfs set sync=disabled) and use SSD''s
for cache.  But don''t go *crazy* adding SSD''s for cache,
because they still have some in-memory footprint.  If you have 8G of ram and 80G
SSD''s, maybe just use one of them for cache, and let the other 3 do
absolutely nothing.  Better yet, make your OS on a pair of SSD mirror, then use
pair of HDD mirror for storagepool, and one SSD for cache.  Then you have one
SSD unused, which you could optionally add as dedicated log device to your
storagepool.  There are specific situations where it''s ok or not ok to
disable zil - look around and ask here if you have any confusion about it.
> 
> Don''t do redundancy in hardware.  Let ZFS handle it.
Thanks. I''ll try doing that, and see how it works out.

Jim Klimov

2012-Nov-27 12:48 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

Performance-wise, I think you should go for mirrors/raid10, and
separate the pools (i.e. rpool mirror on SSD and data mirror on
HDDs). If you have 4 SSDs, you might mirror the other couple for
zoneroots or some databases in datasets delegated into zones,
for example. Don''t use dedup. Carve out some space for L2ARC.
As Ed noted, you might not want to dedicate much disk space due
to remaining RAM pressure when using the cache; however, spreading
the IO load between smaller cache partitions/slices on each SSD
may help your IOPS on average. Maybe go for compression.

I really hope someone better versed in compression - like Saso -
would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in
terms of read-speeds from the pools. My HDD-based assumption is
in general that the less data you read (or write) on platters -
the better, and the spare CPU cycles can usually take the hit.

I''d spread out the different data types (i.e. WORM programs,
WORM-append logs and random-io other application data) into
various datasets with different settings, backed by different
storage - since you have the luxury.

Many best practice documents (and original Sol10/SXCE/LiveUpgrade
requirements) place the zoneroots on the same rpool so they can
be upgraded seamlessly as part of the OS image. However you can
also delegate ZFS datasets into zones and/or have lofs mounts
from GZ to LZ (maybe needed for shared datasets like distros
and homes - and faster/more robust than NFS from GZ to LZ).
For OS images (zoneroots) I''d use gzip-9 or better (likely lz4
when it gets integrated), same for logfile datasets, and lzjb,
zle or none for the random-io datasets. For structured things
like databases I also research the block IO size and use that
(at dataset creation time) to reduce extra work with ZFS COW
during writes - at expense of more metadata.

You''ll likely benefit from having OS images on SSDs, logs on
HDDs (including logs from the GZ and LZ OSes, to reduce needless
writes on the SSDs), and databases on SSDs. Things "depend" for
other data types, and in general would be helped by L2ARC on
the SSDs.

Also note that much of the default OS image is not really used
(i.e. X11 on headless boxes), so you might want to do weird
things with GZ or LZ rootfs data layouts - note that these might
puzzle your beadm/liveupgrade software, so you''ll have to do
any upgrades with lots of manual labor :)

On a somewhat orthogonal route, I''d start with setting up a
generic "dummy" zone, perhaps with much "unneeded" software,
and zfs-cloning that to spawn application zones. This way
you only pay the footprint price once, at least until you
have to upgrade the LZ OSes - in that case it might be cheaper
(in terms of storage at least) to upgrade the dummy, clone it
again, and port the LZ''s customizations (installed software)
by finding the differences between the old dummy and current
zone state (zfs diff, rsync -cn, etc.) In such upgrades you''re
really well served by storing volatile data in separate datasets
from the zone OS root - you just reattach these datasets to the
upgraded OS image and go on serving.

As a particular example of the thing often upgraded and taking
considerable disk space per copy - I''d have the current JDK
installed in GZ: either simply lofs-mounted from GZ to LZs,
or in a separate dataset, cloned and delegated into LZs (if
JDK customizations are further needed by some - but not all -
local zones, i.e. timezone updates, trusted CA certs, etc.).

HTH,
//Jim Klimov

Jim Klimov

2012-Nov-27 13:14 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

Now that I thought of it some more, a follow-up is due on my advices:

1) While the best practices do(did) dictate to set up zoneroots in
    rpool, this is certainly not required - and I maintain lots of
    systems which store zones in separate data pools. This minimizes
    write-impact on rpools and gives the fuzzy feeling of keeping
    the systems safer from unmountable or overfilled roots.

2) Whether LZs and GZs are in the same rpool for you, or you stack
    "tens of" your LZ roots in a separate pool, they do in fact offer
    a nice target for dedup - with expected large dedup ratio which
    would outweigh both the overheads and IO lags (especially if it
    is on SSD pool) and the inconveniences of my approach with cloned
    dummy zones - especially upgrades thereof. Just remember to use
    the same compression settings (or lack of compression) on all
    zoneroots, so that the zfs blocks for OS image files would be
     the same and dedupable.

HTH,
//Jim Klimov

Fajar A. Nugraha

2012-Nov-27 13:37 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

On Tue, Nov 27, 2012 at 5:13 AM, Eugen Leitl <eugen at leitl.org>
wrote:> Now there are multiple configurations for this.
> Some using Linux (roof fs on a RAID10, /home on
> RAID 1) or zfs. Now zfs on Linux probably wouldn''t
> do hybrid zfs pools (would it?)
Sure it does. You can even use the whole disk as zfs, with no
additional partition required (not even for /boot).
> and it wouldn''t
> be probably stable enough for production. Right?
Depends on how you define "stable", and what kind of in-house
expertise you have.

Some companies are selling (or plan to sell, as their product is in
open beta stage) storage appliances powered by zfs on linux (search
the ZoL list for details). So it''s definitely stable-enough for them.

-- 
Fajar

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Nov-28 14:51 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> I really hope someone better versed in compression - like Saso -
> would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in
> terms of read-speeds from the pools. My HDD-based assumption is
> in general that the less data you read (or write) on platters -
> the better, and the spare CPU cycles can usually take the hit.
Oh, I can definitely field that one - 
The lzjb compression (default compression as long as you just turn compression
on without specifying any other detail) is very fast compression, similar to
lzo.  It generally has no noticeable CPU overhead, but it saves you a lot of
time and space for highly repetitive things like text files (source code) and
sparse zero-filled files and stuff like that.  I personally always enable this. 
"compresson=on"

zlib (gzip) is more powerful, but *way* slower.  Even the fastest level gzip-1
uses enough CPU cycles that you probably will be CPU limited rather than IO
limited.  There are very few situations where this option is better than the
default lzjb.

Some data (anything that''s already compressed, zip, gz, etc, video
files, jpg''s, encrypted files, etc) are totally uncompressible with
these algorithms.  If this is the type of data you store, you should not use
compression.

Probably not worth mention, but what the heck.  If you normally have
uncompressible data and then one day you''re going to do a lot of stuff
that''s compressible...  (Or vice versa)...  The compression flag is
only used during writes.  Once it''s written to the pool, compressed or
uncompressed, it stays that way, even if you change the flag later.

Ian Collins

2012-Nov-28 19:12 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> I really hope someone better versed in compression - like Saso -
>> would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in
>> terms of read-speeds from the pools. My HDD-based assumption is
>> in general that the less data you read (or write) on platters -
>> the better, and the spare CPU cycles can usually take the hit.
> Oh, I can definitely field that one -
> The lzjb compression (default compression as long as you just turn
compression on without specifying any other detail) is very fast compression,
similar to lzo.  It generally has no noticeable CPU overhead, but it saves you a
lot of time and space for highly repetitive things like text files (source code)
and sparse zero-filled files and stuff like that.  I personally always enable
this.  "compresson=on"
>
> zlib (gzip) is more powerful, but *way* slower.  Even the fastest level
gzip-1 uses enough CPU cycles that you probably will be CPU limited rather than
IO limited.
I haven''t seen that for a long time.  When gzip compression was first 
introduced, it would cause writes on a Thumper to be CPU bound.  It was 
all but unusable on that machine.  Today with better threading, I barely 
notice the overhead on the same box.
> There are very few situations where this option is better than the default
lzjb.
That part I do agree with!

-- 
Ian.

Jim Klimov

2012-Nov-28 20:23 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
>> There are very few situations where (gzip) option is better than the
>> default lzjb.
Well, for the most part my question regarded the slowness (or lack of)
gzip DEcompression as compared to lz* algorithms. If there are files
and data like the OS (LZ/GZ) image and program binaries, which are
written once but read many times, I don''t really care how expensive
it is to write less data (and for an OI installation the difference
between lzjb and gzip-9 compression of /usr can be around or over
100Mb''s) - as long as I keep less data on-disk and have less IOs to
read in the OS during boot and work. Especially so, if - and this is
the part I am not certain about - it is roughly as cheap to READ the
gzip-9 datasets as it is to read lzjb (in terms of CPU decompression).
//Jim

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Nov-29 14:38 UTC

head link

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> this is
> the part I am not certain about - it is roughly as cheap to READ the
> gzip-9 datasets as it is to read lzjb (in terms of CPU decompression).
Nope.  I know LZJB is not LZO, but I''m starting from a point of saying
that LZO is specifically designed to be super-fast, low-memory for
decompression.  (As claimed all over the LZO webpage, as well as wikipedia, and
supported by my own personal experience using lzop).

So for comparison to LZJB, see here:
http://denisy.dyndns.org/lzo_vs_lzjb/

LZJB is, at least according to these guys, even faster than LZO.  So
I''m confident concluding that lzjb (default) decompression is
significantly faster than zlib (gzip) decompression.

zfs discuss - Nov 2012 - zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools