Dear internets, I''ve got an old SunFire X2100M2 with 6-8 GBytes ECC RAM, which I wanted to put into use with Linux, using the Linux VServer patch (an analogon to zones), and 2x 2 TByte nearline (WD RE4) drives. It occured to me that the 1U case had enough space to add some SSDs (e.g. 2-4 80 GByte Intel SSDs), and the power supply should be able to take both the 2x SATA HDs as well as 2-4 SATA SSDs, though I would need to splice into existing power cables. I also have a few LSI and an IBM M1015 (potentially reflashable to IT mode) adapters, so having enough ports is less an issue (I''ll probably use an LSI with 4x SAS/SATA for 4x SSD, and keep the onboard SATA for HDs, or use each 2x for SSD and HD). Now there are multiple configurations for this. Some using Linux (roof fs on a RAID10, /home on RAID 1) or zfs. Now zfs on Linux probably wouldn''t do hybrid zfs pools (would it?) and it wouldn''t be probably stable enough for production. Right? Assuming I wont''t have to compromise CPU performance (it''s an anemic Opteron 1210 1.8 GHz, dual core, after all, and it will probably run several 10 of zones in production) and sacrifice data integrity, can I make e.g. LSI SAS3442E directly do SSD caching (it says something about CacheCade, but I''m not sure it''s an OS-side driver thing), as it is supposed to boost IOPS? Unlikely shot, but probably somebody here would know. If not, should I go directly OpenIndiana, and use a hybrid pool? Should I use all 4x SATA SSDs and 2x SATA HDs to do a hybrid pool, or would this be an overkill? The SSDs are Intel SSDSA2M080G2GC 80 GByte, so no speed demons either. However, they''ve seen some wear and tear and none of them has keeled over yet. So I think they''ll be good for a few more years. How would you lay out the pool with OpenIndiana in either case to maximize IOPS and minimize CPU load (assuming it''s an issue)? I wouldn''t mind to trade 1/3rd to 1/2 of CPU due to zfs load, if I can get decent IOPS. This is terribly specific, I know, but I figured somebody had tried something like that with an X2100 M2, it being a rather popular Sun (RIP) Solaris box at the time. Or not. Thanks muchly, in any case. -- Eugen
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-27 12:12 UTC
[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eugen Leitl > > can I make e.g. LSI SAS3442E > directly do SSD caching (it says something about CacheCade, > but I''m not sure it''s an OS-side driver thing), as it > is supposed to boost IOPS? Unlikely shot, but probably > somebody here would know.Depending on the type of work you will be doing, the best performance thing you could do is to disable zil (zfs set sync=disabled) and use SSD''s for cache. But don''t go *crazy* adding SSD''s for cache, because they still have some in-memory footprint. If you have 8G of ram and 80G SSD''s, maybe just use one of them for cache, and let the other 3 do absolutely nothing. Better yet, make your OS on a pair of SSD mirror, then use pair of HDD mirror for storagepool, and one SSD for cache. Then you have one SSD unused, which you could optionally add as dedicated log device to your storagepool. There are specific situations where it''s ok or not ok to disable zil - look around and ask here if you have any confusion about it. Don''t do redundancy in hardware. Let ZFS handle it.
On Tue, Nov 27, 2012 at 12:12:43PM +0000, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Eugen Leitl > > > > can I make e.g. LSI SAS3442E > > directly do SSD caching (it says something about CacheCade, > > but I''m not sure it''s an OS-side driver thing), as it > > is supposed to boost IOPS? Unlikely shot, but probably > > somebody here would know. > > Depending on the type of work you will be doing, the best performance thing you could do is to disable zil (zfs set sync=disabled) and use SSD''s for cache. But don''t go *crazy* adding SSD''s for cache, because they still have some in-memory footprint. If you have 8G of ram and 80G SSD''s, maybe just use one of them for cache, and let the other 3 do absolutely nothing. Better yet, make your OS on a pair of SSD mirror, then use pair of HDD mirror for storagepool, and one SSD for cache. Then you have one SSD unused, which you could optionally add as dedicated log device to your storagepool. There are specific situations where it''s ok or not ok to disable zil - look around and ask here if you have any confusion about it. > > Don''t do redundancy in hardware. Let ZFS handle it.Thanks. I''ll try doing that, and see how it works out.
Performance-wise, I think you should go for mirrors/raid10, and separate the pools (i.e. rpool mirror on SSD and data mirror on HDDs). If you have 4 SSDs, you might mirror the other couple for zoneroots or some databases in datasets delegated into zones, for example. Don''t use dedup. Carve out some space for L2ARC. As Ed noted, you might not want to dedicate much disk space due to remaining RAM pressure when using the cache; however, spreading the IO load between smaller cache partitions/slices on each SSD may help your IOPS on average. Maybe go for compression. I really hope someone better versed in compression - like Saso - would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in terms of read-speeds from the pools. My HDD-based assumption is in general that the less data you read (or write) on platters - the better, and the spare CPU cycles can usually take the hit. I''d spread out the different data types (i.e. WORM programs, WORM-append logs and random-io other application data) into various datasets with different settings, backed by different storage - since you have the luxury. Many best practice documents (and original Sol10/SXCE/LiveUpgrade requirements) place the zoneroots on the same rpool so they can be upgraded seamlessly as part of the OS image. However you can also delegate ZFS datasets into zones and/or have lofs mounts from GZ to LZ (maybe needed for shared datasets like distros and homes - and faster/more robust than NFS from GZ to LZ). For OS images (zoneroots) I''d use gzip-9 or better (likely lz4 when it gets integrated), same for logfile datasets, and lzjb, zle or none for the random-io datasets. For structured things like databases I also research the block IO size and use that (at dataset creation time) to reduce extra work with ZFS COW during writes - at expense of more metadata. You''ll likely benefit from having OS images on SSDs, logs on HDDs (including logs from the GZ and LZ OSes, to reduce needless writes on the SSDs), and databases on SSDs. Things "depend" for other data types, and in general would be helped by L2ARC on the SSDs. Also note that much of the default OS image is not really used (i.e. X11 on headless boxes), so you might want to do weird things with GZ or LZ rootfs data layouts - note that these might puzzle your beadm/liveupgrade software, so you''ll have to do any upgrades with lots of manual labor :) On a somewhat orthogonal route, I''d start with setting up a generic "dummy" zone, perhaps with much "unneeded" software, and zfs-cloning that to spawn application zones. This way you only pay the footprint price once, at least until you have to upgrade the LZ OSes - in that case it might be cheaper (in terms of storage at least) to upgrade the dummy, clone it again, and port the LZ''s customizations (installed software) by finding the differences between the old dummy and current zone state (zfs diff, rsync -cn, etc.) In such upgrades you''re really well served by storing volatile data in separate datasets from the zone OS root - you just reattach these datasets to the upgraded OS image and go on serving. As a particular example of the thing often upgraded and taking considerable disk space per copy - I''d have the current JDK installed in GZ: either simply lofs-mounted from GZ to LZs, or in a separate dataset, cloned and delegated into LZs (if JDK customizations are further needed by some - but not all - local zones, i.e. timezone updates, trusted CA certs, etc.). HTH, //Jim Klimov
Now that I thought of it some more, a follow-up is due on my advices: 1) While the best practices do(did) dictate to set up zoneroots in rpool, this is certainly not required - and I maintain lots of systems which store zones in separate data pools. This minimizes write-impact on rpools and gives the fuzzy feeling of keeping the systems safer from unmountable or overfilled roots. 2) Whether LZs and GZs are in the same rpool for you, or you stack "tens of" your LZ roots in a separate pool, they do in fact offer a nice target for dedup - with expected large dedup ratio which would outweigh both the overheads and IO lags (especially if it is on SSD pool) and the inconveniences of my approach with cloned dummy zones - especially upgrades thereof. Just remember to use the same compression settings (or lack of compression) on all zoneroots, so that the zfs blocks for OS image files would be the same and dedupable. HTH, //Jim Klimov
Fajar A. Nugraha
2012-Nov-27 13:37 UTC
[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
On Tue, Nov 27, 2012 at 5:13 AM, Eugen Leitl <eugen at leitl.org> wrote:> Now there are multiple configurations for this. > Some using Linux (roof fs on a RAID10, /home on > RAID 1) or zfs. Now zfs on Linux probably wouldn''t > do hybrid zfs pools (would it?)Sure it does. You can even use the whole disk as zfs, with no additional partition required (not even for /boot).> and it wouldn''t > be probably stable enough for production. Right?Depends on how you define "stable", and what kind of in-house expertise you have. Some companies are selling (or plan to sell, as their product is in open beta stage) storage appliances powered by zfs on linux (search the ZoL list for details). So it''s definitely stable-enough for them. -- Fajar
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-28 14:51 UTC
[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > I really hope someone better versed in compression - like Saso - > would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in > terms of read-speeds from the pools. My HDD-based assumption is > in general that the less data you read (or write) on platters - > the better, and the spare CPU cycles can usually take the hit.Oh, I can definitely field that one - The lzjb compression (default compression as long as you just turn compression on without specifying any other detail) is very fast compression, similar to lzo. It generally has no noticeable CPU overhead, but it saves you a lot of time and space for highly repetitive things like text files (source code) and sparse zero-filled files and stuff like that. I personally always enable this. "compresson=on" zlib (gzip) is more powerful, but *way* slower. Even the fastest level gzip-1 uses enough CPU cycles that you probably will be CPU limited rather than IO limited. There are very few situations where this option is better than the default lzjb. Some data (anything that''s already compressed, zip, gz, etc, video files, jpg''s, encrypted files, etc) are totally uncompressible with these algorithms. If this is the type of data you store, you should not use compression. Probably not worth mention, but what the heck. If you normally have uncompressible data and then one day you''re going to do a lot of stuff that''s compressible... (Or vice versa)... The compression flag is only used during writes. Once it''s written to the pool, compressed or uncompressed, it stays that way, even if you change the flag later.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> I really hope someone better versed in compression - like Saso - >> would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in >> terms of read-speeds from the pools. My HDD-based assumption is >> in general that the less data you read (or write) on platters - >> the better, and the spare CPU cycles can usually take the hit. > Oh, I can definitely field that one - > The lzjb compression (default compression as long as you just turn compression on without specifying any other detail) is very fast compression, similar to lzo. It generally has no noticeable CPU overhead, but it saves you a lot of time and space for highly repetitive things like text files (source code) and sparse zero-filled files and stuff like that. I personally always enable this. "compresson=on" > > zlib (gzip) is more powerful, but *way* slower. Even the fastest level gzip-1 uses enough CPU cycles that you probably will be CPU limited rather than IO limited.I haven''t seen that for a long time. When gzip compression was first introduced, it would cause writes on a Thumper to be CPU bound. It was all but unusable on that machine. Today with better threading, I barely notice the overhead on the same box.> There are very few situations where this option is better than the default lzjb.That part I do agree with! -- Ian.
> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: >> There are very few situations where (gzip) option is better than the >> default lzjb.Well, for the most part my question regarded the slowness (or lack of) gzip DEcompression as compared to lz* algorithms. If there are files and data like the OS (LZ/GZ) image and program binaries, which are written once but read many times, I don''t really care how expensive it is to write less data (and for an OI installation the difference between lzjb and gzip-9 compression of /usr can be around or over 100Mb''s) - as long as I keep less data on-disk and have less IOs to read in the OS during boot and work. Especially so, if - and this is the part I am not certain about - it is roughly as cheap to READ the gzip-9 datasets as it is to read lzjb (in terms of CPU decompression). //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-29 14:38 UTC
[zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > this is > the part I am not certain about - it is roughly as cheap to READ the > gzip-9 datasets as it is to read lzjb (in terms of CPU decompression).Nope. I know LZJB is not LZO, but I''m starting from a point of saying that LZO is specifically designed to be super-fast, low-memory for decompression. (As claimed all over the LZO webpage, as well as wikipedia, and supported by my own personal experience using lzop). So for comparison to LZJB, see here: http://denisy.dyndns.org/lzo_vs_lzjb/ LZJB is, at least according to these guys, even faster than LZO. So I''m confident concluding that lzjb (default) decompression is significantly faster than zlib (gzip) decompression.