thr3ads.net - freebsd stable - kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2014-Mar-27 11:52 UTC

kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

On 3/27/2014 4:11 AM, mikej wrote:> I've been running the latest patch now on r263711 and want to give it 
> a +1
>
> No ZFS knobs set and I must go out of my way to have my system swap.
>
> I hope this patch gets a much wider review and can be put into the
> tree permanently.
>
> Karl, thanks for the working on this.
>
> Regards,
>
> Michael JungNo problem; I was being driven insane by the stalls and related bad 
behavior... and there's that old saw about complaining about something 
without proposing a fix for it (I've done it!) being "less than
optimum"
so.... :-)

Hopefully wider review (and, if the general consensus is similar to what 
I've seen here and what you're reporting as well, inclusion in the 
codebase) will come.

On my sandbox system I have to get truly abusive before I can get the 
system to swap now, but that load is synthetic and we all know what 
sometimes happens when you try to extrapolate from synthetic loads to 
real production ones.

What really has my attention is the impact on systems running live 
production loads.

It has entirely changed the character of those machines, working 
equally-well for both pure ZFS machines and mixed UFS/ZFS systems. One 
of these systems that gets pounded on pretty good and has a 
moderately-large configuration (~10TB of storage, 2 Xeon quad-core 
processors and 24GB of RAM serving a combination of Samba users 
internally, a decently-large Postgres installation supporting an 
externally-facing web forum and blog application, email and similar 
things) has been completely transformed from being "frequently 
challenged" by its workload to literally loafing 90%+ of the day. DBMS 
response times have seen their standard deviation drop by an order of 
magnitude with best-response times down for one of the most-common query 
sequences (~30 separate ops) from ~180ms to ~140.

This particular machine has a separate pool for the system itself (root, 
usr and var) which was formerly UFS because it had to be in order to 
avoid the worst of the "stall" bad behavior.  It also has two other 
pools on it, one for read-nearly-only data sets that are comprised of 
very large files that are almost archival in character and a second that 
has the system's "working set" on it.  The latter has a separate
intent
log; I had a cache SSD drive on it as well but have recently dropped 
that as with these changes it no longer produces a material improvement 
in performance.  I'm frankly not sure the intent log is helping any more 
either but I've yet to drop it and instrument the results -- it used to 
be *necessary* to avoid nasty problems during busy periods.

I now have that machine set up booting from ZFS with the system on a 
mirrored pool dedicated to system images, with lz4 *and* dedup on (for 
that filesystem's root), which allows me to clone it almost instantly, 
start a jail on the clone and then do a "buildworld buildkernel -j8" 
while only allocating storage to actual changes. Dedup ratio on that 
mirror set is 1.4x and lz4 is showing a net compression ratio of 2.01x. 
Even better I cannot provoke misbehavior by doing this sort of thing 
during the middle of the day where formerly that was just begging for 
trouble; the impact on user perceptible performance during it is zero 
although I can see the degradation in performance (a modest increase in 
system latency) in the stats.

Oh, did I mention that everything except the boot/root/usr/var 
filesystems (including swap) are geli-encrypted on this machine as well 
and that the nightly PC backup jobs bury the GIG-E interface on which 
they're attached -- and sustain that performance against the ZFS disks 
for the duration?  (The machine does have AESNI loaded....)

Finally swap allocation remains at zero throughout all of this.

At present, coming off the overnight that has an activity spike for 
routine in-house backup activity from connected PCs but is otherwise the 
"low point" of activity shows 1GB of free memory, an
"auto-tuned" amount
of 12.9GB of ARC cache (with a maximum size of 22.3) and inactive pages 
have remained stable.  Wired memory is almost 19GB with Postgres using a 
sizable chunk of it.  Cache efficiency is claimed to be 98.9% (!)  
That'll go down somewhat over the day but during the busiest part of the 
day it remains well into the 90s which I'm sure has a heck of a lot to 
do with the performance improvements....

Cross-posted over to -STABLE in the hope of expanding review and testing 
by others.

-- 
-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20140327/6fc696a4/attachment.bin>

dteske at FreeBSD.org

2014-Mar-27 18:27 UTC

head link

kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

> -----Original Message-----
> From: Karl Denninger [mailto:karl at denninger.net]
> Sent: Thursday, March 27, 2014 4:53 AM
> To: freebsd-fs at freebsd.org; freebsd-stable at freebsd.org
> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
> 
> On 3/27/2014 4:11 AM, mikej wrote:
> > I've been running the latest patch now on r263711 and want to give
it
> > a +1
> >
> > No ZFS knobs set and I must go out of my way to have my system swap.
> >
> > I hope this patch gets a much wider review and can be put into the
> > tree permanently.
> >
> > Karl, thanks for the working on this.
> >
> > Regards,
> >
> > Michael Jung
> No problem; I was being driven insane by the stalls and related bad
> behavior... and there's that old saw about complaining about something
> without proposing a fix for it (I've done it!) being "less than
optimum"
> so.... :-)
> 
> Hopefully wider review (and, if the general consensus is similar to what
> I've seen here and what you're reporting as well, inclusion in the
> codebase) will come.
> 
> On my sandbox system I have to get truly abusive before I can get the
> system to swap now, but that load is synthetic and we all know what
> sometimes happens when you try to extrapolate from synthetic loads to
> real production ones.
> 
We (vicor) are currently putting your patch through the ringer for stable/8
in an effort to mass-deploy it to hundreds of servers (dozens of which are
relying on production ZFS, several of which have been negatively impacted
by current ARC strategy -- tasks that used to finish in 6 hours or less are
taking longer than a day due to being swapped out under ARC pressure).

We're very excited about your patch and expect to see a kernel running
with it start deployment in mid-April and fully deployed by mid-May.

> What really has my attention is the impact on systems running live
> production loads.
> 
Lots of those, but it will take a little time to trickle out to the
production
machines. Part of the delay was waiting to see when your patch would
stop changing ;D (all good changes, btw... like getting rid of sysctl usage
from within the kernel). I do believe the last thing I merged for our test
lab was March 24th -- and it's changed yet again on March 26th, so I've
got another iteration to churn before we can even start testing in the
test-lab) (smiles)

NB: The patch violates style(9), so I've actually been maintaining a
modified version of your patch for our internal keeping. I've attached
the modified Mar 24th patch which goes against stable/8. Also, it's
uber annoying to have to decode your contextual diff while trying to
translate for style(9) appropriate-ness (if you should switch to unified
diff, also make sure you pass -p to generate function tags so I know
which hunk is where -- merging into stable/8 was unpleasant without
that additional context). In my attached stable/8 patch, you should see
what I'm referring to at the onset of each hunk.

ASIDE: It's no big deal because your patch is only one file, but it's
almost always preferred to generate the patch with full paths to each
file (e.g., generate the patch at the head of the tree *or* go in and
modify the patch-file header afterward to reflect full paths).

(smiles -- sorry for picking nits)

> It has entirely changed the character of those machines, working
> equally-well for both pure ZFS machines and mixed UFS/ZFS systems. One
> of these systems that gets pounded on pretty good and has a
> moderately-large configuration (~10TB of storage, 2 Xeon quad-core
> processors and 24GB of RAM serving a combination of Samba users
> internally, a decently-large Postgres installation supporting an
> externally-facing web forum and blog application, email and similar
> things) has been completely transformed from being "frequently
> challenged" by its workload to literally loafing 90%+ of the day. DBMS
> response times have seen their standard deviation drop by an order of
> magnitude with best-response times down for one of the most-common
> query
> sequences (~30 separate ops) from ~180ms to ~140.
> 
This is most excellent. I can't wait to get it into production! Like you,
the
machines that we have that are struggling are:

a. beefy (24-48 cores, 24-48GB of RAM, 6-12TB ZFS)
b. Using a combination of UFS and ZFS simultaneoulsy
> This particular machine has a separate pool for the system itself (root,
> usr and var) which was formerly UFS because it had to be in order to
> avoid the worst of the "stall" bad behavior.  It also has two
other
> pools on it, one for read-nearly-only data sets that are comprised of
> very large files that are almost archival in character and a second that
> has the system's "working set" on it.  The latter has a
separate intent
> log; I had a cache SSD drive on it as well but have recently dropped
> that as with these changes it no longer produces a material improvement
> in performance.  I'm frankly not sure the intent log is helping any
more
> either but I've yet to drop it and instrument the results -- it used to
> be *necessary* to avoid nasty problems during busy periods.
> [snip]
> 
> At present, coming off the overnight that has an activity spike for
> routine in-house backup activity from connected PCs but is otherwise the
> "low point" of activity shows 1GB of free memory, an
"auto-tuned" amount
> of 12.9GB of ARC cache (with a maximum size of 22.3) and inactive pages
> have remained stable.  Wired memory is almost 19GB with Postgres using a
> sizable chunk of it.  Cache efficiency is claimed to be 98.9% (!)
> That'll go down somewhat over the day but during the busiest part of
the
> day it remains well into the 90s which I'm sure has a heck of a lot to
> do with the performance improvements....
> 
> Cross-posted over to -STABLE in the hope of expanding review and testing
> by others.
> 
I need to produce a new cleaned-up patch from your March 26th changes.
Hopefully the stream of changes is complete... or should I wait?

NB: Cross-posting is generally frowned upon. Create a separate post to each
list next time please.
-- 
Devin

_____________
The information contained in this message is proprietary and/or confidential. If
you are not the intended recipient, please: (i) delete the message and all
copies; (ii) do not disclose, distribute or use the message in any manner; and
(iii) notify the sender immediately. In addition, please be aware that any
message addressed to our domain is subject to archiving and review by persons
other than the intended recipient. Thank you.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: karld.zfs_arc_newreclaim(cleaned).stable8patch.txt
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20140327/dcae5428/attachment.txt>

freebsd stable - Mar 2014 - kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix