thr3ads.net - freebsd stable - Observations from a ZFS reorganization on 12-STABLE [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2019-Mar-17 14:58 UTC

Observations from a ZFS reorganization on 12-STABLE

I've long argued that the VM system's interaction with ZFS' arc
cache
and UMA has serious, even severe issues.? 12.x appeared to have
addressed some of them, and as such I've yet to roll forward any part of
the patch series that is found here [
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 ] or the
Phabricator version referenced in the bug thread (which is more-complex
and attempts to dig at the root of the issue more effectively,
particularly when UMA is involved as it usually is.)

Yesterday I decided to perform a fairly significant reorganization of
the ZFS pools on one of my personal machines, including the root pool
which was on mirrored SSDs, changing to a Raidz2 (also on SSDs.)? This
of course required booting single-user from a 12-Stable memstick.

A simple "zfs send -R zs/root-save/R | zfs recv -Fuev zsr/R" should
have
done it, no sweat.? The root that was copied over before I started is
uncomplicated; it's compressed, but not de-duped.? While it has
snapshots on it too it's by no means complex.

*The system failed to execute that command with an "out of swap space"
error, killing the job; there was indeed no swap configured since I
booted from a memstick.*

Huh?? A simple *filesystem copy* managed to force a 16Gb system into
requiring page file backing store?

I was able to complete the copy by temporarily adding the swap space
back on (where it would be when the move was complete) but that
requirement is pure insanity and it appears, from what I was able to
determine, that it came about from the same root cause that's been
plaguing VM/ZFS interaction since 2014 when I started work this issue --
specifically, when RAM gets low rather than evict ARC (or clean up UMA
that is allocated but unused) the system will attempt to page out
working set.? In this case since it couldn't page out working set since
there was nowhere to page it to the process involved got an OOM error
and was terminated.

*I continue to argue that this decision is ALWAYS wrong.*

It's wrong because if you invalidate cache and reclaim it you *might*
take a read from physical I/O to replace that data back into the cache
in the future (since it's not in RAM) but in exchange for a *potential*
I/O you perform a GUARANTEED physical I/O (to page out some amount of
working set) and possibly TWO physical I/Os (to page said working set
out and, later, page it back in.)

It has always appeared to me to be flat-out nonsensical to trade a
possible physical I/O (if there is a future cache miss) for a guaranteed
physical I/O and a possible second one.? It's even worse if the reason
you make that decision is that UMA is allocated but unused; in that case
you are paging when no physical I/O is required at all as the "memory
pressure" is a phantom!? While UMA is a very material performance win in
the general case to allow allocated-but-unused UMA to force paging, from
a performance perspective, appears to be flat-out insanity.? I find it
very difficult to come up with any reasonable scenario where releasing
allocated-but-unused UMA rather than paging out working set is a net
performance loser.

In this case since the system was running in single user mode the
process that got selected to be destroyed when that circumstance arose
and there was no available swap was the copy process itself.? The copy
itself did not require anywhere near all of the available non-kernel RAM.

I'm going to dig into this further but IMHO the base issue still exists,
even though the impact of it for my workloads with everything "running
normally" has materially decreased with 12.x.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190317/cf0130f6/attachment.bin>

Rainer Duffner

2019-Mar-17 15:48 UTC

head link

Observations from a ZFS reorganization on 12-STABLE

> Am 17.03.2019 um 15:58 schrieb Karl Denninger <karl at
denninger.net>:
> 
> I've long argued that the VM system's interaction with ZFS' arc
cache
> and UMA has serious, even severe issues.  12.x appeared to have
> addressed some of them, and as such I've yet to roll forward any part
of
> the patch series that is found here [
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594> ] or the
> Phabricator version referenced in the bug thread (which is more-complex
> and attempts to dig at the root of the issue more effectively,
> particularly when UMA is involved as it usually is.)
> 
> Yesterday I decided to perform a fairly significant reorganization of
> the ZFS pools on one of my personal machines, including the root pool
> which was on mirrored SSDs, changing to a Raidz2 (also on SSDs.)  This
> of course required booting single-user from a 12-Stable memstick.


Interesting.

The patches published before Christmas 2018 solved all of the problems I had
(shared by many others, probably also visible on the FreeBSD project?s own
infrastructure) with 11.2 and 12.0

I run a decently sized syslog-server and the 25MB/s stream of syslog-data was
killing 11.2 almost instantly.

I have a few 11.2 systems that I haven?t patched yet - but they have north of
128GB of RAM and ARC had been configured down to 70% long before that - so I
never saw the issue there.

Eugene Grosbein

2019-Mar-17 21:57 UTC

head link

Observations from a ZFS reorganization on 12-STABLE

17.03.2019 21:58, Karl Denninger wrote:
> Huh?  A simple *filesystem copy* managed to force a 16Gb system into
> requiring page file backing store?
> 
> I was able to complete the copy by temporarily adding the swap space
> back on (where it would be when the move was complete) but that
> requirement is pure insanity and it appears, from what I was able to
> determine, that it came about from the same root cause that's been
> plaguing VM/ZFS interaction since 2014 when I started work this issue --
> specifically, when RAM gets low rather than evict ARC (or clean up UMA
> that is allocated but unused) the system will attempt to page out
> working set.  In this case since it couldn't page out working set since
> there was nowhere to page it to the process involved got an OOM error
> and was terminated.
> 
> *I continue to argue that this decision is ALWAYS wrong.*
I agree. Recently I've found kind-of-workaround for this problem:
increase vm.v_free_min so when "FREE" memory goes low,
page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving some memory
from WIRED to FREE quick enough so it can be re-used before bad things happen.

But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total RAM)
because kernel may start behaving strange. For 16Gb system it should be enough
to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).

This is not permanent solution in any way but it really helps.

freebsd stable - Mar 2019 - Observations from a ZFS reorganization on 12-STABLE

Observations from a ZFS reorganization on 12-STABLE

Observations from a ZFS reorganization on 12-STABLE

Observations from a ZFS reorganization on 12-STABLE