thr3ads.net - freebsd stable - All the memory eaten away by ZFS 'solaris' malloc

If this information is useful, please help other people find it:
Share via:

Mark Martinec

2018-Jul-23 15:12 UTC

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
(amd64), ZFS is gradually eating up all memory, so that it crashes every
few days when the memory is completely exhausted (after swapping heavily
for a couple of hours).

This machine has only 4 GB of memory. After capping up the ZFS ARC
to 1.8 GB the machine can now stay up a bit longer, but in four days
all the memory is used up. The machine is lightly loaded, it runs
a bind resolver and a lightly used web server, the ps output
does not show any excessive memory use by any process.

During the last survival period I ran  vmstat -m  every second
and logged results. What caught my eye was the 'solaris' entry,
which seems to explain all the exhaustion.

The MemUse for the solaris entry starts modestly, e.g. after a few
hours of uptime:

$ vmstat -m :
          Type InUse MemUse HighUse Requests  Size(s)
       solaris 3141552 225178K       - 12066929  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

... but this number keeps steadily growing.

After about four days, shortly before a crash, it grew to 2.5 GB,
which gets dangerously close to all the available memory:

       solaris 39359484 2652696K       - 234986296  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can 
see
a steady linear growth, about 25 MB per hour. On a fine-resolution small 
scale
the step size seems to be one small step increase per about 6 seconds.
All steps are small, but not all are the same size.

The only thing (in my mind) that distinguishes this host from others
running 11.1 seems to be that one of the two ZFS pools is down because
its disk is broken. This is a scratch data pool, not otherwise in use.
The pool with the OS is healthy.

The syslog shows entries like the following periodically:

Jul 23 16:48:49 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:49:09 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:55:34 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354

The 'zpool status -v' on this pool shows:

   pool: stuff
  state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
         replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
    see: http://illumos.org/msg/ZFS-8000-3C
   scan: none requested
config:

         NAME                    STATE     READ WRITE CKSUM
         stuff                   UNAVAIL      0     0     0
           11732693005294113354  UNAVAIL      0     0     0  was /dev/da2


The same machine with this broken pool could previously survive 
indefinitely
under FreeBSD 10.3 .

So, could this be the reason for memory depletion?
Any fixes for that? Any more tests suggested to perform
before I try to get rid of this pool?

   Mark

Mark Martinec

2018-Jul-31 21:54 UTC

head link

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
and the situation has not improved. Also turned off all services.
ZFS is still leaking memory about 30 MB per hour, until the host
runs out of memory and swap space and crashes, unless I reboot it
first every four days.

Any advise before I try to get rid of that faulted disk with a pool
(or downgrade to 10.3, which was stable) ?

   Mark


2018-07-23 17:12, myself wrote:> After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
> (amd64), ZFS is gradually eating up all memory, so that it crashes 
> every
> few days when the memory is completely exhausted (after swapping 
> heavily
> for a couple of hours).
> 
> This machine has only 4 GB of memory. After capping up the ZFS ARC
> to 1.8 GB the machine can now stay up a bit longer, but in four days
> all the memory is used up. The machine is lightly loaded, it runs
> a bind resolver and a lightly used web server, the ps output
> does not show any excessive memory use by any process.
> 
> During the last survival period I ran  vmstat -m  every second
> and logged results. What caught my eye was the 'solaris' entry,
> which seems to explain all the exhaustion.
> 
> The MemUse for the solaris entry starts modestly, e.g. after a few
> hours of uptime:
> 
> $ vmstat -m :
>          Type InUse MemUse HighUse Requests  Size(s)
>       solaris 3141552 225178K       - 12066929
> 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768
> 
> ... but this number keeps steadily growing.
> 
> After about four days, shortly before a crash, it grew to 2.5 GB,
> which gets dangerously close to all the available memory:
> 
>       solaris 39359484 2652696K       - 234986296
> 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768
> 
> Plotting the 'solaris' MemUse entry vs. wall time in seconds, one
can
> see
> a steady linear growth, about 25 MB per hour. On a fine-resolution 
> small scale
> the step size seems to be one small step increase per about 6 seconds.
> All steps are small, but not all are the same size.
> 
> The only thing (in my mind) that distinguishes this host from others
> running 11.1 seems to be that one of the two ZFS pools is down because
> its disk is broken. This is a scratch data pool, not otherwise in use.
> The pool with the OS is healthy.
> 
> The syslog shows entries like the following periodically:
> 
> Jul 23 16:48:49 xxx ZFS: vdev state changed,
> pool_guid=15371508659919408885 vdev_guid=11732693005294113354
> Jul 23 16:49:09 xxx ZFS: vdev state changed,
> pool_guid=15371508659919408885 vdev_guid=11732693005294113354
> Jul 23 16:55:34 xxx ZFS: vdev state changed,
> pool_guid=15371508659919408885 vdev_guid=11732693005294113354
> 
> The 'zpool status -v' on this pool shows:
> 
>   pool: stuff
>  state: UNAVAIL
> status: One or more devices could not be opened.  There are 
> insufficient
>         replicas for the pool to continue functioning.
> action: Attach the missing device and online it using 'zpool
online'.
>    see: http://illumos.org/msg/ZFS-8000-3C
>   scan: none requested
> config:
> 
>         NAME                    STATE     READ WRITE CKSUM
>         stuff                   UNAVAIL      0     0     0
>           11732693005294113354  UNAVAIL      0     0     0  was 
> /dev/da2
> 
> 
> The same machine with this broken pool could previously survive 
> indefinitely
> under FreeBSD 10.3 .
> 
> So, could this be the reason for memory depletion?
> Any fixes for that? Any more tests suggested to perform
> before I try to get rid of this pool?
> 
>   Mark
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscribe at freebsd.org"

Shane Ambler

2018-Aug-01 01:29 UTC

head link

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

On 01/08/2018 07:24, Mark Martinec wrote:> I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
> and the situation has not improved. Also turned off all services.
> ZFS is still leaking memory about 30 MB per hour, until the host
> runs out of memory and swap space and crashes, unless I reboot it
> first every four days.
> 
> Any advise before I try to get rid of that faulted disk with a pool
> (or downgrade to 10.3, which was stable) ?
> 
> ? Mark
> 
> 
> 2018-07-23 17:12, myself wrote:
>> After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
>> (amd64), ZFS is gradually eating up all memory, so that it crashes
every
>> few days when the memory is completely exhausted (after swapping
heavily
>> for a couple of hours).
>>
>> This machine has only 4 GB of memory. After capping up the ZFS ARC
>> to 1.8 GB the machine can now stay up a bit longer, but in four days
>> all the memory is used up. The machine is lightly loaded, it runs
>> a bind resolver and a lightly used web server, the ps output
>> does not show any excessive memory use by any process.
When you say all used up - you mean the amount of wired ram goes higher
than about 90% physical ram? You can watch the wired amount in top, or
calculate it as vm.stats.vm.v_wire_count * hw.pagesize

ZFS ARC is marked as wired, there is also vm.max_wired which limits how
much the kernel can wire, this defaults to 30% ram, so about 1.2G for
you. It seems these two wired values don't interact and can add up to
more than physical ram. I have reported this in bug 229764

Try the patch at
https://reviews.freebsd.org/D7538
it has given me the best arc related memory improvements I have seen
since 10.1, I now see arc being released instead of swap being used.

-- 
FreeBSD - the place to B...Software Developing

Shane Ambler

Volodymyr Kostyrko

2018-Aug-13 19:48 UTC

head link

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

23.07.18 18:12, Mark Martinec wrote:> After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
> (amd64), ZFS is gradually eating up all memory, so that it crashes every
> few days when the memory is completely exhausted (after swapping heavily
> for a couple of hours).
I've been in the same situation. ZFS, only pool, no ZFS errors.

I think the problem is rather between swapping and ZFS ARC. This host 
has different load, sometimes it needs more active memory, somtimes 
less... This means that active zone can expand and shrink like +-2G os 
mem (I have 16Gb installed there). The problem is, when huge task is 
idle it doesn't use much active memory and other activity is pushing 
it's memory to the swap. When active runs low and ARC runs >50% of 
memory it becomes very hard to make ARC give some memory back. My host 
even was broght to the point when it couldn't get tasks back into memory 
from swap, because while some pages were restored from swap the time 
passes by and the other pages are instead stored to swap due to zome ARC 
activity. Finally active zone shrinks so bad that the host becomes 
unresponsive.

Like 6 month ago I tried tweaking kernel and swap to make things go 
other way. Currently I have `vm.swap_idle_enabled=1` in /etc/loader.conf 
and looks like this solves my problem. The other interesting things to 
look at are `vfs.zfs.arc_free_target`, `vfs.zfs.arc_shrink_shift`, 
`vfs.zfs.arc_grow_retry`.

Or you can take another route and plain limit current ARC size with 
`vfs.zfs.arc_max`.

Hope that helps.

-- 
Sphinx of black quartz judge my vow.

freebsd stable - Jul 2018 - All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64