Mark Martinec
2018-Jul-31 21:54 UTC
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE and the situation has not improved. Also turned off all services. ZFS is still leaking memory about 30 MB per hour, until the host runs out of memory and swap space and crashes, unless I reboot it first every four days. Any advise before I try to get rid of that faulted disk with a pool (or downgrade to 10.3, which was stable) ? Mark 2018-07-23 17:12, myself wrote:> After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11 > (amd64), ZFS is gradually eating up all memory, so that it crashes > every > few days when the memory is completely exhausted (after swapping > heavily > for a couple of hours). > > This machine has only 4 GB of memory. After capping up the ZFS ARC > to 1.8 GB the machine can now stay up a bit longer, but in four days > all the memory is used up. The machine is lightly loaded, it runs > a bind resolver and a lightly used web server, the ps output > does not show any excessive memory use by any process. > > During the last survival period I ran vmstat -m every second > and logged results. What caught my eye was the 'solaris' entry, > which seems to explain all the exhaustion. > > The MemUse for the solaris entry starts modestly, e.g. after a few > hours of uptime: > > $ vmstat -m : > Type InUse MemUse HighUse Requests Size(s) > solaris 3141552 225178K - 12066929 > 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 > > ... but this number keeps steadily growing. > > After about four days, shortly before a crash, it grew to 2.5 GB, > which gets dangerously close to all the available memory: > > solaris 39359484 2652696K - 234986296 > 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768 > > Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can > see > a steady linear growth, about 25 MB per hour. On a fine-resolution > small scale > the step size seems to be one small step increase per about 6 seconds. > All steps are small, but not all are the same size. > > The only thing (in my mind) that distinguishes this host from others > running 11.1 seems to be that one of the two ZFS pools is down because > its disk is broken. This is a scratch data pool, not otherwise in use. > The pool with the OS is healthy. > > The syslog shows entries like the following periodically: > > Jul 23 16:48:49 xxx ZFS: vdev state changed, > pool_guid=15371508659919408885 vdev_guid=11732693005294113354 > Jul 23 16:49:09 xxx ZFS: vdev state changed, > pool_guid=15371508659919408885 vdev_guid=11732693005294113354 > Jul 23 16:55:34 xxx ZFS: vdev state changed, > pool_guid=15371508659919408885 vdev_guid=11732693005294113354 > > The 'zpool status -v' on this pool shows: > > pool: stuff > state: UNAVAIL > status: One or more devices could not be opened. There are > insufficient > replicas for the pool to continue functioning. > action: Attach the missing device and online it using 'zpool online'. > see: http://illumos.org/msg/ZFS-8000-3C > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > stuff UNAVAIL 0 0 0 > 11732693005294113354 UNAVAIL 0 0 0 was > /dev/da2 > > > The same machine with this broken pool could previously survive > indefinitely > under FreeBSD 10.3 . > > So, could this be the reason for memory depletion? > Any fixes for that? Any more tests suggested to perform > before I try to get rid of this pool? > > Mark > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe at freebsd.org"
Mark Johnston
2018-Jul-31 22:09 UTC
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote:> I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE > and the situation has not improved. Also turned off all services. > ZFS is still leaking memory about 30 MB per hour, until the host > runs out of memory and swap space and crashes, unless I reboot it > first every four days. > > Any advise before I try to get rid of that faulted disk with a pool > (or downgrade to 10.3, which was stable) ?If you're able to use dtrace, it would be useful to try tracking allocations with the solaris tag: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' Try letting that run for one minute, then kill it and paste the output. Ideally the host will be as close to idle as possible while still demonstrating the leak.
Mark Martinec
2018-Aug-01 07:12 UTC
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
> On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote: >> I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE >> and the situation has not improved. Also turned off all services. >> ZFS is still leaking memory about 30 MB per hour, until the host >> runs out of memory and swap space and crashes, unless I reboot it >> first every four days. >> >> Any advise before I try to get rid of that faulted disk with a pool >> (or downgrade to 10.3, which was stable) ?2018-08-01 00:09, Mark Johnston wrote:> If you're able to use dtrace, it would be useful to try tracking > allocations with the solaris tag: > > # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] > count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = > count();}' > > Try letting that run for one minute, then kill it and paste the output. > Ideally the host will be as close to idle as possible while still > demonstrating the leak.Good and bad news: The suggested dtrace command bails out: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. Abort trap But I did get one step further, localizing the culprit. I realized that the "solaris" malloc count goes up in sync with the 'telegraf' monitoring service polls, which also has a ZFS plugin which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp' periodically. So after stopping telegraf (and other remaining services), the 'vmstat -m' shows that InUse count for "solaris" goes up by 552 every time that I run "zpool list -Hp" : # (while true; do zpool list -Hp >/dev/null; vmstat -m | \ fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}' 6664427 541 552 552 552 552 552 552 552 552 556 548 552 552 552 552 552 552 552 552 552 # zpool list -Hp floki 68719476736 37354102272 31365374464 - - 49% 54 1.00x ONLINE - stuff - - - - - - - - UNAVAIL - Mark