Mark Johnston
2018-Aug-04 17:01 UTC
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
On Fri, Aug 03, 2018 at 09:11:42PM +0200, Mark Martinec wrote:> More attempts at tracking this down. The suggested dtrace command does > usually abort with: > > Assertion failed: (buf->dtbd_timestamp >= first_timestamp), > file > /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, > line 3330.Hrmm. As a workaround you can add "-x temporal=off" to the dtrace(1) invocation.> but with some luck soon after each machine reboot I can leave the dtrace > running for about 10 or 20 seconds (max) before terminating it with a > ^C, > and succeed in collecting the report. If I miss the opportunity to > leave > dtrace running just long enough to collect useful info, but not long > enough for it to hit the assertion check, then any further attempt > to run the dtrace script hits the assertion fault immediately. > > Btw, (just in case) I have recompiled kernel from source > (base/release/11.2.0) > with debugging symbols, although the behaviour has not changed: > > FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238: > Fri Aug 3 17:29:42 CEST 2018 > mark at xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64 > > > Anyway, after several attempts I was able to collect a useful dtrace > output from the suggested dtrace stript: > > # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] > count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}' > > while running "zpool list" repeatedly in another terminal screen:I think running "zpool list" is adding a lot of noise to the output. Could you retry without doing that?
Mark Martinec
2018-Aug-04 18:38 UTC
All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
2018-08-04 19:01, Mark Johnston wrote:> I think running "zpool list" is adding a lot of noise to the output. > Could you retry without doing that?No, like I said previously, the "zpool list" (with one defunct zfs pool) *is* the sole culprit of the zfs memory leak. With each invocation of "zpool list" the "solaris" malloc jumps up by the same amount, and never ever drops. Without running it (like repeatedly under 'telegraf' monitoring of zfs), the machine runs normally and never runs out of memory, the "solaris" malloc count no longer grows steadily. This leak was introduced sometime between 10.3 and 11.1R-p11, and is still there with 11.2. Mark> On Fri, Aug 03, 2018 at 09:11:42PM +0200, Mark Martinec wrote: >> More attempts at tracking this down. The suggested dtrace command does >> usually abort with: >> >> Assertion failed: (buf->dtbd_timestamp >= first_timestamp), >> file >> /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, >> line 3330. > > Hrmm. As a workaround you can add "-x temporal=off" to the dtrace(1) > invocation. > >> but with some luck soon after each machine reboot I can leave the >> dtrace >> running for about 10 or 20 seconds (max) before terminating it with a >> ^C, >> and succeed in collecting the report. If I miss the opportunity to >> leave >> dtrace running just long enough to collect useful info, but not long >> enough for it to hit the assertion check, then any further attempt >> to run the dtrace script hits the assertion fault immediately. >> >> Btw, (just in case) I have recompiled kernel from source >> (base/release/11.2.0) >> with debugging symbols, although the behaviour has not changed: >> >> FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238: >> Fri Aug 3 17:29:42 CEST 2018 >> mark at xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64 >> >> >> Anyway, after several attempts I was able to collect a useful dtrace >> output from the suggested dtrace stript: >> >> # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] >> count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = >> count()}' >> >> while running "zpool list" repeatedly in another terminal screen: > > I think running "zpool list" is adding a lot of noise to the output. > Could you retry without doing that?