Michael Schuster - Sun Germany
2005-Oct-20 07:44 UTC
[dtrace-discuss] success thx to DTrace
Hi all,
Here''s a real-life customer issue I was able to solve (quicker) thx to
DTrace. Adam and Mike encouraged me to share this with you, so here goes
- pls. note that the D scripts here are missing the #pragmas etc. for
space conservation reasons.
A cust noticed that machine was using up all memory and had to be rebooted
ca. once a week. This cust is a big webhoster in Germany, the machine is a
pop3 server, the pop3 daemon is their own development.
Cust did some investigation on their own and found the ipcl_conn_cache was
constantly growing (we have a dump where it''s 2.5 GB in size). My first
suspicion was the the pop daemon wasn''t cleaning up properly after
itself,
so the first D script we used aggregated calls to kmem_alloc()s out of
ipcl_conn_cache by execname, as well as kmem_free()s:
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache/
{
this->entered = 1;
@allocs[execname] = count();
}
fbt::kmem_cache_alloc:return
/this->entered == 1/
{
this->entered = 0;
inuse[args[1]]++;
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0/
{
inuse[args[1]]--;
@frees[execname] = count();
}
tick-10s
{
printa ("allocs: %@8u %s\n", @allocs);
printa ("frees: %@8u %s\n", @frees);
trunc (@allocs); trunc (@frees);
}
The interesting result was that most allocations were made by "sched",
ie.
kernel threads (a typical sample):
allocs: 15 pop3d
allocs: 866 sched
frees: 2 picld
frees: 2 sched
allocs: 11 pop3d
allocs: 833 sched
frees: 1 pop3d
frees: 1 sched
frees: 2 snmpd
so I decided to get some information about the call stacks when in a kernel
thread, plus a meaningfull timestamp etc.:
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache/
{
this->entered = 1;
@allocs[execname] = count();
}
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache && execname == "sched"/
{
this->savestack=1;
}
fbt::kmem_cache_alloc:return
/this->savestack==1/
{
this->savestack=0;
@alloc_st[stack()] = count();
}
fbt::kmem_cache_alloc:return
/this->entered == 1/
{
this->entered = 0;
inuse[args[1]]++;
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0 && execname == "sched"/
{
this->save_fr_stack=1;
}
fbt::kmem_cache_free:return
/this->save_fr_stack == 1/
{
this->save_fr_stack = 0;
@frees_st[stack()] = count();
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0/
{
inuse[args[1]]--;
@frees[execname] = count();
}
tick-10s
{
printf ("----%Y---\n", walltimestamp ); /* show progress */
printa ("kmem_alloc(ipcl_tcpconn_cache) #: %@8u %s\n",
@allocs);
printf ("kernel threads alloc(): ");
printa (@alloc_st);
printa ("kmem_free(ipcl_tcpconn_cache) #: %@8u %s\n", @frees);
printf ("kernel threads free(): ");
printa (@frees_st);
trunc (@allocs); trunc (@frees);
trunc (@alloc_st); trunc (@frees_st);
}
this typically showed something like this:
----2005 Oct 14 09:46:11---
kmem_alloc(ipcl_tcpconn_cache) #: 6 pop3d
kmem_alloc(ipcl_tcpconn_cache) #: 268 sched
kernel threads alloc():
[ .. a few stacks deleted ...]
ip`tcp_timermp_alloc+0x14
ip`ipcl_tcpconn_constructor+0x14
genunix`kmem_cache_alloc+0x1cc
ip`ipcl_conn_create+0x34
ip`tcp_get_conn+0x64
ip`tcp_conn_request+0x1e4
ip`squeue_enter+0x74
ip`ip_tcp_input+0x9d0
ip`ip_rput+0x6f4
unix`putnext+0x218
ce`ce_drain_fifo+0x52e4
unix`thread_start+0x4
108
kmem_free(ipcl_tcpconn_cache) #: 18 sched
kmem_free(ipcl_tcpconn_cache) #: 89 pop3d
kernel threads free():
[ ...]
genunix`pid_rele+0x68
genunix`pid_exit+0xac
genunix`waitid+0x198
genunix`waitsys32+0x10
unix`syscall_trap32+0xcc
3
----2005 Oct 14 09:46:21---
Luckily, cust was creative and, making use of a load balancer, reduced the
network load on this machine quite a bit, and noticed that all of a sudden,
ipcl_conn_cache had gone down in size. This is what the script showed for
"kernel threads free":
ip`tcp_time_wait_collector+0xc0
genunix`callout_execute+0x98
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
11180
dropping "tcp_time_wait_collector" into sunsolve showed up CR 5042924
"too
many connections on tcp_free_list is a memory hog" on the first page, which
describes the behaviour cust sees exactly.
basically, what tcp_time_wait_collector() does is, it takes the first
element off a "freelist" (a list of connections in TIME_WAIT state, as
I
understand it), checks whether flag "tcp_in_free_list" is B_TRUE, and
walks
the whole linked list and calls CONN_DEC_REF() for every connection. This
flag is only set on the first element in this list, so if the first element
is re-used (which can happen under certain circumstances, such as the
customer was experiencing) before tcp_time_wait_collector gets a chance of
cleaning up the list, the flag is not propagated to the next element, and
therefore the list is not cleaned up.
to confirm this, I created this script which counts how often we see the
flag FALSE or TRUE when tcp_time_wait_collector() is called:
fbt::tcp_time_wait_collector:entry
{
self->tcw=1;
}
fbt::squeue_getprivate:return
/self->tcw==1 && args[1] != 0 &&
(*(tcp_squeue_priv_t **)args[1])->tcp_free_list &&
(*(tcp_squeue_priv_t **)args[1])->tcp_free_list->tcp_in_free_list/
{
@flags1 = count();
self->tcw=0;
}
fbt::squeue_getprivate:return
/self->tcw==1 && args[1] != 0 &&
(*(tcp_squeue_priv_t **)args[1])->tcp_free_list/
{
@flags0 = count();
self->tcw=0;
}
fbt::squeue_getprivate:return
/self->tcw==1/
{
self->tcw=0;
}
tick-10s
{
printf ("----%Y---\n", walltimestamp ); /* show progress */
printa ("tcp_in_free_list == B_TRUE: %@d\n", @flags1);
printa ("tcp_in_free_list == B_FALSE: %@d\n", @flags0);
trunc (@flags1);
trunc (@flags0);
}
and the results were as expected: during high load situation, only
"B_FALSE" occurrences were reported, once load was reduced, we got
this:
tcp_in_free_list == B_FALSE: 4
----2005 Oct 19 09:52:12---
tcp_in_free_list == B_FALSE: 6
----2005 Oct 19 09:52:22---
tcp_in_free_list == B_TRUE: 1
tcp_in_free_list == B_FALSE: 7
----2005 Oct 19 09:52:32---
tcp_in_free_list == B_TRUE: 2
tcp_in_free_list == B_FALSE: 6
----2005 Oct 19 09:52:42---
tcp_in_free_list == B_TRUE: 2
tcp_in_free_list == B_FALSE: 6
For those interested: a fix is in the making ;-)
comments etc. are welcome
Michael
--
Michael Schuster (+49 89) 46008-2974 / x62974
visit the online support center: http://www.sun.com/osc/
Recursion, n.: see ''Recursion''