Michael Schuster - Sun Germany
2005-Oct-20  07:44 UTC
[dtrace-discuss] success thx to DTrace
Hi all,
Here''s a real-life customer issue I was able to solve (quicker) thx to
DTrace. Adam and Mike encouraged me to share this with you, so here goes 
-  pls. note that the D scripts here are missing the #pragmas etc. for
space conservation reasons.
A cust noticed that machine was using up all memory and had to be rebooted
ca. once a week. This cust is a big webhoster in Germany, the machine is a
pop3 server, the pop3 daemon is their own development. 
Cust did some investigation on their own and found the ipcl_conn_cache was
constantly growing (we have a dump where it''s 2.5 GB in size). My first
suspicion was the the pop daemon wasn''t cleaning up properly after
itself,
so the first D script we used aggregated calls to kmem_alloc()s out of
ipcl_conn_cache by execname, as well as kmem_free()s:
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache/
{
        this->entered = 1;
        @allocs[execname] = count();
}
fbt::kmem_cache_alloc:return
/this->entered == 1/
{
        this->entered = 0;
        inuse[args[1]]++;
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0/
{
        inuse[args[1]]--;
        @frees[execname] = count();
}
tick-10s
{
        printa ("allocs: %@8u %s\n", @allocs);
        printa ("frees: %@8u %s\n", @frees);
        trunc (@allocs); trunc (@frees); 
}
The interesting result was that most allocations were made by "sched",
ie.
kernel threads (a typical sample):
allocs:       15 pop3d
allocs:      866 sched
frees:        2 picld
frees:        2 sched
allocs:       11 pop3d
allocs:      833 sched
frees:        1 pop3d
frees:        1 sched
frees:        2 snmpd
so I decided to get some information about the call stacks when in a kernel
thread, plus a meaningfull timestamp etc.:
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache/
{
        this->entered = 1;
        @allocs[execname] = count();
}
fbt::kmem_cache_alloc:entry
/args[0] == `ipcl_tcpconn_cache  && execname == "sched"/
{
        this->savestack=1;
}
fbt::kmem_cache_alloc:return
/this->savestack==1/
{
        this->savestack=0;
        @alloc_st[stack()] = count();
}
fbt::kmem_cache_alloc:return
/this->entered == 1/
{
        this->entered = 0;
        inuse[args[1]]++;
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0 && execname == "sched"/
{
        this->save_fr_stack=1;
}
fbt::kmem_cache_free:return
/this->save_fr_stack == 1/
{
        this->save_fr_stack = 0;
        @frees_st[stack()] = count();
}
fbt::kmem_cache_free:entry
/inuse[args[1]] > 0/
{
        inuse[args[1]]--;
        @frees[execname] = count();
}
tick-10s
{
        printf ("----%Y---\n", walltimestamp ); /* show progress */
        printa ("kmem_alloc(ipcl_tcpconn_cache) #: %@8u %s\n",
@allocs);
        printf ("kernel threads alloc(): ");
        printa (@alloc_st);
        printa ("kmem_free(ipcl_tcpconn_cache) #: %@8u %s\n", @frees);
        printf ("kernel threads free(): ");
        printa (@frees_st);
        trunc (@allocs); trunc (@frees); 
        trunc (@alloc_st); trunc (@frees_st);
}
this typically showed something like this:
----2005 Oct 14 09:46:11---
kmem_alloc(ipcl_tcpconn_cache) #:        6 pop3d
kmem_alloc(ipcl_tcpconn_cache) #:      268 sched
kernel threads alloc():
[ .. a few stacks deleted ...]
               ip`tcp_timermp_alloc+0x14
               ip`ipcl_tcpconn_constructor+0x14
               genunix`kmem_cache_alloc+0x1cc
               ip`ipcl_conn_create+0x34
               ip`tcp_get_conn+0x64
               ip`tcp_conn_request+0x1e4
               ip`squeue_enter+0x74
               ip`ip_tcp_input+0x9d0
               ip`ip_rput+0x6f4
               unix`putnext+0x218
               ce`ce_drain_fifo+0x52e4
               unix`thread_start+0x4
               108
 
kmem_free(ipcl_tcpconn_cache) #:       18 sched
kmem_free(ipcl_tcpconn_cache) #:       89 pop3d
kernel threads free():
[ ...]
               genunix`pid_rele+0x68
               genunix`pid_exit+0xac
               genunix`waitid+0x198
               genunix`waitsys32+0x10
               unix`syscall_trap32+0xcc
                 3
----2005 Oct 14 09:46:21---
Luckily, cust was creative and, making use of a load balancer, reduced the
network load on this machine quite a bit, and noticed that all of a sudden,
ipcl_conn_cache had gone down in size. This is what the script showed for
"kernel threads free":
               ip`tcp_time_wait_collector+0xc0
               genunix`callout_execute+0x98
               genunix`taskq_thread+0x1a4
               unix`thread_start+0x4
             11180
dropping "tcp_time_wait_collector" into sunsolve showed up CR 5042924
"too
many connections on tcp_free_list is a memory hog" on the first page, which
describes the behaviour cust sees exactly.
basically, what tcp_time_wait_collector() does is, it takes the first
element off a "freelist" (a list of connections in TIME_WAIT state, as
I
understand it), checks whether flag "tcp_in_free_list" is B_TRUE, and
walks
the whole linked list and calls CONN_DEC_REF() for every connection. This
flag is only set on the first element in this list, so if the first element
is re-used (which can happen under certain circumstances, such as the
customer was experiencing) before tcp_time_wait_collector gets a chance of
cleaning up the list, the flag is not propagated to the next element, and
therefore the list is not cleaned up.
to confirm this, I created this script which counts how often we see the
flag FALSE or TRUE when tcp_time_wait_collector() is called:
fbt::tcp_time_wait_collector:entry
{
        self->tcw=1;
}      
fbt::squeue_getprivate:return
/self->tcw==1 && args[1] != 0 &&
 (*(tcp_squeue_priv_t **)args[1])->tcp_free_list &&
 (*(tcp_squeue_priv_t **)args[1])->tcp_free_list->tcp_in_free_list/
{
        @flags1 = count();
        self->tcw=0;
}
fbt::squeue_getprivate:return
/self->tcw==1 && args[1] != 0 &&
 (*(tcp_squeue_priv_t **)args[1])->tcp_free_list/
{
        @flags0 = count();
        self->tcw=0;
}
fbt::squeue_getprivate:return
/self->tcw==1/
{
        self->tcw=0;
}       
tick-10s
{
        printf ("----%Y---\n", walltimestamp ); /* show progress */
        printa ("tcp_in_free_list == B_TRUE: %@d\n", @flags1);
        printa ("tcp_in_free_list == B_FALSE: %@d\n", @flags0);
        trunc (@flags1);
        trunc (@flags0);
}
and the results were as expected: during high load situation, only
"B_FALSE" occurrences were reported, once load was reduced, we got
this:
tcp_in_free_list == B_FALSE: 4
----2005 Oct 19 09:52:12---
tcp_in_free_list == B_FALSE: 6
----2005 Oct 19 09:52:22---
tcp_in_free_list == B_TRUE: 1
tcp_in_free_list == B_FALSE: 7
----2005 Oct 19 09:52:32---
tcp_in_free_list == B_TRUE: 2
tcp_in_free_list == B_FALSE: 6
----2005 Oct 19 09:52:42---
tcp_in_free_list == B_TRUE: 2
tcp_in_free_list == B_FALSE: 6
For those interested: a fix is in the making ;-)
comments etc. are welcome
Michael
-- 
Michael Schuster                  (+49 89) 46008-2974 / x62974
visit the online support center:  http://www.sun.com/osc/
Recursion, n.: see ''Recursion''