Michael Schuster - Sun Germany
2005-Oct-20 07:44 UTC
[dtrace-discuss] success thx to DTrace
Hi all, Here''s a real-life customer issue I was able to solve (quicker) thx to DTrace. Adam and Mike encouraged me to share this with you, so here goes - pls. note that the D scripts here are missing the #pragmas etc. for space conservation reasons. A cust noticed that machine was using up all memory and had to be rebooted ca. once a week. This cust is a big webhoster in Germany, the machine is a pop3 server, the pop3 daemon is their own development. Cust did some investigation on their own and found the ipcl_conn_cache was constantly growing (we have a dump where it''s 2.5 GB in size). My first suspicion was the the pop daemon wasn''t cleaning up properly after itself, so the first D script we used aggregated calls to kmem_alloc()s out of ipcl_conn_cache by execname, as well as kmem_free()s: fbt::kmem_cache_alloc:entry /args[0] == `ipcl_tcpconn_cache/ { this->entered = 1; @allocs[execname] = count(); } fbt::kmem_cache_alloc:return /this->entered == 1/ { this->entered = 0; inuse[args[1]]++; } fbt::kmem_cache_free:entry /inuse[args[1]] > 0/ { inuse[args[1]]--; @frees[execname] = count(); } tick-10s { printa ("allocs: %@8u %s\n", @allocs); printa ("frees: %@8u %s\n", @frees); trunc (@allocs); trunc (@frees); } The interesting result was that most allocations were made by "sched", ie. kernel threads (a typical sample): allocs: 15 pop3d allocs: 866 sched frees: 2 picld frees: 2 sched allocs: 11 pop3d allocs: 833 sched frees: 1 pop3d frees: 1 sched frees: 2 snmpd so I decided to get some information about the call stacks when in a kernel thread, plus a meaningfull timestamp etc.: fbt::kmem_cache_alloc:entry /args[0] == `ipcl_tcpconn_cache/ { this->entered = 1; @allocs[execname] = count(); } fbt::kmem_cache_alloc:entry /args[0] == `ipcl_tcpconn_cache && execname == "sched"/ { this->savestack=1; } fbt::kmem_cache_alloc:return /this->savestack==1/ { this->savestack=0; @alloc_st[stack()] = count(); } fbt::kmem_cache_alloc:return /this->entered == 1/ { this->entered = 0; inuse[args[1]]++; } fbt::kmem_cache_free:entry /inuse[args[1]] > 0 && execname == "sched"/ { this->save_fr_stack=1; } fbt::kmem_cache_free:return /this->save_fr_stack == 1/ { this->save_fr_stack = 0; @frees_st[stack()] = count(); } fbt::kmem_cache_free:entry /inuse[args[1]] > 0/ { inuse[args[1]]--; @frees[execname] = count(); } tick-10s { printf ("----%Y---\n", walltimestamp ); /* show progress */ printa ("kmem_alloc(ipcl_tcpconn_cache) #: %@8u %s\n", @allocs); printf ("kernel threads alloc(): "); printa (@alloc_st); printa ("kmem_free(ipcl_tcpconn_cache) #: %@8u %s\n", @frees); printf ("kernel threads free(): "); printa (@frees_st); trunc (@allocs); trunc (@frees); trunc (@alloc_st); trunc (@frees_st); } this typically showed something like this: ----2005 Oct 14 09:46:11--- kmem_alloc(ipcl_tcpconn_cache) #: 6 pop3d kmem_alloc(ipcl_tcpconn_cache) #: 268 sched kernel threads alloc(): [ .. a few stacks deleted ...] ip`tcp_timermp_alloc+0x14 ip`ipcl_tcpconn_constructor+0x14 genunix`kmem_cache_alloc+0x1cc ip`ipcl_conn_create+0x34 ip`tcp_get_conn+0x64 ip`tcp_conn_request+0x1e4 ip`squeue_enter+0x74 ip`ip_tcp_input+0x9d0 ip`ip_rput+0x6f4 unix`putnext+0x218 ce`ce_drain_fifo+0x52e4 unix`thread_start+0x4 108 kmem_free(ipcl_tcpconn_cache) #: 18 sched kmem_free(ipcl_tcpconn_cache) #: 89 pop3d kernel threads free(): [ ...] genunix`pid_rele+0x68 genunix`pid_exit+0xac genunix`waitid+0x198 genunix`waitsys32+0x10 unix`syscall_trap32+0xcc 3 ----2005 Oct 14 09:46:21--- Luckily, cust was creative and, making use of a load balancer, reduced the network load on this machine quite a bit, and noticed that all of a sudden, ipcl_conn_cache had gone down in size. This is what the script showed for "kernel threads free": ip`tcp_time_wait_collector+0xc0 genunix`callout_execute+0x98 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 11180 dropping "tcp_time_wait_collector" into sunsolve showed up CR 5042924 "too many connections on tcp_free_list is a memory hog" on the first page, which describes the behaviour cust sees exactly. basically, what tcp_time_wait_collector() does is, it takes the first element off a "freelist" (a list of connections in TIME_WAIT state, as I understand it), checks whether flag "tcp_in_free_list" is B_TRUE, and walks the whole linked list and calls CONN_DEC_REF() for every connection. This flag is only set on the first element in this list, so if the first element is re-used (which can happen under certain circumstances, such as the customer was experiencing) before tcp_time_wait_collector gets a chance of cleaning up the list, the flag is not propagated to the next element, and therefore the list is not cleaned up. to confirm this, I created this script which counts how often we see the flag FALSE or TRUE when tcp_time_wait_collector() is called: fbt::tcp_time_wait_collector:entry { self->tcw=1; } fbt::squeue_getprivate:return /self->tcw==1 && args[1] != 0 && (*(tcp_squeue_priv_t **)args[1])->tcp_free_list && (*(tcp_squeue_priv_t **)args[1])->tcp_free_list->tcp_in_free_list/ { @flags1 = count(); self->tcw=0; } fbt::squeue_getprivate:return /self->tcw==1 && args[1] != 0 && (*(tcp_squeue_priv_t **)args[1])->tcp_free_list/ { @flags0 = count(); self->tcw=0; } fbt::squeue_getprivate:return /self->tcw==1/ { self->tcw=0; } tick-10s { printf ("----%Y---\n", walltimestamp ); /* show progress */ printa ("tcp_in_free_list == B_TRUE: %@d\n", @flags1); printa ("tcp_in_free_list == B_FALSE: %@d\n", @flags0); trunc (@flags1); trunc (@flags0); } and the results were as expected: during high load situation, only "B_FALSE" occurrences were reported, once load was reduced, we got this: tcp_in_free_list == B_FALSE: 4 ----2005 Oct 19 09:52:12--- tcp_in_free_list == B_FALSE: 6 ----2005 Oct 19 09:52:22--- tcp_in_free_list == B_TRUE: 1 tcp_in_free_list == B_FALSE: 7 ----2005 Oct 19 09:52:32--- tcp_in_free_list == B_TRUE: 2 tcp_in_free_list == B_FALSE: 6 ----2005 Oct 19 09:52:42--- tcp_in_free_list == B_TRUE: 2 tcp_in_free_list == B_FALSE: 6 For those interested: a fix is in the making ;-) comments etc. are welcome Michael -- Michael Schuster (+49 89) 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see ''Recursion''