I''m trying to debug a performance problem, and I''m seeing what appears to be a weird inconsistency between prstat -m output (or, more specifically, prstat -Lmp) and plockstat. (Note: I''m the sysadmin trying to meaure the performance of a T2000. I''m not one of the Java developers who wrote the application I''m running on the server.) When I run prstat -Lmp on the java process with no traffic being sent to the server, I see a number of processes sitting at 100% in LCK. This isn''t necessarily unexpected, as prstat -Lm occasionally shows threads from other processes sitting at 100% in LCK (nscd, inetd, svc.configd, svc.startd.) However, when I run a dtrace script to measure the total amout of time that process is spending waiting on user locks (derived from plockstat -V -A output), that total is around 0.1% of the wallclock time for the sample. For example: Time spent blocked: 27361964 Total time: 49317563716 I''ve attached the script below. Does anyone have a good explanation for why I''m seeing this discrepancy? The most obvious explanation is that the dtrace script isn''t matching the full set of events being accounted for by the system and reported as LCK by prstat -m. Thanks, Chad Mynhier #!/usr/sbin/dtrace -s #pragma D option quiet BEGIN { starttime=timestamp; blocktime = 0; } plockstat$1:::rw-acquire { self->rwhold[arg0] = timestamp; } plockstat$1:::mutex-acquire /arg1 == 0/ { mtxhold[arg0] = timestamp; } plockstat$1:::rw-release /self->rwhold[arg0] && arg1 == 1/ { @rw_w_hold[arg0, ustack(5)] = sum(timestamp - self->rwhold[arg0]); self->rwhold[arg0] = 0; } plockstat$1:::rw-release /self->rwhold[arg0]/ { @rw_r_hold[arg0, ustack(5)] = sum(timestamp - self->rwhold[arg0]); self->rwhold[arg0] = 0; } plockstat$1:::mutex-release /mtxhold[arg0] && arg1 == 0/ { @mtx_hold[arg0, ustack(5)] = sum(timestamp - mtxhold[arg0]); mtxhold[arg0] = 0; } plockstat$1:::rw-block { self->rwblock[arg0] = timestamp; } plockstat$1:::mutex-block { self->mtxblock[arg0] = timestamp; } plockstat$1:::mutex-spin { self->mtxspin[arg0] = timestamp; } plockstat$1:::rw-acquire /self->rwblock[arg0] && arg1 == 1/ { foo = timestamp - self->rwblock[arg0]; @rw_w_block[arg0, ustack(5)] = sum(foo); self->rwblock[arg0] = 0; blocktime += foo; } plockstat$1:::rw-acquire /self->rwblock[arg0]/ { foo = timestamp - self->rwblock[arg0]; @rw_r_block[arg0, ustack(5)] = sum(foo); self->rwblock[arg0] = 0; blocktime += foo; } plockstat$1:::mutex-block /self->mtxspin[arg0]/ { foo = timestamp - self->mtxspin[arg0]; @mtx_vain_spin[arg0, ustack(5)] = sum(foo); self->mtxspin[arg0] = 0; blocktime += foo; } plockstat$1:::mutex-acquire /self->mtxspin[arg0]/ { foo = timestamp - self->mtxspin[arg0]; @mtx_spin[arg0, ustack(5)] = sum(foo); self->mtxspin[arg0] = 0; blocktime += foo; } plockstat$1:::mutex-acquire /self->mtxblock[arg0]/ { foo = timestamp - self->mtxblock[arg0]; @mtx_block[arg0, ustack(5)] = sum(foo); self->mtxblock[arg0] = 0; blocktime += foo; } plockstat$1:::mutex-error /self->mtxblock[arg0]/ { self->mtxblock[arg0] = 0; } plockstat$1:::rw-error /self->rwblock[arg0]/ { self->rwblock[arg0] = 0; } plockstat$1:::mutex-error /self->mtxspin[arg0]/ { self->mtxspin[arg0] = 0; } END { totaltime = timestamp - starttime; printf("Time spent blocked on mutexes: %d\n",blocktime); printf("Total time: %d\n",totaltime); }
Chad Mynhier''s email at 2/10/2006 11:30 AM, said:> I''m trying to debug a performance problem, and I''m seeing what appears > to be a weird inconsistency between prstat -m output (or, more > specifically, prstat -Lmp) and plockstat. > > (Note: I''m the sysadmin trying to meaure the performance of a T2000. > I''m not one of the Java developers who wrote the application I''m > running on the server.) > > When I run prstat -Lmp on the java process with no traffic being sent > to the server, I see a number of processes sitting at 100% in LCK. > This isn''t necessarily unexpected, as prstat -Lm occasionally shows > threads from other processes sitting at 100% in LCK (nscd, inetd, > svc.configd, svc.startd.) > > However, when I run a dtrace script to measure the total amout of time > that process is spending waiting on user locks (derived from plockstat > -V -A output), that total is around 0.1% of the wallclock time for the > sample. For example: > > Time spent blocked: 27361964 > Total time: 49317563716 > > I''ve attached the script below. > > Does anyone have a good explanation for why I''m seeing this > discrepancy? The most obvious explanation is that the dtrace script > isn''t matching the full set of events being accounted for by the > system and reported as LCK by prstat -m. > >Most likely they are sitting in [pthread_]cond_wait(3C). I personally think this should be reflected as SLP time (in fact, I believe I filed a bug a couple of years back), but apparently, the code disagrees: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/syscall/lwp_sobj.c#1552 You could use pstack(1) to verify this is the case. Using your nscd example, I show thread 24 in 100% LCK: # pstack 549/24 549: /usr/sbin/nscd ----------------- lwp# 24 / thread# 24 -------------------- fedb7fab lwp_park (0, fd65df68, 0) fedb22d9 cond_wait_queue (fee7f618, fee7f638, fd65df68, 0) + 3e fedb2680 cond_wait_common (fee7f618, fee7f638, fd65df68) + 1e9 fedb290b _cond_timedwait (fee7f618, fee7f638, fd65dfa8) + a6 fee56710 umem_update_thread (0) + 171 fedb7d20 _thr_setup (feab5400) + 51 fedb7f70 _lwp_start (feab5400, 0, 0, 0, 0, 0) Thanks, Jarod
On 2/10/06, Jarod Jenson <jarod at aeysis.com> wrote:> > > Chad Mynhier''s email at 2/10/2006 11:30 AM, said: > > I''m trying to debug a performance problem, and I''m seeing what appears > > to be a weird inconsistency between prstat -m output (or, more > > specifically, prstat -Lmp) and plockstat. > > > > Most likely they are sitting in [pthread_]cond_wait(3C).Yes, this appears to be the case here, too. Thanks for the reply (and my apologies for not also replying to the list the first time), Chad Mynhier
> Most likely they are sitting in [pthread_]cond_wait(3C). I personally > think this should be reflected as SLP time (in fact, I believe I filed a > bug a couple of years back).I also think that the current choice of a microstate to report time spent sleeping on userland condition variables leads to confusion. A quick bugtraq search found at least three bugs filed against this: 6308383 Thread state when sleeping on COND_VAR 6312989 condvar time is reported as waiting on user lock instead of sleeping 6183800 sleeping on CVs is "LCK" time, should be "SLP" Looks like Jonathan Adams was looking at this problem recently (when 6183800 was filed), so maybe he can share some results of his investigation with us. - Andrei
On Fri, Feb 10, 2006 at 10:34:56AM -0800, Andrei Dorofeev wrote:> > Most likely they are sitting in [pthread_]cond_wait(3C). I personally > > think this should be reflected as SLP time (in fact, I believe I filed a > > bug a couple of years back). > > I also think that the current choice of a microstate to report time spent > sleeping on userland condition variables leads to confusion. A quick > bugtraq search found at least three bugs filed against this: > > 6308383 Thread state when sleeping on COND_VAR > 6312989 condvar time is reported as waiting on user lock instead of sleeping > 6183800 sleeping on CVs is "LCK" time, should be "SLP" > > Looks like Jonathan Adams was looking at this problem recently > (when 6183800 was filed), so maybe he can share some results > of his investigation with us.The big problem is that we do "wait morphing"; that is: T1 | T2 | pthread_cond_wait(&cv, &mx) | 1) puts itself on a userland | queue, then "parks" in | the kernel | | | | pthread_mutex_lock(&mx) | | ... does work ... | 2) | pthread_cond_signal(&cv); | moves T1 from the CV queue to | the MX queue | | | ... does more work ... | 3) | pthread_mutex_unlock(&mx); | wakes up T1 | | Right now, all of the time from (1) to (3) is recorded as ''LCK'' time. The simplest change is to record all of the time from (1) to (3) as ''SLP'' time. The problem with that is that you would expect the time from (2) to (3) to count as ''LCK'' time, since it is time spent waiting for a mutex. So if the mutex is held for a long time by T2, or the mutex is heavily contended, you get bogus answers. But getting the time from (2) to (3) recorded differently than the time from (1) to (2) is not trivial; T1 gets *no* notification of the change made at (2). That''s as far as I made it in my investigation. I''ve closed 6308383 and 6312989 as dups of 6183800, and put the above information into 6183800. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development