thr3ads.net - dtrace discuss - [dtrace-discuss] Persistent "Abort due to systemic unresponsiveness" [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Ananth Shrinivas

2006-Jun-06 14:14 UTC

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"

Hi DTracers,

I happen to hit the DTrace Systemic Unreponsiveness issue rather too 
frequently on my x86 box running Nevada (b35). I undestand that this 
happens when "one of the deadman timers intervenes and kills your dtrace 
process, when it decides that dtrace has put too much load on the system".

The problem is that, I encounter this even when running (what i beleive 
are) relatively lightweight probes. Probes like
''syscall::write:entry''
abort immediately after starting. Same happens with a lot of other 
probes. The machine I execute the script on is a Dual CPU AMD64 box 
running at ~2000Mhz with 2GB RAM (Xorg is running but its not Sooo bad, 
is it ?). I am able to run the same probes succesfully without any 
problems on a low end UltraSparc boxes running nevada or s10.

Of course I am able to workaround this issue by dtrace -w, And there is 
absolutely no sign of the smallest performance degradation on the system 
when I do so. But in that case why is the deadman timer being so paranoid ?

It would be great if someone could shed light on these following queries
1. First of all does it *sound* sane that this problem happens even for 
just read/write/open syscall probes on a relatively "fast" machine ?
2. Is the deadman timer causing Dtrace to fail same as the one in the 
Cyclic Subsystem ? If not where in the source should i look to 
understand more about these timers and their trigger conditions ?

Cheers,
Ananth

Bryan Cantrill

2006-Jun-06 16:29 UTC

head link

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"

> I happen to hit the DTrace Systemic Unreponsiveness issue rather too 
> frequently on my x86 box running Nevada (b35). I undestand that this 
> happens when "one of the deadman timers intervenes and kills your
dtrace
> process, when it decides that dtrace has put too much load on the
system".
That''s not quite accurate:  DTrace has decided that the system is not
satisfying some definition of liveness criteria, and errs on the side of
caution by killing itself.
> The problem is that, I encounter this even when running (what i beleive 
> are) relatively lightweight probes. Probes like
''syscall::write:entry''
> abort immediately after starting. Same happens with a lot of other 
> probes. The machine I execute the script on is a Dual CPU AMD64 box 
> running at ~2000Mhz with 2GB RAM (Xorg is running but its not Sooo bad, 
> is it ?). I am able to run the same probes succesfully without any 
> problems on a low end UltraSparc boxes running nevada or s10.
> 
> Of course I am able to workaround this issue by dtrace -w, And there is 
> absolutely no sign of the smallest performance degradation on the system 
> when I do so. But in that case why is the deadman timer being so paranoid ?
> 
> It would be great if someone could shed light on these following queries
> 1. First of all does it *sound* sane that this problem happens even for 
> just read/write/open syscall probes on a relatively "fast"
machine ?
No.  Is the load on the machine very high?  If you really want to eliminate
the induced probe effect from DTrace as the problem, try just running
"dtrace -n tick-1sec" -- it should suffer from the same problem...
> 2. Is the deadman timer causing Dtrace to fail same as the one in the 
> Cyclic Subsystem ? If not where in the source should i look to 
> understand more about these timers and their trigger conditions ?
DTrace uses two sets of liveness criteria:  a low-level cyclic must be
able to fire in the kernel at least once every ten seconds, and the
user-level process must check in with the kernel at least once every
thirty seconds.  Both of these values are tunable (dtrace_deadman_timeout
and dtrace_deadman_user, in nanoseconds, respectively), but unless
you''re
putting very high load on the system, it''s hard to imagine that
you''re
going to be able to tune your way out of this problem.

If the machine has no load, my (largely uninformed) guess would be that
you''re running into some underlying cyclic backend problem -- perhaps
power management?  Are you running stock B35?

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Ananth Shrinivas

2006-Jun-07 15:56 UTC

head link

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"

Hi Bryan,

I tried out your suggestions as well as a little more tinkering. Have 
added my observations below.> No.  Is the load on the machine very high?  If you really want to eliminate
> the induced probe effect from DTrace as the problem, try just running
> "dtrace -n tick-1sec" -- it should suffer from the same
problem...
>
>   Somehow, all the tick providers seem to work perfectly fine. Even 
profile:::tick-5000 runs persistently without aborting !
That does seem to point the problem towards an induced probe
effect.> DTrace uses two sets of liveness criteria:  a low-level cyclic must be
> able to fire in the kernel at least once every ten seconds, and the
> user-level process must check in with the kernel at least once every
> thirty seconds.  Both of these values are tunable (dtrace_deadman_timeout
> and dtrace_deadman_user, in nanoseconds, respectively), but unless
you''re
> putting very high load on the system, it''s hard to imagine that
you''re
> going to be able to tune your way out of this problem.
>
>   Thanks for the inside info ! "mdb -kw" and playing around with these 
variables (doubling, tripling, qudrapling, x10) did not to have any 
visible effect on this issue.
And I got a doubt here. If the default deadman_timeout and deadman_user 
values are 10 and 30 respectively, I would expect dtrace to run atleast 
for 10 seconds before giving  this error and aborting. But a "time" on
dtrace shows it exiting even before 5 seconds ! Is there some flaw in my 
interpretation of these timers ?> If the machine has no load, my (largely uninformed) guess would be that
> you''re running into some underlying cyclic backend problem --
perhaps
> power management?  Are you running stock B35?
>   Yes its a stock b35. No BFUs. I have also been able to confirm this 
problem on build 38 and build 40 (planning to try it on build 41 next).
I disabled svc:/system/power:default, made sure powerd was not running. 
But still the problem persists. (Is there something else I need to do to 
stop power management functions ?)

And I found another workaround - By disabling one of the processors 
using psradm -f, the problem disappears completely and dtrace starts 
running fine.The moment I do a psradm -n and bring back the offlined 
processor online, the abort happens again. Does that give trigger any 
"Eureka"s ?

Cheers,
Ananth

dtrace discuss - Jun 2006 - Persistent "Abort due to systemic unresponsiveness"

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"

[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"