Ananth Shrinivas
2006-Jun-06 14:14 UTC
[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"
Hi DTracers, I happen to hit the DTrace Systemic Unreponsiveness issue rather too frequently on my x86 box running Nevada (b35). I undestand that this happens when "one of the deadman timers intervenes and kills your dtrace process, when it decides that dtrace has put too much load on the system". The problem is that, I encounter this even when running (what i beleive are) relatively lightweight probes. Probes like ''syscall::write:entry'' abort immediately after starting. Same happens with a lot of other probes. The machine I execute the script on is a Dual CPU AMD64 box running at ~2000Mhz with 2GB RAM (Xorg is running but its not Sooo bad, is it ?). I am able to run the same probes succesfully without any problems on a low end UltraSparc boxes running nevada or s10. Of course I am able to workaround this issue by dtrace -w, And there is absolutely no sign of the smallest performance degradation on the system when I do so. But in that case why is the deadman timer being so paranoid ? It would be great if someone could shed light on these following queries 1. First of all does it *sound* sane that this problem happens even for just read/write/open syscall probes on a relatively "fast" machine ? 2. Is the deadman timer causing Dtrace to fail same as the one in the Cyclic Subsystem ? If not where in the source should i look to understand more about these timers and their trigger conditions ? Cheers, Ananth
Bryan Cantrill
2006-Jun-06 16:29 UTC
[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"
> I happen to hit the DTrace Systemic Unreponsiveness issue rather too > frequently on my x86 box running Nevada (b35). I undestand that this > happens when "one of the deadman timers intervenes and kills your dtrace > process, when it decides that dtrace has put too much load on the system".That''s not quite accurate: DTrace has decided that the system is not satisfying some definition of liveness criteria, and errs on the side of caution by killing itself.> The problem is that, I encounter this even when running (what i beleive > are) relatively lightweight probes. Probes like ''syscall::write:entry'' > abort immediately after starting. Same happens with a lot of other > probes. The machine I execute the script on is a Dual CPU AMD64 box > running at ~2000Mhz with 2GB RAM (Xorg is running but its not Sooo bad, > is it ?). I am able to run the same probes succesfully without any > problems on a low end UltraSparc boxes running nevada or s10. > > Of course I am able to workaround this issue by dtrace -w, And there is > absolutely no sign of the smallest performance degradation on the system > when I do so. But in that case why is the deadman timer being so paranoid ? > > It would be great if someone could shed light on these following queries > 1. First of all does it *sound* sane that this problem happens even for > just read/write/open syscall probes on a relatively "fast" machine ?No. Is the load on the machine very high? If you really want to eliminate the induced probe effect from DTrace as the problem, try just running "dtrace -n tick-1sec" -- it should suffer from the same problem...> 2. Is the deadman timer causing Dtrace to fail same as the one in the > Cyclic Subsystem ? If not where in the source should i look to > understand more about these timers and their trigger conditions ?DTrace uses two sets of liveness criteria: a low-level cyclic must be able to fire in the kernel at least once every ten seconds, and the user-level process must check in with the kernel at least once every thirty seconds. Both of these values are tunable (dtrace_deadman_timeout and dtrace_deadman_user, in nanoseconds, respectively), but unless you''re putting very high load on the system, it''s hard to imagine that you''re going to be able to tune your way out of this problem. If the machine has no load, my (largely uninformed) guess would be that you''re running into some underlying cyclic backend problem -- perhaps power management? Are you running stock B35? - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Ananth Shrinivas
2006-Jun-07 15:56 UTC
[dtrace-discuss] Persistent "Abort due to systemic unresponsiveness"
Hi Bryan, I tried out your suggestions as well as a little more tinkering. Have added my observations below.> No. Is the load on the machine very high? If you really want to eliminate > the induced probe effect from DTrace as the problem, try just running > "dtrace -n tick-1sec" -- it should suffer from the same problem... > >Somehow, all the tick providers seem to work perfectly fine. Even profile:::tick-5000 runs persistently without aborting ! That does seem to point the problem towards an induced probe effect.> DTrace uses two sets of liveness criteria: a low-level cyclic must be > able to fire in the kernel at least once every ten seconds, and the > user-level process must check in with the kernel at least once every > thirty seconds. Both of these values are tunable (dtrace_deadman_timeout > and dtrace_deadman_user, in nanoseconds, respectively), but unless you''re > putting very high load on the system, it''s hard to imagine that you''re > going to be able to tune your way out of this problem. > >Thanks for the inside info ! "mdb -kw" and playing around with these variables (doubling, tripling, qudrapling, x10) did not to have any visible effect on this issue. And I got a doubt here. If the default deadman_timeout and deadman_user values are 10 and 30 respectively, I would expect dtrace to run atleast for 10 seconds before giving this error and aborting. But a "time" on dtrace shows it exiting even before 5 seconds ! Is there some flaw in my interpretation of these timers ?> If the machine has no load, my (largely uninformed) guess would be that > you''re running into some underlying cyclic backend problem -- perhaps > power management? Are you running stock B35? >Yes its a stock b35. No BFUs. I have also been able to confirm this problem on build 38 and build 40 (planning to try it on build 41 next). I disabled svc:/system/power:default, made sure powerd was not running. But still the problem persists. (Is there something else I need to do to stop power management functions ?) And I found another workaround - By disabling one of the processors using psradm -f, the problem disappears completely and dtrace starts running fine.The moment I do a psradm -n and bring back the offlined processor online, the abort happens again. Does that give trigger any "Eureka"s ? Cheers, Ananth