Rafael Vanoni
2008-Aug-26 18:24 UTC
[dtrace-discuss] suspend/resume and libdtrace consumers
Hi all There was recently a couple of emails at tesla-dev pointing out that PowerTOP(1) and intrstat(1M) show 0''ed values after resuming - both are libdtrace(3LIB) consumers. I haven''t found a bug against this issue, but unfortunately I''m having a bit of a hard time finding a system that supports suspend/resume and wouldn''t like to file one w/o having some content to back it up. Does this sound like a libdtrace(3LIB) issue to you or should these consumers be a bit smarter ? thanks Rafael
Chad Mynhier
2008-Sep-17 16:28 UTC
[dtrace-discuss] suspend/resume and libdtrace consumers
On Tue, Aug 26, 2008 at 2:24 PM, Rafael Vanoni <Rafael.Vanoni at sun.com> wrote:> Hi all > > There was recently a couple of emails at tesla-dev pointing out that > PowerTOP(1) and intrstat(1M) show 0''ed values after resuming - both are > libdtrace(3LIB) consumers. > > I haven''t found a bug against this issue, but unfortunately I''m having a > bit of a hard time finding a system that supports suspend/resume and > wouldn''t like to file one w/o having some content to back it up. > > Does this sound like a libdtrace(3LIB) issue to you or should these > consumers be a bit smarter ?Rafael, I have a system that supports suspend/resume, so I took a look at this. Stripping this down to just the D script that''s being run, I see this after a suspend/resume: # dtrace -s ./events_k.d dtrace: script ''./events_k.d'' matched 4 probes dtrace: processing aborted: Abort due to systemic unresponsiveness # So what''s likely happening is that the underlying DTrace invocation is dying, but that the process is unaware of this. And looking at the code for PowerTOP, it appears that it''s calling dtrace_status() but ignoring the return value. (Note that you can achieve the same effect by stopping the powertop process for about a minute I''ve posted a webrev of a fix here: http://cr.opensolaris.org/~cmynhier/powertop/. It looks like intrstat() does something similar. I don''t have a code fix for this yet, but I''ve filed a bug. Chad
Rafael Vanoni
2008-Sep-17 17:06 UTC
[dtrace-discuss] suspend/resume and libdtrace consumers
Chad Mynhier wrote:> On Tue, Aug 26, 2008 at 2:24 PM, Rafael Vanoni <Rafael.Vanoni at sun.com> wrote: >> Hi all >> >> There was recently a couple of emails at tesla-dev pointing out that >> PowerTOP(1) and intrstat(1M) show 0''ed values after resuming - both are >> libdtrace(3LIB) consumers. >> >> I haven''t found a bug against this issue, but unfortunately I''m having a >> bit of a hard time finding a system that supports suspend/resume and >> wouldn''t like to file one w/o having some content to back it up. >> >> Does this sound like a libdtrace(3LIB) issue to you or should these >> consumers be a bit smarter ? > > Rafael, > > I have a system that supports suspend/resume, so I took a look at this. > > Stripping this down to just the D script that''s being run, I see this > after a suspend/resume: > > # dtrace -s ./events_k.d > dtrace: script ''./events_k.d'' matched 4 probes > dtrace: processing aborted: Abort due to systemic unresponsiveness > # > > So what''s likely happening is that the underlying DTrace invocation is > dying, but that the process is unaware of this. And looking at the > code for PowerTOP, it appears that it''s calling dtrace_status() but > ignoring the return value. (Note that you can achieve the same effect > by stopping the powertop process for about a minute > > I''ve posted a webrev of a fix here: > http://cr.opensolaris.org/~cmynhier/powertop/. > > It looks like intrstat() does something similar. I don''t have a code > fix for this yet, but I''ve filed a bug. > > ChadThanks for looking into this Chad, I''m moving this thread to tesla-dev to discuss the patch. Rafael
Chad Mynhier
2008-Sep-19 20:28 UTC
[dtrace-discuss] [tesla-dev] PowerTOP and suspend/resume
For dtrace-discuss, the problem mentioned here is that a DTrace process straddling a suspend/resume will get killed because of the deadman timer. This affects powertop and intrstat (at the least) because they ignore the return value of dtrace_status() and proceed to show zeroed values for everything. On Thu, Sep 18, 2008 at 9:00 PM, Aubrey Li <aubreylee at gmail.com> wrote:> On Fri, Sep 19, 2008 at 12:20 AM, Chad Mynhier <cmynhier at gmail.com> wrote: >> On Wed, Sep 17, 2008 at 9:43 PM, Aubrey Li <aubreylee at gmail.com> wrote: >>> >>> I didn''t dig into the dtrace problem, just wonder is this expected? >>> Or Is the patch just a workaround temporarily and dtrace problem >>> will be fixed eventually? >> >> This is actually tickling a safety feature of dtrace, the deadman >> timer. There''s more information here: >> http://blogs.sun.com/jonh/entry/the_dtrace_deadman_mechanism, but it''s >> basically a mechanism to prevent dtrace from rendering the system >> unresponsive. It''s possible that the mechanism could be modified to >> handle cases like this, but I don''t know that it would be a high >> priority to fix it. >> >> I wouldn''t say that the patch is just a workaround, though. The basic >> problem is that it''s ignoring the return value of dtrace_status(), and >> it really shouldn''t be doing that, anyway. >> > So, all the applications which use libdtrace need this fix for suspend/resume, > this includes intrstat/lockstat/plockstat and dtrace itself. No object from me > to commit this patch, but I still think this issue should be fixed in dtrace, > otherwise all the dtrace applications have to use this trick.I''d agree that this is a bug in DTrace, that it really should be able to handle all cases. But I''d also argue that DTrace was designed to handle issues like this, because dtrace_status() has a meaningful return value. It seems to me that the failure of intrstat and powertop (I haven''t looked a lockstat/plockstat yet) to check the return value of dtrace_status() is a bigger bug, though. That return value may be indicating some problem other than the suspend/resume problem, and that might be a problem that isn''t a bug in DTrace. If we fix the suspend/resume deadman timer problem, we''ve only fixed one of the possible problems, and these utilities might have a similar failure mode for the other problems. If we fix those utilities, those failure modes go away. Chad