Lytvyn, Oleksandr (IT)
2008-Apr-18  20:20 UTC
[dtrace-discuss] whatfor.d -- where''s null pointer?
Hi! Anyone seen this? This one buffles me: I run whatfor.d from /usr/demo/dtrace, and here''s what I get: dtrace: script ''/usr/demo/dtrace/whatfor.d'' matched 12 probes dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu): invalid address (0x0) in action #1 at DIF offset 56 dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu): invalid address (0x0) in action #1 at DIF offset 56 dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu): invalid address (0x0) in action #1 at DIF offset 56 dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu): invalid address (0x0) in action #1 at DIF offset 56 ... Multitudes of those. Apparently, action #1 of probe ID 1 is: self->sobj = curlwpsinfo->pr_stype; So, which address is invalid here? The curlwpsinfo is used in predicate, so it cannot be 0x0, because it''d complain about the predicate too. And pr_stype is supposed to be char. What''s wrong here? Found another error report like that on the web, BTW (in German, accidentially). But no responses there, unfortunately. Thanks, Oleksandr Lytvyn Morgan Stanley | Technology 210 Carnegie Center, 4th Floor | Princeton, NJ 08540 Phone: +1 609 936-4026 Mobile: +1 732 773-4145 Oleksandr.Lytvyn at MorganStanley.com -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20080418/86bce54b/attachment.html>
Hi Oleksandr,
This turned out to be a rather interesting problem. To investigate, I used
this script:
---8<---
off-cpu
{
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
	this->tmp = curlwpsinfo->pr_stype;
}
ERROR
{
	@[arg2] = count();
}
---8<---
Which resulted in a table like this:
                9                1
               10                1
                5                2
                7                2
                1                3
                3                3
                2                5
                6                5
                4               10
So curlwpsinfo->pr_stype can work and later fail. Looking at the translator
for that field we see that it looks like this:
        pr_stype = T->t_sobj_ops ? T->t_sobj_ops->sobj_type : 0;
This compiles to this DIF code:
OFF OPCODE      INSTRUCTION
00: 29010001    ldgs DT_VAR(256), %r1           ! DT_VAR(256) =
"curthread"
01: 25000002    setx DT_INTEGER[0], %r2         ! 0x0
02: 04010201    sll  %r1, %r2, %r1
03: 05010201    srl  %r1, %r2, %r1
04: 0e010002    mov  %r1, %r2
05: 25000103    setx DT_INTEGER[1], %r3         ! 0x88
06: 07020302    add  %r2, %r3, %r2
07: 22020002    ldx  [%r2], %r2
08: 10020000    tst  %r2
09: 12000011    be   17
10: 0e010002    mov  %r1, %r2
11: 25000103    setx DT_INTEGER[1], %r3         ! 0x88
12: 07020302    add  %r2, %r3, %r2
13: 22020002    ldx  [%r2], %r2
14: 1e020002    ldsw [%r2], %r2
15: 0e020002    mov  %r2, %r2
16: 11000012    ba   18
17: 25000002    setx DT_INTEGER[0], %r2         ! 0x0
18: 25000203    setx DT_INTEGER[2], %r3         ! 0x38
19: 04020302    sll  %r2, %r3, %r2
20: 2e020302    sra  %r2, %r3, %r2
21: 23000002    ret  %r2
We can see that we load the t_sobj_ops member once at offset 07 and then again
at offset 17 (right before we load sobj_type at offset 18). The t_sobj_ops
member can be set to NULL asynchronously from other threads so this double
load introduces a window for the failure that you''re seeing.
Either we need to use some temporary, probe-local variable (one that
can''t
conflict with a user-defined variable), or we need to perform some element of
optimization to the generated DIF.
I''ve filed this bug:
  6691541 curlwpsinfo->pr_stype races
Adam
On Fri, Apr 18, 2008 at 04:20:31PM -0400, Lytvyn, Oleksandr (IT)
wrote:> Hi!
>  
> Anyone seen this? This one buffles me: I run whatfor.d from
> /usr/demo/dtrace, and here''s what I get:
>  
> dtrace: script ''/usr/demo/dtrace/whatfor.d'' matched 12
probes
> dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu):
> invalid address (0x0) in action #1 at DIF offset 56
> dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu):
> invalid address (0x0) in action #1 at DIF offset 56
> dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu):
> invalid address (0x0) in action #1 at DIF offset 56
> dtrace: error on enabled probe ID 1 (ID 681: sched:unix:resume:off-cpu):
> invalid address (0x0) in action #1 at DIF offset 56
> ...
>  
> Multitudes of those. Apparently, action #1 of probe ID 1 is:
>  
> self->sobj = curlwpsinfo->pr_stype;
>  
> So, which address is invalid here? The curlwpsinfo is used in predicate,
> so it cannot be 0x0, because it''d complain about the predicate
too. And
> pr_stype is supposed to be char. 
>  
> What''s wrong here?
>  
> Found another error report like that on the web, BTW (in German,
> accidentially). But no responses there, unfortunately.
>  
> Thanks,
>  
> Oleksandr Lytvyn
> Morgan Stanley | Technology
> 210 Carnegie Center, 4th Floor | Princeton, NJ  08540
> Phone: +1 609 936-4026
> Mobile: +1 732 773-4145
> Oleksandr.Lytvyn at MorganStanley.com
> --------------------------------------------------------
> 
> NOTICE: If received in error, please destroy and notify sender. Sender does
not intend to waive confidentiality or privilege. Use of this email is
prohibited when received in error.
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
-- 
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl
On Apr 21, 2008, at 11:00 AM, Adam Leventhal wrote:> > > Either we need to use some temporary, probe-local variable (one that > can''t > conflict with a user-defined variable), or we need to perform some > element of > optimization to the generated DIF. > > I''ve filed this bug: > > 6691541 curlwpsinfo->pr_stype races > > AdamWow, that is a cool bug. If you choose to do some element of optimization, is it possible to guarantee every case will be correctly handled? I worry that it might be difficult to reason about the thread safety of code without explicit guarantees about how this is handled. James M
Nicolas Williams
2008-Apr-21  18:54 UTC
[dtrace-discuss] whatfor.d -- where''s null pointer?
On Mon, Apr 21, 2008 at 11:00:29AM -0700, Adam Leventhal wrote:> So curlwpsinfo->pr_stype can work and later fail. Looking at the translator > for that field we see that it looks like this: > > pr_stype = T->t_sobj_ops ? T->t_sobj_ops->sobj_type : 0; > > This compiles to this DIF code: > > [...] > We can see that we load the t_sobj_ops member once at offset 07 and then again > at offset 17 (right before we load sobj_type at offset 18). The t_sobj_ops > member can be set to NULL asynchronously from other threads so this double > load introduces a window for the failure that you''re seeing. > > Either we need to use some temporary, probe-local variable (one that can''t > conflict with a user-defined variable), or we need to perform some element of > optimization to the generated DIF.In the world of LISP macros the macro writer would gensym a local variable (probe-local in this case) to deal with this sort of issue. Perhaps the DTrace translator facility needs a probe-local gensym feature. Nico --
On Mon, Apr 21, 2008 at 11:45:25AM -0700, James McIlree wrote:>> Either we need to use some temporary, probe-local variable (one that can''t >> conflict with a user-defined variable), or we need to perform some element >> of optimization to the generated DIF. >> >> I''ve filed this bug: >> >> 6691541 curlwpsinfo->pr_stype races > > If you choose to do some element of optimization, is it possible to guarantee > every case will be correctly handled? I worry that it might be difficult to > reason about the thread safety of code without explicit guarantees about how > this is handled.The optimization would just be to perform the load once -- this is what most C compilers would do absent the volatile keyword. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On Mon, Apr 21, 2008 at 01:54:16PM -0500, Nicolas Williams wrote:> In the world of LISP macros the macro writer would gensym a local > variable (probe-local in this case) to deal with this sort of issue. > > Perhaps the DTrace translator facility needs a probe-local gensym > feature.Indeed, many languages have a construct to let you create a new scope; adding one to D might be a good idea. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On Apr 21, 2008, at 1:04 PM, Adam Leventhal wrote:> On Mon, Apr 21, 2008 at 11:45:25AM -0700, James McIlree wrote: >>> Either we need to use some temporary, probe-local variable (one >>> that can''t >>> conflict with a user-defined variable), or we need to perform some >>> element >>> of optimization to the generated DIF. >>> >>> I''ve filed this bug: >>> >>> 6691541 curlwpsinfo->pr_stype races >> >> If you choose to do some element of optimization, is it possible to >> guarantee >> every case will be correctly handled? I worry that it might be >> difficult to >> reason about the thread safety of code without explicit guarantees >> about how >> this is handled. > > The optimization would just be to perform the load once -- this is > what most C > compilers would do absent the volatile keyword.And I absolutely would not trust the C compiler in this case either :-). It''s not required by spec to load once and only once, and there are times when it doesn''t. James M
Lytvyn, Oleksandr (IT)
2008-Apr-22  13:37 UTC
[dtrace-discuss] whatfor.d -- where''s null pointer?
Adam, That''s an interesting analysis. Hope we can have a fix soon. It also makes me wonder: how many races of that kind are there? And is it possible to have a comprehensive fix or approach to handle stuff like this? Thanks, Oleksandr Lytvyn Morgan Stanley | Technology 210 Carnegie Center, 4th Floor | Princeton, NJ 08540 Phone: +1 609 936-4026 Mobile: +1 732 773-4145 Oleksandr.Lytvyn at MorganStanley.com> -----Original Message----- > From: Adam Leventhal [mailto:ahl at eng.sun.com] > Sent: Monday, April 21, 2008 2:00 PM > To: Lytvyn, Oleksandr (IT) > Cc: dtrace-discuss at opensolaris.org > Subject: Re: [dtrace-discuss] whatfor.d -- where''s null pointer? > > Hi Oleksandr, > > This turned out to be a rather interesting problem. To > investigate, I used > this script: > > ---8<--- > off-cpu > { > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > this->tmp = curlwpsinfo->pr_stype; > } > > ERROR > { > @[arg2] = count(); > } > ---8<--- > > Which resulted in a table like this: > > 9 1 > 10 1 > 5 2 > 7 2 > 1 3 > 3 3 > 2 5 > 6 5 > 4 10 > > So curlwpsinfo->pr_stype can work and later fail. Looking at > the translator > for that field we see that it looks like this: > > pr_stype = T->t_sobj_ops ? T->t_sobj_ops->sobj_type : 0; > > This compiles to this DIF code: > > OFF OPCODE INSTRUCTION > 00: 29010001 ldgs DT_VAR(256), %r1 ! DT_VAR(256) > = "curthread" > 01: 25000002 setx DT_INTEGER[0], %r2 ! 0x0 > 02: 04010201 sll %r1, %r2, %r1 > 03: 05010201 srl %r1, %r2, %r1 > 04: 0e010002 mov %r1, %r2 > 05: 25000103 setx DT_INTEGER[1], %r3 ! 0x88 > 06: 07020302 add %r2, %r3, %r2 > 07: 22020002 ldx [%r2], %r2 > 08: 10020000 tst %r2 > 09: 12000011 be 17 > 10: 0e010002 mov %r1, %r2 > 11: 25000103 setx DT_INTEGER[1], %r3 ! 0x88 > 12: 07020302 add %r2, %r3, %r2 > 13: 22020002 ldx [%r2], %r2 > 14: 1e020002 ldsw [%r2], %r2 > 15: 0e020002 mov %r2, %r2 > 16: 11000012 ba 18 > 17: 25000002 setx DT_INTEGER[0], %r2 ! 0x0 > 18: 25000203 setx DT_INTEGER[2], %r3 ! 0x38 > 19: 04020302 sll %r2, %r3, %r2 > 20: 2e020302 sra %r2, %r3, %r2 > 21: 23000002 ret %r2 > > We can see that we load the t_sobj_ops member once at offset > 07 and then again > at offset 17 (right before we load sobj_type at offset 18). > The t_sobj_ops > member can be set to NULL asynchronously from other threads > so this double > load introduces a window for the failure that you''re seeing. > > Either we need to use some temporary, probe-local variable > (one that can''t > conflict with a user-defined variable), or we need to perform > some element of > optimization to the generated DIF. > > I''ve filed this bug: > > 6691541 curlwpsinfo->pr_stype races > > Adam > > On Fri, Apr 18, 2008 at 04:20:31PM -0400, Lytvyn, Oleksandr > (IT) wrote: > > Hi! > > > > Anyone seen this? This one buffles me: I run whatfor.d from > > /usr/demo/dtrace, and here''s what I get: > > > > dtrace: script ''/usr/demo/dtrace/whatfor.d'' matched 12 probes > > dtrace: error on enabled probe ID 1 (ID 681: > sched:unix:resume:off-cpu): > > invalid address (0x0) in action #1 at DIF offset 56 > > dtrace: error on enabled probe ID 1 (ID 681: > sched:unix:resume:off-cpu): > > invalid address (0x0) in action #1 at DIF offset 56 > > dtrace: error on enabled probe ID 1 (ID 681: > sched:unix:resume:off-cpu): > > invalid address (0x0) in action #1 at DIF offset 56 > > dtrace: error on enabled probe ID 1 (ID 681: > sched:unix:resume:off-cpu): > > invalid address (0x0) in action #1 at DIF offset 56 > > ... > > > > Multitudes of those. Apparently, action #1 of probe ID 1 is: > > > > self->sobj = curlwpsinfo->pr_stype; > > > > So, which address is invalid here? The curlwpsinfo is used > in predicate, > > so it cannot be 0x0, because it''d complain about the > predicate too. And > > pr_stype is supposed to be char. > > > > What''s wrong here? > > > > Found another error report like that on the web, BTW (in German, > > accidentially). But no responses there, unfortunately. > > > > Thanks, > > > > Oleksandr Lytvyn > > Morgan Stanley | Technology > > 210 Carnegie Center, 4th Floor | Princeton, NJ 08540 > > Phone: +1 609 936-4026 > > Mobile: +1 732 773-4145 > > Oleksandr.Lytvyn at MorganStanley.com > > -------------------------------------------------------- > > > > NOTICE: If received in error, please destroy and notify > sender. Sender does not intend to waive confidentiality or > privilege. Use of this email is prohibited when received in error. > > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org > > > -- > Adam Leventhal, Fishworks > http://blogs.sun.com/ahl >-------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.
On Tue, Apr 22, 2008 at 09:37:57AM -0400, Lytvyn, Oleksandr (IT) wrote:> That''s an interesting analysis. Hope we can have a fix soon.I forgot to mention that you can work around the issue by replacing curlwpsinfo->pr_stype with: (((this->tmp = curthread->t_sobj_ops) != NULL) ? this->tmp->sobj_type : 0)> It also makes me wonder: how many races of that kind are there? And is > it possible to have a comprehensive fix or approach to handle stuff like > this?There may be other problems like this, but the same fix can apply to all of them. If we go with a compiler change that will cause the DIF to only load the member once in a given probe firing, then that will clean up similarly affected situations. If we create a language construct to make a new local scope, then we''ll need to vet our translators and fix them individually. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Roman Shaposhnik
2008-Apr-22  17:57 UTC
[dtrace-discuss] whatfor.d -- where''s null pointer?
On Tue, 2008-04-22 at 10:21 -0700, Adam Leventhal wrote:> On Tue, Apr 22, 2008 at 09:37:57AM -0400, Lytvyn, Oleksandr (IT) wrote: > > That''s an interesting analysis. Hope we can have a fix soon. > > I forgot to mention that you can work around the issue by replacing > curlwpsinfo->pr_stype with: > > (((this->tmp = curthread->t_sobj_ops) != NULL) ? this->tmp->sobj_type : 0)Isn''t there a race-condition still (although a much less probable one)? Thanks, Roman.
On Tue, Apr 22, 2008 at 10:57:26AM -0700, Roman Shaposhnik wrote:> On Tue, 2008-04-22 at 10:21 -0700, Adam Leventhal wrote: > > On Tue, Apr 22, 2008 at 09:37:57AM -0400, Lytvyn, Oleksandr (IT) wrote: > > > That''s an interesting analysis. Hope we can have a fix soon. > > > > I forgot to mention that you can work around the issue by replacing > > curlwpsinfo->pr_stype with: > > > > (((this->tmp = curthread->t_sobj_ops) != NULL) ? this->tmp->sobj_type : 0) > > Isn''t there a race-condition still (although a much less probable one)?I don''t think so; can you describe the race condition you see? Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Roman Shaposhnik
2008-Apr-22  18:34 UTC
[dtrace-discuss] whatfor.d -- where''s null pointer?
On Tue, 2008-04-22 at 11:21 -0700, Adam Leventhal wrote:> On Tue, Apr 22, 2008 at 10:57:26AM -0700, Roman Shaposhnik wrote: > > On Tue, 2008-04-22 at 10:21 -0700, Adam Leventhal wrote: > > > On Tue, Apr 22, 2008 at 09:37:57AM -0400, Lytvyn, Oleksandr (IT) wrote: > > > > That''s an interesting analysis. Hope we can have a fix soon. > > > > > > I forgot to mention that you can work around the issue by replacing > > > curlwpsinfo->pr_stype with: > > > > > > (((this->tmp = curthread->t_sobj_ops) != NULL) ? this->tmp->sobj_type : 0) > > > > Isn''t there a race-condition still (although a much less probable one)? > > I don''t think so; can you describe the race condition you see?Without knowing the details of how the structure to which t_sobj_ops is pointing gets managed it seems to me that there''s a tiny window of opportunity between recording the address of the structure into this->tmp and the structure itself dealocated/reused for something else. Of course, the recorded address will still be valid, but once you do this->tmp->sobj_type you might get garbage. Can this happen? Thanks, Roman.
On Tue, Apr 22, 2008 at 11:34:32AM -0700, Roman Shaposhnik wrote:> Without knowing the details of how the structure to which t_sobj_ops is > pointing gets managed it seems to me that there''s a tiny window of > opportunity between recording the address of the structure into > this->tmp and the structure itself dealocated/reused for something > else. Of course, the recorded address will still be valid, but once > you do this->tmp->sobj_type you might get garbage. > > Can this happen?No. t_sobj_ops is always assigned the address of a static structure that is never deallocated. But it''s a fair point for other cases, and I don''t think there''s a generic solution. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl