thr3ads.net - dtrace discuss - [dtrace-discuss] Race when removing user static providers [Aug 2007]

If this information is useful, please help other people find it:
Share via:

James McIlree

2007-Aug-01 08:03 UTC

[dtrace-discuss] Race when removing user static providers

I think I''ve found a race when user static providers are
being removed. What I am observing is that under certain stress
tests, fasttrap meta providers are not being cleaned up. The process
exits, all dtrace consumers exit, but a later dtrace -l still
shows probes from that provider.

	I''ve got a theory and a fix that seems to work so far...

	What I think is happening is this:

	You''ve got a dtrace consumer matching probes named:
"test*:::"

	Process A has a "test" user static provider. They''ve been
enabled.
Process A decides to exit. This means that dtrace_helpers_cleanup()
is called, which synchronously marks the provider as "marked" and
"retired", and schedules an asynchronous cleanup for later.

	Very shortly afterwards, the fasttrap exit cleanup is called, and
starts disabling probes.

	Suppose, though, that you''re on a MP machine, and the asynchronous
cleanup was running on another cpu. It could be already active due
to an earlier exit, or because it immediately got an otherwise idle
cpu.

	In the fasttrap cleanup code, we''ve got these checks:

> 			/*
> 			 * If this provider has consumers actively
> 			 * creating probes (ftp_ccount) or is a USDT
> 			 * provider (ftp_mcount), we can''t unregister
> 			 * or even condense.
> 			 */
> 			if (fp->ftp_ccount != 0 ||
> 			    fp->ftp_mcount != 0) {
> 				lck_mtx_unlock(&fp->ftp_mtx);
> 				fp->ftp_marked = 0;
> 				continue;
> 			}
>
> 			if (!fp->ftp_retired || fp->ftp_rcount != 0)
> 				fp->ftp_marked = 0;

	The first thing I notice is that the ftp_marked value is set outside
the lock in one test. Seems a bit risky :-), but I don''t think that is
the problem.

	If we arrive at this code section while the meta provider has been
"marked", but not all its probes have been disabled, fp->ftp_rcount
will be
non-zero. The "mark" is cleared. However, there is never again a set  
of the
"mark", and so this provider becomes a zombie.

	I''m thinking the fix is to detect if the provider is a meta provider,
and then simply punt to later if ftp_rcount is non-zero. Something  
like this:
> 	if (fp->ftp_rcount != 0 && fp->ftp_provider_type ==  
> DTFTP_PROVIDER_USDT) {
> 		later++;
> 		fpp = &fp->ftp_next;
> 		lck_mtx_unlock(&fp->ftp_mtx);
> 		continue;
> 	}
	James M

Adam Leventhal

2007-Aug-01 21:18 UTC

head link

[dtrace-discuss] Race when removing user static providers

Hi James,

Can you describe the sequence of events that could lead to this in a little
more detail? I''m not sure I agree with all of your analysis.
> 	If we arrive at this code section while the meta provider has been
> "marked", but not all its probes have been disabled,
fp->ftp_rcount  will be
> non-zero. The "mark" is cleared. However, there is never again a
set of the
> "mark", and so this provider becomes a zombie.
Since we''ve set ftp_retired on the provider, disabling probes will
cause us
to attempt to remove the provider:

        ASSERT(provider->ftp_rcount > 0);
        provider->ftp_rcount--;

        if (p != NULL) {
                /*
                 * Even though we may not be able to remove it entirely, we
                 * mark this retired provider to get a chance to remove some
                 * of the associated probes.
                 */
                if (provider->ftp_retired &&
!provider->ftp_marked)
                        whack = provider->ftp_marked = 1;
                mutex_exit(&provider->ftp_mtx);

                mutex_enter(&p->p_lock);
                sprunlock(p);
        } else {
                /*
                 * If the process is dead, we''re just waiting for the
                 * last probe to be disabled to be able to free it.
                 */
                if (provider->ftp_rcount == 0 &&
!provider->ftp_marked)
                        whack = provider->ftp_marked = 1;
                mutex_exit(&provider->ftp_mtx);
        }

        if (whack)
                fasttrap_pid_cleanup();
> 	Suppose, though, that you''re on a MP machine, and the
asynchronous
> cleanup was running on another cpu. It could be already active due
> to an earlier exit, or because it immediately got an otherwise idle
> cpu.
That''s true, but the provider lock -- and the bucket lock -- will
protect
against simultaneous access won''t they?

It''s not a surprise that your fix works since it will just cause the
cleanup
pass to repeat until all USDT provider that were ever marked were gone. The
code is meant to wait until frp_rcount drops to zero. Your modification
addresses the issue, but the underlying problem still seems to remain.

In theory at least, it seems that the modification of ftp_marked outside of
the protection of the provider''s lock:

	thread 1				thread 2
					fasttrap_pid_cleanup_cb()
						...
						ftp_mcount != 0
						drop ftp_mtx
fasttrap_provider_retire()			.
	take ftp_mtx
	ftp_retired = 1;			.
	ftp_marked = 1;
	drop ftp_mtx				.
fasttrap_pid_disable()
	take ftp_mtx				.
	ftp_rcount = 0;
	ftp_marked = 1;				.
	drop ftp_mtx
						ftp_marked = 0;

I think this is possible, but it''s pretty unlikely on Solaris (but
could
happen inside of VMware or Xen for example).

Can you describe another scenario whereby we could miss the chance to
destroy the provider when the last probe is disabled? Are you operating
in an environment where a thread could effectively be off CPU for a
while? Can you reproduce the problem?

In any case, both unprotected uses of ftp_marked in fasttrap_pid_cleanup_cb()
should probably be corrected.

Adam

On Wed, Aug 01, 2007 at 01:03:54AM -0700, James McIlree
wrote:> 	I think I''ve found a race when user static providers are
> being removed. What I am observing is that under certain stress
> tests, fasttrap meta providers are not being cleaned up. The process
> exits, all dtrace consumers exit, but a later dtrace -l still
> shows probes from that provider.
> 
> 	I''ve got a theory and a fix that seems to work so far...
> 
> 	What I think is happening is this:
> 
> 	You''ve got a dtrace consumer matching probes named:
"test*:::"
> 
> 	Process A has a "test" user static provider. They''ve
been enabled.
> Process A decides to exit. This means that dtrace_helpers_cleanup()
> is called, which synchronously marks the provider as "marked" and
> "retired", and schedules an asynchronous cleanup for later.
> 
> 	Very shortly afterwards, the fasttrap exit cleanup is called, and
> starts disabling probes.
> 
> 	Suppose, though, that you''re on a MP machine, and the
asynchronous
> cleanup was running on another cpu. It could be already active due
> to an earlier exit, or because it immediately got an otherwise idle
> cpu.
> 
> 	In the fasttrap cleanup code, we''ve got these checks:
> 
> 
> > 			/*
> > 			 * If this provider has consumers actively
> > 			 * creating probes (ftp_ccount) or is a USDT
> > 			 * provider (ftp_mcount), we can''t unregister
> > 			 * or even condense.
> > 			 */
> > 			if (fp->ftp_ccount != 0 ||
> > 			    fp->ftp_mcount != 0) {
> > 				lck_mtx_unlock(&fp->ftp_mtx);
> > 				fp->ftp_marked = 0;
> > 				continue;
> > 			}
> >
> > 			if (!fp->ftp_retired || fp->ftp_rcount != 0)
> > 				fp->ftp_marked = 0;
> 
> 
> 	The first thing I notice is that the ftp_marked value is set outside
> the lock in one test. Seems a bit risky :-), but I don''t think
that is
> the problem.
> 
> 	If we arrive at this code section while the meta provider has been
> "marked", but not all its probes have been disabled,
fp->ftp_rcount
> will be
> non-zero. The "mark" is cleared. However, there is never again a
set
> of the
> "mark", and so this provider becomes a zombie.
> 
> 	I''m thinking the fix is to detect if the provider is a meta
provider,
> and then simply punt to later if ftp_rcount is non-zero. Something  
> like this:
> 
> > 	if (fp->ftp_rcount != 0 && fp->ftp_provider_type ==  
> > DTFTP_PROVIDER_USDT) {
> > 		later++;
> > 		fpp = &fp->ftp_next;
> > 		lck_mtx_unlock(&fp->ftp_mtx);
> > 		continue;
> > 	}
> 
> 	James M
> 
> 
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

James McIlree

2007-Aug-02 00:07 UTC

head link

[dtrace-discuss] Race when removing user static providers

On Aug 1, 2007, at 2:18 PM, Adam Leventhal wrote:
> Hi James,
>
> Can you describe the sequence of events that could lead to this in a  
> little
> more detail? I''m not sure I agree with all of your analysis.
	Drat, I missed the remark in fasttrap_pid_disable. Kind of shoots a
big hole in my theory :-).
>
> Can you describe another scenario whereby we could miss the chance to
> destroy the provider when the last probe is disabled? Are you  
> operating
> in an environment where a thread could effectively be off CPU for a
> while? Can you reproduce the problem?
	I''ll look at it further, and see what I can come up with.

	Let me report my observed data, and maybe you can spot the real
race.

	The problem is 100% reproducible. Of course, this is MP OS X, and
I don''t have a MP Solaris install to test against. SP Solaris does not
show the problem. I really hope its a shared problem, and not some  
subtle
difference in OS behavior :-).

	A test program with ~10 user static probes sits in a loop,
doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast
as it can. When it execs, it execs a copy of itself (with probes)
that immediately exits.

	In another shell, I run "dtrace -n "test*:::", let it observe
a few probe firings, and then I ctrl-c. Do this ~5 - 10 times.

	Stop the test process, and do a dtrace -l. You should observe
USDT probes that belong to processes that have exited. They''re zombies.

	At one point, I walked the provider hash tables by hand in
the debugger. When I reached a zombie, it''s ftp_marked field was unset,
and retired was set. I focused in on the ftp_marked field and have  
forgotten
what the rcount was.

	As you suspected, my proposed fix did work, but I also agree
I didn''t fix the underlying issue.

	I''ll unwind and try to grab another zombie provider snapshot.
>
> In any case, both unprotected uses of ftp_marked in  
> fasttrap_pid_cleanup_cb()
> should probably be corrected.
	I only saw one unprotected set of ftp_marked, and fixing it did
not fix the problem.

	James M

Adam Leventhal

2007-Aug-02 00:45 UTC

head link

[dtrace-discuss] Race when removing user static providers

On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree
wrote:> >Can you describe another scenario whereby we could miss the chance to
> >destroy the provider when the last probe is disabled? Are you  
> >operating
> >in an environment where a thread could effectively be off CPU for a
> >while? Can you reproduce the problem?
> 
> 	I''ll look at it further, and see what I can come up with.
> 
> 	Let me report my observed data, and maybe you can spot the real
> race.
> 
> 	The problem is 100% reproducible. Of course, this is MP OS X, and
> I don''t have a MP Solaris install to test against. SP Solaris does
not
> show the problem. I really hope its a shared problem, and not some  
> subtle difference in OS behavior :-).
I can certainly try it out on a big MP Solaris box. That said, DTrace and
USDT have been used on a wide array of MP machines, and not hit this to my
knowledge.
> 	A test program with ~10 user static probes sits in a loop,
> doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast
> as it can. When it execs, it execs a copy of itself (with probes)
> that immediately exits.
> 
> 	In another shell, I run "dtrace -n "test*:::", let it
observe
> a few probe firings, and then I ctrl-c. Do this ~5 - 10 times.
> 
> 	Stop the test process, and do a dtrace -l. You should observe
> USDT probes that belong to processes that have exited. They''re
zombies.
No luck. This is on a two-way box. Should I try something bigger? Should I
be waiting for these processes that the test program has forked off? If you
could provide the code for the actual test case that would probably help.
The best way to provide a test case would be in a form that could then be
integrated into the DTrace test suite. You might look at this test since it
sounds similar to what you''re trying:
  usr/src/cmd/dtrace/test/tst/common/usdt/*forker*
> 	At one point, I walked the provider hash tables by hand in
> the debugger. When I reached a zombie, it''s ftp_marked field was
unset,
> and retired was set. I focused in on the ftp_marked field and have  
> forgotten what the rcount was.
An aside, next up you guys should port mdb:> fasttrap_provs::print fth_table | ::array fasttrap_bucket_t 100 | ::print
fasttrap_bucket_t ftb_data | ::grep .>0 | ::print fasttrap_provider_t
ftp_marked ftp_retiredftp_marked = 0
ftp_retired = 0
...
> >In any case, both unprotected uses of ftp_marked in  
> >fasttrap_pid_cleanup_cb()
> >should probably be corrected.
> 
> 	I only saw one unprotected set of ftp_marked, and fixing it did
> not fix the problem.
Like I said: pretty unlikely given the amount of code that would need to
be executed on one CPU during a very narrow window on another CPU (though
it''s possible on a VMware or similar config [1]).

Adam

[1] http://blogs.sun.com/ahl/entry/debugging_cross_calls_on_opensolaris

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

James McIlree

2007-Aug-02 00:54 UTC

head link

[dtrace-discuss] Race when removing user static providers

On Aug 1, 2007, at 5:45 PM, Adam Leventhal wrote:
> On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree wrote:
>>
>> 	The problem is 100% reproducible. Of course, this is MP OS X, and
>> I don''t have a MP Solaris install to test against. SP Solaris
does
>> not
>> show the problem. I really hope its a shared problem, and not some
>> subtle difference in OS behavior :-).
>
> I can certainly try it out on a big MP Solaris box. That said,  
> DTrace and
> USDT have been used on a wide array of MP machines, and not hit this  
> to my
> knowledge.
	Okay, another significant piece of observed data.

	There is one decrement of ftp_rcount that does not reset marked,
in fasttrap_pid_disable():
> 	if (!probe->ftp_enabled) {
> 		lck_mtx_lock(&provider->ftp_mtx);
> 		provider->ftp_rcount--;
> 		lck_mtx_unlock(&provider->ftp_mtx);
> 		return;
> 	}
	I instrumented this, and sure enough, all my zombie providers are
decremented to zero inside this block of code.

	My original theory may be revivable :-).

	Of course, the next question is *why* we''re in this snippet of code...

	James M

Adam Leventhal

2007-Aug-02 01:06 UTC

head link

[dtrace-discuss] Race when removing user static providers

On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree
wrote:> 	Okay, another significant piece of observed data.
> 
> 	There is one decrement of ftp_rcount that does not reset marked,
> in fasttrap_pid_disable():
> 
> >	if (!probe->ftp_enabled) {
> >		lck_mtx_lock(&provider->ftp_mtx);
> >		provider->ftp_rcount--;
> >		lck_mtx_unlock(&provider->ftp_mtx);
> >		return;
> >	}
> 
> 	I instrumented this, and sure enough, all my zombie providers are
> decremented to zero inside this block of code.
How old is the code you''re using? This was removed 4 months ago. (And
did
you guys really rip out the assertions?)
> 	My original theory may be revivable :-).
I''m not sure I follow.
> 	Of course, the next question is *why* we''re in this snippet of 
> 	code...
Right. The only way ftp_enabled won''t be set is if we didn''t
get through
fasttrap_pid_enable() successfully (we''d like to fail, but what
we''d do
in the broader DTrace framework is a bit unclear).

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

James McIlree

2007-Aug-02 02:06 UTC

head link

[dtrace-discuss] Race when removing user static providers

On Aug 1, 2007, at 6:06 PM, Adam Leventhal wrote:
> On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree wrote:
>> 	Okay, another significant piece of observed data.
>>
>> 	There is one decrement of ftp_rcount that does not reset marked,
>> in fasttrap_pid_disable():
>>
>>> 	if (!probe->ftp_enabled) {
>>> 		lck_mtx_lock(&provider->ftp_mtx);
>>> 		provider->ftp_rcount--;
>>> 		lck_mtx_unlock(&provider->ftp_mtx);
>>> 		return;
>>> 	}
>>
>> 	I instrumented this, and sure enough, all my zombie providers are
>> decremented to zero inside this block of code.
>
> How old is the code you''re using? This was removed 4 months ago.  
> (And did
> you guys really rip out the assertions?)
	Older than 4 months ago, obviously :-).

	"We apologise for the missing ASSERT. Those responsible have been  
sacked."

>
>> 	My original theory may be revivable :-).
>
> I''m not sure I follow.
	The zombie providers were sticking around because ftp_marked was  
cleared
in the cleanup loop, but never reset.
>
>
>> 	Of course, the next question is *why* we''re in this snippet
of
>> 	code...
>
> Right. The only way ftp_enabled won''t be set is if we
didn''t get
> through
> fasttrap_pid_enable() successfully (we''d like to fail, but what
we''d
> do
> in the broader DTrace framework is a bit unclear).
	Some additional digging is needed.

	James M

dtrace discuss - Aug 2007 - Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers

[dtrace-discuss] Race when removing user static providers