James McIlree
2007-Aug-01 08:03 UTC
[dtrace-discuss] Race when removing user static providers
I think I''ve found a race when user static providers are being removed. What I am observing is that under certain stress tests, fasttrap meta providers are not being cleaned up. The process exits, all dtrace consumers exit, but a later dtrace -l still shows probes from that provider. I''ve got a theory and a fix that seems to work so far... What I think is happening is this: You''ve got a dtrace consumer matching probes named: "test*:::" Process A has a "test" user static provider. They''ve been enabled. Process A decides to exit. This means that dtrace_helpers_cleanup() is called, which synchronously marks the provider as "marked" and "retired", and schedules an asynchronous cleanup for later. Very shortly afterwards, the fasttrap exit cleanup is called, and starts disabling probes. Suppose, though, that you''re on a MP machine, and the asynchronous cleanup was running on another cpu. It could be already active due to an earlier exit, or because it immediately got an otherwise idle cpu. In the fasttrap cleanup code, we''ve got these checks:> /* > * If this provider has consumers actively > * creating probes (ftp_ccount) or is a USDT > * provider (ftp_mcount), we can''t unregister > * or even condense. > */ > if (fp->ftp_ccount != 0 || > fp->ftp_mcount != 0) { > lck_mtx_unlock(&fp->ftp_mtx); > fp->ftp_marked = 0; > continue; > } > > if (!fp->ftp_retired || fp->ftp_rcount != 0) > fp->ftp_marked = 0;The first thing I notice is that the ftp_marked value is set outside the lock in one test. Seems a bit risky :-), but I don''t think that is the problem. If we arrive at this code section while the meta provider has been "marked", but not all its probes have been disabled, fp->ftp_rcount will be non-zero. The "mark" is cleared. However, there is never again a set of the "mark", and so this provider becomes a zombie. I''m thinking the fix is to detect if the provider is a meta provider, and then simply punt to later if ftp_rcount is non-zero. Something like this:> if (fp->ftp_rcount != 0 && fp->ftp_provider_type == > DTFTP_PROVIDER_USDT) { > later++; > fpp = &fp->ftp_next; > lck_mtx_unlock(&fp->ftp_mtx); > continue; > }James M
Adam Leventhal
2007-Aug-01 21:18 UTC
[dtrace-discuss] Race when removing user static providers
Hi James, Can you describe the sequence of events that could lead to this in a little more detail? I''m not sure I agree with all of your analysis.> If we arrive at this code section while the meta provider has been > "marked", but not all its probes have been disabled, fp->ftp_rcount will be > non-zero. The "mark" is cleared. However, there is never again a set of the > "mark", and so this provider becomes a zombie.Since we''ve set ftp_retired on the provider, disabling probes will cause us to attempt to remove the provider: ASSERT(provider->ftp_rcount > 0); provider->ftp_rcount--; if (p != NULL) { /* * Even though we may not be able to remove it entirely, we * mark this retired provider to get a chance to remove some * of the associated probes. */ if (provider->ftp_retired && !provider->ftp_marked) whack = provider->ftp_marked = 1; mutex_exit(&provider->ftp_mtx); mutex_enter(&p->p_lock); sprunlock(p); } else { /* * If the process is dead, we''re just waiting for the * last probe to be disabled to be able to free it. */ if (provider->ftp_rcount == 0 && !provider->ftp_marked) whack = provider->ftp_marked = 1; mutex_exit(&provider->ftp_mtx); } if (whack) fasttrap_pid_cleanup();> Suppose, though, that you''re on a MP machine, and the asynchronous > cleanup was running on another cpu. It could be already active due > to an earlier exit, or because it immediately got an otherwise idle > cpu.That''s true, but the provider lock -- and the bucket lock -- will protect against simultaneous access won''t they? It''s not a surprise that your fix works since it will just cause the cleanup pass to repeat until all USDT provider that were ever marked were gone. The code is meant to wait until frp_rcount drops to zero. Your modification addresses the issue, but the underlying problem still seems to remain. In theory at least, it seems that the modification of ftp_marked outside of the protection of the provider''s lock: thread 1 thread 2 fasttrap_pid_cleanup_cb() ... ftp_mcount != 0 drop ftp_mtx fasttrap_provider_retire() . take ftp_mtx ftp_retired = 1; . ftp_marked = 1; drop ftp_mtx . fasttrap_pid_disable() take ftp_mtx . ftp_rcount = 0; ftp_marked = 1; . drop ftp_mtx ftp_marked = 0; I think this is possible, but it''s pretty unlikely on Solaris (but could happen inside of VMware or Xen for example). Can you describe another scenario whereby we could miss the chance to destroy the provider when the last probe is disabled? Are you operating in an environment where a thread could effectively be off CPU for a while? Can you reproduce the problem? In any case, both unprotected uses of ftp_marked in fasttrap_pid_cleanup_cb() should probably be corrected. Adam On Wed, Aug 01, 2007 at 01:03:54AM -0700, James McIlree wrote:> I think I''ve found a race when user static providers are > being removed. What I am observing is that under certain stress > tests, fasttrap meta providers are not being cleaned up. The process > exits, all dtrace consumers exit, but a later dtrace -l still > shows probes from that provider. > > I''ve got a theory and a fix that seems to work so far... > > What I think is happening is this: > > You''ve got a dtrace consumer matching probes named: "test*:::" > > Process A has a "test" user static provider. They''ve been enabled. > Process A decides to exit. This means that dtrace_helpers_cleanup() > is called, which synchronously marks the provider as "marked" and > "retired", and schedules an asynchronous cleanup for later. > > Very shortly afterwards, the fasttrap exit cleanup is called, and > starts disabling probes. > > Suppose, though, that you''re on a MP machine, and the asynchronous > cleanup was running on another cpu. It could be already active due > to an earlier exit, or because it immediately got an otherwise idle > cpu. > > In the fasttrap cleanup code, we''ve got these checks: > > > > /* > > * If this provider has consumers actively > > * creating probes (ftp_ccount) or is a USDT > > * provider (ftp_mcount), we can''t unregister > > * or even condense. > > */ > > if (fp->ftp_ccount != 0 || > > fp->ftp_mcount != 0) { > > lck_mtx_unlock(&fp->ftp_mtx); > > fp->ftp_marked = 0; > > continue; > > } > > > > if (!fp->ftp_retired || fp->ftp_rcount != 0) > > fp->ftp_marked = 0; > > > The first thing I notice is that the ftp_marked value is set outside > the lock in one test. Seems a bit risky :-), but I don''t think that is > the problem. > > If we arrive at this code section while the meta provider has been > "marked", but not all its probes have been disabled, fp->ftp_rcount > will be > non-zero. The "mark" is cleared. However, there is never again a set > of the > "mark", and so this provider becomes a zombie. > > I''m thinking the fix is to detect if the provider is a meta provider, > and then simply punt to later if ftp_rcount is non-zero. Something > like this: > > > if (fp->ftp_rcount != 0 && fp->ftp_provider_type == > > DTFTP_PROVIDER_USDT) { > > later++; > > fpp = &fp->ftp_next; > > lck_mtx_unlock(&fp->ftp_mtx); > > continue; > > } > > James M > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
James McIlree
2007-Aug-02 00:07 UTC
[dtrace-discuss] Race when removing user static providers
On Aug 1, 2007, at 2:18 PM, Adam Leventhal wrote:> Hi James, > > Can you describe the sequence of events that could lead to this in a > little > more detail? I''m not sure I agree with all of your analysis.Drat, I missed the remark in fasttrap_pid_disable. Kind of shoots a big hole in my theory :-).> > Can you describe another scenario whereby we could miss the chance to > destroy the provider when the last probe is disabled? Are you > operating > in an environment where a thread could effectively be off CPU for a > while? Can you reproduce the problem?I''ll look at it further, and see what I can come up with. Let me report my observed data, and maybe you can spot the real race. The problem is 100% reproducible. Of course, this is MP OS X, and I don''t have a MP Solaris install to test against. SP Solaris does not show the problem. I really hope its a shared problem, and not some subtle difference in OS behavior :-). A test program with ~10 user static probes sits in a loop, doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast as it can. When it execs, it execs a copy of itself (with probes) that immediately exits. In another shell, I run "dtrace -n "test*:::", let it observe a few probe firings, and then I ctrl-c. Do this ~5 - 10 times. Stop the test process, and do a dtrace -l. You should observe USDT probes that belong to processes that have exited. They''re zombies. At one point, I walked the provider hash tables by hand in the debugger. When I reached a zombie, it''s ftp_marked field was unset, and retired was set. I focused in on the ftp_marked field and have forgotten what the rcount was. As you suspected, my proposed fix did work, but I also agree I didn''t fix the underlying issue. I''ll unwind and try to grab another zombie provider snapshot.> > In any case, both unprotected uses of ftp_marked in > fasttrap_pid_cleanup_cb() > should probably be corrected.I only saw one unprotected set of ftp_marked, and fixing it did not fix the problem. James M
Adam Leventhal
2007-Aug-02 00:45 UTC
[dtrace-discuss] Race when removing user static providers
On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree wrote:> >Can you describe another scenario whereby we could miss the chance to > >destroy the provider when the last probe is disabled? Are you > >operating > >in an environment where a thread could effectively be off CPU for a > >while? Can you reproduce the problem? > > I''ll look at it further, and see what I can come up with. > > Let me report my observed data, and maybe you can spot the real > race. > > The problem is 100% reproducible. Of course, this is MP OS X, and > I don''t have a MP Solaris install to test against. SP Solaris does not > show the problem. I really hope its a shared problem, and not some > subtle difference in OS behavior :-).I can certainly try it out on a big MP Solaris box. That said, DTrace and USDT have been used on a wide array of MP machines, and not hit this to my knowledge.> A test program with ~10 user static probes sits in a loop, > doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast > as it can. When it execs, it execs a copy of itself (with probes) > that immediately exits. > > In another shell, I run "dtrace -n "test*:::", let it observe > a few probe firings, and then I ctrl-c. Do this ~5 - 10 times. > > Stop the test process, and do a dtrace -l. You should observe > USDT probes that belong to processes that have exited. They''re zombies.No luck. This is on a two-way box. Should I try something bigger? Should I be waiting for these processes that the test program has forked off? If you could provide the code for the actual test case that would probably help. The best way to provide a test case would be in a form that could then be integrated into the DTrace test suite. You might look at this test since it sounds similar to what you''re trying: usr/src/cmd/dtrace/test/tst/common/usdt/*forker*> At one point, I walked the provider hash tables by hand in > the debugger. When I reached a zombie, it''s ftp_marked field was unset, > and retired was set. I focused in on the ftp_marked field and have > forgotten what the rcount was.An aside, next up you guys should port mdb:> fasttrap_provs::print fth_table | ::array fasttrap_bucket_t 100 | ::print fasttrap_bucket_t ftb_data | ::grep .>0 | ::print fasttrap_provider_t ftp_marked ftp_retiredftp_marked = 0 ftp_retired = 0 ...> >In any case, both unprotected uses of ftp_marked in > >fasttrap_pid_cleanup_cb() > >should probably be corrected. > > I only saw one unprotected set of ftp_marked, and fixing it did > not fix the problem.Like I said: pretty unlikely given the amount of code that would need to be executed on one CPU during a very narrow window on another CPU (though it''s possible on a VMware or similar config [1]). Adam [1] http://blogs.sun.com/ahl/entry/debugging_cross_calls_on_opensolaris -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
James McIlree
2007-Aug-02 00:54 UTC
[dtrace-discuss] Race when removing user static providers
On Aug 1, 2007, at 5:45 PM, Adam Leventhal wrote:> On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree wrote: >> >> The problem is 100% reproducible. Of course, this is MP OS X, and >> I don''t have a MP Solaris install to test against. SP Solaris does >> not >> show the problem. I really hope its a shared problem, and not some >> subtle difference in OS behavior :-). > > I can certainly try it out on a big MP Solaris box. That said, > DTrace and > USDT have been used on a wide array of MP machines, and not hit this > to my > knowledge.Okay, another significant piece of observed data. There is one decrement of ftp_rcount that does not reset marked, in fasttrap_pid_disable():> if (!probe->ftp_enabled) { > lck_mtx_lock(&provider->ftp_mtx); > provider->ftp_rcount--; > lck_mtx_unlock(&provider->ftp_mtx); > return; > }I instrumented this, and sure enough, all my zombie providers are decremented to zero inside this block of code. My original theory may be revivable :-). Of course, the next question is *why* we''re in this snippet of code... James M
Adam Leventhal
2007-Aug-02 01:06 UTC
[dtrace-discuss] Race when removing user static providers
On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree wrote:> Okay, another significant piece of observed data. > > There is one decrement of ftp_rcount that does not reset marked, > in fasttrap_pid_disable(): > > > if (!probe->ftp_enabled) { > > lck_mtx_lock(&provider->ftp_mtx); > > provider->ftp_rcount--; > > lck_mtx_unlock(&provider->ftp_mtx); > > return; > > } > > I instrumented this, and sure enough, all my zombie providers are > decremented to zero inside this block of code.How old is the code you''re using? This was removed 4 months ago. (And did you guys really rip out the assertions?)> My original theory may be revivable :-).I''m not sure I follow.> Of course, the next question is *why* we''re in this snippet of > code...Right. The only way ftp_enabled won''t be set is if we didn''t get through fasttrap_pid_enable() successfully (we''d like to fail, but what we''d do in the broader DTrace framework is a bit unclear). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
James McIlree
2007-Aug-02 02:06 UTC
[dtrace-discuss] Race when removing user static providers
On Aug 1, 2007, at 6:06 PM, Adam Leventhal wrote:> On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree wrote: >> Okay, another significant piece of observed data. >> >> There is one decrement of ftp_rcount that does not reset marked, >> in fasttrap_pid_disable(): >> >>> if (!probe->ftp_enabled) { >>> lck_mtx_lock(&provider->ftp_mtx); >>> provider->ftp_rcount--; >>> lck_mtx_unlock(&provider->ftp_mtx); >>> return; >>> } >> >> I instrumented this, and sure enough, all my zombie providers are >> decremented to zero inside this block of code. > > How old is the code you''re using? This was removed 4 months ago. > (And did > you guys really rip out the assertions?)Older than 4 months ago, obviously :-). "We apologise for the missing ASSERT. Those responsible have been sacked."> >> My original theory may be revivable :-). > > I''m not sure I follow.The zombie providers were sticking around because ftp_marked was cleared in the cleanup loop, but never reset.> > >> Of course, the next question is *why* we''re in this snippet of >> code... > > Right. The only way ftp_enabled won''t be set is if we didn''t get > through > fasttrap_pid_enable() successfully (we''d like to fail, but what we''d > do > in the broader DTrace framework is a bit unclear).Some additional digging is needed. James M