thr3ads.net - dtrace discuss - [dtrace-discuss] Potential OpenSolaris hres

If this information is useful, please help other people find it:
Share via:

Bryan Cantrill

2009-May-01 06:50 UTC

[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks

Mathieu,
> First, let me introduce myself : I am the author and maintainer of
> LTTng, a Linux kernel tracer, a project that started back in 2005.
> I have looked at the Dtrace code to figure out how you managed the
> time-base reads in Dtrace so I can talk about it in my forthcoming paper
> about synchronization primitives for OS tracing. Unfortunately, there
> seem to be a few race conditions that I would like to first discuss with
> you before publication. 
Apologies for my delay in getting back to you.  Executive summary:  there
is no race condition here; you appear to have misunderstood our
implementation.  Please see below.
> You will then probably be interested in the
> solutions I propose in my paper. I am referring to the Solaris 20090330
> code base.
> 
> 1)
> 
> sun4/sys/clock.h has a comment above the CLOCK_LOCK which indicates
> this:
> 
>  * (3) Interaction with the high level cyclic handler, hres_tick().
>  *
>  * One unusual property of hres_lock is that it''s acquired in a
high
>  * level cyclic handler, hres_tick().  Thus, hres_lock must be acquired at
>  * CBE_HIGH_PIL or higher to prevent single-CPU deadlock.
> 
> However, CLOCK_LOCK is also taken from :
> 
> ntp_adjtime(), calling clock_update(). This first problem could probably
> be fixed by simply disabling interrupts in clock_update() around
> CLOCK_LOCK usage(fix 1).
Yes, CLOCK_LOCK is doing a lock_set_spl(), which is setting PIL to 15
before acquiring the lock.  There is no race here.
> So this first problem, even when solved, leads us to a more interesting
> potential deadlock related to the tracer time source.
> 
> 2)
> 
> Assuming that code executed in NMI handlers can be instrumented by
> Dtrace, the following race condition could happen :
> 
>         CPU A                            CPU B
> - executes ntp_adjtime
> - disables interrupts (assuming fix (1))
> - takes CLOCK_LOCK
>                                          - cyclic tsc_tick runs
>                                          - takes shadow clock lock
>                                          - interrupted by NMI
>                                          - Dtrace probe called
> 					 - try to read time base/check
> 					   the clock lock twice, fails.
>                                          - try to read shadow time base,
>                                            check shadow clock lock.
>                                              -> deadlock, NMI will wait
>                                                 for shadow clock lock
forever
>                                                 because it is nested over
this
>                                                 lock.
> 
> I used the NMI as an example, but this could also apply to any interrupt
> source that would run with higher priority than the cyclic subsystem (if
> this is ever possible on sparc or i86pc ?).
> 
> This second problem would however require rethinking the locking
> primitives you use to synchronize your time base. This is the topic I
> will discuss in great length in my forthcoming paper.
> 
> Hopefully I am not misunderstanding too much of Solaris-specific
> internals in my study here. Comments would be much appreciated.
Did you read the block comment above dtrace_hres_tick()?  As that explains
things pretty thoroughly, let me repeat it here (minus the C comment
formatting):

   Making available adjustable high-resolution time in DTrace is regrettably
   more complicated than one might think it should be.	The problem is that
   the variables related to adjusted high-resolution time (hrestime,
   hrestime_adj and friends) are adjusted under hres_lock -- and this lock may
   be held when we enter probe context.	 One might think that we could address
   this by having a single snapshot copy that is stored under a different lock
   from hres_tick(), using the snapshot iff hres_lock is locked in probe
   context.  Unfortunately, this too won''t work:  because hres_lock is
grabbed
   in more than just hres_tick() context, we could enter probe context
   concurrently on two different CPUs with both locks (hres_lock and the
   snapshot lock) held.	 As this implies, the fundamental problem is that we
   need to have access to a snapshot of these variables that we _know_ will
   not be locked in probe context.  To effect this, we have two snapshots
   protected by two different locks, and we mandate that these snapshots are
   recorded in succession by a single thread calling dtrace_hres_tick().  (We
   assure this by calling it out of the same CY_HIGH_LEVEL cyclic that calls
   hres_tick().)  A single thread can''t be in two places at once:  one
of the
   snapshot locks is guaranteed to be unheld at all times.  The
   dtrace_gethrestime() algorithm is thus to check first one snapshot and then
   the other to find the unlocked snapshot.

As that comment explains (pretty clearly, I thought), there is not one shadow
lock -- there are two.  And those shadow locks are only acquired by a single
thread and only in succession:  it is not possible for both locks to be held
simultaneously.  You seem to have missed this rather important detail when
you wrote that CPU B "tr[ies] to read time base/check the clock lock
twice, fails" -- one of those _must_ succeed.

So, unless I''ve missed some critical element of your argument, there is
no deadlock here -- and I''d ask that you kindly reflect that fact in
your
work.  Please let me know if you have additional questions -- and should
you wish me to review a draft of the paper that you are preparing, I am
at your service...

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Sun Microsystems Fishworks.       http://blogs.sun.com/bmc

Bryan Cantrill

2009-May-01 19:59 UTC

head link

[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks

Hey Mathieu,
> OK, so we are talking about different pieces of code then. You refer to
> common/os/dtrace_subr.c:dtrace_gethrestime() (read-side) and
> dtrace_hres_tick() for update.
Ah.  Yes, that explains it.
> The two sequence locks idea is pretty interesting, since it makes sure
> they are only taken one after the other by the same unique thread, and
> never held at the same time. Having two elements of data updated
> consecutively is clearly something we have in common in our respective
> clock algorithms. They probably differ in term of execution overhead,
> but the double consecutive seqlock idea should definitely work.
The execution overhead doesn''t really matter, as the code that actually
performs the sequential locking is executed once per second.
> There is a concern regarding NTP-related time skews though, which will
> be explained below.
> 
> 
> Now about i86pc/os/timestamp.c:dtrace_gethrtime()
> 
> So given it''s called from dtrace_gethrestime(), you should really,
> really modify it so it uses a double shadow-lock based solution as you
> rightly do in dtrace_gethrestime().
> 
> I also wonder if your clock might be skewed when NTP adjustment is
> performed, given it does not update the shadow variables. I guess you
> will simply use the old NTP adjustment until you get the next tick, but
> this might cause skews between readers using the non-shadow and shadow
> clock during this time-frame. This seems rather odd. The same reasoning
> applies to a reader coming right between the two consecutive shadow
> variables updates in dtrace_gethrestime(). Time could appear to go
> backward.
Time can always appear to go backwards when looking at hrestime --
that''s
why we have hrtime (which is guaranteed to be monotonic).
> I should update the deadlock scenario for x86 dtrace_gethrtime(),
> because we need one more actor than present in the scenario I thought of
> in my original email :
> 
>         CPU A                            CPU B
> - executes ntp_adjtime
> - takes CLOCK_LOCK (disable interrupts)
>                                          - cyclic tsc_tick runs
>                                          - takes shadow clock lock
>                                          - interrupted by NMI
>                                          - Dtrace probe called
> 					 - try to read time base/check
> 					   the clock lock twice, fails.
>                                          - try to read shadow time base,
>                                            check shadow clock lock.
> - interrupted by NMI
> - Dtrace probe called
> - try to read time base/check
>   the clock lock twice, fails.
> - try to read shadow time base,
>   check shadow clock lock.
> 
>                          -> 4-way deadlock, NMIs will wait
>                             for clock and shadow clock locks forever
>                             because they are nested over both of these
>                             locks.
> 
> It should be rare, but it seems possible.
"Rare" is not quite the word for it.  You are requiring an NMI on a
six
instruction window -- one that is executed only once per second -- and
_another_, _concurrent_ NMI on a disjoint CPU, itself in a relatively small
window (hres_lock held).  And then both must of course make calls into
functions instrumented by DTrace.  This is no easy trick, as we use NMI
in Solaris primarily to enter the kernel debugger (speaking for the product
line that I''m working on, we use NMI exclusively to force an operating
system panic).  So while perhaps theoretically possible, the other
elements of the system make this problem pragmatically impossible to
hit.  If we felt that there were a non-zero chance of hitting this, we
could -- as you point out -- easily duplicate the double shadow lock solution
that we used for hrestime...
> > So, unless I''ve missed some critical element of your
argument, there is
> > no deadlock here -- and I''d ask that you kindly reflect that
fact in your
> > work.  Please let me know if you have additional questions -- and
should
> > you wish me to review a draft of the paper that you are preparing, I
am
> > at your service...
> > 
> 
> I guess we were simply talking about two different locks. I look forward
> to your reply. I will be honored to send you a draft of my paper for
> review when I get the last bits and pieces in place.
I''d be happy to; send it my way when you have a draft.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Sun Microsystems Fishworks.       http://blogs.sun.com/bmc

dtrace discuss - May 2009 - Potential OpenSolaris hres_lock deadlocks

[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks

[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks