Bryan Cantrill
2009-May-01 06:50 UTC
[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks
Mathieu,> First, let me introduce myself : I am the author and maintainer of > LTTng, a Linux kernel tracer, a project that started back in 2005. > I have looked at the Dtrace code to figure out how you managed the > time-base reads in Dtrace so I can talk about it in my forthcoming paper > about synchronization primitives for OS tracing. Unfortunately, there > seem to be a few race conditions that I would like to first discuss with > you before publication.Apologies for my delay in getting back to you. Executive summary: there is no race condition here; you appear to have misunderstood our implementation. Please see below.> You will then probably be interested in the > solutions I propose in my paper. I am referring to the Solaris 20090330 > code base. > > 1) > > sun4/sys/clock.h has a comment above the CLOCK_LOCK which indicates > this: > > * (3) Interaction with the high level cyclic handler, hres_tick(). > * > * One unusual property of hres_lock is that it''s acquired in a high > * level cyclic handler, hres_tick(). Thus, hres_lock must be acquired at > * CBE_HIGH_PIL or higher to prevent single-CPU deadlock. > > However, CLOCK_LOCK is also taken from : > > ntp_adjtime(), calling clock_update(). This first problem could probably > be fixed by simply disabling interrupts in clock_update() around > CLOCK_LOCK usage(fix 1).Yes, CLOCK_LOCK is doing a lock_set_spl(), which is setting PIL to 15 before acquiring the lock. There is no race here.> So this first problem, even when solved, leads us to a more interesting > potential deadlock related to the tracer time source. > > 2) > > Assuming that code executed in NMI handlers can be instrumented by > Dtrace, the following race condition could happen : > > CPU A CPU B > - executes ntp_adjtime > - disables interrupts (assuming fix (1)) > - takes CLOCK_LOCK > - cyclic tsc_tick runs > - takes shadow clock lock > - interrupted by NMI > - Dtrace probe called > - try to read time base/check > the clock lock twice, fails. > - try to read shadow time base, > check shadow clock lock. > -> deadlock, NMI will wait > for shadow clock lock forever > because it is nested over this > lock. > > I used the NMI as an example, but this could also apply to any interrupt > source that would run with higher priority than the cyclic subsystem (if > this is ever possible on sparc or i86pc ?). > > This second problem would however require rethinking the locking > primitives you use to synchronize your time base. This is the topic I > will discuss in great length in my forthcoming paper. > > Hopefully I am not misunderstanding too much of Solaris-specific > internals in my study here. Comments would be much appreciated.Did you read the block comment above dtrace_hres_tick()? As that explains things pretty thoroughly, let me repeat it here (minus the C comment formatting): Making available adjustable high-resolution time in DTrace is regrettably more complicated than one might think it should be. The problem is that the variables related to adjusted high-resolution time (hrestime, hrestime_adj and friends) are adjusted under hres_lock -- and this lock may be held when we enter probe context. One might think that we could address this by having a single snapshot copy that is stored under a different lock from hres_tick(), using the snapshot iff hres_lock is locked in probe context. Unfortunately, this too won''t work: because hres_lock is grabbed in more than just hres_tick() context, we could enter probe context concurrently on two different CPUs with both locks (hres_lock and the snapshot lock) held. As this implies, the fundamental problem is that we need to have access to a snapshot of these variables that we _know_ will not be locked in probe context. To effect this, we have two snapshots protected by two different locks, and we mandate that these snapshots are recorded in succession by a single thread calling dtrace_hres_tick(). (We assure this by calling it out of the same CY_HIGH_LEVEL cyclic that calls hres_tick().) A single thread can''t be in two places at once: one of the snapshot locks is guaranteed to be unheld at all times. The dtrace_gethrestime() algorithm is thus to check first one snapshot and then the other to find the unlocked snapshot. As that comment explains (pretty clearly, I thought), there is not one shadow lock -- there are two. And those shadow locks are only acquired by a single thread and only in succession: it is not possible for both locks to be held simultaneously. You seem to have missed this rather important detail when you wrote that CPU B "tr[ies] to read time base/check the clock lock twice, fails" -- one of those _must_ succeed. So, unless I''ve missed some critical element of your argument, there is no deadlock here -- and I''d ask that you kindly reflect that fact in your work. Please let me know if you have additional questions -- and should you wish me to review a draft of the paper that you are preparing, I am at your service... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc
Bryan Cantrill
2009-May-01 19:59 UTC
[dtrace-discuss] Potential OpenSolaris hres_lock deadlocks
Hey Mathieu,> OK, so we are talking about different pieces of code then. You refer to > common/os/dtrace_subr.c:dtrace_gethrestime() (read-side) and > dtrace_hres_tick() for update.Ah. Yes, that explains it.> The two sequence locks idea is pretty interesting, since it makes sure > they are only taken one after the other by the same unique thread, and > never held at the same time. Having two elements of data updated > consecutively is clearly something we have in common in our respective > clock algorithms. They probably differ in term of execution overhead, > but the double consecutive seqlock idea should definitely work.The execution overhead doesn''t really matter, as the code that actually performs the sequential locking is executed once per second.> There is a concern regarding NTP-related time skews though, which will > be explained below. > > > Now about i86pc/os/timestamp.c:dtrace_gethrtime() > > So given it''s called from dtrace_gethrestime(), you should really, > really modify it so it uses a double shadow-lock based solution as you > rightly do in dtrace_gethrestime(). > > I also wonder if your clock might be skewed when NTP adjustment is > performed, given it does not update the shadow variables. I guess you > will simply use the old NTP adjustment until you get the next tick, but > this might cause skews between readers using the non-shadow and shadow > clock during this time-frame. This seems rather odd. The same reasoning > applies to a reader coming right between the two consecutive shadow > variables updates in dtrace_gethrestime(). Time could appear to go > backward.Time can always appear to go backwards when looking at hrestime -- that''s why we have hrtime (which is guaranteed to be monotonic).> I should update the deadlock scenario for x86 dtrace_gethrtime(), > because we need one more actor than present in the scenario I thought of > in my original email : > > CPU A CPU B > - executes ntp_adjtime > - takes CLOCK_LOCK (disable interrupts) > - cyclic tsc_tick runs > - takes shadow clock lock > - interrupted by NMI > - Dtrace probe called > - try to read time base/check > the clock lock twice, fails. > - try to read shadow time base, > check shadow clock lock. > - interrupted by NMI > - Dtrace probe called > - try to read time base/check > the clock lock twice, fails. > - try to read shadow time base, > check shadow clock lock. > > -> 4-way deadlock, NMIs will wait > for clock and shadow clock locks forever > because they are nested over both of these > locks. > > It should be rare, but it seems possible."Rare" is not quite the word for it. You are requiring an NMI on a six instruction window -- one that is executed only once per second -- and _another_, _concurrent_ NMI on a disjoint CPU, itself in a relatively small window (hres_lock held). And then both must of course make calls into functions instrumented by DTrace. This is no easy trick, as we use NMI in Solaris primarily to enter the kernel debugger (speaking for the product line that I''m working on, we use NMI exclusively to force an operating system panic). So while perhaps theoretically possible, the other elements of the system make this problem pragmatically impossible to hit. If we felt that there were a non-zero chance of hitting this, we could -- as you point out -- easily duplicate the double shadow lock solution that we used for hrestime...> > So, unless I''ve missed some critical element of your argument, there is > > no deadlock here -- and I''d ask that you kindly reflect that fact in your > > work. Please let me know if you have additional questions -- and should > > you wish me to review a draft of the paper that you are preparing, I am > > at your service... > > > > I guess we were simply talking about two different locks. I look forward > to your reply. I will be honored to send you a draft of my paper for > review when I get the last bits and pieces in place.I''d be happy to; send it my way when you have a draft. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc