On Thu, 25 Feb 2016 23:37:52 +0000, Olly Betts <olly at survex.com> wrote:> On Thu, Feb 25, 2016 at 05:21:17PM +0100, Eric J wrote: > > On Thu, 25 Feb 2016 02:24:51 +0000, Olly Betts <olly at survex.com> wrote: > > > It's clearly not as simple as execl() always releasing the lock, but I > > > don't think we've ruled out the OS entirely yet - the above isn't > > > exactly equivalent to the Tcl code, as the two databases are created by > > > the same process in Tcl but different processes with simpleindex. > > > > but the same problem happens from two different Tcl processes - both > > succeed because there is no lock. > > Ah, OK - I missed that detail. > > > Finally, it appears that it does work with Tcl 8.5 (actually a tclkit, > > but does not work with an 8.6 tclkit). > > I'm testing with Tcl 8.6 (Debian package 8.6.4+dfsg-3), and it works for > me. > > So it does seem it must be due to something your Tcl interpreter is > doing, but I'm struggling to think what that could be. > > If O_CLOEXEC was set on the lock fd when execl() was called, the fd > would get closed and the lock released. But your lsof shows the fd open > but not locked in the child process after it has exec-ed cat. > > If there were a second fd open on the lock file which gets closed > in the child process after the lock is taken, that would release the > lock. But we carefully close all other open fds before taking the > lock to avoid that.I have tried Tcl 8.6.4 now, and it too has the problem. However with the very new Tcl 8.6.5rc2 it works! I still intend to try to find out what the problem was, but I can use the 8.5 tclkit for what I was doing when this all started, and then move to 8.6.5 when it becomes a real release. Thanx very much, Eric -- ms fnd in a lbry
On Sat, 27 Feb 2016 19:39:11 +0100 (CET), Eric J <eric at deptj.eu> wrote:> On Thu, 25 Feb 2016 23:37:52 +0000, Olly Betts <olly at survex.com> wrote:8>< -------->> I'm testing with Tcl 8.6 (Debian package 8.6.4+dfsg-3), and it works for >> me. >> >> So it does seem it must be due to something your Tcl interpreter is >> doing, but I'm struggling to think what that could be. >> >> If O_CLOEXEC was set on the lock fd when execl() was called, the fd >> would get closed and the lock released. But your lsof shows the fd open >> but not locked in the child process after it has exec-ed cat. >> >> If there were a second fd open on the lock file which gets closed >> in the child process after the lock is taken, that would release the >> lock. But we carefully close all other open fds before taking the >> lock to avoid that. > > I have tried Tcl 8.6.4 now, and it too has the problem. However with the > very new Tcl 8.6.5rc2 it works! I still intend to try to find out what > the problem was, but I can use the 8.5 tclkit for what I was doing when > this all started, and then move to 8.6.5 when it becomes a real release.After some more experiments, and some help from the Tcl side, I can now say that database locks from the Tcl bindings will not function correctly in the following Tcl versions: 8.5.18 built with threads (not the default) 8.6.[1-4] built with threads (default) but will function correctly in the following Tcl versions: 8.5.18 built without threads (default) 8.5.19 built with or without threads 8.6.[1-4] built without threads (not the default) 8.6.5 built with or without threads Earlier 8.5.x are presumably the same as 8.5.18. This all seems (just my own theory, not proven) to be a collision of corner cases: * fork + exec being expected to need to preserve a file lock. * early creation of a notifier thread expected to be without undesirable side-effects Anyway, the answer is to use Tcl versions as above, or to use Xapian/kernel combinations where OFD locks are available. Eric -- ms fnd in a lbry
On Tue, Mar 01, 2016 at 07:02:03PM +0100, Eric J wrote:> After some more experiments, and some help from the Tcl side, I can now > say that database locks from the Tcl bindings will not function > correctly in the following Tcl versions: > > 8.5.18 built with threads (not the default) > 8.6.[1-4] built with threads (default) > > but will function correctly in the following Tcl versions: > > 8.5.18 built without threads (default) > 8.5.19 built with or without threads > 8.6.[1-4] built without threads (not the default) > 8.6.5 built with or without threads > > Earlier 8.5.x are presumably the same as 8.5.18. > > This all seems (just my own theory, not proven) to be a collision of > corner cases: > > * fork + exec being expected to need to preserve a file lock. > * early creation of a notifier thread expected to be without undesirable > side-effectsLooking at the code in tcl 8.5, I notice the notifier thread calls pipe() and then later sets FD_CLOEXEC on the two fds (where supported pipe2(fds, O_CLOEXEC) would achieve that atomically): http://sources.debian.net/src/tcl8.5/8.5.18-3/unix/tclUnixNotfy.c/#L1090 I'm not seeing exactly how, but I wonder if this interacts badly with Xapian closing all unwanted fds in the child process, resulting in Tcl's thread ends up setting FD_CLOEXEC on the lock file fd. There seem to have been a number of fixes and fixes to those fixes to this file in the last year, so it's hard to quickly see what's changed and why, so I'm not sure why 8.6.5 works better, or if it's just that the problem doesn't manifest as reliably there: http://core.tcl.tk/tcl/finfo?name=unix/tclUnixNotfy.c> Anyway, the answer is to use Tcl versions as above, or to use > Xapian/kernel combinations where OFD locks are available.OFD locks are a good answer for Linux, but sadly POSIX don't seem to be steaming ahead with standardising them, and I don't know of any other platforms offering them as an extension. We can't fix existing releases of Xapian (or Tcl) but a way to stop this happening going forwards would be good. For Tcl the simple fix would be just to document "if Tcl is build with threads, you need to use Tcl >= 8.6.5" (assuming that 8.6.5 actually fixes this). But being robust to arbitrary pthread_atfork() handlers doing unhelpful stuff would be good too. I don't see a way to kill off any other threads in the child process - Linux has pthread_kill_other_threads_np() but it only does anything for LinuxThreads, it's a no-op for NPTL. And it's mostly other platforms we're concerned about anyway. All I can really think of is replacing /bin/cat with a custom helper and so take the lock after exec(). That adds extra overhead to failed locking attempts, but that's not such a big deal, especially if OFD locks get standardised since then it only affects older platforms. Anyone got any better ideas? Cheers, Olly