Samuel Williams
2011-Apr-07 14:56 UTC
[Xapian-devel] Problems with /bin/cat and flintlock?
Hi Guys, I'm working on some integration project with Ruby, Rack, Apache, Phusion Passenger and Xapian. I've been having intermittent issues with the flintlock code - it seems that the function FlintLock::lock is never returning and this is locking up the Ruby process. My guess is that Xapian is locking up in a system call and Ruby can't schedule its green threads. I've done some basic debugging with strace and noticed the following: 29944 30022 29942 29939 ? -1 Sl 33 0:09 | | \_ Passenger ApplicationSpawner: /srv/www/www.oriontransfer.co.nz 30022 30041 29942 29939 ? -1 S 33 0:00 | | | \_ /bin/cat [Using the following source code as a reference http://xapian.org/docs/sourcedoc/html/flint__lock_8cc_source.html] At this point, using strace I found that the application process seemed to be stuck in on 00219 ssize_t n = read(fds[0], &ch, 1); Obviously child process was cat, nothing really interesting about that. After I killed cat, then the process was freed up and the web application started responding again. Well, I don't know why this is unreliable I've briefly looked at the code and noticed a few things: 00172 // Connect pipe to stdin and stdout. 00173 dup2(fds[1], 0); 00174 dup2(fds[1], 1); Isn't this setting stdin and stdout to the same end of an existing pipe? Does this make sense? Anyway, I thought I'd mention this because it is a consistent problem. If there is anything you think I should do with strace, gdb, etc on the processes next time it hangs, let me know. One option to fix the bug without really understanding the real issue would be to use select in the parent thread, rather than read. Then, use a timeout of a few seconds so that if the child doesn't acquire the lock within x seconds, it is as good as failed. Kind regards, Samuel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110408/da9e0f36/attachment.html>
On Fri, Apr 08, 2011 at 02:56:22AM +1200, Samuel Williams wrote:> I've been having intermittent issues with the flintlock code - it > seems that the function FlintLock::lock is never returning and this is > locking up the Ruby process.What OS is this on? That's likely to be highly relevant.> At this point, using strace I found that the application process > seemed to be stuck in on > 00219 ssize_t n = read(fds[0], &ch, 1); > > Obviously child process was cat, nothing really interesting about that.The child process should send a single character before it execs /bin/cat, which is what the parent is waiting to read there. If the write() call in the child fails, then the child exits, so unless the OS fails to transfer the byte across the pipe, I struggle to see how we can end up in this situation.> 00172 // Connect pipe to stdin and stdout. > 00173 dup2(fds[1], 0); > 00174 dup2(fds[1], 1); > > Isn't this setting stdin and stdout to the same end of an existing > pipe? Does this make sense?It's a bidirectional socket, so that's fine.> Anyway, I thought I'd mention this because it is a consistent problem. > If there is anything you think I should do with strace, gdb, etc on > the processes next time it hangs, let me know.It would be useful to attach gdb to the parent and child and do a backtrace in each (bt) to see exactly where we are.> One option to fix the bug without really understanding the real issue > would be to use select in the parent thread, rather than read. Then, > use a timeout of a few seconds so that if the child doesn't acquire > the lock within x seconds, it is as good as failed.I'd prefer to understand the issue rather than paper over it. Locking is rather a critical operation to get right! Also, it's rather unclear what a suitable threshold is - you can use fcntl locking over NFS if you run the lock daemon, so a few seconds to get a lock is probably not impossible with a busy NFS server. Cheers, Olly