Mueen Nawaz <mueen at nawaz.org> writes:> After a lot of poking around, I figured out the problem, and this may be > of interest to the developers (although not sure if it is a xapian issue > or a notmuch issue). > > Here's why it would freeze: > > I have a post-new hook that runs a Python script. Depending on whether > the new email it is processing matches a rule I have, it will fire off > an email to the sender using the SMTP library in Python. > > I had recently upgraded my MTA (PostFix), and it had a backward > incompatible change that broke my config. I don't know why, but I could > still send emails via Emacs, but when I tried to send them via Python, > Postfix would log an error and it would not send. The Python statement > would freeze (I guess Postfix doesn't return an appropriate response? > Not sure why). > > > I have a cron job to run "notmuch new" 3 times an hour. Since the hook > was frozen, so was the notmuch new command. I had quite a lot of > "notmuch new" processes. I assume this meant the DB was locked all this > time for writing.notmuch unlocks the database before running the hook, so I don't understand how a hung hook results in a locked database. If it happens again (or you're motivated to set up a testbed) I'd be interested in the output of lsof ~/Maildir/.notmuch/xapian/flintlock Also, is this by chance a network file system? Because those often break locking.> Now killing all those jobs did not fix the database. It was still > broken. And as we saw the second time round, it was /really/ broken - it > would not even open in read-only mode.That seems like something the Xapian devs (in copy) might be interested in fixing, if you could come up with a simple reproducer.> It is scary that if a post-new hook freezes while the database is > locked, it could (eventually) clobber the database. I don't know if > notmuch can do anything to prevent this outcome?notmuch could be cleverer about timing out on trying to acquire a lock. I suspect it's a bit delicate to get that right, and I've been hoping the underlying primitives would get a bit more flexible w.r.t. locking. We could also potentially run hooks in the equivalent of "timeout", but I don't know how much code that would be. A simpler option (once we understand what the real problem is) would be to suggest that users use timeout themselves in hooks to be run unattended. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 658 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20180910/ddd5823b/attachment.sig>
David Bremner <david at tethera.net> writes:>> Here's why it would freeze: >> >> I have a post-new hook that runs a Python script. Depending on >> whether the new email it is processing matches a rule I have, >> it will fire off an email to the sender using the SMTP library >> in Python. >> >> I had recently upgraded my MTA (PostFix), and it had a backward >> incompatible change that broke my config. I don't know why, but >> I could still send emails via Emacs, but when I tried to send >> them via Python, Postfix would log an error and it would not >> send. The Python statement would freeze (I guess Postfix >> doesn't return an appropriate response? Not sure why). >> >> I have a cron job to run "notmuch new" 3 times an hour. Since >> the hook was frozen, so was the notmuch new command. I had >> quite a lot of "notmuch new" processes. I assume this meant the >> DB was locked all this time for writing. > > notmuch unlocks the database before running the hook, so I don't > understand how a hung hook results in a locked database. If it > happens again (or you're motivated to set up a testbed) I'd be > interested in the output ofWell, it results in a locked database because I have this in the (Python) hook: DATABASE = notmuch.Database(mode=notmuch.Database.MODE.READ_WRITE) Soon after that I freeze the new messages. And at the end I thaw them out. The hang occurs in between the two, I think.> Also, is this by chance a network file system? Because those > often break locking.No - regular hard drive.>> Now killing all those jobs did not fix the database. It was >> still broken. And as we saw the second time round, it was >> /really/ broken - it would not even open in read-only mode. > > That seems like something the Xapian devs (in copy) might be > interested in fixing, if you could come up with a simple > reproducer.I can think of two experiments: 1. Write a hook that opens the database as above, and then just does nothing (e.g. while True). Let it run, say, for 24 hours. (Not sure if the "freeze" part is relevant. 2. Same as the above, but have a cron job that fires "notmuch new" every 20 minutes. This will freeze on the database line above (all except the first invocation which will be stuck at while True). After a day of this, check if you can open the database in READ_WRITE mode.> notmuch could be cleverer about timing out on trying to acquire > a lock. I suspect it's a bit delicate to get that right, and > I've been hoping the underlying primitives would get a bit more > flexible w.r.t. locking.I agree having notmuch handle it is not ideal. I was originally thinking there should be a default timeout that one can adjust as needed. However, when someone does "notmuch new" to build a new database, that can take several minutes. And others may have flows very different from mine. At the very least, we probably should know why the DB be clobbered at all. -- Don't take life so seriously. It won't last. /\ /\ /\ / / \/ \ u e e n / \/ a w a z >>>>>>mueen at nawaz.org<<<<<< anl
Mueen Nawaz <mueen at nawaz.org> writes:> > DATABASE = notmuch.Database(mode=notmuch.Database.MODE.READ_WRITE)OK. So your code is locking the database, and never unlocking it (because of the hang). So that part is at least not mysterious.> I can think of two experiments:I was thinking more along the lines of something that could be part of the notmuch test suite, i.e. run in a few seconds. Or at worst in 10 minutes or so to be usable to debug. d
On Mon, Sep 10, 2018 at 08:01:06AM -0300, David Bremner wrote:> Mueen Nawaz <mueen at nawaz.org> writes: > > Now killing all those jobs did not fix the database. It was still > > broken. And as we saw the second time round, it was /really/ broken - it > > would not even open in read-only mode. > > That seems like something the Xapian devs (in copy) might be interested > in fixing, if you could come up with a simple reproducer.I'm certainly happy to investigate if someone can provide a way for me to make it happen on demand. It doesn't make much sense to me that holding the lock alone could be causing any sort of corruption - that's just an fcntl() lock. I would suggest to make sure you're running Xapian 1.4.7 as that fixed a cursor handling bug which affected notmuch. I didn't find a way to make it corrupt on-disk data, but it's hard to be completely certain that it couldn't ever do that, so ruling out that as a cause would be good.> notmuch could be cleverer about timing out on trying to acquire a > lock. I suspect it's a bit delicate to get that right, and I've been > hoping the underlying primitives would get a bit more flexible > w.r.t. locking.You mean in Xapian? If so, a wishlist bug saying what you're hoping for might help it happen. Cheers, Olly