I'm one of the developers of Notmuch, an email client that makes extensive use of Xapian. For some time, full multi-version concurrency control (MVCC) has been on our wish list. Olly mentioned that brass now uses free lists---a key step toward full MVCC---and I offered to pitch in on putting the other pieces into place. Before I wade too deep into this, I wanted to propose my plan. Full MVCC would enable Xapian to keep any database revision valid as long as any reader is using it, thus eliminating the dreaded DatabaseModifiedError and simplifying application logic. The high level idea is to use file range locking to represent locks on database revisions. Readers will acquire a shared lock on the latest revision. Writers will acquire an exclusive lock from revision 0 to the highest revision they can lock, up to but not including the latest committed version. Writers are then prevented from allocating (and hence overwriting) blocks unless they were freed within the writer's locked revision range. A detailed design follows. Locking changes --------------- On Unix platforms, Xapian already uses fcntl byte range locking to lock byte 0 of "flintlock" to protect against concurrent writers. Create a new "brasslock" that extends this to use byte R+1 to represent a lock on revision R. 1) In the writer: Lock byte 0 like usual. Then let RSAFE be the current database revision minus 1 (omitting the latest revision ensures it's available to readers that start during this writer.) Use F_GETLK to test for conflicting shared locks on revision range [0, RSAFE]. If F_GETLK returns a conflicting lock, set RSAFE to one less than the lower bound of the conflicting revision range and retry. (Multiple retries may be necessary because it is unspecified which of multiple conflicting locks F_GETLK will return.) Once a lockable range has been found, use F_SETLK to acquire an exclusive lock on it. This may fail in an unlikely race with a reader; if it does, continue to back off RSAFE with F_GETLK. Remember RSAFE. 2) In the reader: Once the version file has been read, attempt to acquire a shared lock on the current revision. This may fail if a new revision is committed after reading the version file; if it does, re-read the version file and try again. This has the annoying side-effect of requiring readers to start a lock-holder subprocess like writers currently do. I propose also adding support for F_OFD_SETLK, which won't require a lock-holder subprocess for either readers or writers. F_OFD_SETLK was added in Linux 3.12 (which will be in the next Debian stable) and will hopefully appear in the next version of POSIX.1, so with any luck it will become widely available in the future. If F_OFD_SETLK isn't available, Xapian can fall back to what it does currently. Windows has LockFileEx, which is essentially the same as F_OFD_SETLK. I don't know what to do about EMX or if we care. Free list changes ----------------- For free blocks, Xapian will need to keep track of which revision each block was freed at to know when it's safe to reuse that block. The most efficient place to store this is in the on-disk free list itself. The free list is currently a FIFO queue of block numbers ordered from oldest free revision to newest free revision. This is already the desired order; we just have to add revision information. 1) Add an additional free list entry of the form {pack_uint4(-2) pack_uint4(revision)} that indicates all block numbers up to the next revision marker were freed in 'revision'. 2) Track the free revision of the entry at the head of the free list in BrassFreeList and serialize this to the RootInfo metadata. When get_block encounters a revision entry, it updates the head revision and moves on to the next free list entry. 3) In mark_block_unused, if this is the first block added since the free list was loaded or committed, then before adding the block number, add a revision entry with the uncommitted revision number. 4) Modify get_block to take the maximum reusable revision. If the head revision exceeds the reusable revision, return an unused block rather than the next block on the free list. (This supersedes the current fl_end check done in get_block and walk.) 5) When a writer calls get_block, pass RSAFE as the maximum reusable revision. Bonus points ------------ When a writer exhausts free blocks that are available in the locked revisions, but there are more available in revisions it couldn't lock, it could attempt to extend its range of locked revisions. If a reader has since closed the database, this could allow the writer to reuse more blocks rather than extend the database file. This could be important for long-running writers. This would be easy with F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a dedicated lock helper binary). In DANGEROUS mode, the writer could lock the latest committed revision as well, thus preventing readers from even attempting to open the database while it's being stomped on. Double bonus points ------------------- In the above design, a long-running reader could prevent writers from reusing blocks that were both allocated and freed after that reader's locked revision. We could solve this by having the writer lock all of the revision ranges it can, not just the range starting at revision 0. Then, it could reuse any block whose revision and freed revision both lie in a contiguous range of locked revisions. This complicates the locking procedure and substantially complicates the free list, since it would no longer be a simple FIFO queue.
Some brief initial comments, before I get a chance to study the more complex parts: On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements wrote:> This has the annoying side-effect of requiring readers to start a > lock-holder subprocess like writers currently do. I propose also > adding support for F_OFD_SETLK, which won't require a lock-holder > subprocess for either readers or writers. F_OFD_SETLK was added in > Linux 3.12 (which will be in the next Debian stable) and will > hopefully appear in the next version of POSIX.1, so with any luck it > will become widely available in the future. If F_OFD_SETLK isn't > available, Xapian can fall back to what it does currently.F_OFD_SETLK looks like exactly what we've longed wished F_SETLK meant. It'd be good to do that as a separate patch, as I think it would be worth backporting to 1.2.x. The ticket submitting this to POSIX took a little effort to find, so here it is: http://austingroupbugs.net/view.php?id=768> Windows has LockFileEx, which is essentially the same as F_OFD_SETLK.Sounds good. We might want to keep the current locking code for chert there - need to think through if locking between 1.2 using the current approach and trunk using LockFileEx() would work without problems (we particularly want to avoid two writers managing to hold a lock at the same time). But if that's hard, I think just documenting that you shouldn't mix 1.2.x and newer versions on Windows would do.> I don't know what to do about EMX or if we care.While OS/2 seems to live on, EMX looks pretty dead - last release was over 13 years ago. We've already decided that requiring a compiler with good C++11 support for trunk is reasonable, so the one in EMX clearly won't cut it. There's also code to just use flock() (for anything without fcntl() locking), but I think we can probably just not support MVCC on platforms without a suitable locking API. I think we'd ideally want to allow opting out of reader locking (e.g. if your searches run on a database which isn't updated at the same time, the overhead of the reader locks is unnecessary), so if there's no byte range locking API, we can use the same mechanism. I certainly think we shouldn't hold up progress on this because it isn't possible to support everywhere, especially since it can be supported for the vast majority of users.> Bonus points > ------------ > > When a writer exhausts free blocks that are available in the locked > revisions, but there are more available in revisions it couldn't lock, > it could attempt to extend its range of locked revisions. If a reader > has since closed the database, this could allow the writer to reuse > more blocks rather than extend the database file. This could be > important for long-running writers. This would be easy with > F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a > dedicated lock helper binary).It would be beneficial to have a helper in other ways: http://trac.xapian.org/ticket/502 But one option is to just not extend the lock range if we're using F_SETLK. Cheers, Olly
Austin Clements <aclements at csail.mit.edu> writes:> Full MVCC would enable Xapian to keep any database revision valid as > long as any reader is using it, thus eliminating the dreaded > DatabaseModifiedError and simplifying application logic.This would also make it easier to spool a consistent backup to tape of an index that is being used (updated), right? Best regards, Adam -- "(Phew, it's hard to explain with my poor English.)" Adam Sj?gren asjo at koldfront.dk
On Wed, Aug 20, 2014 at 10:22:57AM +0200, Adam Sj?gren wrote:> Austin Clements <aclements at csail.mit.edu> writes: > > > Full MVCC would enable Xapian to keep any database revision valid as > > long as any reader is using it, thus eliminating the dreaded > > DatabaseModifiedError and simplifying application logic. > > This would also make it easier to spool a consistent backup to tape of > an index that is being used (updated), right?If you mean for copying at the filesystem level, not really. If you had a read lock held for all the time the backup was happening, you should ensure the backup contained enough information to recreate a working database from, but there's still no guarantee the backup itself would be a valid database. You could produce a dump of the entries from each table using this feature, and then restore by reinserting those - this would need two special tools, and a way to get data from them to/from the backup (saving it to a file seems clumsy, but would work if you have spare disk space comparable in size to the database). The backup and restore operation would be comparable in speed to compaction + the time to do a file-level backup and restore (it would be very like compaction, except with the data being saved and restored in the middle). For a large and actively updated database, keeping the revision around while it gets backed up could inflate the database size quite a bit. That space would get reused after the backup is taken though. Another approach to allowing backups of a live database would probably be to make use of the replication changesets. If you just backup the database at the filesystem level (without worrying about locks) then also backup any changesets created while the backup was running (and the next one created afterwards), then you should just be able to replay the changesets onto the database to restore. If you're replicating already, this is clearly a nice approach. Not sure how it would compare if you weren't. Cheers, Olly
On Tue, 19 Aug 2014, Olly Betts <olly at survex.com> wrote:> Some brief initial comments, before I get a chance to study the > more complex parts: > > On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements wrote: >> This has the annoying side-effect of requiring readers to start a >> lock-holder subprocess like writers currently do. I propose also >> adding support for F_OFD_SETLK, which won't require a lock-holder >> subprocess for either readers or writers. F_OFD_SETLK was added in >> Linux 3.12 (which will be in the next Debian stable) and will >> hopefully appear in the next version of POSIX.1, so with any luck it >> will become widely available in the future. If F_OFD_SETLK isn't >> available, Xapian can fall back to what it does currently. > > F_OFD_SETLK looks like exactly what we've longed wished F_SETLK meant.I think that goes for anyone who has ever used F_SETLK. :)> It'd be good to do that as a separate patch, as I think it would be > worth backporting to 1.2.x.Since the locking algorithm is rather different for MVCC, I was planning to implement it in a new BrassLock class. However, orthogonal to that, I don't think it would be hard to also add F_OFD_SETLK support in FlintLock and backport that.> The ticket submitting this to POSIX took a little effort to find, so > here it is: > http://austingroupbugs.net/view.php?id=768 > >> Windows has LockFileEx, which is essentially the same as F_OFD_SETLK. > > Sounds good. > > We might want to keep the current locking code for chert there - need to > think through if locking between 1.2 using the current approach and > trunk using LockFileEx() would work without problems (we particularly > want to avoid two writers managing to hold a lock at the same time). > > But if that's hard, I think just documenting that you shouldn't mix > 1.2.x and newer versions on Windows would do.I haven't seen anything saying that LockFileEx interacts with CreateFile locking, but if this goes in a new BrassLock, then it shouldn't be an issue because we'll always use LockFileEx for brass databases.>> I don't know what to do about EMX or if we care. > > While OS/2 seems to live on, EMX looks pretty dead - last release was > over 13 years ago. We've already decided that requiring a compiler > with good C++11 support for trunk is reasonable, so the one in EMX > clearly won't cut it. > > There's also code to just use flock() (for anything without fcntl() > locking), but I think we can probably just not support MVCC on platforms > without a suitable locking API. I think we'd ideally want to allow > opting out of reader locking (e.g. if your searches run on a database > which isn't updated at the same time, the overhead of the reader locks > is unnecessary), so if there's no byte range locking API, we can use the > same mechanism.Okay. If we're on a platform that can't support MVCC, should an attempt to open the database for read *without* opting out fail? Alternatively, it could fall back to old-fashioned read locking (blocking writers) so that readers still get the guarantee that DatabaseModifiedError will never happen (and they could still opt out of all read locking and deal with the consequences).> I certainly think we shouldn't hold up progress on this because it isn't > possible to support everywhere, especially since it can be supported for > the vast majority of users. > >> Bonus points >> ------------ >> >> When a writer exhausts free blocks that are available in the locked >> revisions, but there are more available in revisions it couldn't lock, >> it could attempt to extend its range of locked revisions. If a reader >> has since closed the database, this could allow the writer to reuse >> more blocks rather than extend the database file. This could be >> important for long-running writers. This would be easy with >> F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a >> dedicated lock helper binary). > > It would be beneficial to have a helper in other ways: > > http://trac.xapian.org/ticket/502 > > But one option is to just not extend the lock range if we're using > F_SETLK.I think for a first shot I just won't do lock extension. The next step would be to do it unless we're using F_SETLK. Then the last step would be to add a lock helper. Steps 2 and 3 are backwards-compatible, so they could be done at any point in the future.> Cheers, > Olly
On Wed, 20 Aug 2014, Adam Sj?gren <asjo at koldfront.dk> wrote:> Austin Clements <aclements at csail.mit.edu> writes: > >> Full MVCC would enable Xapian to keep any database revision valid as >> long as any reader is using it, thus eliminating the dreaded >> DatabaseModifiedError and simplifying application logic. > > This would also make it easier to spool a consistent backup to tape of > an index that is being used (updated), right?As Olly mentioned, this wouldn't ensure that a copy straight from the filesystem would be a valid database. However, it would be pretty close. The problem is that, even if you're holding a read lock on a given revision, the database metadata in "iambrass" may no longer be consistent with that revision. However, I think if you were to save the iambrass file as it was when the reader opened and locked its revision, you could then safely copy the remaining files directly from the file system.
On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements wrote:> Before I wade too deep into this, I wanted to propose my plan.[snip] I've just read through this carefully, and it all sounds good to me. I'm currently working on merging some changes into brass, but they shouldn't touch the same areas as this, so don't be concerned if you merge from master and find changes to brass flooding in. Cheers, Olly