thr3ads.net - Xapian devel - [Xapian-devel] Full MVCC in Brass [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Austin Clements

2014-Aug-20 01:19 UTC

[Xapian-devel] Full MVCC in Brass

I'm one of the developers of Notmuch, an email client that makes
extensive use of Xapian.  For some time, full multi-version
concurrency control (MVCC) has been on our wish list.  Olly mentioned
that brass now uses free lists---a key step toward full MVCC---and I
offered to pitch in on putting the other pieces into place.

Before I wade too deep into this, I wanted to propose my plan.


Full MVCC would enable Xapian to keep any database revision valid as
long as any reader is using it, thus eliminating the dreaded
DatabaseModifiedError and simplifying application logic.

The high level idea is to use file range locking to represent locks on
database revisions.  Readers will acquire a shared lock on the latest
revision.  Writers will acquire an exclusive lock from revision 0 to
the highest revision they can lock, up to but not including the latest
committed version.  Writers are then prevented from allocating (and
hence overwriting) blocks unless they were freed within the writer's
locked revision range.

A detailed design follows.

Locking changes
---------------

On Unix platforms, Xapian already uses fcntl byte range locking to
lock byte 0 of "flintlock" to protect against concurrent writers.
Create a new "brasslock" that extends this to use byte R+1 to
represent a lock on revision R.

1) In the writer: Lock byte 0 like usual.  Then let RSAFE be the
current database revision minus 1 (omitting the latest revision
ensures it's available to readers that start during this writer.)  Use
F_GETLK to test for conflicting shared locks on revision range [0,
RSAFE].  If F_GETLK returns a conflicting lock, set RSAFE to one less
than the lower bound of the conflicting revision range and retry.
(Multiple retries may be necessary because it is unspecified which of
multiple conflicting locks F_GETLK will return.)  Once a lockable
range has been found, use F_SETLK to acquire an exclusive lock on it.
This may fail in an unlikely race with a reader; if it does, continue
to back off RSAFE with F_GETLK.  Remember RSAFE.

2) In the reader: Once the version file has been read, attempt to
acquire a shared lock on the current revision.  This may fail if a new
revision is committed after reading the version file; if it does,
re-read the version file and try again.

This has the annoying side-effect of requiring readers to start a
lock-holder subprocess like writers currently do.  I propose also
adding support for F_OFD_SETLK, which won't require a lock-holder
subprocess for either readers or writers.  F_OFD_SETLK was added in
Linux 3.12 (which will be in the next Debian stable) and will
hopefully appear in the next version of POSIX.1, so with any luck it
will become widely available in the future.  If F_OFD_SETLK isn't
available, Xapian can fall back to what it does currently.

Windows has LockFileEx, which is essentially the same as F_OFD_SETLK.

I don't know what to do about EMX or if we care.

Free list changes
-----------------

For free blocks, Xapian will need to keep track of which revision each
block was freed at to know when it's safe to reuse that block.  The
most efficient place to store this is in the on-disk free list itself.
The free list is currently a FIFO queue of block numbers ordered from
oldest free revision to newest free revision.  This is already the
desired order; we just have to add revision information.

1) Add an additional free list entry of the form {pack_uint4(-2)
pack_uint4(revision)} that indicates all block numbers up to the next
revision marker were freed in 'revision'.

2) Track the free revision of the entry at the head of the free list
in BrassFreeList and serialize this to the RootInfo metadata.  When
get_block encounters a revision entry, it updates the head revision
and moves on to the next free list entry.

3) In mark_block_unused, if this is the first block added since the
free list was loaded or committed, then before adding the block
number, add a revision entry with the uncommitted revision number.

4) Modify get_block to take the maximum reusable revision.  If the
head revision exceeds the reusable revision, return an unused block
rather than the next block on the free list.  (This supersedes the
current fl_end check done in get_block and walk.)

5) When a writer calls get_block, pass RSAFE as the maximum reusable
revision.

Bonus points
------------

When a writer exhausts free blocks that are available in the locked
revisions, but there are more available in revisions it couldn't lock,
it could attempt to extend its range of locked revisions.  If a reader
has since closed the database, this could allow the writer to reuse
more blocks rather than extend the database file.  This could be
important for long-running writers.  This would be easy with
F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a
dedicated lock helper binary).

In DANGEROUS mode, the writer could lock the latest committed revision
as well, thus preventing readers from even attempting to open the
database while it's being stomped on.

Double bonus points
-------------------

In the above design, a long-running reader could prevent writers from
reusing blocks that were both allocated and freed after that reader's
locked revision.  We could solve this by having the writer lock all of
the revision ranges it can, not just the range starting at revision 0.
Then, it could reuse any block whose revision and freed revision both
lie in a contiguous range of locked revisions.  This complicates the
locking procedure and substantially complicates the free list, since
it would no longer be a simple FIFO queue.

Olly Betts

2014-Aug-20 03:00 UTC

head link

[Xapian-devel] Full MVCC in Brass

Some brief initial comments, before I get a chance to study the
more complex parts:

On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements
wrote:> This has the annoying side-effect of requiring readers to start a
> lock-holder subprocess like writers currently do.  I propose also
> adding support for F_OFD_SETLK, which won't require a lock-holder
> subprocess for either readers or writers.  F_OFD_SETLK was added in
> Linux 3.12 (which will be in the next Debian stable) and will
> hopefully appear in the next version of POSIX.1, so with any luck it
> will become widely available in the future.  If F_OFD_SETLK isn't
> available, Xapian can fall back to what it does currently.
F_OFD_SETLK looks like exactly what we've longed wished F_SETLK meant.

It'd be good to do that as a separate patch, as I think it would be
worth backporting to 1.2.x.

The ticket submitting this to POSIX took a little effort to find, so
here it is:
http://austingroupbugs.net/view.php?id=768
> Windows has LockFileEx, which is essentially the same as F_OFD_SETLK.
Sounds good.

We might want to keep the current locking code for chert there - need to
think through if locking between 1.2 using the current approach and
trunk using LockFileEx() would work without problems (we particularly
want to avoid two writers managing to hold a lock at the same time).

But if that's hard, I think just documenting that you shouldn't mix
1.2.x and newer versions on Windows would do.
> I don't know what to do about EMX or if we care.
While OS/2 seems to live on, EMX looks pretty dead - last release was
over 13 years ago.  We've already decided that requiring a compiler
with good C++11 support for trunk is reasonable, so the one in EMX
clearly won't cut it.

There's also code to just use flock() (for anything without fcntl()
locking), but I think we can probably just not support MVCC on platforms
without a suitable locking API.  I think we'd ideally want to allow
opting out of reader locking (e.g. if your searches run on a database
which isn't updated at the same time, the overhead of the reader locks
is unnecessary), so if there's no byte range locking API, we can use the
same mechanism.

I certainly think we shouldn't hold up progress on this because it isn't
possible to support everywhere, especially since it can be supported for
the vast majority of users.
> Bonus points
> ------------
> 
> When a writer exhausts free blocks that are available in the locked
> revisions, but there are more available in revisions it couldn't lock,
> it could attempt to extend its range of locked revisions.  If a reader
> has since closed the database, this could allow the writer to reuse
> more blocks rather than extend the database file.  This could be
> important for long-running writers.  This would be easy with
> F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a
> dedicated lock helper binary).
It would be beneficial to have a helper in other ways:

http://trac.xapian.org/ticket/502

But one option is to just not extend the lock range if we're using
F_SETLK.

Cheers,
    Olly

Adam Sjøgren

2014-Aug-20 08:22 UTC

head link

[Xapian-devel] Full MVCC in Brass

Austin Clements <aclements at csail.mit.edu> writes:
> Full MVCC would enable Xapian to keep any database revision valid as
> long as any reader is using it, thus eliminating the dreaded
> DatabaseModifiedError and simplifying application logic.
This would also make it easier to spool a consistent backup to tape of
an index that is being used (updated), right?


  Best regards,

    Adam

-- 
 "(Phew, it's hard to explain with my poor English.)"         Adam
Sj?gren
                                                         asjo at koldfront.dk

Olly Betts

2014-Aug-20 09:12 UTC

head link

[Xapian-devel] Full MVCC in Brass

On Wed, Aug 20, 2014 at 10:22:57AM +0200, Adam Sj?gren
wrote:> Austin Clements <aclements at csail.mit.edu> writes:
> 
> > Full MVCC would enable Xapian to keep any database revision valid as
> > long as any reader is using it, thus eliminating the dreaded
> > DatabaseModifiedError and simplifying application logic.
> 
> This would also make it easier to spool a consistent backup to tape of
> an index that is being used (updated), right?
If you mean for copying at the filesystem level, not really.

If you had a read lock held for all the time the backup was happening,
you should ensure the backup contained enough information to recreate a
working database from, but there's still no guarantee the backup itself
would be a valid database.

You could produce a dump of the entries from each table using
this feature, and then restore by reinserting those - this would
need two special tools, and a way to get data from them to/from the
backup (saving it to a file seems clumsy, but would work if you have
spare disk space comparable in size to the database).  The backup and
restore operation would be comparable in speed to compaction + the time
to do a file-level backup and restore (it would be very like compaction,
except with the data being saved and restored in the middle).

For a large and actively updated database, keeping the revision around
while it gets backed up could inflate the database size quite a bit.
That space would get reused after the backup is taken though.

Another approach to allowing backups of a live database would probably
be to make use of the replication changesets.  If you just backup the
database at the filesystem level (without worrying about locks) then
also backup any changesets created while the backup was running (and the
next one created afterwards), then you should just be able to replay the
changesets onto the database to restore.

If you're replicating already, this is clearly a nice approach.  Not
sure how it would compare if you weren't.

Cheers,
    Olly

Austin Clements

2014-Aug-21 02:40 UTC

head link

[Xapian-devel] Full MVCC in Brass

On Tue, 19 Aug 2014, Olly Betts <olly at survex.com>
wrote:> Some brief initial comments, before I get a chance to study the
> more complex parts:
>
> On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements wrote:
>> This has the annoying side-effect of requiring readers to start a
>> lock-holder subprocess like writers currently do.  I propose also
>> adding support for F_OFD_SETLK, which won't require a lock-holder
>> subprocess for either readers or writers.  F_OFD_SETLK was added in
>> Linux 3.12 (which will be in the next Debian stable) and will
>> hopefully appear in the next version of POSIX.1, so with any luck it
>> will become widely available in the future.  If F_OFD_SETLK isn't
>> available, Xapian can fall back to what it does currently.
>
> F_OFD_SETLK looks like exactly what we've longed wished F_SETLK meant.
I think that goes for anyone who has ever used F_SETLK.  :)
> It'd be good to do that as a separate patch, as I think it would be
> worth backporting to 1.2.x.
Since the locking algorithm is rather different for MVCC, I was planning
to implement it in a new BrassLock class.  However, orthogonal to that,
I don't think it would be hard to also add F_OFD_SETLK support in
FlintLock and backport that.
> The ticket submitting this to POSIX took a little effort to find, so
> here it is:
> http://austingroupbugs.net/view.php?id=768
>
>> Windows has LockFileEx, which is essentially the same as F_OFD_SETLK.
>
> Sounds good.
>
> We might want to keep the current locking code for chert there - need to
> think through if locking between 1.2 using the current approach and
> trunk using LockFileEx() would work without problems (we particularly
> want to avoid two writers managing to hold a lock at the same time).
>
> But if that's hard, I think just documenting that you shouldn't mix
> 1.2.x and newer versions on Windows would do.
I haven't seen anything saying that LockFileEx interacts with CreateFile
locking, but if this goes in a new BrassLock, then it shouldn't be an
issue because we'll always use LockFileEx for brass databases.
>> I don't know what to do about EMX or if we care.
>
> While OS/2 seems to live on, EMX looks pretty dead - last release was
> over 13 years ago.  We've already decided that requiring a compiler
> with good C++11 support for trunk is reasonable, so the one in EMX
> clearly won't cut it.
>
> There's also code to just use flock() (for anything without fcntl()
> locking), but I think we can probably just not support MVCC on platforms
> without a suitable locking API.  I think we'd ideally want to allow
> opting out of reader locking (e.g. if your searches run on a database
> which isn't updated at the same time, the overhead of the reader locks
> is unnecessary), so if there's no byte range locking API, we can use
the
> same mechanism.
Okay.  If we're on a platform that can't support MVCC, should an attempt
to open the database for read *without* opting out fail?  Alternatively,
it could fall back to old-fashioned read locking (blocking writers) so
that readers still get the guarantee that DatabaseModifiedError will
never happen (and they could still opt out of all read locking and deal
with the consequences).
> I certainly think we shouldn't hold up progress on this because it
isn't
> possible to support everywhere, especially since it can be supported for
> the vast majority of users.
>
>> Bonus points
>> ------------
>> 
>> When a writer exhausts free blocks that are available in the locked
>> revisions, but there are more available in revisions it couldn't
lock,
>> it could attempt to extend its range of locked revisions.  If a reader
>> has since closed the database, this could allow the writer to reuse
>> more blocks rather than extend the database file.  This could be
>> important for long-running writers.  This would be easy with
>> F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a
>> dedicated lock helper binary).
>
> It would be beneficial to have a helper in other ways:
>
> http://trac.xapian.org/ticket/502
>
> But one option is to just not extend the lock range if we're using
> F_SETLK.
I think for a first shot I just won't do lock extension.  The next step
would be to do it unless we're using F_SETLK.  Then the last step would
be to add a lock helper.  Steps 2 and 3 are backwards-compatible, so
they could be done at any point in the future.
> Cheers,
>     Olly

Austin Clements

2014-Aug-21 02:46 UTC

head link

[Xapian-devel] Full MVCC in Brass

On Wed, 20 Aug 2014, Adam Sj?gren <asjo at koldfront.dk>
wrote:> Austin Clements <aclements at csail.mit.edu> writes:
>
>> Full MVCC would enable Xapian to keep any database revision valid as
>> long as any reader is using it, thus eliminating the dreaded
>> DatabaseModifiedError and simplifying application logic.
>
> This would also make it easier to spool a consistent backup to tape of
> an index that is being used (updated), right?
As Olly mentioned, this wouldn't ensure that a copy straight from the
filesystem would be a valid database.  However, it would be pretty
close.  The problem is that, even if you're holding a read lock on a
given revision, the database metadata in "iambrass" may no longer be
consistent with that revision.  However, I think if you were to save the
iambrass file as it was when the reader opened and locked its revision,
you could then safely copy the remaining files directly from the file
system.

Olly Betts

2014-Sep-02 04:38 UTC

head link

[Xapian-devel] Full MVCC in Brass

On Tue, Aug 19, 2014 at 09:19:12PM -0400, Austin Clements
wrote:> Before I wade too deep into this, I wanted to propose my plan.[snip]

I've just read through this carefully, and it all sounds good to me.

I'm currently working on merging some changes into brass, but they
shouldn't touch the same areas as this, so don't be concerned if
you merge from master and find changes to brass flooding in.

Cheers,
    Olly

Xapian devel - Aug 2014 - Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass

[Xapian-devel] Full MVCC in Brass