thr3ads.net - Xapian devel - [Xapian-devel] postlist chunking [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2004-Aug-23 14:03 UTC

[Xapian-devel] postlist chunking

Postlists are split up into chunks, so that skip_to can avoid reading
all the postlist.

Currently the chunk threshold is 2048, but this is checked before adding
an entry, so the postlist chunk can actually grow a little larger.
Something like 2060 at most.  Unfortunately this isn't a good threshold
with the default blocksize (8192 bytes).

Internally the B-tree splits up items with a large tag.  It will always
make sure it can fit at least 4 items in a block.  There's some overhead
for the block header (11 + 2 per item) and there's 2 + 1 + 2 + 2 = 7
bytes of overhead in each item.  The upshot is that the maximum size of
key + tag is (8192 - 11 - 4 * 2) / 4 - 7 = 2036 bytes.

So I guessed at a bound for typical key length and allowed some for
overrunning the threshold and decided 2000 bytes would be a better
threshold.  I tried building a sample database with both versions:

2048 bytes:

real 115m44.698s
user 98m14.880s
sys 3m13.560s
-rw-r--r--    1 olly     olly     322043904 Aug 23 02:50 postlist_DB

2000 bytes:

real 112m5.970s
user 96m35.130s
sys 3m47.350s
-rw-r--r--    1 olly     olly     315490304 Aug 23 11:08 postlist_DB

The machine wasn't totally idle, but it has 2 CPUs so hopefully the
timings are reasonable comparable.  But the DB is 2% smaller, which
isn't bad for a 2 character source change!

I've not got a harness for testing query speed, but I would expect
that will be improved by this change as well.

So far, so good.

But how about calculating the max tag size taking into account the true
key length and imposing the threshold exactly?

Doing this actually appears to be noticeably worse than 2000, though a
bit better than 2048.  Not quite sure why - maybe random factors from
packing shorter tags make as much difference, or maybe I was off by one
(or a few) in my size calculations (I've double checked them, but
perhaps I misunderstood the structure).

Exact

real 114m3.236s
user 96m33.610s
sys 3m43.580s
-rw-r--r--    1 olly     olly     321634304 Aug 23 14:27 postlist_DB

The user and system times are both down, so perhaps the increased wall
clock time is because the box was just busier.  But the DB size is
definitely up from 2000.

Do we have anything which can inspect the B-tree structure at this low
level?  Quartzdump dumps out key/tag pairs, but split up tags have
already been reassembled.

Cheers,
    Olly

Olly Betts

2004-Aug-26 17:08 UTC

head link

[Xapian-devel] postlist chunking

On Mon, Aug 23, 2004 at 03:03:20PM +0100, Olly Betts
wrote:> But how about calculating the max tag size taking into account the true
> key length and imposing the threshold exactly?
> 
> Doing this actually appears to be noticeably worse than 2000, though a
> bit better than 2048.  Not quite sure why - maybe random factors from
> packing shorter tags make as much difference, or maybe I was off by one
> (or a few) in my size calculations (I've double checked them, but
> perhaps I misunderstood the structure).
I talked about this with Richard Boulton (in person) and he suggested
that the problem is that there will typically be lots of small key/tag
pairs and that chunking to exactly fill blocks can end up being
counterproductive because these small pairs can then easily cause a
chunk to just fail to fit at the end of the block.

That seems plausible, and 2048 is definitely a poor threshold, so I
think I'll just go with 2000 for now.  We should revisit this later and
probably get the postlist chunking and the lower level tag splitting to
actual talk to each other.

Cheers,
    Olly

Reasonably Related Threads

Search for more possibly parallel threads

Xapian devel - Aug 2004 - postlist chunking

[Xapian-devel] postlist chunking

[Xapian-devel] postlist chunking

Reasonably Related Threads