thr3ads.net - Xapian discuss - Storing the documents text: data record or value ? [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2018-Jan-03 15:18 UTC

Storing the documents text: data record or value ?

Hi,

Following the Recoll snippets generation performance problem caused by the
new positions list storage scheme in Xapian 1.4, I am experimenting with
generating snippets from the complete document text stored in the index.

This increases the index size much less than I would have expected (around
10-15% apparently with my home directory data), which is good news
obviously.

I have tried storing the text in the data record, or in a value (after
compressing it). Storing in a value uses a tiny bit more space, I am
guessing because of the co-compression of related data occuring when
storing in the data record.

Seen from the outside, it would appear to make sense to use values, so that
code which needs to access the data record but not the full document text
does not pay a performance penalty.

I am wondering if there are other arguments for using either method ?

Cheers,

jf

Olly Betts

2018-Jan-04 05:42 UTC

head link

Storing the documents text: data record or value ?

On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes
wrote:> Seen from the outside, it would appear to make sense to use values, so that
> code which needs to access the data record but not the full document text
> does not pay a performance penalty.
> 
> I am wondering if there are other arguments for using either method ?
I wouldn't recommend using a value to store large data - fundamentally
it's not what they're intended for, and that's likely to end up
biting
you because design decisions get made based on their intended uses.

A minor current example is that the backend tracks upper and lower
bounds on all the values in a given slot, so you get a pointless (for
you) extra copy of the text of two documents, plus a lot of pointless
comparing of document texts to keep track of which is the largest and
smallest.  We've discussed tracking a binned distribution for each slot,
which would allow optimisations when sorting or doing value ranges, but
would mean more pointless overhead for your case.

If you want to store the document text separately, I'd put it in the
user metadata (build a key from the docid, ideally one which sorts in
the same order as the integer docids do so that append works very
efficiently - you could copy Xapian's pack_uint_preserving_sort() for
that).

You'll want to compress the document text yourself (currently at least,
though I wonder if we should support transparent compression of user
metadata entries - mostly they aren't compressed because they're stored
in the postlist table which doesn't have transparent compression on
because it's unhelpful for updating postlist chunks, and currently
transparent compression is either on or off per table, but doing it
based on the type of entry wouldn't be hard).

We could also add a way to read document data in chunks rather than
all at once, and then if you put the document text last in the document
data you should be able to read the other fields without much penalty.

Cheers,
    Olly

Jean-Francois Dockes

2018-Jan-04 19:02 UTC

head link

Storing the documents text: data record or value ?

Olly Betts writes:
 > On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes wrote:
 > > Seen from the outside, it would appear to make sense to use values,
so that
 > > code which needs to access the data record but not the full document
text
 > > does not pay a performance penalty.
 > > 
 > > I am wondering if there are other arguments for using either method ?
 > 
 > I wouldn't recommend using a value to store large data - fundamentally
 > it's not what they're intended for, and that's likely to end
up biting
 > you because design decisions get made based on their intended uses.
 > 
 > A minor current example is that the backend tracks upper and lower
 > bounds on all the values in a given slot, so you get a pointless (for
 > you) extra copy of the text of two documents, plus a lot of pointless
 > comparing of document texts to keep track of which is the largest and
 > smallest.  We've discussed tracking a binned distribution for each
slot,
 > which would allow optimisations when sorting or doing value ranges, but
 > would mean more pointless overhead for your case.

Ok, no values then...

 > If you want to store the document text separately, I'd put it in the
 > user metadata (build a key from the docid, ideally one which sorts in
 > the same order as the integer docids do so that append works very
 > efficiently - you could copy Xapian's pack_uint_preserving_sort() for
 > that).
 > 
 > You'll want to compress the document text yourself (currently at
least,
 > though I wonder if we should support transparent compression of user
 > metadata entries - mostly they aren't compressed because they're
stored
 > in the postlist table which doesn't have transparent compression on
 > because it's unhelpful for updating postlist chunks, and currently
 > transparent compression is either on or off per table, but doing it
 > based on the type of entry wouldn't be hard).

The compression is not the problem (already doing it when storing in values).

What makes user metadata records less convenient is that they are not
linked to a Xapian document by Xapian itself. This makes several things
slightly more complicated.

 > We could also add a way to read document data in chunks rather than
 > all at once, and then if you put the document text last in the document
 > data you should be able to read the other fields without much penalty.

Thanks, for now, I'll reluctantly take a better look at using user metadata
records.

jf

Possibly Parallel Threads

Search for more possibly parallel threads

Xapian discuss - Jan 2018 - Storing the documents text: data record or value ?

Storing the documents text: data record or value ?

Storing the documents text: data record or value ?

Storing the documents text: data record or value ?

Possibly Parallel Threads