Jean-Francois Dockes
2018-Jan-03 15:18 UTC
Storing the documents text: data record or value ?
Hi, Following the Recoll snippets generation performance problem caused by the new positions list storage scheme in Xapian 1.4, I am experimenting with generating snippets from the complete document text stored in the index. This increases the index size much less than I would have expected (around 10-15% apparently with my home directory data), which is good news obviously. I have tried storing the text in the data record, or in a value (after compressing it). Storing in a value uses a tiny bit more space, I am guessing because of the co-compression of related data occuring when storing in the data record. Seen from the outside, it would appear to make sense to use values, so that code which needs to access the data record but not the full document text does not pay a performance penalty. I am wondering if there are other arguments for using either method ? Cheers, jf
On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes wrote:> Seen from the outside, it would appear to make sense to use values, so that > code which needs to access the data record but not the full document text > does not pay a performance penalty. > > I am wondering if there are other arguments for using either method ?I wouldn't recommend using a value to store large data - fundamentally it's not what they're intended for, and that's likely to end up biting you because design decisions get made based on their intended uses. A minor current example is that the backend tracks upper and lower bounds on all the values in a given slot, so you get a pointless (for you) extra copy of the text of two documents, plus a lot of pointless comparing of document texts to keep track of which is the largest and smallest. We've discussed tracking a binned distribution for each slot, which would allow optimisations when sorting or doing value ranges, but would mean more pointless overhead for your case. If you want to store the document text separately, I'd put it in the user metadata (build a key from the docid, ideally one which sorts in the same order as the integer docids do so that append works very efficiently - you could copy Xapian's pack_uint_preserving_sort() for that). You'll want to compress the document text yourself (currently at least, though I wonder if we should support transparent compression of user metadata entries - mostly they aren't compressed because they're stored in the postlist table which doesn't have transparent compression on because it's unhelpful for updating postlist chunks, and currently transparent compression is either on or off per table, but doing it based on the type of entry wouldn't be hard). We could also add a way to read document data in chunks rather than all at once, and then if you put the document text last in the document data you should be able to read the other fields without much penalty. Cheers, Olly
Jean-Francois Dockes
2018-Jan-04 19:02 UTC
Storing the documents text: data record or value ?
Olly Betts writes: > On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes wrote: > > Seen from the outside, it would appear to make sense to use values, so that > > code which needs to access the data record but not the full document text > > does not pay a performance penalty. > > > > I am wondering if there are other arguments for using either method ? > > I wouldn't recommend using a value to store large data - fundamentally > it's not what they're intended for, and that's likely to end up biting > you because design decisions get made based on their intended uses. > > A minor current example is that the backend tracks upper and lower > bounds on all the values in a given slot, so you get a pointless (for > you) extra copy of the text of two documents, plus a lot of pointless > comparing of document texts to keep track of which is the largest and > smallest. We've discussed tracking a binned distribution for each slot, > which would allow optimisations when sorting or doing value ranges, but > would mean more pointless overhead for your case. Ok, no values then... > If you want to store the document text separately, I'd put it in the > user metadata (build a key from the docid, ideally one which sorts in > the same order as the integer docids do so that append works very > efficiently - you could copy Xapian's pack_uint_preserving_sort() for > that). > > You'll want to compress the document text yourself (currently at least, > though I wonder if we should support transparent compression of user > metadata entries - mostly they aren't compressed because they're stored > in the postlist table which doesn't have transparent compression on > because it's unhelpful for updating postlist chunks, and currently > transparent compression is either on or off per table, but doing it > based on the type of entry wouldn't be hard). The compression is not the problem (already doing it when storing in values). What makes user metadata records less convenient is that they are not linked to a Xapian document by Xapian itself. This makes several things slightly more complicated. > We could also add a way to read document data in chunks rather than > all at once, and then if you put the document text last in the document > data you should be able to read the other fields without much penalty. Thanks, for now, I'll reluctantly take a better look at using user metadata records. jf