thr3ads.net - Xapian devel - [Xapian-devel] term duplication among index tables [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Peter A. Friend

2006-Nov-03 18:20 UTC

[Xapian-devel] term duplication among index tables

Greetings all,

I have been looking into using Xapian to provide search for email. I  
have to be very careful about indexing overhead, and I notice that  
several of the tables use term strings as part of the key or data. I  
was wondering if there is a performance or complexity reason for not  
having a separate table mapping term strings to unique numbers, which  
could then be used in the other tables. Is this something that has  
been considered previously and discarded as unworkable, or do you  
think it may be worth pursuing?

Cheers,

Peter

richard@lemurconsulting.com

2006-Nov-03 19:48 UTC

head link

[Xapian-devel] term duplication among index tables

On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend
wrote:> I have been looking into using Xapian to provide search for email. I  
> have to be very careful about indexing overhead, and I notice that  
> several of the tables use term strings as part of the key or data. I  
> was wondering if there is a performance or complexity reason for not  
> having a separate table mapping term strings to unique numbers, which  
> could then be used in the other tables. Is this something that has  
> been considered previously and discarded as unworkable, or do you  
> think it may be worth pursuing?
It's a perfectly good question, and one which has been discussed several
times in the past (though maybe not on this list).  Indeed, IIRC, one of
the early backend implementations used a central lexicon of terms, and
referenced this lexicon using term IDs in the posting list, position list
and termlist tables.  This idea was discarded at a later date for
performance reasons.

The problem is that while getting the database size as small as possible is
useful, our primary goal with Xapian (at least, with the current backends)
is to make searches as fast as possible; and the speed is very much
dependent on the number of disk reads.  Use of a lexicon risks forcing
searches to perform an extra disk read for each term in the query, to look
up the term in the lexicon.

It is possible that the idea could still be useful for some tables, since
the lexicon should be relatively small, and could therefore get largely
cached in memory, but this would, of course, reduce the amount of space
available for caching other parts of the database.

I would be surprised if it could be used effectively for storing termlists,
since accessing such a list would require lots of lookups in the lexicon.
In any case, these are now rather well compressed (at least with the flint
backend).  It doesn't seem inconceivable that it could be useful for the
posting list or position list tables.

If you want to try any experiments, profiling results of an implementation
with term IDs would be quite interesting.

-- 
Richard

Seemingly Similar Threads

Search for more possibly parallel threads

Xapian devel - Nov 2006 - term duplication among index tables

[Xapian-devel] term duplication among index tables

[Xapian-devel] term duplication among index tables

Seemingly Similar Threads