Greetings all, I have been looking into using Xapian to provide search for email. I have to be very careful about indexing overhead, and I notice that several of the tables use term strings as part of the key or data. I was wondering if there is a performance or complexity reason for not having a separate table mapping term strings to unique numbers, which could then be used in the other tables. Is this something that has been considered previously and discarded as unworkable, or do you think it may be worth pursuing? Cheers, Peter
richard@lemurconsulting.com
2006-Nov-03 19:48 UTC
[Xapian-devel] term duplication among index tables
On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:> I have been looking into using Xapian to provide search for email. I > have to be very careful about indexing overhead, and I notice that > several of the tables use term strings as part of the key or data. I > was wondering if there is a performance or complexity reason for not > having a separate table mapping term strings to unique numbers, which > could then be used in the other tables. Is this something that has > been considered previously and discarded as unworkable, or do you > think it may be worth pursuing?It's a perfectly good question, and one which has been discussed several times in the past (though maybe not on this list). Indeed, IIRC, one of the early backend implementations used a central lexicon of terms, and referenced this lexicon using term IDs in the posting list, position list and termlist tables. This idea was discarded at a later date for performance reasons. The problem is that while getting the database size as small as possible is useful, our primary goal with Xapian (at least, with the current backends) is to make searches as fast as possible; and the speed is very much dependent on the number of disk reads. Use of a lexicon risks forcing searches to perform an extra disk read for each term in the query, to look up the term in the lexicon. It is possible that the idea could still be useful for some tables, since the lexicon should be relatively small, and could therefore get largely cached in memory, but this would, of course, reduce the amount of space available for caching other parts of the database. I would be surprised if it could be used effectively for storing termlists, since accessing such a list would require lots of lookups in the lexicon. In any case, these are now rather well compressed (at least with the flint backend). It doesn't seem inconceivable that it could be useful for the posting list or position list tables. If you want to try any experiments, profiling results of an implementation with term IDs would be quite interesting. -- Richard