Hello, I am building a personal search tool, based on xapian-core and qt. I am experimenting with not stemming at indexing time (for a personal system, the database size will not usually be an issue), and handling it at query time. The idea is to stem the user's query term and find the set of database terms that stem to the same value (more or less like what is in the "Using stemming in IR" paragraph in the stemming page on xapian.org). The query can then be (optionally) expanded to the stem siblings. Given that the database volumes are not going to be gigantic, it would be easy to build the stem->SetOfWords database at the end of indexing, by extracting and stemming the whole term list from the Xapian db (it takes a few seconds for my 300,000 terms db). I could then store the result using any indexed file manager like gdbm or whatever. I am wondering though if I could use the xapian backend to handle the storage. Would it be absurd, for example, to have pseudo documents indexed by something like a unique STM:stemvalue term, and to store the word list in the document data ? Or would you suggest another way ? Or is this all just wrong, and I should stem during indexing like omindex ? Incidentally, if somebody is interested in taking a look at the software (it is still very incomplete, but may already be somewhat useful in some cases), it is at http://perso.wanadoo.fr/dockes/recoll/). Regards, Jean-Francois Dockes
On Tue, Feb 08, 2005 at 07:40:25PM +0100, Jean-Francois Dockes wrote:> Given that the database volumes are not going to be gigantic, it would be > easy to build the stem->SetOfWords database at the end of indexing, by > extracting and stemming the whole term list from the Xapian db (it takes a > few seconds for my 300,000 terms db).Right.> I am wondering though if I could use the xapian backend to handle > the storage. Would it be absurd, for example, to have pseudo > documents indexed by something like a unique STM:stemvalue term, and > to store the word list in the document data ? Or would you suggest > another way ?I'd advise /either/ having a different database for it (so you don't need STM:stemvalue, just 'stemvalue') /or/ just using the stemmed terms to index the documents, but add in another term which you can filter on the /lack/ of for normal searches. The reason the second one might be worth considering is that putting it within the same database might compress the termlist better - although I can't actually remember how termlist compression works, so it might not. (At the least, it will help where stemmed terms exactly match unstemmed words indexing the 'regular' documents.)> Or is this all just wrong, and I should stem during indexing like > omindex ?It probably depends on what machines these are designed to run on. Stemming at index time will probably chew less disk space, so on low (ish :-) memory machines that will probably work better than the larger database you'll get by not stemming (just because stemming conflates terms, but also the terms will be shorter on average). Particularly important if you see people typing in quick queries regularly, but not constantly (so they use another application in the meantime, pushing some of the Xapian database out of file buffers). On the other hand, search-time stemming and query expansion gives you advantages in not needing to detect the language of everything you stem right now. For a personal search tool, that might be a big bonus. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
(subject: where to store the (stem to words) relationship James Aylett writes: > I'd advise /either/ having a different database for it (so you don't > need STM:stemvalue, just 'stemvalue') /or/ just using the stemmed > terms to index the documents, but add in another term which you can > filter on the /lack/ of for normal searches. Thanks a lot, I implemented storing in separate databases. Better to keep it simple, as the stem database is very small in practice (many terms stem to themselves, or have no other terms that stem to the same value, and so do not need an entry). In fact it's so small, I could store precomputed versions for several languages. I guess that cross-language stemming is going to produce strange results at times, but it's more or less bound to happen if the user mixes documents in different languages, which is probably the general case. > On the other hand, search-time stemming and query expansion gives you > advantages in not needing to detect the language of everything you > stem right now. For a personal search tool, that might be a big bonus. Yes, I am not sure how useful it will be, but it does seem nice to be able to turn stemming on/off or change languages at query time, on the fly. Regards, J.F. Dockes