thr3ads.net - Xapian discuss - [Xapian-discuss] Stemming [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2005-Feb-08 18:41 UTC

[Xapian-discuss] Stemming

Hello,
I am building a personal search tool, based on xapian-core and qt. I am
experimenting with not stemming at indexing time (for a personal system,
the database size will not usually be an issue), and handling it at query
time.

The idea is to stem the user's query term and find the set of database
terms that stem to the same value (more or less like what is in the "Using
stemming in IR" paragraph in the stemming page on xapian.org). The query
can then be (optionally) expanded to the stem siblings.

Given that the database volumes are not going to be gigantic, it would be
easy to build the stem->SetOfWords database at the end of indexing, by
extracting and stemming the whole term list from the Xapian db (it takes a
few seconds for my 300,000 terms db). 

I could then store the result using any indexed file manager like gdbm or
whatever.

I am wondering though if I could use the xapian backend to handle the
storage. Would it be absurd, for example, to have pseudo documents indexed
by something like a unique STM:stemvalue term, and to store the word list
in the document data ? Or would you suggest another way ?

Or is this all just wrong, and I should stem during indexing like omindex ?

Incidentally, if somebody is interested in taking a look at the software
(it is still very incomplete, but may already be somewhat useful in some
cases), it is at http://perso.wanadoo.fr/dockes/recoll/).

Regards,
Jean-Francois Dockes

James Aylett

2005-Feb-09 15:28 UTC

head link

[Xapian-discuss] Stemming

On Tue, Feb 08, 2005 at 07:40:25PM +0100, Jean-Francois Dockes wrote:
> Given that the database volumes are not going to be gigantic, it would be
> easy to build the stem->SetOfWords database at the end of indexing, by
> extracting and stemming the whole term list from the Xapian db (it takes a
> few seconds for my 300,000 terms db). 
Right.
> I am wondering though if I could use the xapian backend to handle
> the storage. Would it be absurd, for example, to have pseudo
> documents indexed by something like a unique STM:stemvalue term, and
> to store the word list in the document data ? Or would you suggest
> another way ?
I'd advise /either/ having a different database for it (so you don't
need STM:stemvalue, just 'stemvalue') /or/ just using the stemmed
terms to index the documents, but add in another term which you can
filter on the /lack/ of for normal searches.

The reason the second one might be worth considering is that putting
it within the same database might compress the termlist better -
although I can't actually remember how termlist compression works, so
it might not. (At the least, it will help where stemmed terms exactly
match unstemmed words indexing the 'regular' documents.)
> Or is this all just wrong, and I should stem during indexing like
> omindex ?
It probably depends on what machines these are designed to run
on. Stemming at index time will probably chew less disk space, so on
low (ish :-) memory machines that will probably work better than the
larger database you'll get by not stemming (just because stemming
conflates terms, but also the terms will be shorter on
average). Particularly important if you see people typing in quick
queries regularly, but not constantly (so they use another application
in the meantime, pushing some of the Xapian database out of file
buffers).

On the other hand, search-time stemming and query expansion gives you
advantages in not needing to detect the language of everything you
stem right now. For a personal search tool, that might be a big bonus.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Jean-Francois Dockes

2005-Feb-10 13:16 UTC

head link

[Xapian-discuss] Stemming

(subject: where to store the (stem to words) relationship
James Aylett writes:
 > I'd advise /either/ having a different database for it (so you
don't
 > need STM:stemvalue, just 'stemvalue') /or/ just using the stemmed
 > terms to index the documents, but add in another term which you can
 > filter on the /lack/ of for normal searches.

Thanks a lot, I implemented storing in separate databases. Better to keep
it simple, as the stem database is very small in practice (many terms
stem to themselves, or have no other terms that stem to the same value, and
so do not need an entry). In fact it's so small, I could store precomputed
versions for several languages. 

I guess that cross-language stemming is going to produce strange results at
times, but it's more or less bound to happen if the user mixes documents
in different languages, which is probably the general case.

 > On the other hand, search-time stemming and query expansion gives you
 > advantages in not needing to detect the language of everything you
 > stem right now. For a personal search tool, that might be a big bonus.

Yes, I am not sure how useful it will be, but it does seem nice to be able to
turn stemming on/off or change languages at query time, on the fly. 

Regards,
J.F. Dockes

Xapian discuss - Feb 2005 - Stemming

[Xapian-discuss] Stemming

[Xapian-discuss] Stemming

[Xapian-discuss] Stemming