Ben Campbell
2008-Dec-09 17:54 UTC
[Xapian-discuss] Newbie question: ESets, finding similar documents
I'm using ESets to look for similar documents using the following method: 1) build an RSet of example documents (my dataset consists of newspaper articles, and my RSet is a bunch of articles written by a particular journalist) 2) use Enquire::get_eset(20, reldocs) to get an ESet 3) build a query using the terms in the ESet (term OP_OR temr OP_OR term etc...) But get_eset often returns me useless terms, eg: ['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear', 'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon', 'for', 'Zfor'] (the particular journalist in this example covers environmental issues, so I'm interested in other articles which are about the environment - I'd want terms like "environment", "oil", "climate" etc...) Now... most of these terms would be considered stopwords - should I be using a stopper to avoid indexing them in the first place? I was under the impression that it was best to leave such words in for positional reasons... Does anyone have any good ideas on how I could improve my results? I'd have thought that the terms I'm getting back were so frequent that they'd be useless for an ESet... but maybe I don't really understand how ESets are intended to be used... is there any particular documentation I might have missed? Any suggestions welcome! Thanks, Ben Campbell
Olly Betts
2008-Dec-10 01:08 UTC
[Xapian-discuss] Newbie question: ESets, finding similar documents
On Tue, Dec 09, 2008 at 05:54:17PM +0000, Ben Campbell wrote:> I'm using ESets to look for similar documents using the following method: > > 1) build an RSet of example documents (my dataset consists of newspaper > articles, and my RSet is a bunch of articles written by a particular > journalist) > 2) use Enquire::get_eset(20, reldocs) to get an ESet > 3) build a query using the terms in the ESet (term OP_OR temr OP_OR term > etc...) > > But get_eset often returns me useless terms, eg: > ['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear', > 'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon', > 'for', 'Zfor']I'm surprised that the list is so bad.> (the particular journalist in this example covers environmental issues, > so I'm interested in other articles which are about the environment - > I'd want terms like "environment", "oil", "climate" etc...) > > Now... most of these terms would be considered stopwords - should I be > using a stopper to avoid indexing them in the first place? I was under > the impression that it was best to leave such words in for positional > reasons...Yes, that's generally best.> Does anyone have any good ideas on how I could improve my results?You can provide an ExpandDecider which rejects terms like these. You probably don't want both stemmed and unstemmed forms - again an ExpandDecider can take care of that. You might find OmegaExpandDecider::operator() in query.cc useful to look at: http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2265 Cheers, Olly