thr3ads.net - Xapian discuss - [Xapian-discuss] Newbie question: ESets, finding similar documents [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Ben Campbell

2008-Dec-09 17:54 UTC

[Xapian-discuss] Newbie question: ESets, finding similar documents

I'm using ESets to look for similar documents using the following method:

1) build an RSet of example documents (my dataset consists of newspaper 
articles, and my RSet is a bunch of articles written by a particular 
journalist)
2) use Enquire::get_eset(20, reldocs) to get an ESet
3) build a query using the terms in the ESet (term OP_OR temr OP_OR term 
etc...)

But get_eset often returns me useless terms, eg:
['Zsay', 'are', 'Zare', 'says', 'but',
'Zbut', 'be', 'it', 'Zyear',
'Zthat', 'that', 'is', 'Zis', 'Zit',
'Zbe', 'Zthere', 'on', 'Zon',
'for', 'Zfor']
(the particular journalist in this example covers environmental issues, 
so I'm interested in other articles which are about the environment - 
I'd want terms like "environment", "oil",
"climate" etc...)

Now... most of these terms would be considered stopwords - should I be 
using a stopper to avoid indexing them in the first place? I was under 
the impression that it was best to leave such words in for positional 
reasons...

Does anyone have any good ideas on how I could improve my results?
I'd have thought that the terms I'm getting back were so frequent that 
they'd be useless for an ESet... but maybe I don't really understand how
ESets are intended to be used... is there any particular documentation I 
  might have missed?

Any suggestions welcome!
Thanks,
Ben Campbell

Olly Betts

2008-Dec-10 01:08 UTC

head link

[Xapian-discuss] Newbie question: ESets, finding similar documents

On Tue, Dec 09, 2008 at 05:54:17PM +0000, Ben Campbell
wrote:> I'm using ESets to look for similar documents using the following
method:
> 
> 1) build an RSet of example documents (my dataset consists of newspaper 
> articles, and my RSet is a bunch of articles written by a particular 
> journalist)
> 2) use Enquire::get_eset(20, reldocs) to get an ESet
> 3) build a query using the terms in the ESet (term OP_OR temr OP_OR term 
> etc...)
> 
> But get_eset often returns me useless terms, eg:
> ['Zsay', 'are', 'Zare', 'says',
'but', 'Zbut', 'be', 'it', 'Zyear',
> 'Zthat', 'that', 'is', 'Zis',
'Zit', 'Zbe', 'Zthere', 'on', 'Zon',
> 'for', 'Zfor']
I'm surprised that the list is so bad.
> (the particular journalist in this example covers environmental issues, 
> so I'm interested in other articles which are about the environment - 
> I'd want terms like "environment", "oil",
"climate" etc...)
> 
> Now... most of these terms would be considered stopwords - should I be 
> using a stopper to avoid indexing them in the first place? I was under 
> the impression that it was best to leave such words in for positional 
> reasons...
Yes, that's generally best.
> Does anyone have any good ideas on how I could improve my results?
You can provide an ExpandDecider which rejects terms like these.  You
probably don't want both stemmed and unstemmed forms - again an
ExpandDecider can take care of that.

You might find OmegaExpandDecider::operator() in query.cc useful to
look at:

http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2265

Cheers,
    Olly

Xapian discuss - Dec 2008 - Newbie question: ESets, finding similar documents

[Xapian-discuss] Newbie question: ESets, finding similar documents

[Xapian-discuss] Newbie question: ESets, finding similar documents