thr3ads.net - Xapian discuss - [Xapian-discuss] Getting documents "like" arbitrary text? [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Ryan Shaw

2008-Jun-29 14:56 UTC

[Xapian-discuss] Getting documents "like" arbitrary text?

Hello,

I am try to build a service that takes as input arbitrary text, treats
it as a document and returns a list of similar documents from my
index. I'm using the Xapian Python bindings.

My thought was to do something like this:

database = xapian.Database('/opt/index')
querydoc = xapian.Document()
querydoc.set_data(u'test')
indexer = xapian.TermGenerator()
stemmer = xapian.Stem("english")
indexer.set_stemmer(stemmer)
indexer.set_document(querydoc)
indexer.index_text(arbitrary_text)
rset = xapian.RSet()
rset.add_document(querydoc)
enquire = xapian.Enquire(database)
eset = enquire.get_eset(40, rset)

...then use the list of terms in eset to query for a set of matching documents.

Because RSet.add_document takes a docid, it seems I must add my
document to a database before I can include it in a relevance set. I
don't really want to add the arbtrary input text to my index, though.
Should I be going about this a different way?

Thanks,
Ryan

Olly Betts

2008-Jun-30 21:30 UTC

head link

[Xapian-discuss] Getting documents "like" arbitrary text?

On Sun, Jun 29, 2008 at 04:56:16PM +0200, Ryan Shaw
wrote:> Because RSet.add_document takes a docid, it seems I must add my
> document to a database before I can include it in a relevance set. I
> don't really want to add the arbtrary input text to my index, though.
> Should I be going about this a different way?
Look at OP_ELITE_SET, which was added for this sort of thing.  Give
it all the terms from the document and it will pick the best N and make
them into an "OR" query.  It can of course be combined with other
query
operators.

You should probably think of "best" as defined by outcome rather than
anything else, but currently it picks the terms with the highest max
termweight (as reported by the current weighting scheme).

Cheers,
    Olly

Xapian discuss - Jun 2008 - Getting documents "like" arbitrary text?

[Xapian-discuss] Getting documents "like" arbitrary text?

[Xapian-discuss] Getting documents "like" arbitrary text?