Hello, I am try to build a service that takes as input arbitrary text, treats it as a document and returns a list of similar documents from my index. I'm using the Xapian Python bindings. My thought was to do something like this: database = xapian.Database('/opt/index') querydoc = xapian.Document() querydoc.set_data(u'test') indexer = xapian.TermGenerator() stemmer = xapian.Stem("english") indexer.set_stemmer(stemmer) indexer.set_document(querydoc) indexer.index_text(arbitrary_text) rset = xapian.RSet() rset.add_document(querydoc) enquire = xapian.Enquire(database) eset = enquire.get_eset(40, rset) ...then use the list of terms in eset to query for a set of matching documents. Because RSet.add_document takes a docid, it seems I must add my document to a database before I can include it in a relevance set. I don't really want to add the arbtrary input text to my index, though. Should I be going about this a different way? Thanks, Ryan
Olly Betts
2008-Jun-30 21:30 UTC
[Xapian-discuss] Getting documents "like" arbitrary text?
On Sun, Jun 29, 2008 at 04:56:16PM +0200, Ryan Shaw wrote:> Because RSet.add_document takes a docid, it seems I must add my > document to a database before I can include it in a relevance set. I > don't really want to add the arbtrary input text to my index, though. > Should I be going about this a different way?Look at OP_ELITE_SET, which was added for this sort of thing. Give it all the terms from the document and it will pick the best N and make them into an "OR" query. It can of course be combined with other query operators. You should probably think of "best" as defined by outcome rather than anything else, but currently it picks the terms with the highest max termweight (as reported by the current weighting scheme). Cheers, Olly