Kevin Duraj
2009-May-12 01:44 UTC
[Xapian-discuss] Problem with weight_cutoff (not percent_cutoff)
I have problem with weight_cutoff, do not be mistaken with percent_cutoff that works fine and great. I have approximately 1 million of documents found by a search criteria in order to sort them I am need to cutoff documents that I know during indexing are not as important as some other documents. Therefore I assign 50+ weight to important documents during indexing and hoping that when my result sets gets too big I can cutoff all document with weight less than 50 as on the following example. $enq->set_cutoff(0, 50); But it does not work, when I set weight cutoff = 50 (not percent cutoff ), I do not get any results even I know for sure I have many documents weighter 50+ with that search criteria. I would like to know if that is a mistake I am doing somewhere or we have a bug in Xapian. Thanks, Kevin Duraj http://myhealthcare.com/
Olly Betts
2009-May-12 15:12 UTC
[Xapian-discuss] Problem with weight_cutoff (not percent_cutoff)
On Mon, May 11, 2009 at 06:44:34PM -0700, Kevin Duraj wrote:> I have approximately 1 million of documents found by a search criteria > in order to sort them I am need to cutoff documents that I know during > indexing are not as important as some other documents. Therefore I > assign 50+ weight to important documents during indexing and hoping > that when my result sets gets too big I can cutoff all document with > weight less than 50 as on the following example. > > $enq->set_cutoff(0, 50);The "weight" you are setting during indexing if the within document frequency (wdf) of a term. This is used to calculate the weight of a matching document, but the document weight won't simply be equal to the wdf, at least not with the supplied weighting schemes. If you want the document weight to simply equal the sum of the wdf, you could implement your own weighting scheme where this was true (you'll probably need to use 1.1.x for this as user weighting schemes are rather restricted in the statistics they can access in 1.0.x). But beware that this will probably give you noticeably worse search results. A better way to get rid of the unimportant documents would be to add a boolean term to the importnat ones and filter the results by this term. Cheers, Olly